Mon Nov 24 05:18:23 PST 2008 This is on my machine at home; an old Thinkpad. pizza@extracheese:~/proj/palindromic-numbers/strrev$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 10 cpu MHz : 896.135 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse up bogomips : 1793.80 clflush size : 32 pizza@extracheese:~/proj/palindromic-numbers/strrev$ ./strrev function sec speedup obvious 68.33 0.0% obvious 43.90 55.7% byte4_w32 24.00 82.9% byte8_w32 23.95 83.3% byte8_w64 27.85 57.6% byte16_w32 18.08 142.8% byte16_w64 18.80 133.5% byte32_w32 15.97 174.8% byte32_w_prefetch 15.89 176.2% byte32_w64 16.84 160.7% byte64_w32 14.89 194.8% byte64_w64 24.82 76.9% byte128_w32 15.19 189.0% byte128_w64 26.65 64.7% byte256_w32 30.19 45.4% byte256_w64 15.72 179.3% byte512_w64 18.26 140.4% obvious_prefetch 44.86 -2.1% obvious_check 86.36 -49.2% obvious_pointer 32.76 34.0% obvious_twoindex 30.41 44.4% insideout 45.09 -2.7% byte2_unroll 57.16 -23.2% byte4_unroll 42.50 3.3% byte4_unroll_prefetch 42.50 3.3% byte4_unroll2 49.32 -11.0% byte4_unroll2_expect 49.00 -10.4% byte4_unroll3 29.99 46.4% byte4_loop 36.65 19.8% byte4_wb 44.81 -2.0% byte4_wc 23.68 85.4% byte8_unroll 43.65 0.6% byte8_subloop 46.05 -4.7% obvious 43.75 0.3% This is on my workstation at work, a 32-bit Pentium Core2 processor. Fast clockspeed but a surprisingly small cache. The unrolled functions come out way ahead again, but the 32-bit versions (*_w32) beat out the 64-bit versions (*_w64) pizza@debian:~/c$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 8 cpu MHz : 1992.651 cache size : 64 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss nx pni ds_cpl bogomips : 3956.73 pizza@debian:~/c$ gcc -std=c99 -march=core2 -mtune=core2 -W -Wall -pedantic -Wno-unused -O3 -o strrev strrev.c pizza@debian:~/c$ ./strrev function sec speedup obvious 19.11 0.0% byte4_w32 4.71 305.8% byte8_w32 5.39 254.8% byte8_w64 6.21 207.6% byte16_w32 3.44 456.1% byte16_w64 3.87 393.8% byte32_w32 2.24 751.7% byte32_w_prefetch 2.35 715.0% byte32_w64 2.41 693.7% byte64_w32 2.25 751.1% byte64_w64 3.72 414.1% byte128_w32 1.72 1011.2% byte128_w64 1.82 949.7% obvious_prefetch 9.81 94.7% obvious_check 15.35 24.5% obvious_pointer 4.80 298.4% obvious_twoindex 4.61 314.3% insideout 8.43 126.8% byte2_unroll 12.48 53.1% byte4_unroll 9.41 103.1% byte4_unroll_prefetch 10.22 86.9% byte4_unroll2 11.79 62.1% byte4_unroll2_expect 11.91 60.4% byte4_unroll3 6.17 209.8% byte4_loop 7.19 165.9% byte4_wb 9.53 100.6% byte4_wc 7.85 143.5% byte8_unroll 18.26 4.6% byte8_subloop 17.13 11.6% obvious 8.83 116.5% Here's the same machine running Windows XP (via VMWare; the Linux install runs in VMWare as well) compiled with Visual Studio 2005. Had to make a few small changes to the file. We see that the obvious function runs much better while the unrolled functions don't do as well. It must be the result of caching in the OS runtime? Also notice that many functions perform significantly worse than the 'obvious' one. Due to the way I calculate speedup the negative percentages don't really make sense numerically... C:\src\strrev\vs2005\release>vs2005.exe function sec speedup obvious 8.88 0.0% byte4_w32 3.33 166.6% byte8_w32 3.00 195.8% byte8_w64 3.02 194.4% byte16_w32 2.86 210.4% byte16_w64 2.97 198.9% byte32_w32 2.81 215.6% byte32_w_prefetch 2.80 217.3% byte32_w64 2.75 222.7% byte64_w32 2.64 236.0% byte64_w64 4.63 91.9% byte128_w32 2.61 240.2% byte128_w64 2.33 281.2% byte256_w64 2.34 278.6% obvious_prefetch 8.70 2.0% obvious_check 26.17 -66.1% obvious_pointer 8.55 3.8% obvious_twoindex 8.50 4.4% insideout 8.67 2.3% byte2_unroll 10.20 -13.0% byte4_unroll 26.64 -66.7% byte4_unroll_prefetch 26.73 -66.8% byte4_unroll2 22.27 -60.1% byte4_unroll2_expect 8.66 2.5% byte4_unroll3 22.06 -59.8% byte4_loop 24.66 -64.0% byte4_wb 13.66 -35.0% byte4_wc 3.28 170.5% byte8_unroll 38.98 -77.2% byte8_subloop 22.50 -60.6% obvious 8.58 3.5% This is on a 4-way 64-bit Intel Xeon server at work. Huge L2 cache, though clockspeed is a little slower because the machine is older. Copying the most bytes possible at once, a la byte128_w64, is enormously faster than the byte-by-byte way. And since this is a 64-bit CPU the 64-bit functions beat the 32-bit ones. pizza@uranus ~ $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 5110 @ 1.60GHz stepping : 6 cpu MHz : 1596.479 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 3196.19 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: pizza@uranus ~ $ cc -std=c99 -W -Wall -Wno-unused -pedantic -O3 -o strrev strrev.c && ./strrev function sec speedup obvious 35.74 0.0% obvious 8.77 307.4% byte4_w32 4.55 92.7% byte8_w32 4.84 81.3% byte8_w64 4.44 97.6% byte16_w32 6.19 41.7% byte16_w64 5.96 47.1% byte16_w128 7.37 19.0% byte32_w32 4.18 109.8% byte32_w_prefetch 4.32 103.2% byte32_w64 3.67 139.2% byte32_w128 4.86 80.6% byte64_w32 3.46 153.6% byte64_w64 5.80 51.2% byte64_w128 3.78 132.4% byte128_w32 3.46 153.4% byte128_w64 2.55 244.4% byte128_w128 3.41 157.2% byte256_w32 6.03 45.6% byte256_w64 2.64 232.6% byte256_w128 3.28 167.7% byte512_w64 2.50 250.3% obvious_prefetch 18.85 -53.4% obvious_check 27.80 -68.4% obvious_pointer 8.80 -0.3% obvious_twoindex 8.74 0.4% insideout 10.81 -18.8% byte2_unroll 9.03 -2.9% byte4_unroll 9.04 -2.9% byte4_unroll_prefetch 9.47 -7.4% byte4_unroll2 8.89 -1.3% byte4_unroll2_expect 9.01 -2.7% byte4_unroll3 8.72 0.6% byte4_loop 9.28 -5.4% byte4_wb 10.47 -16.2% byte4_wc 5.65 55.3% byte8_unroll 8.08 8.6% byte8_subloop 9.49 -7.6% obvious 8.77 0.0% This is on my webserver, which is a Celeron. $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Celeron(R) CPU 2.00GHz stepping : 9 cpu MHz : 1999.764 cache size : 128 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 3948.54