Here you can find information, how to measure speed of your machine running spinpack.
First you have to download spinpack-2.15.tgz (or better version). Uncompress and untar the file, configure the Makefile, compile the sources and run the executable. Here is an example:
gunzip -c spinpack-2.15.tgz | tar -xf - cd spinpack # --- small speed test --- (1CPU, MEM=113MB, DISK=840MB, nud=30,10) ./configure --nozlib make speed_test sh -c "( cd ./exe; time ./spin ) 2>&1 | tee speed_test_small" # --- big speed test --- (16CPUs, MEM=735MB, DISK=6GB, nud=28,12) ./configure --mpt --nozlib make speed_test; grep -v small exe/daten.i1 >exe/daten.i sh -c "( cd ./exe; time ./spin ) 2>&1 | tee speed_test_big"
Send me the output files together with the characteristic data
of your computer for comparisions. Please also add the output of
grep FLAGS= Makefile
and cpu.log
if you have.
The next table gives an overview about computation time for a N=40 site system (used for speed test) started at year 2003. First column marks the up- and down-spins (nud) given in daten.i. Other columns list the time needed for writing the matrix (SH) and for the first 40 iterations (i=40) showed by the output. The star (*) marks the default configuration when using make speed_test (see above). The double star is an example for the big speed_test (see above).
new tests: t=[[hh:]mm:]ss nud ns SH i100 12sij CPUs machine time=[hh:]mm:ss(+-ss) dflt: v2.26 -O2 ------+---------------------------------+--+--- E=-8.22686823 SMag= 0.14939304 ---------------------- 32,8 39 1:36 37 16:43 1 Altix330-64GB 8x2-IA-64.m2.r1-1.5GHz v2.26 g++4.1 -O2 32,8 24 50 22 8:37 2 h_get-inlined.i100=22s 32,8 15 26 17 4:41 4 h_get-inlined.i100=17s 32,8 14 16 2:40 2:58 6 32,8 15 13 3:52 2:13 8 tmp=/dev/shm or DSK cpu 0,1,6,7,10,13,14,15 32,8 14 14 1:32 2:23 8 # pmshub, linkstat, shubstats -cachetraffic,memdir,linkstats 32,8 14 12 4:12 2:13 8 # numactl -i all 2m, -l 3m 32,8 14 9 5:18 - 12 tmp=/dev/shm 32,8 14 18 5:12 1:34 12 32,8 14 12 4:32 2:01 16 30,10 6:39 19:15 12:54 - 1 inlined_h_get 30,10 4:11 9:54 10:17 - 2 inlined_h_get 30,10 2:33 4:55 6:20 - 4 inlined_h_get 30,10 2:29 2:41 4:25 - 8 inlined_h_get 28,12 42:13 2:11:21 1:48:02 23:49:57 1 inlined_h_get (157m -> 108m) 28,12 26:15 1:10:32 1:29:55 12:16:00 2 h_get_inlined (292m -> 90m) 28,12 16:33 36:05 1:19:55 6:12:54 4 h_get_inlined (346m -> 80m) 28,12 16:07 19:08 38:07 3:10:28 8 h_get_inlined (412m -> 38m) 32,8 37 2:43 1:51 27:56 1 GS160-24GB 4x4-alpha-731MHz v2.26 cxx-6.5 32,8 28 1:23 1:20 15:05 2 h_get_inlined (3:09 -> 1:20) 32,8 16 43 1:09 7:17 4 h_get_inlined (5:10 -> 1:09) 32,8 10 29 45 3:44 8 h_get-inlined (6:29 -> 0:45) 32,8 9 14 42 1:54 16 (100i: 5:21 -> 42(unter Last)) 30,10 6:24 32:29 38:02 - 1 30,10 4:38 16:47 25:30 - 2 30,10 2:45 8:50 20:55 - 4 h_get inlined 30,10 1:40 4:37 14:02 47:53 8 h_get inlined (87m -> 14m=8*(6-11m)) 30,10 1:48 3:12 11:10 24:38 16 h_get inlined (73m -> 11m) used 28,12 11:01 35:19 1:52:40 - 8 h_get inlined (476m->123m, other jobs running) 28,12 10:59 17:10 - - 16 h_get inlined (424m->?) cxx -g1 -pthread -pg; ./spin; gprof -b -F inc1 ./spin # gmon.out 34,6 16CPUs (real=12s user=147s, 67s h_get, 59s hamilton_geth_block) 34,6 8CPUs (real=13s user= 94s, 43s h_get, 34s hamilton_geth_block) cxx -g1 -pthread -p; ./spin; prof ./spin # mon.out ... + hiprof -pthread -run cxx -g3 -pthread; pixie -pthread -run ./spin; prof -pixie -threads -all spin *.Counts* 35,5 1:07 1:35 1:44 - 8 # hiprof -threads -run ./spin; gprof -b -scaled -all spin.hiprof *.?.hiout (-asm|-lines -f h_get ) 32,8 3:47 1:24 48 - 8 # h_get-inlined 410s(8*50s) 30,10 4:46 26:09 29:37 - 1 # Last! GS1280-128GB-striped 32-alpha-1150MHz v2.26 cxx-6.5 30,10 2:08 6:21 15:45 - 4 # Last! GS1280-128GB-striped 32-alpha-1150MHz v2.26 cxx-6.5 old tests: nud SH-time i=40-time CPUs machine time=[hh:]mm:ss(+-ss) dflt: v2.15 -O2 -------+---------+---------+--+------------------- E=-8.22686823 ---------------------- 32,8 3:32 9:20 1 Via-C3-1GHz-64k-gcc-3.3 v2.19 -O2 -msse -march=i586 lt=245s (-T 255MB/s) 32,8 1:25 3:10 1 Celeron-1GHz-gcc_2.95.3 v2.18 -O2 lt=59s (rl: cache=256kB disk=26MB/s dskcache=168MB/s) 32,8 0:53 1:56 1 Centrino-1.4GHz-gcc-3.3 v2.19 -O2 -msse -march=i586 lt=40s 3s/5It (-T 986MB/s) 32,8 2:02 4:10 1 Centrino-600MHz-gcc-3.3 v2.19 -O2 -msse -march=i586 lt=94s 4s/5It (-T 858MB/s) speed-step 32,8 2:11 4:18 1 Centrino-600MHz-gcc-3.3 v2.19 -O2 -msse -march=i686 lt=95s 4s/5It (-T 858MB/s) speed-step 32,8 0:50 2:04 1 Pentium4-2.5GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse B_NL2=4 32,8 1:22 2:27 1 Pentium4-2.6GHz-gcc-3.3 v2.26p1 -O2 B_NL2=0 L2=512kB fam=15 model=2 stepping=9 (no inline h_get) no matter if h_get/h_put inlined, B_NL2=2 32,8 1:03 2:21 1 AthlonXP-1.7GHz-gcc-3.2 v2.17 -O2 -march=athlon-xp -m3dnow ( hda=55MB/s) 32,8 0:55 2:21 1 AthlonXP-1.7GHz-gcc-3.3 v2.21 -O4 -march=athlon-xp -m3dnow lt=39s 6s/5It (-T 408MB/s hda=48MB/s) i65=2m32s+18s 66MB/48MB*s*65=89s 32,8 1:14 4:32 1 Xeon-2GHz-v2.18-gcc-3.2 -O2 -march=i686 -msse 4x4 lt=1:00 32,8 0:43 1:18 1 Xeon-2660MHz-2M-8GB-v2.25-gcc-4.1.1 -O2 amd?64bit 4x4 model6 stepping4 lt=22s n2=24s 3s/10It 2*DualCore*2HT=8vCPUs bellamy 32,8 0:43 1:19 1 Xeon-3GHz-12GB-v2.24-gcc-4.1 -O2 64bit 4x4 lt=0:22 model4 stepping10 32,8 0:49 1:31 1 Xeon-3GHz- 2GB-v2.25-gcc-4.0.2 -O2 32bit 4x4 lt=0:26 model4 stepping3 32,8 1:21 3:13 1 GS160-Alpha-731MHz-cxx v2.17 -fast -g3 -pg Compaq C++ V6.3-008 32,8 1:42 3:23 1 GS160-Alpha-731MHz-g++4.1.1 v2.24 -O2 lt=47s n2=52s 12s/10It 32,8 1:43 3:23 1 GS160-Alpha-731MHz-g++4.1.1 v2.24 -O2 -mcpu=ev67 -mtune=ev67 lt=47s n2=53s 12s/10It 32,8 1:39 3:11 1 GS160-Alpha-731MHz-gcc4.1.1 v2.24 -O3 -funroll-loops -fomit-frame-pointer -ffast-math -mcpu=ev67 -mtune=ev67 lt=41s n2=46s 11s/10It 32,8 0:59 2:16 1 ES45-Alpha-1250MHz-cxx-6.3 -fast v2.18 lt=0:59 2x2 32,8 3:16 7:04 1 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA B_NL2=2 32,8 3:11 6:39 1 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA B_NL2=0 32,8 3:13 5:56 1 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lpthread 32,8 1:59 4:54 2 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lpthread 32,8 1:23 4:36 4 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lpthread ---------------------------- n1=5.3e6 113MB+840MB E=-13.57780124 --------- 30,10 23m 76m 1 Pentium-1.7GHz-gcc v2.15 -lgz 30,10 21m 50m 1 Pentium-1.7GHz-gcc v2.15 30,10 12:58 44:01 1 AthlonXP-1.7GHz-gcc-3.3 v2.21 -O4 -march=athlon-xp -m3dnow (lt=6m44s hda=48MB/s, cat 40x800MB=15m, 48%idle) 3m/5It i65:r=60m,u=34m,s=4m (also pthread) 30,10 15:15 38:29 1 AthlonXP-1.7GHz-gcc-3.2 v2.17 -O2 -march=athlon-xp -m3dnow (lt=7m29s hda=55MB/s, cat 40x800MB=15m, 40%idle) 30,10 15:34 45:28 1 AthlonXP-1.7GHz-gcc-3.2 v2.18 -O2 -march=athlon-xp -m3dnow -lgz (lt=7m29s hda=55MB/s, zcat 40x450MB=13m, 1%idle) 30,10 11:51 26:31 1 Pentium4-2.5GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse B_NL2=4 30,10 15:59 40:09 1 Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse -g -pg 30,10 14:26 34:08 1 Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse 30,10 8:16 25:09 2 Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse (slow nfs-disk) 30,10 14:40 32:26 1 Xeon-2GHz-gcc-3.2 v2.18 -O2 -march=i686 -msse 4x4 lt=10m34 30,10 8:44 16:59 1 Xeon-3GHz-12GB-v2.24-gcc-4.1.1 -O2 64bit 4x4 lt=3m31 model4 stepping10 n2=3m55 65s/10It 30,10 9:57 19:08 1 Xeon-3GHz- 2GB-v2.25-gcc-4.0.2 -O2 32bit 4x4 lt=4m34 model4 stepping3 n2=4m58 30,10 8:52 17:53 1 Xeon-3GHz- 2GB-v2.24-gcc-4.1.1 -O2 32bit 4x4 lt=4m25 model4 stepping3 n2=4m50 30,10 8:27 16:48 1 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=3m43 model6 stepping4 n2=4m10 62s/10It 2*DualCore*2HT=8vCPUs bellamy 30,10 3:19 11:36 4 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=2m10 model6 stepping4 n2=2m37 76s/10It 2*DualCore*2HT=8vCPUs bellamy 30,10 1:57 7:49 8 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=1m07 model6 stepping4 n2=1m35 55s/10It 2*DualCore*2HT=8vCPUs bellamy 30,10 6:56 15:15 1 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=4m18 n2=04:37 (116s/10It) Knoppix-3.8-32bit 30,10 4:04 11:12 2 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=2m40 n2=02:58 (63s/10It) cpu5+7 30,10 2:20 9:05 4 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m39 n2=01:57 (72s/10It) cpu3-6 (2*HT dabei?) 30,10 2:47 6:33 4 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=1m23 n2=01:40 (52s/10It) cpu3-6 (2*HT dabei?) 30,10 1:15 4:42 8 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m37 n2=01:55 (24s/10It) 30,10 1:03 4:10 2*8 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m37 n2=01:55 (17s/10It) (4CPUs * 2Cores) 30,10 1:00 4:33 4*8 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m38 n2=01:56 (17s/10It) (4CPUs * 2Cores) ulimit -n 4096 30,10 2:43 8:57 2 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=1m51 n2=02:10 (46s/10It) kanotix2005-04-64bit 16G-RAM tmpfs=...MB/s 30,10 2:02 5:34 4 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=1m05 n2=01:23 (32s/10It) kanotix2005-04-64bit 16G-RAM tmpfs=...MB/s 30,10 1:08 3:16 8 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=0m42 n2=01:00 (17s/10It) kanotix2005-04-64bit 16G-RAM tmpfs 30,10 5:16 13:19 a2 1 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=2m24 n2=02:41 (i56,1m12s/10It) zizj=15m07 all=18m58s SLES10-64bit 32G-RAM (8virtCPUs) ltrace: rand=450s fread=223s 30,10 0:54 3:25 a2 16 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=0m41 n2=00:58 (i49,2s/It) sisj=13m57 slow! SLES10-64bit 32G-RAM (8virtCPUs) 30,10 8:14 29:52 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast -lz (sun4u) lt=4:14 4 threads 30,10 19:31 1:03:28 1 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast (sun4u) lt=9:50 16 threads 2048s/40*168e6=0.30us 30,10 27:28 1:14:28 1 SunFire-880-SparcIII-750MHz-g++2.95 v2.17 -mv8 -O2 (sun4u) lt=22:32 4 threads 30,10 30:24 59:33 1 SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -mv9 -O2 32-v8+ lt=12:01 4m00/10It 8 threads 30,10 28:20 55:14 1 SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -mv9 -O2 64-v9 lt=9:59 3m54/10It 8 threads 30,10 29:47 59:49 1 SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -profile-use 64v9 lt=9:52 4m44/10It 4 threads 30,10 28:31 56:35 1 SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -mcpu=ultrasparc3 -mvis -mtune= 64v9 lt=9:19 n2=10m18 4m27/10It 4 threads 30,10 28:10 55:32 1 SunFire-880-SparcIII-750MHz-gcc-4.1 v2.25 -O3 -mcpu=ultrasparc3 -mtune= 64v9 lt=8:56 n2=9m50 4m22/10It 4 threads 30,10 9:12 25:51 4 SunFire-880-SparcIII-750MHz-gcc-4.1 v2.25 -O3 -mcpu=ultrasparc3 -mtune= 64v9 lt=3:13 n2=4m06 3m08/10It 4 threads 30,10 7:52 21:40 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.19 -fast (sun4u) lt=6:11 4 threads (55s/5It) vbuf=16M 30,10 7:24 26:45 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast (sun4u) lt=4:11 4 threads 4*910s/40*168e6=0.54us 30,10 7:12 26:28 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast -O4 (sun4u) lt=4:05 4 threads 4*911s/40*168e6=0.54us 30,10 3:44 16:58 8 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast (sun4u) lt=4:23 16 threads 8*532s/40*168e6=0.63us 30,10 13:42 25:56 1 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 4x4 (64bit) v2.25+ lt=4m26 n2=4m53 110s/10It (8virtCPUs,4DualCore) 32GB 30,10 9:04 20:26 1 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=4m50 n2=5m20 81s/10It (8virtCPUs) 30,10 9:01 19:30 1 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra3 -xarch=v9 (64bit) v2.25+ lt=4m37 n2=5m06 80s/10It (8virtCPUs) 30,10 5:26 13:49 2 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=3m16 n2=3m46 69s/10It (8virtCPUs) 30,10 3:07 8:16 4 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=1m49 n2=2m21 42s/10It (8virtCPUs) 30,10 1:46 7:43 8 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=2m55 n2=3m59 29s/10It (8virtCPUs) 30,10 1:23 5:05 2*8 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=1m34 n2=2m04 (8virtCPUs,4DualCore) 30,10 2:00 5:39 2*8 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 (64bit) v2.25+ lt=1m29 n2=1m55 26s/10It (8virtCPUs,4DualCore) 32GB 30,10 2:04 5:40 2*8 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O2 -mcpu=ultrasparc3 (64bit) v2.25+ lt=1m25 n2=1m53 26s/10It (8virtCPUs,4DualCore) 32GB 30,10 14:25 26:09 1 ES45-Alpha-1250MHz-gcc-3.2.3 -O2 v2.18 lt=6:09 2x2 30,10 12:13 22:15 1 ES45-Alpha-1250MHz-cxx-6.3 -fast v2.18 lt=4:14 2x2 (ev56) 30,10 22:55 53:52 1 GS160-Alpha-731MHz-gcc4.1.1 v2.24 -O3 -funroll-loops -fomit-frame-pointer -ffast-math -mcpu=ev67 -mtune=ev67 lt=6m56s n2=7m46 5m41s/10It 17jobs/16cpus 30,10 13:20 32:30 4 GS160-Alpha-731MHz-gcc4.1.1 v2.24 -O3 -funroll-loops -fomit-frame-pointer -ffast-math -mcpu=ev67 -mtune=ev67 lt=5m08s n2=5m59 3m21s/10It 18jobs/16cpus * 30,10 24m 64m 1 GS160-Alpha-731MHz-cxx-6.3 v2.15 30,10 19:00 48:14 1 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -g3 -pg (42% geth_block, 27% b_smallest, 16% ifsmallest3) 30,10 21:12 50:37 1 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 30,10 19:36 59:44 1 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads 30,10 12:15 36:16 2 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads 30,10 8:24 24:17 3 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads 30,10 7:40 26:36 3 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -pthread 4 threads 30,10 7:48 53:00 10 GS160-Alpha-731MHz-cxx-6.3 v2.15 30,10 3:50 18:23 16 GS160-Alpha-731MHz-cxx-6.3 v2.15 ( 64 threads) 30,10 3:33 15:19 16 GS160-Alpha-731MHz-cxx-6.3 v2.15 (128 threads) simulates async read 30,10 21:20 43:55 1 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=06:50 4m/10It 30,10 19:44 46:16 2 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=05:46 (1m59s..2m54s)/5It (work load, home) 30,10 12:18 34:11 a2 2 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=05:38 (20s..22s)/a2It (work load, home, a2=53It/34m) 30,10 5:35 12:41 16 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=02:51 1m/10It (640%CPU) 30,10 12:55 23:15 1 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=04:33 1m26s/10It = 10*840MB/1m26s=98MB/s 10*hnz/86s/1=20e6eps/cpu 50ns (max.80ns) 30,10 2:19 5:59 8 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=02:11 22s/10It (14%user+5%sys+81%idle (0%dsk) von 32CPUs) 10*hnz/22s/8=10e6eps/cpu 30,10 1:38 4:11 16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=01:47 12s/10It 30,10 1:46 3:48 32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=01:25 9s/10It 30,10 1:01:10 4:25:28 1 O2100-IP27-250MHz-CC-7.30 v2.15 -O3 -lz 30,10 50:06 3:12:22 1 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lz 30,10 30:14 2:00:42 2 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lz 30,10 41:50 1:35:44 1 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA 30,10 54:00 1:45:15 1 O2100-IP27-250MHz-CC-7.30 v2.17v3 ssrun -64 -O2 -IPA lt=00:20:33 geth_bl=2200s latency?=2030s/60*168e6=0.20us (XY_NEW+sortH) 30,10 47:06 1:36:56 1 O2100-IP27-250MHz-CC-7.30 v2.17v3 ssrun -64 -O2 -IPA lt=00:20:40 geth_bl=2090s latency?=1928s/60*168e6=0.19us (XY_NEW) 30,10 26:52 1:14:28 2 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA (HBLen=1024 about same) 30,10 16:50 1:13:51 8 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA 30,10 19:23 1:16:17 4 O2100-IP27-250MHz-CC-7.30 v2.18 -64 -O2 4x4 hnz+15% lt=00:11:33 30,10 19:08 0:59:29 4 O2100-IP27-250MHz-CC-7.30 v2.18 -64 -Ofast -IPA 4x4 hnz+15% lt=00:13:00 30,10 44:22 2:11:25 2 MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA 2x2 lt=23m (8m00s)/5It CFLOAT 30,10 44:14 2:12:54 a2 2 MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA 2x2 lt=25m (2m05s)/a2It CFLOAT i45=2h23m 30,10 22:08 2:03:54 4 MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA 4x4 lt=47m (6m49s)/5It 30,10 22:04 2:12:33 a2 4 MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA 4x4 lt=47m (2m09s)/a2It i51=2h36m 30,10 23:13 1:45:24 4 MIPS--IP25-194MHz-gcc-323 v2.19 -O2 -mips4 -mabi=64 -mcpu=orion 4x4 lt=21m (7m35s)/5It read=20k 30,10 20:22 1:32:46 4 MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA 4x4 lt=14m (7m15s)/5It ---------------------------- n1=35e6 -18.11159089 735MB+6GB 28,12 10:40:39 20:14:07 1 MIPS--IP25-194MHz-CC-7.21 v2.18 -64 -Ofast -IPA 1x1 lt=2h51m 50m/5It (dd_301720*20k=354s dd*5=30m /tmp1 cat=6GB/352s=17MB/s) 28,12 6:04:55 16:03:42 2 MIPS--IP25-194MHz-CC-7.21 v2.18 -64 -Ofast -IPA 2x2 lt=3h00m 52m/5It (ToDo: check time-diffs It0..It20?) 28,12 5:40:49 14:05:49 2 MIPS--IP25-194MHz-CC-7.30 v2.18 -64 -Ofast -IPA 2x2 lt=1h49m 49m/5It FLOAT npri=40 MaxSym=170, write=20480 read=? (2cat=6GB/451s 5*6GB=38m) 28,12 3:14:01 10:09:10 4 MIPS--IP25-194MHz-CC-7.30 v2.18 -64 -Ofast -IPA 4x4 lt=1h26m 41m/5It FLOAT npri=40 (was resetted?) (4cat=6.6GB/469s) 28,12 3:25:09 11:04:01 4 MIPS--IP25-194MHz-CC-7.30 v2.18 -64 -Ofast -IPA 4x4 lt=1h32m 45m46s/5It CFLOAT npri=40 28,12 3:43:02 13:31:30 4 MIPS--IP25-194MHz-gcc-323 v2.19 -O2 -mips4 -mabi=64 -mcpu=orion 4x4 lt=2h11m (57m)/5It read=20k 28,12 3:42:38 13:07:44 a2 4 MIPS--IP25-194MHz-gcc-323 v2.19 -O2 -mips4 -mabi=64 -mcpu=orion 4x4 lt=2h22m (16m)/a2It read=20k i55=17h 28,12 3:14:22 12:28:31 4 MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA 4x4 lt=1h25m (59m)/5It 28,12 3:15:36 12:00:46 a2 4 MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA 4x4 lt=1h25m (16m)/a2It i54=15h51m 28,12 3:42:27 10:32:52 2 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA 28,12 171m 7h 1 Pentium-1.7GHz-gcc v2.15 28,12 5h 10h 1 GS160-Alpha-731MHz-cxx-6.3 v2.15 ** 28,12 57:39 5:29:57 16 GS160-Alpha-731MHz-cxx-6.3 v2.15 (16 threads) 28,12 59:22 2:51:54 16 GS160-Alpha-731MHz-cxx-6.3 v2.15 (128 threads) . 28,12 3:03:00 10:04:03 1 GS160-Alpha-731MHz-cxx-6.3 v2.17pre -fast 28,12 1:13:27 5:45:12 3 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -pthread 16 28,12 1:49:31 4:29:09 4 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=25m home 10It/32..77m 7.5GB/635s=12MB/s(392s,254s,81s) tmp3=160s,40s,33s tmp3_parallel=166s,138s 28,12 52:57 2:17:00 8 GS160-Alpha-731MHz-cxx-6.5 v2.19 -fast lt=24m 13m30s/10It 28,12 2:00:56 4:08:31 1 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=53:23 17m17s/10It = 10*6GB/17m17s=58MB/s (3GB_local+3GB_far) 12e6eps/cpu 28,12 1:12:02 2:18:26 2 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=20:39 11m08s/10It = 10*6GB/11m08s=90MB/s 9e6eps/cpu 28,12 40:36 1:21:40 4 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=16:06 6m13s/10It 28,12 23:20 50:20 8 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=13:08 3m26s/10It 28,12 21:35 53:10 8 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=13:13 (2m04s..4m41s)/10It HBlen=409600 10*6GB/2m=492MB/s hnz*10/2m/8=10e6eps/cpu 28,12 14:01 32:17 16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=10:46 1m51s/10It 28,12 13:09 27:50 32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=08:37 1m29s/10It 3 28,12 15:41 30:57 32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=08:42 1m24s/10It 70%user+7%sys+23%idle(0%io) 1user 4e6eps/cpu 28,12 51:51 1:56:55 2 ES45-Alpha-1250MHz-cxx-6.5 v2.23 -fast lt=16m 11m34s/10It unter Last 28,12 3:19:39 7:02:48 1 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=1h05m 1 threads (19m40s/5It) 28,12 1:48:28 4:29:24 2 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=47:17 2 threads (14m08s/5It) 28,12 58:41 2:42:08 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=36:36 4 threads (8m/5It, 4cat=6GB/0.5s) 28,12 1:00:27 2:44:18 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=35:46 4 threads (8m19s/5It) 2nd try v2.19 28,12 1:00:45 2:38:41 a2 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=39:25 4 threads (1m57s/1a2) 2nd try v2.19a2 i51=3h incl. EV 28,12 59:16 2:37:09 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=38:48 4 threads (7m17s/5It) FLOAT 28,12 35:59 1:47:38 8 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=29:39 8 threads (5m/5It) 28,12 1:31:19 3:18:24 1 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 (64bit) v2.25+ lt=27m18 n2=29m52 9m17s/10It (8virtCPUs,4DualCore) 32GB 28,12 1:03:21 2:30:33 1 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=30m02 n2=32m57 13m32s/10It 8virtCPUs(4DualCore) 32GB 28,12 38:31 1:47:45 2 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=20m02 n2=22m58 11m32s/10It 8virtCPUs(4DualCore) 28,12 22:09 1:02:11 4 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m52 n2=13m58 6m30s/10It 8virtCPUs(4DualCore) 2 28,12 12:17 42:57 8 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m05 n2=13m25 4m15s/10It 8virtCPUs(4DualCore) 8threads 28,12 10:14 39:00 2*8 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m06 n2=13m02 3m55s/10It 8virtCPUs(4DualCore) 16threads 28,12 14:31 43:09 2*8 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 (64bit) v2.25+ lt=9m26 n2=12m00 4m08s/10It (8virtCPUs,4DualCore) 32GB 28,12 14:44 43:48 2*8 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O2 -mcpu=ultrasparc3 (64bit) v2.25+ lt=9m13 n2=12m00 4m15s/10It (8virtCPUs,4DualCore) 32GB 28,12 54:19 1:53:43 1 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=19m (9m36s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) 28,12 32:42 1:20:55 2 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=13m (8m19s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) 28,12 19:40 1:07:17 4 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=7m (9m38s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) memory distributed badly? 28,12 1:11:37 2:14:45 1 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=25m52 n2=27:42 (8m36s/10It) Novel10-32bit 32G-RAM dsk=6MB/s 28,12 44:31 1:31:31 2 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=16m25 n2=18:18 (6m59s/10It) Novel10-32bit 32G-RAM dsk=6MB/s 28,12 23:30 51:54 4 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=10m06 n2=12:00 (4m09s/10It) Novel10-32bit 32G-RAM dsk=6MB/s # Novel10-32bit: mount -t tmpfs -o size=30g /tmp1 /tmp1 # w=591MB/s 28,12 20:17 1:25:03 4 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=9m05 n2=10:47 (13m33s/10It) knoppix-5.0-32bit 4of32G-RAM dsk=60MB/s 28,12 11:30 1:31:56 8 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=8m53 n2=10:35 (16m54s/10It) knoppix-5.0-32bit 4of32G-RAM dsk=60MB/s 28,12 9:39 1:44:06 2*8 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=8m56 n2=10:38 (20m29s/10It) knoppix-5.0-32bit 4of32G-RAM dsk=60MB/s 28,12 47:41 1:48:12 1 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=30m n2=32:15 (7m03s/10It) kanotix2005-04-64bit 16G-RAM tmpfs 28,12 30:19 1:20:48 2 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=22m n2=24:03 (6m36s/10It) kanotix2005-04-64bit 16G-RAM tmpfs 28,12 16:29 45:49 4 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=6m55 n2=13:27 (3m30s/10It) kanotix2005-04-64bit 16G-RAM tmpfs 1 28,12 9:01 28:13 8 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=8m29 n2=10:18 (2m13s/10It) kanotix2005-04-64bit 16G-RAM tmpfs 28,12 37:13 1:32:28 a2 1 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=15m03 n2=16:58 (9m19s/10It) SLES10-64bit 32G-RAM 28,12 22:27 1:07:18 a2 2 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=10m37 n2=12:21 (8m45s/10It) SLES10-64bit 32G-RAM 28,12 12:38 47:26 a2 4 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=6m01 n2=7:47 (6m37s/10It) SLES10-64bit 32G-RAM 28,12 7:19 36:50 a2 8 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=4m29 n2=6:24 (5m33s/10It) SLES10-64bit 32G-RAM 28,12 1:04:36 2:04:00 1 Xeon-3GHz-12GB-v2.24-gcc-4.1 64bit 4x4 lt=22m (8m36s)/10It model4 stepping10 n2=24m44 xen 28,12 37:42 1:28:24 2 Xeon-3GHz-12GB-v2.24-gcc-4.1 64bit 2x2 lt=16m (7m28s)/10It model4 stepping10 n2=18m18 xen 28,12 29:46 1:17:56 4 Xeon-3GHz-12GB-v2.24-gcc-4.1 64bit 4x4 lt=11m (8m51s)/10It model4 stepping10 n2=13m15 xen 28,12 1:13:28 2:42:16 1 Xeon-3GHz- 2GB-v2.25-gcc-4.0.2 32bit 4x4 lt=29m (13m46s)/10It model4 stepping3 n2=31m34 28,12 1:06:07 2:30:48 1 Xeon-3GHz- 2GB-v2.24-gcc-4.1.1 32bit 4x4 lt=28m (13m33s)/10It model4 stepping3 n2=30m16 28,12 13:46 1:24:21 a2 8 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit model6 stepping4 lt=7m56 n2=10m42 13m02/10It 2*DualCore*2HT=8vCPUs bellamy 28,12 13:42 49:10 8 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit model6 stepping4 lt=8m43 n2=11m30 7m13/10It 2*DualCore*2HT=8vCPUs bellamy san=w142MB/s,r187MB/s 28,12 12:40 48:23 2*8 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit model6 stepping4 lt=8m08 n2=10m54 6m01/10It 2*DualCore*2HT=8vCPUs bellamy san=w142MB/s,r187MB/s 28,12 40:13 1:25:12 16 AltixIA64-1500MHz-gcc-3.3 v2.23 -O2 lt=18m (6m06s)/10It hxy_size 163840/32768=5 14+2CPUs auf 1RAM, numalink=Bottleneck 28,12 34:35 1:53:10 8 AltixIA64-1500MHz-gcc-3.3 v2.23 -O2 lt=18m (14m30s)/10It hxy_size 163840/32768=5 28,12 44:21 2:33:00 4 AltixIA64-1500MHz-gcc-3.3 v2.23 -O2 lt=18m (21m50s)/10It hxy_size 163840/32768=5 28,12 1:01:59 3:30:52 2 AltixIA64-1500MHz-gcc-3.3 v2.23 -O2 lt=28m (29m29s)/10It hxy_size 163840/32768=5 28,12 1:34:21 3:21:23 1 AltixIA64-1500MHz-gcc-3.3 v2.23 -O2 lt=46m (14m37s)/10It hxy_size 163840/32768=5 ---------------------------- 27,13 7:41:55 29:08:50 4 MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA 4x4 lt=3h10m (2h16m)/5It 27,13 2:21:06 6:24:27 4 SunFire-880-SparcIII-750MHz-CC-5.3 v2.19 -fast -xtarget=ultra -xarch=v9 -g -xipo -xO5 lt=73:15 (21m14s/5It) 27,13 1:48:39 10:35:00 8 GS160-Alpha-731MHz-cxx-6.5 v2.19 -fast lt=45m 56m/5It 27,13 54:36 14:23:16 8 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=27m (13m..1h32m)/5It 27,13 57:59 2:18:59 8 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=28m (6m38s)/5It HBLen=409600 27,13 32:15 4:26:40 16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=22:35 (4m43s..1h38m)/10It 27,13 46:03 1:43:21 16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=25:44 (3m50s..3m57s)/5It mfs-disk + vbuf=16MB + sah's 27,13 29:18 1:01:25 32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=18:03 (1m43s..1h38m)/5It (2stripe-Platte=150MB/s) 60%user+5%sys+35%idle(0%disk) 1user ---------------------------- n1=145068828=145e6 E0=-21.77715233 ZMag= 0.02928151 26,14 107h 212h 1 O2100-IP27-250MHz-CC-7.30 v1.4 26,14 3:30:24 12:09:12 4 ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17 lt=62m21 26,14 45:51 1:45:48 16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=27m07 (4m00s...4m12s)/5It mfs-disk vbuf=16M + spike (optimization after linking) 26,14 47:18 1:50:45 16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=27m08 (3m48s...5m16s)/5It mfs-disk + spike (optimization after linking) 26,14 1:31:49 3:31:01 16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=48m34 (7m57s..10m01s)/5It mfs-disk vbuf=16M 26,14 1:08:45 16:19:00 32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=33m17 (30m..5h)/10It HBLen=409600 26,14 4:23:00 13:58:37 1 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=76m (123m)/10It 16blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) 26,14 2:49:53 10:00:26 2 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=50m ( 91m)/10It 2blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) io.r=50MB/s 26,14 1:32:16 6:53:23 4 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=26m39 ( 72m)/10It 4blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) 26,14 32:06 1:42:33 8 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=18m32 n2=25:18 (10m24s/10It) SLES10-64bit 32G-RAM 26,14 32:23 2:15:03 a2 8 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=18m32 n2=25:52 (21m26s/10It) SLES10-64bit 32G-RAM 26,14 4:56:57 11:36:23 1 Xeon-3GHz-12GB-v2.24-gcc-4.1.1 64bit 4x4 lt=1h28m ( 75m)/10It model4 stepping10 26,14 3:03:22 8:58:02 2 Xeon-3GHz-12GB-v2.24-gcc-4.1.1 64bit 2x2 lt=1h06m ( 70m)/10It model4 stepping10 ---------------------------- mem=7e9 hnz=17e9 E0= -24.52538640 25,15 6:21:08 22:13:42 4 ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17 lt=1h32m 25,15 3:03:41 15:54:51 8 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=1h24m (1h18m..1h26m)/5It HBLen=409600 24,16 4:58:56 25:21:48 8 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=2h08m (2h16m)/5It HBLen=409600 24,16 10:17:31 : : 4 ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17 lt=02:41:03 99%stored (+4h i40ca54h) 24,16 8:31:10 31:49:42 8 AltixIA64-1500MHz-gcc-3.3 v2.23 -O2 lt=3h14m (4h55m)/10It hxy_size 163840/32768=5 23,17 17:19:51 51:02:31 4 ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.18 lt=04:11:14 latency=29h30m/40*63GB=42ns cat=2229s(28MB/s) zcat=5906s(11MB/s)
Next figure shows the computing time for different older program versions and computers (I update it as soon as I can). The computing time depends nearly linearly from the matrix size n1 (time is proportional to n1^1.07, n1 is named n in the figure).
Memory usage depends from the matrix dimension n1. For the N=40 sample two double vectors and one 5-byte vector is stored in the memory, so we need n1*21 Bytes, where n1 is approximatly (N!/(nu!*nd!))/(4N). Disk usage is mainly the number of nonzero matrix elements hnz times 5 (disk size for tmp_l1.dat is 5*n1 and is not included here). The number of nonzero matrix elements hnz depends from n1 by hnz=11.5(10)*n1^1.064(4), which was found empirically. Here are some examples:
nu,nd n1 memory hnz disk (zip) (n1*21=memory, hnz*5=disk) -----+---------------+---------------------- 34,6 24e3 432kB 526e3 2.6MB 1.3MB 32,8 482e3 11MB 13e6 66MB 34MB 30,10 5.3e6 113MB 168e6 840MB 444MB small speed test 28,12 35e6 735MB 1.2e9 6GB 3.6GB big speed test 27,13 75e6 1.4GB 2.8e9 14GB # n1=75214468 26,14 145e6 2.6GB 5.5e9 28GB 25,15 251e6 5.3GB 9.9e9 50GB 24,16 393e6 8.3GB 15.8e9 79GB 23,17 555e6 11.7GB 23e9 115GB 63GB 22,18 708e6 14.9GB ... ... 20,20 431e6 7.8GB 18e9 90GB
A typical cpu load for a N=40 site system looks like this:
Data are generated using the following tiny script:
#!/bin/sh while ps -o pid,pcpu,time,etime,cpu,user,args -p 115877;\ do sleep 30; done | grep -v CPU
115877 is the PID of the process. You have to replace it.
Alternatively you can activate a script activated by daten.i (edit it).
The machine was used by 5 users, therefore peak load is only
about 12CPUs. 735MB memory and 6GB diskspace (or cache) were used.
You can see the initialization process (20min),
the matrix generation (57min) and the first 4 iterations (4x8min).
The matrix generation is most dependend from CPU power.
The iteration time mainly depends from the disk speed
(try: time cat exe/tmp/ht* >/dev/null
) and the
speed of random memory access. For example a GS1280-1GHz needs a
bandwith to the disk of 60MB/s per CPU to avoid a bottle neck.
Reading 5GB in 8min means a sequential data rate of 12MB/s which
is no problem for disks or memory cache. Reading randomly a 280MB
vector in 8min means 600kB/s and should also be no problem for the
machine.
You can improve
disk speed using striped disks or files (AdvFS) and putting every
H-block on another disk. The maximum number
of threads was limited to 16, but this can be changed (see src/config.h).
During iterations the multi-processor scaling is so bad on most machines -- why? I guess, this is because of random read access to the vector a (see picture below). I thought a shared memory computer should not have such problems with scaling here, but probably I am wrong. In future I try to solve the problem.
Figure shows dataflow during iterations for 2 CPUs.
Version 2.24 was very slow in calculating the expectation value of <SiSj>. A gprof analysis was showing, that most time was spend for finding the index of an configuration in the configuration table (function b2i of hilbert.c). This was the reason to have a closer look to the speed of memory access. I wrote memspeed.c which was simply read a big number of integers at different steps. Reading integers one after another (sequential read) gives the best results of the order of 1-2GB/s. But worst case where integers are read at distance of about 16kB gives performance of about 10-40MB/s which is a factor of 100 smaller. This is a random access to the RAM. The OpSiSj-function does arround n1*(ln2(n1)+1) such accesses to memory for every SiSj value calculation. I think it should be possible to reduce the randomness of index calculation by using ising energy of each config to divide the configs in blocks. (Sep2006)