Here you can find information, how to measure speed of your machine running spinpack.
First you have to download spinpack-2.27.tgz (or better version). Uncompress and untar the file, configure the Makefile, compile the sources and run the executable. Here is an example:
gunzip -c spinpack-2.27.tgz | tar -xf -
cd spinpack
./configure --mpt # configure for pthreads
make speed_test # compile spinpack, create data files
cd exe
# choose problem size (DISK should be fully cached):
# nud=30,10 - MEM=113MB, DISK=840MB, ca.10-45m/run
# nud=28,12 - MEM=735MB, DISK=6GB, ca.3-5h/run (preferred)
nedit daten.i # only for big system
for x in 1 2 4 8 16; do ./spin -t$x; done 2>&1 | tee -a o_speed
Send me the output files together with the characteristic data
of your computer for comparisions. Please also add the output of
grep FLAGS= Makefile and cpu.log if you have.
The next table gives an overview about computation time for a N=40 site system (used for speed test) started at year 2003. First column marks the up- and down-spins (nu,nd) given in daten.i. These numbers define the size of the problem (nn!/(nu!nd!(4nn))) with nu+nd=nn. Other columns list the time needed for writing the matrix (SH) and for the first 40 iterations (i=40) (old version) or the extrapolated time for the first 100 iterations (version 2.26) showed by the output. For the new version ns is the time for filtering symmetric configurations and 12sisj the time for computing the correlations.
new tests: (verbose=1, SBase=1, S1sym=1(?), dbl)
# Altix330IA64-1500MHz 8nodes*2IA64.m2.r1
nud CPUs ns SH i100 12sij machine time=mm:ss(+-ss) dflt: v2.26 -O2
------+--++--------------------------------------------------
32,8 1 39 1:36 37 16:43 Altix330 v2.26 g++4.1 -O2 noHUP?=slow
32,8 2 24 50 22 8:37 v2.26+h_get-inlined.i100
32,8 4 15 26 17 4:41 v2.26+h_get-inlined.i100
32,8 8 15 13 3:52 2:13 tmp=/dev/shm or DSK cpu 0,1,6,7,10,13,14,15
32,8 8 14 14 1:32 2:23 ?
32,8 8 14 12 4:12 2:13 ?
32,8 16 14 12 4:32 2:01 ?
30,10 1 6.65 19.25 12.90 - inlined_h_get [min]
30,10 2 4.18 9.90 10.28 - inlined_h_get
30,10 4 2.55 4.92 6.33 - inlined_h_get
30,10 8 2.48 2.68 4.42 - inlined_h_get
28,12 1 42:13 131:21 108:02 1429:57 inlined_h_get (157m -> 108m)
28,12 2 26:15 70:32 89:55 736:00 h_get_inlined (292m -> 90m)
28,12 4 16:33 36:05 79:55 372:54 h_get_inlined (346m -> 80m)
28,12 8 16:07 19:08 38:07 190:28 h_get_inlined (412m -> 38m)
# v2.23 bad scaling i100+SH (h_get not inlined), but SH-singlespeed
# 28,12 16 18 40.2 61.0 - Altix330-gcc-3.3-O2 v2.23 hxy_size 163840/32768=5 14+2CPUs auf 1RAM, numalink=Bottleneck
# 28,12 8 18 34.6 145.0 - Altix330-gcc-3.3-O2 v2.23 hxy_size 163840/32768=5
# 28,12 4 18 44.4 218.3 - Altix330-gcc-3.3-O2 v2.23 hxy_size 163840/32768=5
# 28,12 2 28 62.0 294.8 - Altix330-gcc-3.3-O2 v2.23 hxy_size 163840/32768=5
# 28,12 1 46 94.4 146.2 - Altix330-gcc-3.3-O2 v2.23 hxy_size 163840/32768=5
20,20 15 366. 455. 11550. - v2.26-flt J2=0.59 q_10_10_0_0- n2.char=0/50m
nud CPUs ns SH i100 12sij machine time=min dflt: v2.26 -O2
------+--++------------------------------------ default=160sym
32,8 1 1.48 4.82 2.17 45.72 PentiumM-600MHz-L2=1MB-1GB-gcc-4.1.2-v2.27-Oct07 FSB=4*100MHz SSE2 fam=6 model=9 stepping=5 mem=2075MB
nud CPUs ns SH i100 12sij machine time=min dflt: v2.26 -O2
------+--++------------------------------------ default=160sym
32,8* 1 1.10 51.53 - - 0sym Pentium4-Northwood-1.3GHz-1GB-gcc-4.1-v2.26 Northwood FSB=4*100MHz L2=512kB SSE2 HT fam=15 model=2 stepping=9 mem=233MHz
32,8 1 0.83 2.58 1.48 - Pentium4-Northwood-1.3GHz-1GB-gcc-4.1-v2.26 Northwood FSB=4*100MHz L2=512kB SSE2 HT fam=15 model=2 stepping=9 mem=233MHz
32,8 2 0.70 1.93 1.33 - Pentium4-Northwood-1.3GHz-1GB-gcc-4.1-v2.26 Northwood FSB=4*100MHz L2=512kB SSE2 HT fam=15 model=2 stepping=9 mem=233MHz
30,10 1 8.57 32.02 41.78 - Pentium4-Northwood-1.3GHz-1GB-gcc-4.1-v2.26 Northwood FSB=4*100MHz L2=512kB SSE2 HT fam=15 model=2 stepping=9 mem=233MHz
30,10 2 7.27 23.68 37.00 - Pentium4-Northwood-1.3GHz-1GB-gcc-4.1-v2.26 Northwood FSB=4*100MHz L2=512kB SSE2 HT fam=15 model=2 stepping=9 mem=233MHz
32,8 1 0.40 1.27 1.00 - Pentium4-Northwood-2.6GHz-1GB-gcc-4.1-v2.26 Northwood FSB=4*200MHz L2=512kB SSE2 HT fam=15 model=2 stepping=9
32,8 2 0.35 1.07 0.82 - Hyper-Threading(2x) sse2 ThermalThrootling moeglich (passiv-Kuehlung)
32,8 4 0.35 0.98 0.82 10.12 - try valgrind cachegrind,callgrind,gprof, gcov, oprof_start
32,8 1 0.45 1.17 0.78 - -mtune=pentium4 -msse2
32,8 2 0.40 0.87 0.68 - -mtune=pentium4 -msse2
32,8 1 0.40 8.80 0.78 - -mtune=pentium4 -msse2 new b_smallest.v2.17a
0.78 0.60 t_sites=int,float 10% speedup
32,8 2 23.88 69.87 16.20 - callgrind --trace-children=yes callgrind_annotate Ir-recorded: sum 249e9 b_smallest=114e9+15e9 b_ifsamllest3=62e9
30,10 2 3.67 11.95 26.43 - ToDo: __builtin_prefetch
nud CPUs ns SH i100 12sij machine time=min dflt: v2.27 -O2
------+--++-------------------------------- n1=5.3e6 113MB+840MB E=-13.57780124
30,10 2 1.68 5.72 4.27 57.78 Xeon-E5345-2.33Ghz-2*4MB-gcc-4.1.1-v2.27 32GB thor
30,10 4 1.32 2.92 2.97 29.15 Clovertown 2,33 GHz FSB1333 80Watt 851$
* 30,10 8 1.07 1.47 2.67 14.72 chipset i5000P totBW=21GB/s
30,10 8 1.02 1.47 2.25 14.68 flt
28,12 1 14.90 78.78 58.78 808.77 Xeon-E5345-2.33Ghz-2*4MB-gcc-4.1.1-v2.27 32GB thor k2.6.18
28,12 2 10.35 40.88 35.10 411.73
28,12 4 7.92 20.90 25.35 208.82
* 28,12 8 6.15 10.58 22.48 105.62
28,12 8 6.13 10.60 19.48 105.33 flt
nud CPUs ns SH i100 12sij machine time=min dflt: v2.25 -O2
------+--++------------------------------------ E=-8.22686823 SMag= 0.14939304 ----------------------
28,12 1 19 54.32 96.0 - 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
28,12 2 13 32.70 83.16 - 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
28,12 4 7 19.67 96.33 - 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) memory distributed badly? k2.6.9 (2.6.18 is faster)
28,12 4 9.22 24.22 30.6 - 2CoreOpteron-2194MHz-gcc-4.1 v2.27 k2.6.18
28,12 1 30.00 47.68 70.50 - 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit n2=32:15 (7m03s/10It) kanotix2005-04-64bit 16G-RAM tmpfs DL585
28,12 2 22.00 30.32 66.00 - 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit n2=24:03 kanotix2005-04-64bit 16G-RAM tmpfs
28,12 4 16.92 16.48 35.00 ns? 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit n2=13:27 kanotix2005-04-64bit 16G-RAM tmpfs
28,12 8 8.48 9.02 21.60 - 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit n2=10:18 kanotix2005-04-64bit 16G-RAM tmpfs
nud CPUs ns SH i100 12sij [min] dflt: v2.26 -O2
------+--+------------------------------
30,10 1 4.78 30.72 14.17 - V490-32GB 4x2-sparcv9-ultraIV+-10*150MHz-2MB/32MB(L3)-gcc4.1 L1=64K+64K
30,10 2 3.60 15.53 8.15 - J2=0? gcc-4.1 -O2
30,10 4 1.82 7.82 4.57 -
30,10 8 1.48 3.97 3.18 -
30,10 16 1.50 3.95 3.38 -
30,10 32 1.43 3.90 3.57 -
28,12 1 29.98 205.30 143.72 -
28,12 2 22.17 104.67 82.75 -
28,12 4 11.10 53.20 46.50 -
28,12 8 9.55 26.80 31.77 -
28,12 16 9.63 26.70 33.68 -
28,12 32 9.33 26.82 35.52 -
# SunFire-T200-8GB 8x4-ultraSparc-T1-5*200MHz-3MB?-gcc-4.1-flt L1=8K+16K
# ToDo: better compiler[options]?
30,10 1 18.05 64.25 72.63 663.07
30,10 2 11.62 32.48 39.60 341.27 n2.t=1.33
30,10 4 6.68 16.35 21.63 174.22
30,10 8 6.52 8.33 13.87 90.83 n2.t=1.33 4CPUs-ab70s 1cpu-ab200s
30,10 16 6.53 5.57 10.50 59.42
30,10 32 6.53 4.68 9.08 48.85 sh=96% cat_htmp*=0.02s
28,12 8 42.40 59.90 129.18 625.87
28,12 32 42.37 32.02 103.48 - ns=(+1m=25%,+19m=88m) dd=7GB/14s
nud CPUs ns SH i100 12sij machine time=min dflt: v2.26 -O2
------+--++---------------------------------
30,10 1 6:24 32:29 38:02 - GS160-24GB 4x4-alpha-731MHz v2.26 cxx-6.5
30,10 2 4:38 16:47 25:30 -
30,10 4 3.40 8.73 21.02 - 2.27 empty machine Apr07
30,10 8 2.68 4.55 13:17 - 2.27 empty machine Apr07
30,10 16 2.08 2.30 7.58 - 2.27 empty machine Apr07
28,12 4 20.50 65.73 182.93 - 2.27 empty machine Apr07
28,12 8 16.07 34.17 112.45 - 2.27 empty machine Apr07
28,12 16 12.45 18.28 65.58 - 2.27 empty machine Apr07
*32,8 8 0.20 21.20 83.40 113.58 # 820sisj 1sym! 2.27 Apr07 n1=77e6
nud CPUs ns SH i100 12sij machine time=min dflt: v2.27 -fast
-------+--++----------------------------------
30,10 1 3:82 20.03 19.27 - # GS1280-128GB-striped 32-alpha-1150MHz v2.27 cxx-6.5
30,10 2 2:78 10.33 11.52 - # L1=2*64KB L2-1.75MB= mem=4GB/CPU=82..250ns
30,10 4 2:12 5.23 8.00 - # 2.27-dbl 0user (32cpus/rad)
30,10 8 1.68 2.67 5.18 - # 2.27-dbl 0user (32cpus/rad) + vmstat.log
30,10 16 1.30 1.35 2.83 - # 2.27-dbl 0user (32cpus/rad)
30,10 32 0.93 0.78 1.72 - # 2.27-dbl 0user (32cpus/rad)
# for 32cpus/RAD memory is striped, no memory-cpu locality (see i100)
30,10 1 3.80 19.80 11.70 - # 2.27-dbl 0user (1cpus/rad)
30,10 2 2.80 10.30 7.92 - # 2.27-dbl 0user (1cpus/rad)
30,10 4 2.12 5.23 5.72 - # 2.27-dbl 0user (1cpus/rad)
30,10 8 1.67 2.67 4.33 - # 2.27-dbl 0user (1cpus/rad) + vmstat1_1cpu1rad.log i100b=4.38
30,10 16 1.28 1.37 2.85 - # 2.27-dbl 0user (1cpus/rad) i100b=2.77
30,10 32 0.92 0.68 1.75 - # 2.27-dbl 0user (1cpus/rad)
# 30,10 32 1.05 1.77 2.90 - # 2.25-dbl 0user (1cpus/rad)
# 30,10 8 1.08 2.75 7.70 - # 2.25-dbl 0user (1cpus/rad)
# 32,8 32 0.10 4.37 9.88 - # 2.25-dbl 0user (1cpus/rad) 1sym?
#
28,12 8 10.08 18.87 37.33 - # 2.27-dbl 0user (8cpus/rad) + vmstat60-R 25%RAD2+75%RAD3
#
28,12 1 23.65 139.13 167.15 - # 2.27-dbl 0user (32cpus/rad) ps=4h33m05s real=275m 31user!
28,12 2 17.02 72.33 91.22 - # 2.27-dbl 0user (32cpus/rad) ps=4h30m47s real=152m
28,12 4 12.87 36.83 65.67 - # 2.27-dbl 0user (32cpus/rad) ps=5h24m05s real=102m
28,12 8 10.08 18.80 39.67 - # 2.27-dbl 0user (32cpus/rad
28,12 16 7.77 9.77 23.00 - # 2.27-dbl 0user (32cpus/rad)
28,12 32 5.73 4.97 13.17 - # 2.27-dbl 0user (32cpus/rad) ps=6h03m37s
#
28,12 1 23.58 136.28 89.52 - # 2.27-dbl 0user (1cpus/rad)
28,12 2 17.00 71.70 59.20 - # 2.27-dbl 0user (1cpus/rad)
28,12 4 12.83 36.48 43.47 - # 2.27-dbl 0user (1cpus/rad)
28,12 8 10.07 18.63 31.17 - # 2.27-dbl 0user (1cpus/rad) + vmstat1-R(no influence to speed)
28,12 16 7.78 9.73 23.43 - # 2.27-dbl 0user (1cpus/rad)
28,12 32 5.70 4.88 13.02 - # 2.27-dbl 0user (1cpus/rad)
old tests: (i40 is summed time for ns+n2+SH+40Iterations, bad choosen)
nud CPUs SH-time i=40-time machine time=[hh:]mm:ss(+-ss) dflt: v2.15 -O2
-------+--+---------+---------+------------------------------------------
30,10 1 23m 76m Pentium-1.7GHz-gcc v2.15 -lgz
30,10 1 21m 50m Pentium-1.7GHz-gcc v2.15
30,10 1 12:58 44:01 AthlonXP-1.7GHz-gcc-3.3 v2.21 -O4 -march=athlon-xp -m3dnow (lt=6m44s hda=48MB/s, cat 40x800MB=15m, 48%idle) 3m/5It i65:r=60m,u=34m,s=4m (also pthread)
30,10 1 15:15 38:29 AthlonXP-1.7GHz-gcc-3.2 v2.17 -O2 -march=athlon-xp -m3dnow (lt=7m29s hda=55MB/s, cat 40x800MB=15m, 40%idle)
30,10 1 15:34 45:28 AthlonXP-1.7GHz-gcc-3.2 v2.18 -O2 -march=athlon-xp -m3dnow -lgz (lt=7m29s hda=55MB/s, zcat 40x450MB=13m, 1%idle)
30,10 1 11:51 26:31 Pentium4-2.5GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse B_NL2=4
30,10 1 15:59 40:09 Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse -g -pg
30,10 1 14:26 34:08 Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse
30,10 2 8:16 25:09 Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse (slow nfs-disk)
30,10 1 14:40 32:26 Xeon-2GHz-gcc-3.2 v2.18 -O2 -march=i686 -msse 4x4 lt=10m34
30,10 1 8:44 16:59 Xeon-3GHz-12GB-v2.24-gcc-4.1.1 -O2 64bit 4x4 lt=3m31 model4 stepping10 n2=3m55 65s/10It
30,10 1 9:57 19:08 Xeon-3GHz- 2GB-v2.25-gcc-4.0.2 -O2 32bit 4x4 lt=4m34 model4 stepping3 n2=4m58
30,10 1 8:52 17:53 Xeon-3GHz- 2GB-v2.24-gcc-4.1.1 -O2 32bit 4x4 lt=4m25 model4 stepping3 n2=4m50
30,10 1 8:27 16:48 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=3m43 model6 stepping4 n2=4m10 62s/10It 2*DualCore*2HT=8vCPUs bellamy
30,10 4 3:19 11:36 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=2m10 model6 stepping4 n2=2m37 76s/10It 2*DualCore*2HT=8vCPUs bellamy
30,10 8 1:57 7:49 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=1m07 model6 stepping4 n2=1m35 55s/10It 2*DualCore*2HT=8vCPUs bellamy
30,10 1 6:56 15:15 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=4m18 n2=04:37 (116s/10It) Knoppix-3.8-32bit
30,10 2 4:04 11:12 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=2m40 n2=02:58 (63s/10It) cpu5+7
30,10 4 2:20 9:05 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m39 n2=01:57 (72s/10It) cpu3-6 (2*HT dabei?)
30,10 4 2:47 6:33 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=1m23 n2=01:40 (52s/10It) cpu3-6 (2*HT dabei?)
30,10 8 1:15 4:42 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m37 n2=01:55 (24s/10It)
30,10 2*8 1:03 4:10 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m37 n2=01:55 (17s/10It) (4CPUs * 2Cores)
30,10 4*8 1:00 4:33 4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m38 n2=01:56 (17s/10It) (4CPUs * 2Cores) ulimit -n 4096
30,10 2 2:43 8:57 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=1m51 n2=02:10 (46s/10It) kanotix2005-04-64bit 16G-RAM tmpfs=...MB/s
30,10 4 2:02 5:34 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=1m05 n2=01:23 (32s/10It) kanotix2005-04-64bit 16G-RAM tmpfs=...MB/s
30,10 8 1:08 3:16 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=0m42 n2=01:00 (17s/10It) kanotix2005-04-64bit 16G-RAM tmpfs
30,10 1 19:31 63:28 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast (sun4u) lt=9:50 16 threads 2048s/40*168e6=0.30us
30,10 1 28:10 55:32 SunFire-880-SparcIII-750MHz-gcc-4.1 v2.25 -O3 -mcpu=ultrasparc3 -mtune= 64v9 lt=8:56 n2=9m50 4m22/10It 4 threads
30,10 4 9:12 25:51 SunFire-880-SparcIII-750MHz-gcc-4.1 v2.25 -O3 -mcpu=ultrasparc3 -mtune= 64v9 lt=3:13 n2=4m06 3m08/10It 4 threads
30,10 4 7:52 21:40 SunFire-880-SparcIII-750MHz-CC-5.3 v2.19 -fast (sun4u) lt=6:11 4 threads (55s/5It) vbuf=16M
30,10 4 7:24 26:45 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast (sun4u) lt=4:11 4 threads 4*910s/40*168e6=0.54us
30,10 4 7:12 26:28 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast -O4 (sun4u) lt=4:05 4 threads 4*911s/40*168e6=0.54us
30,10 8 3:44 16:58 SunFire-880-SparcIII-750MHz-CC-5.3 v2.17 -fast (sun4u) lt=4:23 16 threads 8*532s/40*168e6=0.63us
30,10 1 13:42 25:56 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 4x4 (64bit) v2.25+ lt=4m26 n2=4m53 110s/10It (4DualCore) 32GB
30,10 1 9:04 20:26 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=4m50 n2=5m20 81s/10It (4DualCore)
30,10 1 9:01 19:30 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra3 -xarch=v9 (64bit) v2.25+ lt=4m37 n2=5m06 80s/10It (4DualCore)
30,10 2 5:26 13:49 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=3m16 n2=3m46 69s/10It (4DualCore)
30,10 4 3:07 8:16 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=1m49 n2=2m21 42s/10It (4DualCore)
30,10 8 1:46 7:43 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=2m55 n2=3m59 29s/10It (4DualCore)
30,10 2*8 1:23 5:05 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=1m34 n2=2m04 i40-n2-SH=98s/40It=245s/100It
30,10 2*8 2:00 5:39 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 (64bit) v2.25+ lt=1m29 n2=1m55 26s/10It (4DualCore) 32GB
30,10 2*8 2:04 5:40 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O2 -mcpu=ultrasparc3 (64bit) v2.25+ lt=1m25 n2=1m53 26s/10It (4DualCore) 32GB
30,10 1 12:13 22:15 ES45-Alpha-1250MHz-cxx-6.3 -fast v2.18 lt=4:14 2x2 (ev56)
30,10 1 19:00 48:14 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -g3 -pg (42% geth_block, 27% b_smallest, 16% ifsmallest3)
30,10 1 21:12 50:37 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast
30,10 1 19:36 59:44 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads
30,10 2 12:15 36:16 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads
30,10 16 3:50 18:23 GS160-Alpha-731MHz-cxx-6.3 v2.15 ( 64 threads)
30,10 16 3:33 15:19 GS160-Alpha-731MHz-cxx-6.3 v2.15 (128 threads) simulates async read
30,10 1 21:20 43:55 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=06:50 4m/10It
30,10 16 5:35 12:41 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=02:51 1m/10It (640%CPU)
30,10 1 12:55 23:15 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=04:33 1m26s/10It = 10*840MB/1m26s=98MB/s 10*hnz/86s/1=20e6eps/cpu 50ns (max.80ns)
30,10 8 2:19 5:59 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=02:11 22s/10It (14%user+5%sys+81%idle (0%dsk) von 32CPUs) 10*hnz/22s/8=10e6eps/cpu
30,10 16 1:38 4:11 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=01:47 12s/10It
30,10 32 1:46 3:48 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=01:25 9s/10It
30,10 1 41:50 95:44 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA
30,10 2 26:52 74:28 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA (HBLen=1024 about same)
30,10 8 16:50 73:51 O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA
30,10 4 19:08 59:29 O2100-IP27-250MHz-CC-7.30 v2.18 -64 -Ofast -IPA 4x4 hnz+15% lt=00:13:00
-------- --------------------
28,12 1 10:40:39 20:14:07 MIPS--IP25-194MHz-CC-7.21 v2.18 -64 -Ofast -IPA 1x1 lt=2h51m 50m/5It (dd_301720*20k=354s dd*5=30m /tmp1 cat=6GB/352s=17MB/s)
28,12 2 6:04:55 16:03:42 MIPS--IP25-194MHz-CC-7.21 v2.18 -64 -Ofast -IPA 2x2 lt=3h00m 52m/5It (ToDo: check time-diffs It0..It20?)
28,12 4 3:14:22 12:28:31 MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA 4x4 lt=1h25m (59m)/5It
28,12 1 5h 10h GS160-Alpha-731MHz-cxx-6.3 v2.15
28,12 16 57:39 5:29:57 GS160-Alpha-731MHz-cxx-6.3 v2.15 (16 threads)
28,12 16 59:22 2:51:54 GS160-Alpha-731MHz-cxx-6.3 v2.15 (128 threads) .
28,12 1 3:03:00 10:04:03 GS160-Alpha-731MHz-cxx-6.3 v2.17pre -fast
28,12 3 1:13:27 5:45:12 GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -pthread 16
28,12 4 1:49:31 4:29:09 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=25m home 10It/32..77m 7.5GB/635s=12MB/s(392s,254s,81s) tmp3=160s,40s,33s tmp3_parallel=166s,138s
28,12 8 52:57 2:17:00 GS160-Alpha-731MHz-cxx-6.5 v2.19 -fast lt=24m 13m30s/10It
28,12 1 2:00:56 4:08:31 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=53:23 17m17s/10It = 10*6GB/17m17s=58MB/s (3GB_local+3GB_far) 12e6eps/cpu
28,12 2 1:12:02 2:18:26 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=20:39 11m08s/10It = 10*6GB/11m08s=90MB/s 9e6eps/cpu
28,12 4 40:36 1:21:40 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=16:06 6m13s/10It
28,12 8 23:20 50:20 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=13:08 3m26s/10It
28,12 8 21:35 53:10 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=13:13 (2m04s..4m41s)/10It HBlen=409600 10*6GB/2m=492MB/s hnz*10/2m/8=10e6eps/cpu
28,12 16 14:01 32:17 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=10:46 1m51s/10It
28,12 32 13:09 27:50 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=08:37 1m29s/10It
28,12 32 15:41 30:57 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=08:42 1m24s/10It 70%user+7%sys+23%idle(0%io) 1user 4e6eps/cpu
28,12 2 51:51 1:56:55 ES45-Alpha-1250MHz-cxx-6.5 v2.23 -fast lt=16m 11m34s/10It unter Last
28,12 1 3:19:39 7:02:48 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=1h05m 1 threads (19m40s/5It)
28,12 2 1:48:28 4:29:24 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=47:17 2 threads (14m08s/5It)
28,12 4 58:41 2:42:08 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=36:36 4 threads (8m/5It, 4cat=6GB/0.5s) (FLOAT: same, sh=59:16 i40=2:37:09 lt=38:48 7m17s/5It)
28,12 8 35:59 1:47:38 SunFire-880-SparcIII-750MHz-CC-5.3 v2.18 -fast (sun4u) lt=29:39 8 threads (5m/5It)
28,12 1 91:19 198:24 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 (64bit) v2.25+ lt=27m18 n2=29m52 9m17s/10It (4DualCore) 32GB
28,12 1 63:21 150:33 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=30m02 n2=32m57 13m32s/10It (4DualCore) 32GB
28,12 2 38:31 107:45 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=20m02 n2=22m58 11m32s/10It (4DualCore)
28,12 4 22:09 62:11 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m52 n2=13m58 6m30s/10It (4DualCore)
28,12 8 12:17 42:57 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m05 n2=13m25 4m15s/10It (4DualCore) 8threads
28,12 2*8 10:14 39:00 SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m06 n2=13m02 3m55s/10It (4DualCore) 16threads
28,12 2*8 14:31 43:09 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 (64bit) v2.25+ lt=9m26 n2=12m00 4m08s/10It (4DualCore) 32GB
28,12 2*8 14:44 43:48 SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O2 -mcpu=ultrasparc3 (64bit) v2.25+ lt=9m13 n2=12m00 4m15s/10It (4DualCore) 32GB
28,12 1 54:19 1:53:43 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=19m (9m36s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
28,12 2 32:42 1:20:55 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=13m (8m19s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
28,12 4 19:40 1:07:17 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=7m (9m38s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) memory distributed badly?
28,12 1 71:37 2:14:45 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=25m52 n2=27:42 (8m36s/10It) Novel10-32bit 32G-RAM dsk=6MB/s
28,12 2 44:31 1:31:31 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=16m25 n2=18:18 (6m59s/10It) Novel10-32bit 32G-RAM dsk=6MB/s
28,12 4 23:30 51:54 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=10m06 n2=12:00 (4m09s/10It) Novel10-32bit 32G-RAM dsk=6MB/s
# Novel10-32bit: mount -t tmpfs -o size=30g /tmp1 /tmp1 # w=591MB/s
28,12 4 20:17 85:03 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=9m05 n2=10:47 (13m33s/10It) knoppix-5.0-32bit 4of32G-RAM dsk=60MB/s
28,12 8 11:30 91:56 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=8m53 n2=10:35 (16m54s/10It) knoppix-5.0-32bit 4of32G-RAM dsk=60MB/s
28,12 2*8 9:39 104:06 4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=8m56 n2=10:38 (20m29s/10It) knoppix-5.0-32bit 4of32G-RAM dsk=60MB/s
28,12 1 47:41 108:12 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=30m n2=32:15 (7m03s/10It) kanotix2005-04-64bit 16G-RAM tmpfs
28,12 2 30:19 80:48 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=22m n2=24:03 (6m36s/10It) kanotix2005-04-64bit 16G-RAM tmpfs
28,12 4 16:29 45:49 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=16m55? n2=13:27 (3m30s/10It) kanotix2005-04-64bit 16G-RAM tmpfs
28,12 8 9:01 28:13 4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=8m29 n2=10:18 (2m13s/10It) kanotix2005-04-64bit 16G-RAM tmpfs
28,12 1 1:04:36 2:04:00 Xeon-3GHz-12GB-v2.24-gcc-4.1 64bit 4x4 lt=22m (8m36s)/10It model4 stepping10 n2=24m44 xen
28,12 2 37:42 1:28:24 Xeon-3GHz-12GB-v2.24-gcc-4.1 64bit 2x2 lt=16m (7m28s)/10It model4 stepping10 n2=18m18 xen
28,12 4 29:46 1:17:56 Xeon-3GHz-12GB-v2.24-gcc-4.1 64bit 4x4 lt=11m (8m51s)/10It model4 stepping10 n2=13m15 xen
28,12 1 1:13:28 2:42:16 Xeon-3GHz- 2GB-v2.25-gcc-4.0.2 32bit 4x4 lt=29m (13m46s)/10It model4 stepping3 n2=31m34
28,12 1 1:06:07 2:30:48 Xeon-3GHz- 2GB-v2.24-gcc-4.1.1 32bit 4x4 lt=28m (13m33s)/10It model4 stepping3 n2=30m16
28,12 8 13:42 49:10 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit model6 stepping4 lt=8m43 n2=11m30 7m13/10It 2*DualCore*2HT=8vCPUs bellamy san=w142MB/s,r187MB/s
28,12 2*8 12:40 48:23 Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit model6 stepping4 lt=8m08 n2=10m54 6m01/10It 2*DualCore*2HT=8vCPUs bellamy san=w142MB/s,r187MB/s
-------- -------------------- 1.4GB + 14GB
27,13 4 141:06 384:27 SunFire-880-SparcIII-750MHz-CC-5.3 v2.19 -fast -xtarget=ultra -xarch=v9 -g -xipo -xO5 lt=73:15 (21m14s/5It)
27,13 8 57:59 138:59 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=28m (6m38s)/5It HBLen=409600
27,13 16 46:03 103:21 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=25:44 (3m50s..3m57s)/5It mfs-disk + vbuf=16MB + sah's
27,13 32 29:18 61:25 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=18:03 (1m43s..1h38m)/5It (2stripe-Platte=150MB/s) 60%user+5%sys+35%idle(0%disk) 1user
-------- -------------------- 2.6GB + 28GB
26,14 4 3:30:24 12:09:12 ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17 lt=62m21
26,14 16 45:51 1:45:48 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=27m07 (4m00s...4m12s)/5It mfs-disk vbuf=16M + spike (optimization after linking)
26,14 16 1:31:49 3:31:01 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=48m34 (7m57s..10m01s)/5It mfs-disk vbuf=16M
26,14 1 4:23:00 13:58:37 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=76m (123m)/10It 16blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
26,14 2 2:49:53 10:00:26 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=50m ( 91m)/10It 2blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) io.r=50MB/s
26,14 4 1:32:16 6:53:23 2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2 lt=26m39 ( 72m)/10It 4blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
26,14 8 32:06 1:42:33 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=18m32 n2=25:18 (10m24s/10It) SLES10-64bit 32G-RAM
26,14 1 4:56:57 11:36:23 Xeon-3GHz-12GB-v2.24-gcc-4.1.1 64bit 4x4 lt=1h28m ( 75m)/10It model4 stepping10
26,14 2 3:03:22 8:58:02 Xeon-3GHz-12GB-v2.24-gcc-4.1.1 64bit 2x2 lt=1h06m ( 70m)/10It model4 stepping10
-------- --------------------
25,15 4 6:21:08 22:13:42 ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17 lt=1h32m
25,15 8 3:03:41 15:54:51 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=1h24m (1h18m..1h26m)/5It HBLen=409600
24,16 8 4:58:56 25:21:48 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=2h08m (2h16m)/5It HBLen=409600
24,16 8 8:31:10 31:49:42 Altix330IA64-1500MHz-gcc-3.3 v2.23 -O2 lt=3h14m (4h55m)/10It hxy_size 163840/32768=5
23,17 4 17:19:51 51:02:31 ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.18 lt=04:11:14 latency=29h30m/40*63GB=42ns cat=2229s(28MB/s) zcat=5906s(11MB/s)
Next figure shows the computing time for different older program versions and computers (I update it as soon as I can). The computing time depends nearly linearly from the matrix size n1 (time is proportional to n1^1.07, n1 is named n in the figure).
Memory usage depends from the matrix dimension n1. For the N=40 sample two double vectors and one 5-byte vector is stored in the memory, so we need n1*21 Bytes, where n1 is approximatly (N!/(nu!*nd!))/(4N). Disk usage is mainly the number of nonzero matrix elements hnz times 5 (disk size for tmp_l1.dat is 5*n1 and is not included here). The number of nonzero matrix elements hnz depends from n1 by hnz=11.5(10)*n1^1.064(4), which was found empirically. Here are some examples:
# hnz/n1=10.4*n1**0.069 per gnuplot fit (should be f(x)=1..x for small x) nud n1 memory hnz/n1 disk (n1*21=memory, hnz*5=disk) -----+---------------+---------------------- 36,4 632 13kB 15.8 60kB 160sym E= 4.69533136 34,6 24e3 432kB 21.9 2.6MB 160sym E= -2.11313499 32,8 482e3 11MB 27.0 66MB 160sym E= -8.22686823 SMag= 0.14939304 30,10 5.3e6 113MB 31.7 840MB 160sym E=-13.57780124 28,12 35e6 735MB 34.3 6GB 160sym E=-18.11159089 27,13 75e6 1.4GB 37.3 14GB 160sym n1=75214468 26,14 145e6 2.6GB 37.9 28GB 160sym E=-21.77715233 ZMag= 0.02928151 n1=145068828 25,15 251e6 5.3GB 39.4 50GB 160sym 24,16 393e6 8.3GB 40.2 79GB 160sym E=-24.52538640 23,17 555e6 11.7GB 41.4 115GB 22,18 708e6 14.9GB ... ... 20,20 431e6 7.8GB 41.8 90GB 2*160sym E=-27.09485025 -----+---------------+---------------------- 34,6 3.8e6 80MB 22.1 422MB 1sym E= -2.11313499 32,8 77e6 1.6GB 27.3 10GB 1sym E= -8.22686823 Performance prediction: ToDo: min,mean,max GS1280-i100-time-estimation for 28,12: (measured: ca.50min/100=30s) get_vector1: n1*(dbl/rspeed+double/wspeed) 35e6*16/2e9/s=0.28s get_matrix: hnz*Hsize/rspeed 1.2e9*5/2e9/s=3s (or DSK!) get_vector2: hnz*(double/rspeed...latency) 1.2e9*(8/2e9..82ns)=5s..98s idx+mult+add: hnz*3clocks 1.2e9*3/1250e6=3s load_program_instructions: cached (small loop) # MPI???(sort-nodes+transfer-v2=n1*dbl/MPIspeed)= # latency=82..250e-9s clock=1/1250MHz speed=2e9B/s (estimations) # latency L1=1clk,L2=12clk=10.4ns,mem=102.5clks clock=0.8e-9s # 18.4GB/s max=767MHz*8*2=12.3GB/s,remote=6.2GB/s # 1.75MB 8channels? -> 8*latency-overlap? # random-hitrate 1.75MB/735MB=0.2% ToDo: measure hitrate
CPU-scaling for spinpack v2.26 (Oct2007)
The T200 has 8 Cores with 4 Threads per Core (CMT). SH is the matrix generation (integer-OPs), i100 are 100 Lanczos-iterations (FLOPs and memory transfers). Created by the gnuplot file speed.gpl.
A typical cpu load for a N=40 site system looks like this:
Data are generated using the following tiny script:
#!/bin/sh
while ps -o pid,pcpu,time,etime,cpu,user,args -p 115877;\
do sleep 30; done | grep -v CPU
115877 is the PID of the process. You have to replace it.
Alternatively you can activate a script activated by daten.i (edit it).
The machine was used by 5 users, therefore peak load is only
about 12CPUs. 735MB memory and 6GB diskspace (or cache) were used.
You can see the initialization process (20min),
the matrix generation (57min) and the first 4 iterations (4x8min).
The matrix generation is most dependend from CPU power.
The iteration time mainly depends from the disk speed
(try: time cat exe/tmp/ht* >/dev/null) and the
speed of random memory access. For example a GS1280-1GHz needs a
bandwith to the disk of 60MB/s per CPU to avoid a bottle neck.
Reading 5GB in 8min means a sequential data rate of 12MB/s which
is no problem for disks or memory cache. Reading randomly a 280MB
vector in 8min means 600kB/s and should also be no problem for the
machine.
You can improve
disk speed using striped disks or files (AdvFS) and putting every
H-block on another disk. The maximum number
of threads was limited to 16, but this can be changed (see src/config.h).
During iterations the multi-processor scaling is so bad on most machines -- why? I guess, this is because of random read access to the vector a (see picture below). I thought a shared memory computer should not have such problems with scaling here, but probably I am wrong. In future I try to solve the problem.

Figure shows dataflow during iterations for 2 CPUs.
Version 2.24 was very slow in calculating the expectation value of <SiSj>. A gprof analysis was showing, that most time was spend for finding the index of an configuration in the configuration table (function b2i of hilbert.c). This was the reason to have a closer look to the speed of memory access. I wrote memspeed.c which was simply read a big number of integers at different steps. Reading integers one after another (sequential read) gives the best results of the order of 1-2GB/s. But worst case where integers are read at distance of about 16kB gives performance of about 10-40MB/s which is a factor of 100 smaller. This is a random access to the RAM. The OpSiSj-function does arround n1*(ln2(n1)+1) such accesses to memory for every SiSj value calculation. I think it should be possible to reduce the randomness of index calculation by using ising energy of each config to divide the configs in blocks. (Sep2006)