You can find the last (important) changes here.
see also ToDo !!! --- please report compile problems and/or send your improvements --- ToDo: remove XY_FLAG! needed for simplifications makes SH_storage bigger (nzx*8B+4B)!? thbuf.ofs[]; simpler code first add scfg2idx_blk incl. XY_FLAG () better gprof? ToDo: mpi_l1 + scfg2node last cfg per node? comments missing! ToDo: unique mpi-buffer-names, dynamic alloc?! scplx nsend_rr to mpibuf_vr replace HBLen by dyn.nzAHmax to get same MPI-pckt-lenghts SH.nhm_l vs. i100.geth see spins.c ah_nz[ah_blks] ToDo17: split nhm_line to cfg2scfg_nz_blk + MPIscfg2i2vi_nz_blk? +nhn[](should be done by op_H_already!)=compact ? XY_FLAG to ofs[] (zero-nzblks!) + XY_FLAG ggf. in storedH only see asyncMPI, rename hamilton_puth_line to store_H_nz_AHblock ToDo17: fix v2.56 checkpointing, store min_nzx (modAH) not max NZX check scfg2idx is bottleneck (hashing or 2 phase scan more caching?) or HRidx_search ToDo18: universal parameter input list space or komma, also per option ToDo18: alloc big array dynamic nrecv_cfg,nsend_cfg,nrecv_vy,nsend_vy cause linker errors for other static vars, if to big v2.57 ToDo: use xorshift for startvec 4 and 5 also rnd-speed: urnd=4MB/s xorshift1024*=1GB/s libc.rand=196MB/s speed/info no pthread-inivec ToDo: rename spinpack to qspinpack (quantum) ?? TODO v2.80pre stable 2-loop-nonblocking_MPI_thread0_only (speed) TODO v2.70pre stable partly-nonblocking_MPI_thread0_only, taskfile(~threadfile)? TODO v2.60pre stable partly-nonblocking_MPI_thread0_only improved ToDo: more short parallel! speed_tests + slow-node-diag + rmem-diag (show pinning problems?) ToDo: lrzGarching slow l1_load (write 14m, read 26m vs SH 3min) 3200nodes Todo: set thbuf.lsym-Macro, testing thread mem locality hpc18 - no speedup better for generator syms only check doc/speed_mpi.gpl hpc18 problem check hpc18 mpt=32 + NAH=2048 .o above 2GB do not link together check meggie pthread gcc412 compile error add mini-mpitest to show mpi-speed as info and later archive info better env-vars MEM_PER_THREAD or MEM_PER_CORE for hybrid-code? HT? use tmp = $SCRATCH + tmp.$jobid if not specified by option? without daten.def-mode? (pipe in daten.def from exe or incl. daten.i?) per stdin? switch to load_model at +++model+++ and leave at ---model--- or load file per model_file= model_cmd= ... time resolution i64 ns? no (slow float)? H=float (cplx-)cfg-phase abtrennen? und erst spaeter anmult. (test.lc5) hnz_stored=... wegnehmen TODO: til8_48.dat ud=14,-14 c64.bn16=5.92ns f32.bn16=3.10ns !!! use symgen.products not allsysm (less L1-cache?) TODO: remove XY_FLAG to get max_node_n1 to max32bit, remove M_d2f (allowAVX), fill n1 with dummies to NAH-multiple = simple loops ----- v2.59pre v2.58a 2019-07-03 backport-fix multi-a0 v2.58 stable partly-nonblocking_MPI_thread0_only reference base for further code simplifications 2019-06-05 pt>B_NUM warning replaced by error (avoid mistakes) 2019-06-04 disable maxmem,maxfile on daten.i (to avoid confusion) 2019-05-31 fix bad pointer initialization of ib4=s_ib4[-1] in speed_tests causing multiple speed_tests calls or other strange behaviour bug was not detected by valgrind (since 2019-05-20?) 2019-05-17 fix segfault at fulldiag (nev=1, multithread-code) 2019-05-11 use /proc/meminfo + /proc/cpuinfo to get coremem=memory/core can be overwritten by p.e. option --maxmem=1.5e9 daten.i maxmem and maxfile is obsolete now, replaced by option fix output on out of memory (OOM) speed+scaling-tests upto 1022 nodes 48 cores (kago48) on SuperMUC-NG 2019-05-06 some SuperMuc-NG adaptions, lc48 tests, use: num_fat_nodes=n1/2^31 add 1/4s warmup for speed_test, fix t[ns]/cfgs for wud!=0, 2019-04-15 HRMAX=0 as default to avoid HR_TYPE overflow, needs more Mem 2019-04-12 fix sisj for blocks<NAH, caused scfg_err + segfault 2019-04-10 fix mrule for NOSZSYM (hbuf[] as arg, not at stack, fix false-CPU-cache-sharing = speed), fix randvec=6 (xorshift) bad eigenvectors for NEV>0 2019-04-05 fix sisj for n1%NAH!=0, mpi=1, using valgrind.memcheck uninitialized mem as index in nhm_line, bug reported by JR, since 2017-04 2019-04-05 fix overflow-check models/m_square.c (nn>256), m_1d.c +bond_filter 2019-03-31 6x speedup symcfg.ibfly in matrix-generation for NN>128 gcc8 needed for unroll ibfly-loop (simpler code) but slower on gcc7 and earlier 2019-02-05 fix models/m_sc.c for J1-J2 (was J1-J3), add J3 2019-01-23 fix 32bit-cpu compatibility, show binary-name (DBG), exit on errors in input file (old was warning only, bad for Million-CPUh) 2019-01-20 fix bad pinning (taskset 0xff 8tasks/node 1thread/task, CPUSET=2) 2019-01-16 fix overwriting existing daten.def by make [install] 2019-01-16 replace static b_hbuf to allow bigger AHEAD*max_threads_per_task also improve thread local memory usage (better?) 2018-12-13 incorrect output maxmem, maxfile "per task", correct per thread 2018-12-12 abort on bad k input to avoid unintended waste of CPUh + power v2.57 main-change: Hmatrix is now stored with fixed blocks of variable size because there are different numbers of nonzero elements per line to make code simpler (less errors, better to maintain) fixes partly stored matrix problem, since ?, size=NZXMAX*AH*B_NUM*nzSZ Hnz-blocks per node (not per thread) = B_NUM*bigger = hybrid speed old: fix nz-size blocks with data reorganization (for slow disks) only thread 0 is doing mpi or disk to maximize packet size for speed 2018-12-09 more debug info, reduce speed_tests overhead to multiple 100*time_res, ca. 1ms, old 400ms 2018-11-20 Makefile: add parallel test build, make -j4 test use CONFIG_MPI=$mpi_max_num to configure max number MPI tasks old CONFIG_MPI=1 sets default value in block.h add optional configure --mpi=$mpi_max_num --mpt=$mpt_max_num_per_task using CONFIG_MPI_MAX_NUM + CONFIG_MPT_MAX_NUM from host spec. environment 2018-11-18 test gcc/clang -m32 ..., tested: x86_64 + i686(32bit) 2018-11-17 enable clang complex extension (clang-3.4) 2018-11-15 fix double free of Hmatrix-blocks (possible segfault on 2nd a0) fix multiple debug outputs for mpi-jobs remove option -o (use shell pipe >, dont split output, simplified) experimental v1709_maxscfg removed, to slow, not reliable 2018-11-11 configure 2.75c changed config.log, MacOS-fixes compile test gcc-6.3 ok gcc-8.2.0 ok LLVM.clang-3.4.2 check partly stored matrix + test-case AH=8 reduce maxmem, maxfile=maxmem/2 create hfiles only if needed, avod 1 empty hfile per core (slow NFS) fix false-positiv overflow binomial coefficient detection improve + check overflow detection (check on lowbit-types) on overflow old: bad H-matrix, new error + abort (test u8 + high S) overflow-detection is working in more but not all cases (ToDo) fix CONFIG_Fidelity for gcc complex vectors (compile error) no fabs(z) in complex-math, but in own complex.h (replaced by cabs) problem since 2017-01 and earlier if intern complex + fidelity used fix problem itr>n1 for some models (high 2s, n1=1) and parameter sets also getting bad FTLM outputs + results for low n1, since 2016-01-23 check low N but high S models (NN>64, NN>128 until NN=256) see doc/example_lc2-3.html for exact eigenvalues and higher spin see doc/example_lc5.html for max b_factor_lm in hilbert.c 2s=16..40 test Tnorm2=int,__int128,f32,dbl see doc, max N=3 2s=84 NN=256 HRMAX=0 change daten.sym data error to fatal error, safe CPUh 2018-11-05 fix macOS-10.14 problems (no pthread_barrier, no echo -e; static inline, not inline, this feels bad but see no better way) 2018-08-27 fix bad num_symcfg filtering for MPI-code, since v2.57pre ./configure --mpi [--lapack --ftlm] [--mpt];\ OMP_NUM_THREADS=2 make test # mpi=2 ok 2018-08-29 add test m_1d.c.MixedSpinKagomeChain, fix bug: ud=16+2,a32 bad since 2017-09-11 for systems spin=1++ and site 0 spin 1/2 only 2018-08-23 skipping EV-code on FTLM removed for (make test) set nev=0 for FTLM to skip ev-code remove Makefile.in, user must edit Make_cfg.inc, not Makefile 2018-08-22 perf_test code broken for parallel ns-code, dont use it, may be reintroduced later clean up ns_thread-code, to fix some v2.57pre bugs 2018-08-20 add startvec=6 + bits4..31 for xorshift128plus {-1..+1,cplx} old random vectors have reduced space (nonnegative, nonrational ...) this may be relevant for FTLM (statistics), much better randomness add warning about bad FTLM results using startvec.method!=6 test: N=22 sawtooth J1=0.45,J2=-1 FTLM R=10 now converges to full-diag but not for startvec.method=5 or 4 (startvec=13 or 12), R=100 fails configure: check if autodetected compiler is working (icpc + no license) 2018-05-27 fix bad wud (not 0 for nu!=nd) for mpi-code since 2017-09-06 fix deadlock for 2nd parameter run for multithread(?) and hybrid fix bad randomness on configs for speed_test 2018-05-25 fix bad error triggering if mpi_n*pt_n not fit to BY_TYPE since v2.36 2008-08 only mpi_n must fit to BY_TYPE fat-testrun kago48 using 2040 nodes * 2tasks/node * 16 threads = OK 2018-01-10 rename doc/example[1-4]_*.html to doc/example_*.html rename f# to fz# in daten.def change MACRO HFIELD from def/undef to def 0 or 1 = simpler code ToDo: bad expectation values? fixed by wud fix? 2018-01-09 prepare code for field wfx[] needs NOSZSYM, and wfy[] needs NOSZSYM,VecType+4 ToDo: setting wfx,wfy field must be implemented also Kitaev + Dzyaloshinskii-Moriya, EV.SixSjx etc. 2018-01-05 replace macro names H_JZ, H_Jadd for IFz Add34 Ham1 H_Jadd meaning changed, h_zpara excluded, simplified nzH JJ-term 2017-10-20 fix bad last-block-size problem (may be zero, ex: lc27_s1 950*6t) switched from async num const-length-HNZ-blocks to const (thread_n1/AH) num variable-len-HNZ-blocks to get simpler code for nonblocking NZX handling and mix mem+dsk vvv+16 shows unbalanced cpu-time and mem use (speedup 1..3 possible?) test: partly stored NZX, mem only, using disk crashes = ToDo 2017-09-13 fix div0 on small n1 where blocksize may be 0, example: n1=4 -t3 2017-09-12 fix multithread minsymcfg_dflt speed measurement, to identify slow (bad) nodes, old: task0 was always faster than others daten.i: add ":sh_dbg_cmd" for first 4 mpi-tasks ("!" is task0 only) 2017-09-11 reintroduce 2013-01 next+LM for s=1-JJ p_ns factor 10 faster 2017-09-06 improve [is]minsymcfg_*, simplify if wud (speedup pns) more thread local mem, read same mem is slow on MIPS-threads not cached in CPU-L1 (but x86_64 L1-caches reads of same mem) this makes hybrid code much slower on SC-MIPS-ICE9B than pure MPI (good) hybrid code wins always on large scale (bigger message size) 2017-04-07 better output after memory bit-flip detection in l1[] + add l1[] auto correction on runtime (1-5 errors/h/8GB? 2009) t100: 511529 ECC-errors / 479d / (174*256GB) = 45/h/43.5TB 2016 node-statistics: found on 31 of 174 nodes (18% nodes), DIMM-statistics: found 1 bad of 16/node = 1.1% 16GB-DDR4-DIMMs-2016 kautz: ca. 15/972*2 DIMMs = 0.77% bad 2GB-DDR2-DIMMs-2008 2017-04-13 add startvec+=... entry for daten.i for simpler scripting 2017-04-06 fix buggy -m32 sq40=7+33 (6+34=ok) h_read.len: int to size_t 2017-03-08 output number of iterations for FTLM extension 2017-03-02 configure changed, Makefile.in removed, Make_cfg.inc added cleanup redefinitions of t_sites+tphase in header files, since 2.56 2017-02-28 better output for task/thread-to-vcore affinity (64..128vcores) 2017-02-24 fix bad results on partly stored matrix (AH=8 10%+last_small_blk) since v2.56 2017-02-24 partly non-blocking MPI collective operations, i1.speedup 25%-90% v2.56c known problem: partly stored matrix = bad results 2018-09-10 backport v2.57 2018-05-25 triggers tasks*threads=2048*16 fix bad error triggering if mpi_n*pt_n not fit to BY_TYPE v2.56b 2018-08-23 backport method=6, CC-test, make_test-fix for FTLM from v2.57 ./configure --mpi --lapack --ftlm; make test # mpi=2 ok 45s ./configure --mpi ; make test # mpi=2 ok 48s, mpi=4 45s ./configure --mpt --lapack --ftlm; make test # fixed ca. 40s, ./configure --mpt;OMP_NUMTHREADS=4 make test # fixed 29s ./configure ; make test # fixed v2.56 stable(?) blocking_MPI_thread0_only, scales better than v2.55 bug: probably bad results on partly stored matrix, fixed in v2.57 bug: STOREH 2 failes O_DIRECT (CentOS5), use STOREH 4 as workaround 2017-02-22 fix problem with last block on last thread in case of zero length 2017-02-20 make test fixed (--mpt --mpi @quantum,JJ64,cplx,sisj) storing nzx instead NZXMAX again, so memory consumption is like 2.55 2017-02-18 asynchron fully filled const_size_nz_blks (HBLen) replaced by _synchron_ partly filled const_size_nz_blks (NZX*AH), more KISS HBLen obsolete now and replaced by NZX*AH, NZX must be minimized synchron H-blocks needed to fix complicated MPI-error (since HBLen exist): MPI and partly written matrix may cause problems (BFACTOR-err+bad_e0) but eats more memory! NZXMAX*n1 instead of old nzx*n1 (subject of change) p.e. kago42k1=15+27 +49% (this is a big change in the code pipe) 2017-01-28 use mem + disk for matrix, set maxmem + maxfile (per CPU-core) replace ./tmp_shared by ./tmp (shared or local scratch disk) STOREH=0 replaced by runtime options maxmem + maxfile 2017-01-27 rename short names (better to read + grep), split functions to prepare for asynchronous communication (in progress) 2017-01-23 fix speed_test race condition for mpi/pt/hybrid code (get nzx) reorder h-file-content 2017-01-22 --perf_test=100 reduces n1 by 100, old was stopping at 100 this gives much better statistics (nzx, performance per nz) - more output about mem-usage, better large scale debugging update perf_test: 2018-08 broken, needs reimplementation 2017-01-21 only thread0 does MPI in hybrid mode (bit slower than v2.54!) bigger MPI packets, less CPU-power for MPI, better readable code will overcome v2.54 when using MPI_IAlltoallv=overlapping Net-I/O - synchronized loop over threads doing MPI replaced by thread0 only doing MPI using bigger data blocks for hybrid mode (MPI + pthreads) v2.55 - stable seq_MPI_all_threads Alltoallv error: MPI and partly written matrix may cause problems (BFACTOR-err+bad_e0) 2017-02-01 fix compile error, if MPI is not used, workaround bad g++412+cplx 2017-01-30 CFG_CPUSET=1 to show CPU-affinity, CFG_CPUSET=2 to set affinity 2017-01-28 speedup for expectation value Zi (diagonal operator) 2017-01-26 add warnings, bigger buffers enlarge OOM risk by libmpi 2017-01-25 disable speed_test.nhv for mpi or pthread, its buggy gcc-6 + complex compile error fixed (defs.h+__STDC_LIMIT_MACROS) v2.54 please use v2.55 2017-01-24 fix mpi + ev=1 hanging triggered by x_sort in error.c (backport) 2017-01-21 fix complex + CPUSET for icc (need to add too v2.60) fix compile problems g++ + complex 2017-01-19 backport iy y inverse indexing old: rnd_write+seq_read, new: rnd_read+seq_write (geth_blk) 2017-01-19 backports v2.60 output cpuset, use faster blk_scfg2idx bigger MPI_MAX_NUM (better defaults for supermuc) v2.53 - much better MPI scaling using Alltoallv - use v2.55 2017-01-16 fix hybrid mode code, buggy since 2017-01-15 2017-01-15 replace loop over MPI_Sendrecv (mpi_n*sync) by MPI_Alltoallv faster on comp2.09.ompi14 + supermuc12.mpich3, but much slower (factor 15 for 5700 tasks) on SiCortex09.sc_mpich2 A2av-emulation for 5700*SiCortex gives speedup of 100% SH and 30% i100 Problem: My_MPI_Alltoallv was loop j over -blocked- send i to i+j but datasize of blocks within each loop is very different only sum is distributed, so alltoallv is much better way but output of mpi_stats is not correct anymore (use meansize per node now) - loop over unbalanced (blocking) Point-to-Point-communication (MPI_Sendrecv i-1,i+1) replaced by collective communication (MPI_Alltoallv) using collected data (better balanced) 2017-01-13 option: --perf_test for reduced matrix-size performance test 2017-01-13 temporary fix for bad alloc_mem on disk usage STOREH=2 or 4 set bigger dflt AHEAD 64 to 1024 (see 2016/lrz + doc/speed_mpi) v2.52 2017-01-11 fix icpc compile error 2017-01-10 fix hybrid code (MPI+OMP), fix (make test) include DFLAGS 2017-01-09 rewrite to use OpenMP-2.5 only (but i100 is about 6x slower) 2017-01-08 add OpenMP-3.0 workarround for PTHREAD (but icc10 has OMP-2.5) this is for systems, where we have OMP but no or buggy libpthread 2017-01-06 found some problems using icc10 -pthread = hangs + bugs 2016-12-27 reduce stack memory needed for speed_test by ca. 16kB 2016-12-22 replace use of hard coded mask CFG_CPUSET by getaffinity() CFG_CPUSET must be set to some nonzero value only (p.e. 1) example use cores 4-8: taskset 0x0FF0 ./spin -t8 2016-12-12 fix unnecessary abortion for maximum iteration (i=MAX_ITR) 2016-12-01 sym_k= -999999 and below stops sym-search, similar to SIGUSR2 should be replaced by faster sym search (exclude full commuting syms) example: chain of repeating (N=2,3,..)-rings seperated by single spins for N=2 dimer-plaquette-chain, changes in symmetry.c get_recursive_perm 2016-11-10 avoid re-use of old eigen values on failed malloc (fulldiag) 2016-11-10 malign alloc big Hbuf for fulldiag, since about 2015-06 v2.51 - 2016-10-05 gcc-6.2.1 fix stronger spacing errors in models/*.c fix bad signal behave of my_handler() for mpi-jobs (thanks gcc6) fix indentation according to gcc6.2 warnings - 2016-08-17 g++-6.1.1 fix stronger spacing errors in error.h "..."var - 2016-08-09 fix lot of sym output (LM+NOS1 only?) for mpi since 2016-01 - 2016-05-18 add extra workspace for lapack routines, 2-3x more speed using multithread for a4-eigenvectors, use OMP_NUM_THREADS if no option -t fix bad "ERROR fclose_l1"-msg (no impact, clean output only) add autodetection of core number for configure --mpt, +option --ftlm use CFG_PTHREAD>1 for B_NUM, link example*lc2*html in doc/spins.tex - 2016-04-29 fix buggy transposed eigenvector-matrix of cjacobi() buggy since v2.42 2012-01 test: nev=1 verbose=7 a4 - 2016-04-28 improved utils/lapack_test.c +options +benchmarks +XXXevd() autoswitch to faster LAPACK_dsyevd/zheevd for nev!=0 - 2016-04-21 fix autoset maxHmem=maxfile, if more than 16GB/core needed since 2015-02--, improve output (normvec/startvec) fix bad srand(0) (==srand(1) from 2016-04-15, 2nd = 1st random startvector - 2016-04-15 for JSchnack2016 Finite-Temperature Lanczos method = FTLM see spinpack_rel_pap/FiniteTemperatureLanczos_FTLM_JSchnack2012.pdf DFLAGS += -DCFG_FTLM=1 + daten.i: startvec=13 + repeat a0's (unproofed) update: better use max_ea=16 and one a0 only, startvec=5 - 2016-04-15 fix bad v=0 for startvec=8++ (rng_ini+=startvec&~15) change randvec, startvec=0,1,4,8++ use, random startvectors changed now see randvec()-comments: grep startvec src/vector.c v2.50d Apr2016 - add min/max CPU/node speed to detect bad cpus or nodes 2016-04-07 - fix segfault speed_tests.nhv_no_l1 (high node numbers ca.58*AH) 04-07 since about 2015-03, verbose=1 is a workaround - reordered debug outputs v2.50c Mar2016 - fix multiple a0 memleak (l1) introduced 2015-02-24 for BLCR 2016-03-22 using: valgrind --tool=memcheck --leak-check=full ./spin - fix out of array read access for NUM_AHEAD gt n1 since 2015-10? 2016-03-22 - fix bad "symtable full" case (MaxSym to small) since 2016-01 2016-03-22 - add usleep(100ms++) for nonzero tasks to exonerate nfs overload 2016-03-16 ToDo: untested, please try with/without spins.c.L4302 and report v2.50b Mar2016 - fix zero block crashs (SEGFAULT+badscfgs for mpi_n big, n1 small) introduced 2015-10-08 (benchmark adaptions) 2016-03-10 - allow massive oversubscription (64++) removing unneeded MPI_Barrier +using --mca mpi_yield_when_idle 1 --mca mpi_preconnect_mpi 1 usefull for debugging problems on PCs which appear for high mpi_n only - fix buggy autocorrection for nu+nd!=nn (since 2016-02 +old) 2016-03-09 - fix multiple (per-mpi-task) warning "paini overflow" (N>=68) 2016-03-07 introduced in v2.50 2016-01 v2.50 Apr2015-Jan2016 SIMD-CPU version - fix bad delay tasks*seconds, result of fixed mpi-token-ring 2016-02-03 introduced 2016-01 - allow more iterations for small systems (better for JJ N=2 s=33/2) -old- - fix bad N=2 LM detection, tested: N=2 2s=1...34 using int64 2016-01-23 - fix overflow problems using -DTnorm2=double (N=10 2s=8) 2016-01-22 - fix mpi-token-ring for shared mymap_l1 2016-01-20 - replace O(n^3) by O(n^2) paini-storage, simplify paini, check overflows - add sym_lm (0=max(S(S+1)), 1=max(J(J+1))) see doc/sym_tU.txt 2016-01-16 - 10-20% speedup recursive ns for high-spin-systems (see cubocta.def 2s=7) - fix highSpin-problems with vectorized code or wud=0 and nu=nd 2016-01-11 - fix speed_tests() for multithread + static 2016-01-11 - symmetry.c partly rewritten for better handling of s1-systems + NOS1SYM N=11-s=1-lc-chains: sym_k= -21 0 ... # skip 2N syms, set LM-syms (fast) N=11-s=1-lc-chains: sym_k= 0 0 -60000 # skip LM-syms, set 2N k's only (fast) - make auto-VS depend from tbase (VS*S_tbase=const.) faster on SSE2, AVX2 - enable benes upto 512 bit-cfgs (JJ) tested 2016-01-08 - fix vlint - int for NN>128, improved error check + output 2016-01-08 - speed_test bnv4+bnv8 removed, code reduction (bnVS left), +const-args - fix buggy warning "WARNING: NOS1SYM ..." if S1SYM set 2016-01-08 - add L1_PACKED=0 to allow fast non packed or slow packed tbase_vector l1 - fix tJ- and tU-model for NN=32 (%4==0) and nn=3 (%4!=0) since 2014-05-14 - fix speed_test.check_minsymcfg for tJ- and tU-model 2016-01-05 - fix bad spins.c.L1258 special S1-Sym speedup (works for JJ only) 2016-01-04 - fix bad gcc.phase_eq.fabs(cplx)<1e-6 by [sqrt](Norm(cplx))<1e-12 tJ_lc_s1 k=2/8 nu,nd=0,2 failed before (see tJ_lc_s1.gpl) 2016-01-04 - fix sqrt-format for negativ non-sqrt numbers in myprint.c (vvv&64) - fix bad phase output (fabs() != sqrt(Norm())) 2015-12-31 - add new test data to doc/lc.gpl N=46 n1=45e9, lc_s1.gpl N=26, tJ_lc_s1.gpl - fix S1SYM + isminsymcfg_lm() return -1 for tJ+tU model (recursion) - Vlint as 2^n-bits to fit benes needs (hangs on old code and non log2) - algo=64 stop after l1 (prepare l1 serial, --ckpt_load=2 + parallel) 2015-12 - earlier output of i100.speed at i001 2015-12-24 - reduce error output for mpi (symmetry search, nsym>MaxSym) 2015-10-15 - fix speed_test factor 2 to high speed outputs (since v2.45?) 2015-10-08 - fix array overflow (NN>MaxSym) in symmetry.c sym.w initializ. 2015-09-23 - fix benes-algo for NN>64 (JJ, but slower than lNbrk) 2015-09-23 NN=512 JJ.nn=4+36 + nn=2+498 tested, some very slow ini parts - add term e_i to tJ-model (tJ as tU where U=infty) 2015-09-23 - fix div 0 for numsym=0, FP exception (or endless loop?) 2015-09-23 - fix missing initialization of benes-network bnVS after 1st round 2015-09-22 - fix buggy symcfg_bnv4 (only little speed-test relevant) 2015-09-08 fix valgrind warnings - replace local hbuf (size*AH) by alloc_thbuf, stackovl speed_test 2015-07-08 - switch back from benes to old slow lNbrk for NN gt 64 (buggy) 2015-07-07 - add BENES-performance-decision-table in symmetry.h for different CPU-types also lot of detailed vector-performance data added in speed_mpi.gpl - improve Hbuffer allocation for multi-thread version (B_NUM>1) 2015-06-11 - fix segfault in 2x2-diag-algo (a2) 2015-06-08 - expand (make test) for tJ and tU-model, fix some bugs+warnings 2015-06-02 - add string.h to spins.c + vector.c to fix linker problem gcc492 -std=c99 - fix abort-on-non-hermitian-matrix problem for vvv=2 if terms sum to zero - add macro for __int128 if it is available (for base configs) 2015-05-17 speedup 3.5 for N>32 tU on Atom-N455 64bit SSE2 - benes network (bnVS) for bit-permute O(2Log2(n)-1) implemented 2015-05-11 may be faster for SIMD, AVX2-advantage expected, but not seen fully set CFG_USE_BENES in hilbert.h (see also minsymcfg_blk_bnVS[_no_ud]) tested on gcc41-gcc49 -m32/m64 SSE2/AVX2 NN=40 - hbuf->el[].{bj,blkj,jj,rr} replaced by hbuf->{acj,blkj,jj,rr}[] 2015-04-15 for better SIMD vectorization, also smaller CPU cache footprint - VChunkSize (load/store/checkpoint vectors) reduced to 4MB (OOM--)2015-04-13 - replace piecewise malloc by one pre-malloc for storeH (see bugs) 2015-04-08 reduce problems (deadlocks + aborts) by mpi in OOM conditions (STOREH=0) malloc has some intern "optimization" which causes trouble near Out-Of-Mem v2.49 Mar2015 = new stable (has 32bit-compile-problem, patch available) - fix n1=0; speed_test for vvv .and. 3 (old 2) 2015-03-11 - minimize sbase size (like old MPI+JJ) 2015-03-10 v2.49 Mar2015 (+fix above) - maxfile(=hfmax) should be set for STOREH=0 too, libc/ompi bad on OOM see bugs.txt - fix speed output (0-sym added) for numsym below 100 2015-03-09 - fix missing if-condition before error for multithreads 2015-03-06 - -DCKPT_HELP for SIGUSR2 triggered safe checkpoint window without MPI-messages 2015-02-26 speed-loss? no - fix speed_test for tJ, tU; rename b_smallest to minsymcfg 2015-02-24 PRF output per config*num_symmetry - fix segfault for speed_test for sym.numall=1 2015-02-23 - add ReserveMB to avoid Out-of-Memory in case of partly stored H 2015-02-23 before that checkpointing could not alloc 16MB because of OOM - fix multiple output of checkpoint time for MPI 2015-02-21 v2.48 2015-02-05 - fix uninitialized use of thbuf since 2014-07-14 2015-01-31 - fix buggy hfmax use 2015-01-29 - fix gcc-4.8.2 compile warnings (JJ,tJ,tU) using vlint - remove macro Ne dependency of Hubbard-e-term since 2.16 and earlier - replace all int= tbase(=vlint) .and. int, may cause bugs NN>64 2014-12-17 - fix bad checkpointing "v0[eo].dat" if chkpt6.i-last is even 2014-12-03 - fix segfaults nev!=0 introduced in 2.48 2014-10-14 - fix compile errors for tU on wop() 2014-10-01 - reduce messages on loadvec error + abort 2014-09-19 - HW fault detection, memory bit flips (?) by checking tridiag range 2014-07 - partly stored SH-matrix reenabled for maximum iteration speed this may cause problems with MPI calling mmap/malloc? - thbuf-allocs moved outside loops for better lowmem behav. - reduce lowmem outputs (see pipe.c L128) 2014-07-26 - coordinated l1-file access to reduce file-system-pressure - replace big dyn. mem allocs within loops (fix lowmem probs) 2014-07-17 buggy 2nd run (see 2015-01-31) - add CFG_SIMPLE_CODE for auto-micro-parallelization tests 2014-07-04 - fix error propagation to mpi-threads for loadvec() 2014-06-19 - improve mpi_stress.c + memspeed.c benchmarks 2014-06-19 - fix segfault by read unini mem in b_smallest (kago36,comp2) 2014-05-23 - fix buggy faster b_smallest v2.44 for tU (tJ untested) 2014-05-23 - configure is detecting openblas-devel for parallel fulldiag a4 2014-05-15 big matrizes segfault sometimes for unknown reason (race cond?) - some more pre-benchmark outputs (eats 1-2 seconds per test) 2014-05-10 v2.47 2014-02-14 + 2014-05-25_hilbert.c - fix bad results for maxfile=0 a0 from a2-improvements(2.45) 2014-02-14 - fix bad precision for lapack-3.1+.zheev.lwork=3*n1 sawt20Z6 2013-05-24 v2.46 2013-05-02 + 2014-05-25_hilbert.c - fix bug SiSj and wop (bad values, caused by .nhv vs. .rr) 2013-05-02 v2.45 2013-04-29 - benchmarks and speedtests for verbose=3 added, oprofile tests 2013-04-25 - auto load checkpoint removed to avoid problems with old runs 2013-04-25 - better i100.t estimation (old was i/(i-1) to big) 2013-04-24 - change ns-mode switching, a0=auto, a16=rekursiv, a32=old_seq 2013-04-24 - add fstime() for seconds(+ms,us) as double, itime is obsolete 2013-04-23 - output timings using prefix PRF for performance, clean diffs 2013-04-22 - option maxmem removed (use ulimit or job limits vmem) 2013-04-22 - fix ./configure (bad obsolete --mpp option, check for icc+c99) 2013-04-05 - fixpoint16 removed, type cast revised, better float accuracy 2013-04-04 - VecType-tests 1D-JJ-N=40 8+32 n1= 963793 gcc-4.1.2: 2013-04-04 1 1*8B: -5.37411616 -5.14810857 g99 -std=c99 i100=0.25m i75 0.18SH -O3 0 1*4B: -5.37411616 -5.14810857 g99 -std=c99 i100=0.20m i75 0.20SH -O3 4 2*4B: -5.37411616 -5.14810857 g89__complex__ i100=0.34m i75 0.20SH -O3 ! 8 fix4B: -5.37389582 -5.37385328 gcc -std=c99 i1000, after bug fixes ! 8 fix2B: -5.26368707 -5.26176567 gcc -std=c99 i1000, == bad fixpoint float results + lot of complexity, removed for simplicity may be C99 half float can be used much simpler (not in gcc-4.1.2!?) - VecType-tests 1D-JJ-N=40 6+34 nosym n1=3.8e6 gcc-4.1.2: 4 2*4B: -1.75056342 -1.67952075 g++myclass i100=0.82m i125 4 2*4B: -1.75056342 -1.67952075 g89__complex__ i100=0.77m i125 0.28SH (=O3) 4 2*4B: -1.75056342 -1.67952075 g99_Complex i100=1.09m i125 5 2*8B: -1.75056342 -1.67952075 g99_Complex i100=1.29m i125 0.38SH 5 2*8B: -1.75056342 -1.67952075 g99_Complex i100=0.91m i125 0.33SH -msse2 -mssse3 -ffast-math -O3 5 2*8B: -1.75056342 -1.67952075 g89__complex__ i100=0.94m i125 0.28SH -msse2 -mssse3 -ffast-math -O3 5 2*8B: -1.75056342 -1.67952075 g++myclass i100=0.91m i125 0.30SH -msse2 -mssse3 -ffast-math -O3 4 2*4B: -1.75056342 -1.67952075 g++myclass i100=0.83m i125 0.30SH -msse2 -mssse3 -ffast-math -O3 1 1*8B: -1.75056342 -1.67952075 g++ i100=0.50m i125 0.23SH -msse2 -mssse3 -ffast-math -O3 0 1*4B: -1.75056342 -1.67952075 g++ i100=0.48m i125 0.23SH -msse2 -mssse3 -ffast-math -O3 - zahl,mzahl,mcplx replaced by double,sdouble,scplx (s=short,l=long) - CC=gcc -std=c99 adaptions, shbuf.shelem.rr(mcplx-to-cplx) 2013-04-03 - typecast fix + _Complex support by A.Honecker for icc 2013-04-02 - speedup next() for S=1++ removed for simplicity and fix S=4 bug 2013-04-02 ToDo: add parallel code for rekursive algo for compensation speed: N=5 S=4 nud=8+32 28s/threads vs. 0s, 12+28 30min vs. 0s(rekursiv) N=10 S=2 nud=12+28 31m/threads vs. 2s (Faktor 1000) N=14 S=3/2 nud=8+34 48s/threads vs. 1s (Faktor 48) N=20 S=1 nud=8+32 34s/threads vs. 8s (Faktor 4) N=40 noSym 8+32 120s/threads vs. 127s (Faktor 0.94 !!) - mv doc/example2.html doc/example_tU8.html 2013-03-13 - ignore long lines (above 1022 chars) in daten.def (m_bcc N=250) 2013-03-04 - struct shbuf thbuf changed, nhn,nhv[ahead] computation for _nhv 2013-02-07 - add MPI code to a2 (2x2 method) (works for AH=1 only, ToDo) 2013-02-06 - fix OOM problem for big MaxSyms (reducing static array) 2013-01-26 - fix bug even/odd checkpoint badly set back for resume 2013-01-25 v2.44 2013-01-17 + 2013-05-02_diag.c + 2014-05-25_hilbert.c - fix bug in algo2 (2x2diag) faster for SH not stored + mem/2 2013-01-17 - add start-token for send_from_all_nodes-to-node0 on wv (mpich OOM) 2012-11-14 - fix bad MPI recv bufsize on parallel ns() causing signal 15 2012-11-14 - get_maxscfg for parallel ns speedup disabled, bad algorithm - fix "access last element"-error for n1==0, improve outputs, 2012-11-05 - fix wrong error in clrvec for n1 smaller than number of nodes 2012-10 - infile: include + ":" + "pout" removed for simplicity 2012-10, infile: x[0-9]* replaced by xout=*, l* replaced by loadvec=* 2012-10 - break on full symtables to avoid mass output (NoS1Sym + S=1-Lattices) - fix mistake in b_smallest ib2=ib1; s2=1; (no nzx for s=2) - fix problem on critical abort during checkpointing, use even/odd(n) instead of critical rename of chkpt n-1 to n, 2012-09-27 - fix wrong complex phase output (1+(r-1))*phase (old: 1+(1-r))*p) 2012-08-03 - add utils/io_latency.c to analyze storage speed (RAIDs, SSDs) 2012-07-05 - simplify code + fix some tU lm-bugs (s1-triangle n=27*2 108bit) 2012-07-05 - remove MaxMem (600MB), default is infty (limited by system) 2012-07-05 - output cfgs as hex number (more compact), see N=54 sample 2012-07 - change behavior on bad set nu,nd (try to keep smaller number) 2012-07 - fix problem with kill after chkpt2 and disk caching l1 (missing sync) 2012-06-26 - speedup for parallel numsymconf ca 1..20 (s=1, small nu) 2012-06-24 fix maxscfg() is now the maximum sym config - handle incomplete written chkpt6 (restore old ones) 2012-06-21 - handle incomplete l1-writes (for odd Bsize) on chkpt.resume 2012-06-20 - check for possible changes of NN+Bsize after chkpt.resume 2012-06-20 - fix problem with scmpich-lib, eating all the memory (and slowdown) on parallel_send_to_0/sequential_recv_from_all-sequence on numsymconf() 2012-06-18 - change checkpoint numbering, 2 substeps (change chkpt for resumed jobs!) old=0ns1nc2sh3ew45ev6 new=0ns23sh45ew67ev89 2012-06-11 - check plausiblity of chkpt1 before using it 2012-06-07 - replace savevec.MPI_FILE_* by MPI_Send/Recv + task0-file-Ops 2012-06-06 to work around older network file system (mpich+NFS?) locking problems v2.43 2012-05-23 add checkpointing functionality for MPI code - save_mode renamed to chkpt_mode (unused) 2012-05-23 - writing more checkpoint files tmp*/chkpt[4] (status renamed to chkpt1) - fix problems on parallel func. ns+next() for s=2/2++ systems 2012-05-16 - using version.c for version_date (avoid unneccessary slow recompilation) - fix parallel vector save/load - improved checkpointing for mpi-jobs (2*USR2+USR1 + chkpt_time) 2012-05-15 the problem: sometimes jobtime is limited to a maximum (max. walltime) checkpointing is needed to stop (ordered) the job and resume it later using options --chkpt_load=4 --chkpt_time=60 or similar (SH not stored only) v2.42 Oct11-Apr12 2012-05-07 (buggy cjacobi transposed-EV until 2016-04) - fix lapack related code 2012-05-05 - parallel ns() writes one file via task0, 2012-05-05 (see speed_mpi:asgard) - for parallel ns(), write single file instead of 8*mpi*pt files (2012-05-02) - ">= defined in vlint.h (for 65bit++) 2012-04 - fix 64bit n1 parallel computation problem for S1SYM JJ NN=64 (2012-03) - fix and enable multi threaded numsymcfg-code for s=1 (2012-01) - add HEigensystem as replacement of real [[A,-B][B,A]] for cplx matrix speedup about 5 (for complex matrix and fulldiag) - more info on matrix image (matrix.pgm, verbose+=256[+512], a4) - auto choose the faster method for scfg generation (a16 sets the opposit) that means the fast serial code is only choosen for pt_n*mpi_n=1 this decision may be bad for a small number of nodes or threads - fix bad abort for mpi code and n1 above 2^32 (LC-41 n1=6.6e9 256nodes) v2.41 Nov09-Oct11 (2.41b 2015-09-23) backport-fix bad MPI-test 32bit n1 of v2.42 2011-10-25 on 2015-09-23 add doc/lc.gpl example data for spin-1/2-afm-Heisenberg-chain N=40 add w_diag_op() replacing wop() for diagonal operators for speedup (JS-Oct11) this gives nearly the old speed for zizj without MPI ballast, see lc better error handling if ./tmp/ is missing (JS-Oct11) fulldiag: better error handling if malloc failed (JS-Aug11) improved error handling for mpi code of loadvec() (JS-Mar11) add mpi-code for loadvec() and minimalistic code for x_out() (JS-Mar11) add CONFIG_Fidelity switch to compute overlap to last EV (JS-Mar11) fix creation of ./tmp_shared on installation (JS-Sep10) fix utils/defspin1.sh (output of positions as float) (JS-Nov09) v2.40 2009-11-26 fix n1=0 problem for trivial case nu=0,nd=N where n1=1 is correct add Zi output to fulldiag (a4), fix Zi in case of used ud-symmetry add site-rotation-term samples to spins.c (disabled by if_0_endif) fix m_lattice.c + m_tilings.c output format (exsample3 did not work) fix bug randvec=1 n1=1 (example: all spins up or down) fix bug in op_sxsx, op_sisjsksl, op_jxjx, op_ninj, op_nisnjs, op_nis add 4-spin operator op_mult_sisj_sksl(), rename op_sisjsk to op_diff_s* fix bug for MPI where some nodes can not store full matrix to memory remove 11 char limit for input file name v2.39 2009-04-20 fix wrong n1=0 for serial code a0, skip trivial case: nu=0, nd=nn new option -m(default: daten.def) new option -z for utils/def2fig.sh models/lattice.c renamed to m_lattice.c, also new options added all coordinates based on base coordinates (ex: 60degree for kagome) add stretched kagome lattice add CONFIG_SymSearch option to config.h (0: disable slow sym search) reduce default static size (probably lower speed) partly replace "\n ..." by "...\n" for better MPI output Tru64 defines LONG_BIT instead of __WORDSIZE v2.38 2009-02-11 fix possible deadlock for wop() (expectation values) fix deadlock for x_sort() if last task has block length node_n1=0 fix early convergence abortion for N=40 square j1=-1 j2=0.42 exe/tmp can point to local scratch space now (disk cluster) exe/tmp_shared for shared scratch space (removed in later versions) fix check of XY_TYPE bits for node_n1 (not n1) fix error in mrule for mpi_n>1 (wrong results) remove execution of spinsdef in Makefile, better for cluster remove llong = "long long" for better MPI and ansi compatibility will be a problem for gcc -m32 (no C++, 32bit, NN -gt CM(32,16,16)) fix compile problems for CM=tU,tJ (hubbard model, t-J-model) v2.37 2008-09-17 (benchmark version) add CFG_CPUSET option (set to 0x0005 on dual HT-Xeon boards) fix free(static) bug in mpi version of ns() output of SH_speed and i_speed in hnz/s better configuration script for MPI SH.t measures max. realtasktime which can be wrong for overload, fixed better balance measuring from hnz[block] (max. efficiency = t1/(n*tn)) v2.36 Jun08 2008-08-04 better default settings for mpi and pthread buffers expectation values computed now using mpi (wop(op)) replace v0[thread] by v0 + b_ofs[thread] (better mpi code, less /%-ops) (one malloc, first touched by threads) sort mpi data in SH once (like i100) v2.35 2008-07-22 replace integer modulo operation (slow on IA64) add anisotropy Jz(i,j) (as z$parameter_index to daten.def, default=0) for H = Sum(i,j) J(i,j)(Sx(i)Sx(j)+Sy(i)Sy(j)) + (J(i,j)+Jz(i,j))(Sz(i)Sz(j)) fix bug for moved b_smallest() b2i() (part of FPGA rewrite) v2.34 2008-04-23 fix deadlock and errors for partly stored matrix using MPI fix AddSS bug (was a factor of nw to big and to slow) new option -o<outfile> for better job management (torque/PBS buffers stdout localy) Fix: log2(n1=0) FP EXCEPTION for Tru64@alpha (-nanf on linux) Fix: mymap.read(2GB+) failed on 64bit-systems, buggy since v2.33? Fix: ini_thxy() hanging in endlessloop if (int)2*n1<0 (square42) v2.33 2008-03-16 mymap(ev) replaced by mymalloc(ev), else slow or hangs on NFSv3+MPI Bug for mapped eigenvects on MPI systems fixed (nev>0) Bug for 32bit systems using mpi and mymap > 2GB fixed (for LFS) norm2 vector removed, code simplified Bug fixed, for maxfile = 0, was 10* slower (wrong number stored blocks) also output excitations for small n1 at "conv=" line Bug fixed, maxfile to small or 0: wrong results (faster mpi) Bug fixed, n1 < num_threads less output for nodes*ppn > 8 algo:a0 replaced by a16, a0 is new fast ns(), a16 is old slow ns() good scaling up to 128 CPUs tested v2.32 2008-02-19 single thread workaround for S=1 systems, multithread computes n1 sometimes to big (ToDo: check for reason and fix it) fix bug in fulldiag (a4) matrix generation dont print pointer for verbose malloc for easier diff bug fixed for maxfile=0 (incomplete stored H) IA64 is very slow for div operations, MPI speed up 50% MPI Data reduced, 30% speedup for 100Mbit v2.31 2007-12-14 fix problems with model=tU for v2.27 or later fix problem of uninitialized values on tiny systems, where some threads have an empty cfg-table (since v2.27 and B_NUM>1) multiple sublattice detection, better MPI scaling v2.30 Dec07 STOREH=0: bug for 2nd run of storeh fixed, 0 is default now in config0.h nzxmax fixed for multithread, STOREH=2 missing creation mode fixed v2.29 2007-12-04 STOREH=0: realloc of more than half of main memory may fail, recoded new sublattice generation, using defined bonds only (not correlations) weight matrix of Ising energies for j1 and j2 bonds extended (3D plot) v2.28 Nov07 2nd successfull mpi-run (np<4 only for dsk>0%), hybrid MPI+PTHREAD define CONFIG_DIMER_CHECK for artificial symmetry breaking (finite systems) only symmetries which are not explicite set in daten.sym can be broken first successfull mpi-run (np=2..3 pt=1 only), but not usefull (slow) output of converged energies for better awk handling -D STOREH=0 to store/read H into/from memory for max. performance reduce cache (line) coherence overhead by replace of work.bi[thread] for configurations output .oxQ for .ud3, which looks better new sublattice (SL) generation (bi-/tripartit only) macro Sud removed, switch off ud-symmetry by setting sym.wud=sym_ud=0 v2.27 2007-11-22 new performance data, well pthread scaling up to 32 (no disk used) mrule speedup (5-100)x using symmetry, more comments history.tex converted to history.html set number of threads by option -tn (n={1..B_NUM}) noSBase is not supported anymore (use sym_k= -9999) for simplicity update models/*.c according to modeldef.c (partly untested) modeldef.c: only the new flexible format of daten.def is supported now mrule parallized, renaming korr to corr (en:correlation) remove chk2 function for degenerated states (simplification) remove wop_k, wop_t, wop_ud (simpler, can be taken from biggest coeff.) renaming ckfg to ccfg (proper engl.), Mai07 output time in minutes (better to parse) + human abbreviation, Apr07 fix segfault for small LM-systems in wop_block function status script removed, write PID-file instead (more flexible) stack-output for SIGUSR1 removed for simplicity (better use gdb) fix noncritical error in err_fulltable v2.26 2007-02-27 h_get inlined (not done by the compiler), (2-5)x speedup for i100 better scaling of iteration (was bad before) thread code simplified B_NL2 removed, B_NUM used now to define #threads, code simplified configure checks signal.h, failes on g++-4.1 on SLES9 at IA-64 four-site exchange added partly (must be fullfill symmetry) fast_a2 removed due to future adaptions (#CPU>1024, FPGA) may be wrong results for a2 using syms (LC6 was NaN, check it!) remove HALF_HXY (store only upper half triangular H, problems with blocking, simplify code) reduce files from B_NUM^2 to B_NUM (non diagonal blocks carry only about 10% of diagonal blocks, wasted) problems with high file number if going to massive parallel change output of ZiZj, SiSj and S^2 for a4 via sisj=... (daten.i) v2.25 2006-10-23 fix gcc-3.3.4 compiler error and warnings bug fix for <SiSi>!=0.75 and NOS1SYM if sym_lm!=nn a8 bug fixed (this was introduced by new parallel method) better error report, if HRMAX is to small symmetry search can be aborted by SIGUSR2+SIGUSR1 (usefull for pyrochlore) output a warning if maxfile limit was reached (ERR(630)) v2.24 2006-04-12 Output trace of H, sum of upper left nondiagonal elements of H and sum E as a check for correctness of matrix elements (see Bug below) Bug a4-wrong results (!=a0) squago30 29+1 a0: wrong h12 for B_NL2>0 + CONFIG_noB_MASK + nommap, but H is ok a4: wrong results for B_NL2>0 (fixed) error message "unused bits are set" fixed for N=32 on int32-systems (RS) tmp/tri.txt closed after usage (stated as memleak by valgrind) performance data for 2-Prozessor Dual-Core Opteron running Linux added v2.23 2005-07-11 fix segfault for lapack + complex add input format " p%d= %lf" for daten.i hole-hole repulsion added for tJ+tU-model (h#) bugfix pic2.cc (if edgevectors are negative only) bugfix o2tower.sh (convergence warning lead to dublicated values) nev=0 for fulldiag using LAPACK (faster), less memory (cplx) overflow dbgD for SSANISO fixed make N=6 s=14 possible by defining Tnorm2 as double (default=long) sisj now is a bitpattern (bit0=<sisj>,bit1=<ss>) store_sym_tupel overflow for NN>nn fixed v2.22 2005-05-06 utils/o2tower.sh adapted to new output format memleak for fulldiag+pthread (fixed using valgrind), vector.c iortho(): wrong degeneracy for complex vectors (fixed), daten.def: change format of =pbcf= from A-B-C-D-A to A-B-C-D script utils/def2fig.sh for xfig improved, new options "code2" code removed, models/m_tU.c added for Hubbard chains bug for tJ,tU+SBase removed (no nondiagonalelements since v2.20) op_ninj now is (nu+nd)(i)*(nu+nd)(j) instead of nui*nuj new expectation values (op_nisnjs) for tJ/tU-model 1st step to replace wopij() by wop() for any number of sites definition of operators depending on more than 2 sites possible, see xval.c wop(), op_sisj2(), op_sisjsksl() (under work), for s>1/2 all (2s*2s) intrabonds i-j have same correlation old: s=5/2 ss=s*(s+1)=8.75 SiSi=0.75 SiSj=0.25 ss=5*SiSi+20*SiSj i!=j new: s=5/2 ss=s*(s+1)=8.75 SiSi=SiSj=ss/25=0.35 NE0 replaced by ne0 in daten.i (more comfortable) v2.21 2005-09-08 rename macro Zahl (german) to VecType, default to complex for c++, convergence check improved (sometimes it stopped to early, pew=NEW=60), max_NN enlarged from 127 to 32767, speed up for s>1/2 emulation (about factor 2s) include infile (new command in daten.i, max level 1) v2.20 2004-03-26 error.tex translated to english (please correct bad english for me) lintab renamed to hilbert (its better named), buggy wopij fixed makesym() removed, big vectors (.vec) moved to ./tmp (quota) bug040217: nu==nd && Sud==0 && (!norm2)==ERR(600) - adding local single site anisotropy (SSANISO, by Reimar Schmidt) h_file: testing STORE_XandY for future use (more simple code) - N=32+8 XR: raw=hnz*5=66MB zip=34MB, XYR: raw=hnz*9=119MB zip=47MB - try to change size of htmp by: export GZIP="-6" before starting spin - add octahedron def-files to models - add 16MB I/O-buffer (reducing file fragmentation a bit) bug030620: complex + LAPACK + fulldiag uninitialized values leading sometimes to random results, fixed using valgrind-1.9.6 memleak fixed, parallel SiSj, using valgrind-1.9.6 v2.19 2004-02-27 Warning: there could be new bugs, only speed_test is checked by me if norm2[] stored dont collect states with different orbit length (vec output), H-blocks instead of H-stripes for better speed (local data, CPU cache) no binary compatibility to tmp-files of older versions! numsymconf uses max. B_NUM threads (old: 16 threads) bug: PTHREADS: storeh_block starts with uninitialized values (fixed) bug: TBC + sym + nu==nd => -9.95702223 nonhermitian H also wx=0 (fixed) v2.17 2003-04-24 pthread_attr explicit set to PTHREAD_SCOPE_SYSTEM, because SunOS uses PTHREAD_SCOPE_PROCESS as default (all threads running on one CPU) h_file.c completely rewritten for better performance (40%(x86)-200%(MIPS) faster) better use of CPU cache, much better speed! has much more potential! bug: use of only one H-block (B_NL2=0) negativ shift fixed output sym-factor(norm2) (=orbital-length?) after configurations, mmap offset must be multiple of _SC_PAGESIZE on some systems (fix), bug: startvec=4 and if max_ea>1 and nev>0 and more than one run (fixed) new: startvec=<4+rnd*8> (p.e. 12, 20, 28) for different startvectors a0: max. n1 iterations (small n1) v2.16 lapack usable for full diag (a4), try ./configure --lapack 2003-04-11 v2.15 speed_test added, Warnings fixed: suggest explicit braces to avoid ambiguous "else", bug fixed: r^2 was wrong (pos was set after daten.def was read), bug fixed: (for EA>1 checkpoint not resetted, wrong results for EA>1), compatibility to cc, c89: Compaq C V6.4-014 (no //, see c89 -V) sleeping time on error reduced, repeating errors not printed, comments added in put_h() v2.14 bug: forgotten close after reading tmp/tmph* (resulting in linux-2.4.10 crash, if ev=1 and a0 is repeated 1000 times) fixed model-files with nn<NN accepted (no recompilation needed) spins.tex further translated to english v2.13 compiler error if float fixed old Bug fixed, using valgrind-20020601 by Julian Seward, GNU (Bug: fulldiag a4 + float, since v1.8) v2.12 mrule() in xval.c revised, set sublattice SL=... for MRule use SL=...;a0;SL=...,a3 for different sublattices v2.11 version is now real number (2nd dot removed, 2.101<2.11) fulldiag: output E and S^2 as table (for TD, susceptibility) precision depends on float v2.1.0 vlint.h (C++class: very long int) added, allows any number of sites 68 sites tested, but tJ and tU not checked for correctness ! matrix dimension is still limited to 32 or 64bit value ! simplifications using C++ operations (code2 under construction) v2.0.3 bug fixed (wrong matrix dimension on very small systems), quick-start (2 spins) added in documentation v2.0.2 use ud+=+1,-1 and/or param+=0.1,0,0,0 to change parameters by constant steps more easier (p.e. in loops) to generate lot of data, "+="-form works for anisotropy too, name of package is changed from spin to spinpack (more unique) bugs fixed (noSBase,tJ(but slow),s=1,ferrimagnets) v2.0.1 startvec changed according to v[i]<1.0 (16bit-real) Warning: random startvec before v2.0.0 is not reproduced! do not load/save l1 (its mmaped to a file already) save thxy_r-table all the time (its only small amount of data) save_mode=1: bug save hxy fixed; write tmp/status n1 fixed v2.0.0 remalloc uses fallback to mmap(tmpfile) => no memory limits (a0,a2) 1GB-File-split removed (use linux-LFS or 64bit or more blocks) quash H-file if an reproducible error occurs during write (disk full?) v1.9.3 mmap used to bypass memory+swap limits (use algo=2 to spare memory) bzlib for h-zip (ca. 1/3 of size! (gz=1/2) but slower unzip 2*gz) try lin() before get b_smallest should save lot of time v1.9.2 use (unsigned Tnorm2)norm2, b_smallest, b_ifsmallest3 10% faster algo=2 (2x2-diag) parallelized (around a0 speed!?) v1.9.1 better error handling, lintab/numsymkonf changed (faster, but save_mode may not work) v1.9.0 storeh(),read_h(),wzizj() parallelized using pthreads (configure -mpt), recommended for shared memory multi processing machines, using 3 to 4 threads is default (best number of threads depends on disk speed and size, it depends also on the value of the variable maxfile and of the CPU usage by other processes of cause) and can be changed by defining B_NL2 in config.h (3 for 2^3 blocks), only lanczos-inc1 is parallelized yet (getting energy), speed-square36: numsymkonf=10x, storeh=2x, Hv_from_disk=1.1x v1.8.3 pthread library used for multiprocessor machines, no MP_PRAGMAs MIPSpro C++ v7.2.1 warnings fixed, configure --mpt added v1.8.2 more output if sisj=1 (sisj-3zizj, num_of_same_bonds) missing kommas in param= for default daten.i fixed v1.8.1 CONFIG_ABEL compiler error fixed, bug: read_sym if s1-sym fixed div0 fixed v1.8.0 missing close(tmp/tmph*.tmp) fixed, (fd-buf overflow after lot of iterations) fulldiag => ev=0 possible, no fatal error if jacobi not converge fulldiag output changed, some fixes use CONFIG_TBC instead of IFww, TBC+noSym+noSud(!) fixed daten.i: nud= and param= instead of :nu,nd,p1,p2,p3... v1.7.20 h-field added (patch by Reimar Schmidt), add some corrections bug in fulldiag (ev, since v1.7.3) fixed v1.7.19 daten.def pos[x,y,z] is changed to double, m_1d.c added v1.7.18 configure v0.4.1, configure --debug => stack-checking v1.7.17 configure v0.4.0, bug in symmetry.c fixed (daten.sym + sym_k=-1) v1.7.16 sbase->norm2 changed to llong allowing N=10 S=5/2 on 32bit machines b_ifsmallest_lm() etc. added bcc40s2 30u10d 31m/10s/18s/483s/509s => 0s/6s/12s/73s/90s i586-133MHz S=1 and ud-sym or tJ,tU not tested Apr2001 v1.7.15 maxfile is per block now to live with the 32bit limit v1.7.14 symcreate now rekursiv, shorter and over 40x faster, ATTENTION: other generators!!! no-degeneration bug fixed, docu extended v1.7.13 daten.sym: cyclic form possible, trying all permutations v1.7.12 tJ,tU cc-errors, asm removed, n1==0 => no error v1.7.11 new packaging v1.7.10 h-buffer for writing, parallel (no speedup), models updated v1.7.9 check_point=4 implemented (gzip faster (buffer)) get_h is now not parallel! change it! Feb 2001 v1.7.8d bug removed for Sz=0, Sud not used; <Szi>=0 always! v1.7.8c complex.h error if Zahl==4 xor 5 fixed? v1.7.8,8b bug fixed, HIDX replaced by HRMAX>0 sizeof(thxy)=5 (8) v1.7.7 HRMAX=0 replaces noHIDX, save_mode+checkpoint against crashs v1.7.6 mp-bugs: hamilton(), problems with I/O + stack ??? v1.7.5 mp-bug removed: hamilton() v0x,v0y,.. must be local() v1.7.3 vector divided in blocks, preparation for block version v1.7.2 bug fixed in fast_nhv(), blocksize rounded to 2^n v1.7.1 bugs removed: wrong sym if bond twice, h_stored<100% v1.7.0 H-schreiben/lesen nur noch auf Platte (parallelisierbar) v1.6.4 zlib used instead of pipe (no memory problems on thor???) v1.6.3d bug removed, if v0,v1=NULL, small changes in inc1() v1.6.3c HRMAX in config.h gesetzt, sisj=0 (daten.i) v1.6.3b Tnorm2 can be set to char (less memory, see spins.h) v1.6.3 bug in symmetry.c removed (delay, if daten.sym given) v1.6.2 cache in pipe.c for better performance v1.6.1 new pipe.c (IRIX64 needs twice of parent-mem for fork or popen) v1.6.0c bug removed thor: -mp sym_k=-1000 Sz>0 numsymkonf is wrong v1.6.0 TBC nun auch mit Symmetrien! v1.5.8 Twisted Boundary Conditions or Field (J^-(i+L)=e^(iw)J^-(i)) v1.5.7 fork - schlaegt manchmal fehl, HRMAX auf 1024 gesetzt v1.5.6 better error handling (daten.def, more output) v1.5.5b bug wenn S=1 und sym_k!=0 beseitigt (thanks J.R.) v1.5.5 S1SYM in numsymkonf() und next() eingebaut, S=1 faster! v1.5.4 keine zombies, endlich S1SYM schneller and korrekt!? v1.5.3 base2[],lintab entfernt, block_no(), nw<=Nw mgl. automatic SUBLATTICE generation fkt=symkorrelationen() v1.5.2 bug: pipe.c: killte gzip, bevor gzip zu Ende schrieb v1.5.1 Hamilton-Matrix mit gzip gepackt (see pipe.c,h_file.c,ca.50%) v1.5.0 es gibt kein model.h mehr! nur noch daten.def zu nutzen v1.4.8 alternative creation of H, algo=8 a8 [n1max] (10x slower) v1.4.7 doc/error.tex als genauere Fehlerbeschreibungsliste v1.4.6 ansatz.h/ansatz.c fuer Variationsansaetze v1.4.4 bug beseitigt (2x hamilton schreiben erzeugte Fehler) defspin1.sh zum erzeugen von Gittern mit Spin-1,Spin-3/2,etc. v1.4.3 Vektoren koennen bei wenig RAM & C++ auf Platte (array_m.h) v1.4.2 now use the disk to store l1 (N=40 possible using 4GB mem) Jan 2000 06.12.99 definition of HIDX in spins.h for storing indizes to matrixvalues less memory and disk, therefore faster, (+ 2*8*64k memory for table) 09.11.99 Bug in gcc on alphaPC (large struct object as function argument) 31.10.99 AddSS in Add34 umbenannt, AddSS als $S^2$ Operator zugefuegt 07.06.99 anisotropy parameter for xy-component of Heisenberg-Hamiltonian (XXZ) v1.4 01.02.99 anisotropy parameter for z-component of Heisenberg-Hamiltonian (XXZ) sym_k= -n for omitting next n generated symmetries, v1.4 23.01.99 kein extra outputfile, besser stdout und filtern via awk Jan 1999 07.12.98 startvector fuer algo=1 jetzt korrekt waehlbar 30.11.98 HamiltonOperator wird in GB-Happen gespeichert (h_file.c) damit kann auf 32bit-UNIX 2GB-Filegrenze umgangen werden v1.3 algorithm a4 = full diag (works, bad implementation) 1998-Nov spins-lintab2 fehlerhaft bei nichtvertauschenden unvertraegl. Sym. Bsp: N=4 k=2/4 0/2 (sgn[udud]=0) 1998-295 spins-chk3: nur verschiedene Paarkorrelationen berechnen nun korrekte Sz,S,ZMag fuer Spin-Mischungen (S=1/2,S=1,...) 1998-293 sighand-ini: shell-script status wird erzeugt, use "sh status" 1998-270 m_bcc: nun immer R(i=0)=( 0 0 0 ) besser fuer Abstansbestimmungen 1998-190 NE0=0 Eigenvektor=Groundstate, NE0=1 1st EV = 1st Excitation etc. Bestimmung der Aequivalenzklassen (max. Anzahl waehlbarer k's) 1998-184 cplx+SBase: storeh >10 mal schneller, da locale arrays nun static! 1998-183 Anzahl Eigenvectoren nun in daten.i nev veraenderbar 1998-170 complex-version getestet fuer JJ-N=5-chain, k=1/5*2Pi, g++,Zahl=4 1998-154 tU,tJ mit up/down getrennt (code2), nun 64 sites (llong) moeglich! bug beseitigt, nun tU,tJ mit SBase und k!=0 korrekt bug beseitigt, zahl sym, wenn cyclen(P0)!=2 fehlte P0^i, etc bug beseitigt, falsche cyclenlaenge der Permutation (not Prod!) 1998-139 notation year-day_of_year, daten.i nun mit Sym_ud=<+1|-1> date '+%Y-%j' 18.05.98 bug in symkorrelation() beseitigt, erzeugte unsym. Korrelationen 08.04.98 version 1.2 bis N=64 JJ-Sites (long long), NoAbel, k!=0 geht ? 15.01.98 lintab2(),wzizj() fuer symmetrie beschleunigt + korrelat-sym-Ansatz Jan 1998 15.12.97 Endlich der Durchbruch beim Symmetrie suchen, nun 0:01 statt 8:01 beim BCC30_30 (486DX4-100)! (Versucht Nachbar als naechstes) 12.12.97 Nutzung nichtvertauschender Permutationen geht im k=0 Raum (viele 32er Systeme koennen nun berechnet werden ca. 20h,200MB) OO.OO.97 Parameter (J's,t's,U's etc.) werden ueber Indexliste zugewiesen, leichter hantierbar, Bindungen klassifiziert Jan 1997 08.01.96 ud-Symmetrie kann mit Sud=1 genutzt werden, H-Matrix wird NUR bei Speichermangel in hxy.tmp (tmp_hxy) gespeichert 10.07.95 rq: (H*v-<v|H*v>*v)^2 => |(H*v/<v|H*v>-v)| 19.04.95 _nhn=_nhv(m1=1) -> einsparung _nhn 11.04.95 symmetrisiere -> v'=v +/- SymOp(v) (aendert EW nicht!!!) 11.04.95 gleichzeit. diag. 2A(-1,+1)+B(-1,+1)=(-2-1,-2+1,2-1,2+1), [A,B]=0 1995-Mar H speichern tJ-16 Faktor 3.5 schneller tJ-16 131s/70It 14.02.95 einfuehrung symmetriequantenzahlen (vertauschende Permutationen) 12.10.94 16-er tJ lanz=13m24s lanz2=13m42s 10.10.94 hamilton_nhv (neue prozedure) <n|H|v> l.615 tJ-8 1s 10.10.94 hamilton_nhn (neue prozedure) <n|H|n> l.695 tJ-8 24s 27.09.94 lanz2 zugefuegt (neues verfahren) l.866 tJ-8 98s 29.06.94 diplomversion fertig (ohne Symmetrien) tJ-8 3s/6s