NWChem computation: multi-CPU run never converges

Click here for full thread

Forum Vet

9:23:08 AM PDT - Fri, Aug 3rd 2012
Just ran it with 16 processors. I would remove the "io replicated" from the input (that's what I did). Also, I seriously doubt you can build one binary that will work on all platforms. Also, we have never tested running MPI-SPAWN between heterogeneous nodes on different clusters. Bert [QUOTE=Jeronimo Jul 31st 12:01 pm]Dear Bert, at first, sorry for my late response (I've had a vacation). Quote:Bert Jul 23rd 9:50 am I do not believe this is a kompile issue but rather a run issue. How much memory do you have per processor (not per node, but per core)? The memory keyword in NWChem is per core. Could you try running on multiple processors with the memory keyword set to maybe 1000 mb at most to see if this works. I do run the computation using various nodes in our computing infrastructure -- ranging from less-CPU nodes (Dual Core AMD Opteron 885, 16 cores and 64GB of memory) to SMP nodes (Intel Xeon E7 4860, 80 cores, 512GB of memory). On each of these clusters, I did have 4 cores and 50GB of memory reserved on a single node -- no matter which machine I use, all the computations fail in the same fashion (described initially). (I've also tried to specify just 1GB of memory as you have suggested; however, this resulted in the same error). To illustrate a run, here is a run log on the SMP node (80 cores, 512GB of memory) -- the computation obtained 4CPUs and 50GB of memory reserved by our scheduling system: single-CPU (successfull) computation run-single.out and multi-CPU (failing) computation run-multi.out (the failing convergence is visible on lines starting by line no. 1031). Quote:Bert Jul 23rd 9:50 am What is the hardware you are running on that forces you to use MPI-SPAWN? Since the infrastructure we run is quite heterogenous, some clusters do have Infiniband interconnection and some do not. Thus, I've decided to use MPI-SPAWN so that such a compilation should be runable on all the machines we run. I hope this is a correct deduction... Thanks very much for any advice. --best Tom Rebok, MetaCentrum NGI, Czech Republic. PS: Is there anybody, who can test to run the computation above in parallel and let me know whether it converges? PS2: Maybe, it is somehow related to the problem described here...