Problem of running NWchem6.0


Click here for full thread
Just Got Here
Hi,

I compiled NWchem 6.0 with the script build_nwchem

  1. !/bin/bash
  2. export USE_GPROF=yes
export USE_SUBGROUPS=yes
export USE_MPI=yes
  1. export OLD_GA=yes
export MSG_COMMS=MPI
export USE_PYTHON64=yes
export MPI_LOC=/pkg/mpi/gcc/mvapich2-1.6
export MPI_INCLUDE=$MPI_LOC/include
export MPI_LIB=$MPI_LOC/lib
export LIBMPI="-lfmpich -lmpich -lpthread" # MPICH2 1.2
  1. export LIBMPI="-lmpichf90 -lmpich -lmpl -lpthread" # MPICH2 1.3.1
export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export NWCHEM_TOP=`pwd`
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES="all python"
export NWCHEM_EXECUTABLE=$NWCHEM_TOP/bin/LINUX64/nwchem
export PYTHONPATH=./:$NWCHEM_TOP/contrib/python/
cd $NWCHEM_TOP/src
make DIAG=PAR FC=gfortran CC=gcc nwchem_config
make DIAG=PAR FC=gfortran CC=gcc $1

The executable runs normally on our cluster when one computing node (48 cores SMP) is
used, however, when using two or more nodes, I get this error message:



It appears that tasks allocated on the same host machine do not have
consecutive message-passing IDs/numbers. This is not acceptable
to the ARMCI library as it prevents SMP optimizations and would
lead to poor resource utilization.

Please contact your System Administrator or, if you can, modify the MPI
message-passing job startup configuration.

Last System Error Message from Task 0:: No such process

nwchem:4066 terminated with signal 11 at PC=29b4054 SP=7fff3b45c880. Backtrace:
./nwchem(_armci_buf_get+0x56)[0x29b4054]
./nwchem(_armci_buf_get_clear_busy+0x1f)[0x29b4b6b]
./nwchem(armci_serv_quit+0x4d)[0x29b3496]
./nwchem(armci_wait_for_server+0x28)[0x29ae47a]
./nwchem(ARMCI_Cleanup+0x5d)[0x299da51]
./nwchem(armci_abort+0x1a)[0x299dbc9]
./nwchem(dassertp_fail+0xbf)[0x299eb29]
./nwchem[0x29a31ba]
./nwchem(armci_init_clusinfo+0x193)[0x29a380b]
./nwchem(PARMCI_Init+0x52)[0x299e154]
./nwchem(PARMCI_Init_args+0x3a)[0x299df48]
./nwchem(ARMCI_Init_args+0x1d)[0x29a5725]
./nwchem(install_nxtval+0x55)[0x29bf375]
./nwchem(ALT_PBEGIN_+0xa5)[0x29be5c5]
./nwchem(PBEGIN_+0x1c)[0x29be60c]
./nwchem(pbeginf_+0x16f)[0x29be2df]
./nwchem(MAIN__+0x27)[0x5605e4]
./nwchem(main+0x2c)[0x29c115c]
/lib64/libc.so.6(__libc_start_main+0xe6)[0x2b8cff88dbc6]
./nwchem[0x55f2d5]
MPI process (rank: 0) terminated unexpectedly on alps6-01.cluster.nchc.org.tw
Exit code -5 signaled from alps6-01
handle_mt_peer: fail to read...: Success




I wonder what can be done to avoid this problem? Each computing node of the cluster has
48 cores and 128GB RAM, connected to each other with QDR infiniband network. The
mvapich2-1.6 was built with QLogic infiniband driver. Thanks for any suggestion.

Jyh-Shyong