"inconsistency processing clusterinfo" error when trying to use multiple cluster nodes...


Clicked A Few Times
Any thoughts would be welcome. Here's my situation:
1) Heterogeneous commodity cluster with gigabit ethernet interconnects.
2) Nwchem version Nwchem-6.5.revision26243-src.2014-09-10, with patches applied (maybe one did not work?)
3) Compilation environment:
OS - Ubuntu Debian 14.04.2 lts

export PATH=$PATH:/opt/intel/bin
export NWCHEM_TOP=/shared/nwchem/Nwchem-6.5.revision26243-src.2014-09-10
export NWCHEM_TARGET=LINUX64
export LARGE_FILES=TRUE
export ENABLE_COMPONENT=yes
export NWCHEM_MODULES="all python"
export NWCHEM_MPIF_WRAP=/usr/bin/mpif90
export NWCHEM_MPIC_WRAP=/usr/bin/mpicc
export NWCHEM_MPICXX_WRAP=/usr/bin/mpicxx
export USE_NOFSCHECK=Y
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_INCLUDE="-Wl,-Bsymbolic-functions -Wl,-z,relro -I/usr/include/mpich -I/usr/include/mpich"
export MPI_LIB="-L/usr/lib/x86_64-linux-gnu"
export LIBMPI="-lmpichf90 -lmpich -lopa -lmpl -lrt -lcr -lpthread"
export FC=gfortran
export CC=gcc
export CXX=g++
export ARMCI_NETWORK=SOCKETS
  1. export MSG_COMMS=MPI
export BLASOPT=" "
export PYTHON_EXE=/usr/bin/python
export PYTHONVERSION=2.7
export USE_PYTHON64=yes
export PYTHONCONFIGDIR=config-x86_64-linux-gnu
export PYTHONPATH=/usr/lib/python2.7/dist-packages
export PYTHONHOME=/usr
export PYTHONLIBTYPE=so
export CCSDTQ=y
export CCSDTLR=y
export IPCCSD=y
export EACCSD=y

4) Error when launching across multiple nodes (2 in this case) the simple geometry optimization of formaldehyde works on a single node with the same total number of cores.

0:inconsistency processing clusterinfo: 1
(rank:0 hostname:<some non printable characters>pid:25050):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/clusterinfo.c:process_hostlist():203 cond:0
Last System Error Message from Task 0:: Connection refused
0:aborting

I notice that the node name is mangled (it should be either node15 or node2). Is there something in the ga-5-3 code that is messing up offsets when reading strings?

I'd welcome any ideas..

Would I be better off with one of the development releases?

Thanks,
Jonathan

Clicked A Few Times
1. export ... in the environment script is a typo...
The 1. is replacing the pound sign used to comment out that line. I hadn't noticed the wiki formatting issue.

Sorry,
Jonathan

Forum Vet
Jonathan
The ARMCI port corresponding to ARMCI_NETWORK=SOCKETS is pretty broken when using more than one node.
You should define ARMCI_NETWORK=MPI_TS, instead

Clicked A Few Times
SOCKETS versus MPI_TS...I think I tried that, but will again...
I'm pretty sure I tried both MPI_TS and MPI_MT instead of sockets. Those wouldn't even run properly on a single node. However, I will try again and report back.

Thanks

Forum Vet
Jonathan
I agree with you that the other ARMCI ports might be affected by the same problem since they might use the same clusterinfo code. We might need to debug this issue. What is the value that you get from the command /bin/hostname on the nodes of your cluster?

Clicked A Few Times
Tried with ARMCI = MPI_TS different error...
So first to answer the above question. >hostname on any node in my cluster returns 'node#', where # is replaced by the integer node number. So all my nodes are named as node1, node2, node3 ... The \etc\hosts file on all nodes also has the 'Node#' as an alternate name/alias for each node.

I can 'ssh node# <cmd>' from any node to any other using either the lowercase or capitalized version of the names.

Here's the error I got using MPI_TS as my compilation choice. This time it did run properly on a single node.


nwchem: ../../ga-5-3/comex/src-mpi/comex.c:1359: comex_init: Assertion `0 == status' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
nwchem: ../../ga-5-3/comex/src-mpi/comex.c:1359: comex_init: Assertion `0 == status' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
nwchem: ../../ga-5-3/comex/src-mpi/comex.c:197: _mq_test: Assertion `0 == rc' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
nwchem: ../../ga-5-3/comex/src-mpi/comex.c:197: _mq_test: Assertion `0 == rc' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
nwchem: ../../ga-5-3/comex/src-mpi/comex.c:197: _mq_test: Assertion `0 == rc' failed.
nwchem: ../../ga-5-3/comex/src-mpi/comex.c:197: _mq_test: Assertion `0 == rc' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
  1. 0 0x7FF702970777
  2. 0 0x7FF1EA30D777
  3. 1 0x7FF1EA30DD7E
  4. 1 0x7FF702970D7E
  5. 2 0x7FF1E9C5FD3F
  6. 2 0x7FF7022C2D3F
  7. 3 0x7FF7022C2CC9
  8. 3 0x7FF1E9C5FCC9
  9. 4 0x7FF7022C60D7
  10. 4 0x7FF1E9C630D7
  11. 5 0x7FF7022BBB85
  12. 5 0x7FF1E9C58B85
  13. 6 0x7FF7022BBC31
  14. 6 0x7FF1E9C58C31
  15. 0 0x7FC24BA6B777
  16. 1 0x7FC24BA6BD7E
  17. 2 0x7FC24B3BDD3F
  18. 3 0x7FC24B3BDCC9
  19. 4 0x7FC24B3C10D7
  20. 5 0x7FC24B3B6B85
  21. 6 0x7FC24B3B6C31
  22. 0 0x7F77D791F777
  23. 1 0x7F77D791FD7E
  24. 2 0x7F77D7271D3F
  25. 3 0x7F77D7271CC9
  26. 4 0x7F77D72750D7
  27. 5 0x7F77D726AB85
  28. 6 0x7F77D726AC31
  29. 0 0x7F829398B777
  30. 1 0x7F829398BD7E
  31. 2 0x7F82932DDD3F
  32. 3 0x7F82932DDCC9
  33. 4 0x7F82932E10D7
  34. 5 0x7F82932D6B85
  35. 6 0x7F82932D6C31
  36. 0 0x7FEB98228777
  37. 1 0x7FEB98228D7E
  38. 2 0x7FEB97B7AD3F
  39. 3 0x7FEB97B7ACC9
  40. 4 0x7FEB97B7E0D7
  41. 5 0x7FEB97B73B85
  42. 6 0x7FEB97B73C31
  43. 7 0x4B71A07 in _mq_test at comex.c:197
  44. 8 0x4B73154 in comex_barrier at comex.c:1208
  45. 9 0x4B735CF in comex_init at comex.c:1395
  46. 10 0x4B7369F in comex_init_args at comex.c:1411
  47. 11 0x4B6E7E5 in PARMCI_Init_args at armci.c:178
  48. 12 0x4B3A42A in install_nxtval
  49. 13 0x4B3A1CD in tcgi_alt_pbegin
  50. 14 0x4B3A235 in tcgi_pbegin
  51. 15 0x4B38F1B in pbeginf_
  52. 16 0x54551D in nwchem at nwchem.F:84
  53. 7 0x4B73622 in comex_init at comex.c:1359 (discriminator 1)
  54. 8 0x4B7369F in comex_init_args at comex.c:1411
  55. 9 0x4B6E7E5 in PARMCI_Init_args at armci.c:178
  56. 7 0x4B71A07 in _mq_test at comex.c:197
  57. 8 0x4B73154 in comex_barrier at comex.c:1208
  58. 9 0x4B735CF in comex_init at comex.c:1395
  59. 10 0x4B7369F in comex_init_args at comex.c:1411
  60. 11 0x4B6E7E5 in PARMCI_Init_args at armci.c:178
  61. 10 0x4B3A42A in install_nxtval
  62. 11 0x4B3A1CD in tcgi_alt_pbegin
  63. 12 0x4B3A42A in install_nxtval
  64. 12 0x4B3A235 in tcgi_pbegin
  65. 13 0x4B3A1CD in tcgi_alt_pbegin
  66. 13 0x4B38F1B in pbeginf_
  67. 14 0x4B3A235 in tcgi_pbegin
  68. 14 0x54551D in nwchem at nwchem.F:84
  69. 15 0x4B38F1B in pbeginf_
  70. 16 0x54551D in nwchem at nwchem.F:84
  71. 7 0x4B73622 in comex_init at comex.c:1359 (discriminator 1)
  72. 7 0x4B71A07 in _mq_test at comex.c:197
  73. 7 0x4B71A07 in _mq_test at comex.c:197
  74. 8 0x4B73154 in comex_barrier at comex.c:1208
  75. 8 0x4B7369F in comex_init_args at comex.c:1411
  76. 9 0x4B735CF in comex_init at comex.c:1395
  77. 8 0x4B73154 in comex_barrier at comex.c:1208
  78. 10 0x4B7369F in comex_init_args at comex.c:1411
  79. 9 0x4B735CF in comex_init at comex.c:1395
  80. 10 0x4B7369F in comex_init_args at comex.c:1411
  81. 11 0x4B6E7E5 in PARMCI_Init_args at armci.c:178
  82. 9 0x4B6E7E5 in PARMCI_Init_args at armci.c:178
  83. 11 0x4B6E7E5 in PARMCI_Init_args at armci.c:178
  84. 12 0x4B3A42A in install_nxtval
  85. 12 0x4B3A42A in install_nxtval
  86. 10 0x4B3A42A in install_nxtval
  87. 13 0x4B3A1CD in tcgi_alt_pbegin
  88. 13 0x4B3A1CD in tcgi_alt_pbegin
  89. 11 0x4B3A1CD in tcgi_alt_pbegin
  90. 14 0x4B3A235 in tcgi_pbegin
  91. 12 0x4B3A235 in tcgi_pbegin
  92. 14 0x4B3A235 in tcgi_pbegin
  93. 13 0x4B38F1B in pbeginf_
  94. 15 0x4B38F1B in pbeginf_
  95. 15 0x4B38F1B in pbeginf_
  96. 14 0x54551D in nwchem at nwchem.F:84
  97. 16 0x54551D in nwchem at nwchem.F:84
  98. 16 0x54551D in nwchem at nwchem.F:84


Clicked A Few Times
A simple mpi program that reports back which node it is launched on does work...
I tried the simple hello world example code at www.mpitutorial.com. That works across all 14 active nodes on my cluster.
Jonathan

Forum Vet
Jonathan
Something seems to be going wrong in the compilation of the tools directory.
Could you please upload the following file to a website so that we can access it?
$NWCHEM_TOP/src/tools/build/config.log

Forum Vet
Could you send me the output you get on your Ubuntu system of the following commands

cat /usr/share/mpi-default-dev/debian_defaults

/usr/bin/mpif90 -show

which mpif90

Forum Vet
Jonathan
I have just noticed that on your first posting you wrote
"Heterogeneous commodity cluster ..."
Does this mean that the compute nodes do not share the same HW/SW environment as the node where you have compiled NWChem?

Clicked A Few Times
I'll get back with the debian_defaults and mpif90 -show (that's how I set up the environment).

To explain the heterogeneous cluster:
1) The whole cluster is a mix of both AMD and Intel processors, however the tests I am running are using either two identical nodes with i7's, 32GB RAM or 3 i7s one with 64GB ram.
2) All are running the same linux kernel and have the same version of mpich (3.04) installed.

Jonathan

Clicked A Few Times
mpif90 -show and which mpif90 results...
As I did not compile mpich from source but used the debian binary package (apt-get mpich), I do not have a /usr/share/mpi-default-dev/debian_defaults file (don't have the mpi-defaults-dev directory). Should I recompile from source? I do have the other results:
$ mpif90 -show
gfortran -Wl,-Bsymbolic-functions -Wl,-z,relro -I/usr/include/mpich -I/usr/include/mpich -L/usr/lib/x86_64-linux-gnu -lmpichf90 -lmpich -lopa -lmpl -lrt -lcr -lpthread

$ which mpif90
/usr/bin/mpif90

Jonathan

Clicked A Few Times
This is definitely an mpi problem as I run other distributed code...
I'm pretty convinced I have an MPI setting problem.
A little more information about my cluster. I regularly run distributed GAMESS-US code on this cluster. I do not use MPI (although it is an option) with this code because GAMESS-US seems to run somewhat faster using their old-fashioned sockets communication (it was also easier to set up). We have also run trajectory software on the cluster using an older MPI (no longer installed).

My main interest in NWCHEM is that it can do distributed tce_cc calculations, whereas GAMESS is limited to single node computations for most cc calculations.


Forum >> NWChem's corner >> Running NWChem