CPU load drops very low


  • Guest -
Hi

Apologies if this has been discussed. I am running NWChem on a Linux cluster (12 cores/24 threads)
using the following command

/opt/openmpi/bin/mpirun -np 24 nwchem bv6-esp-res.nw >& bv6-esp-res2.out &

The job starts normally on all threads with high CPU load (~100%) but after some seconds
the CPU load drops very low, to ~2%-15%. Also, the number of processes appears to
decrease and fluctuate.

Is this behaviour normal?

Thanks in advance
George

Forum Regular
Hi,

Can you post or send me your input file ?

Thanks.
-Niri
niri.govind@pnnl.gov

  • Guest -
George,

Storngly suggest to run one process per core, not two (for one to reduce memory and communication pressure). Also, I don't know how much data is written to disk, which could affect the CPU usage.

The number of processes should be constant.

Bert


Quote: Jan 9th 2:23 pm
Hi

Apologies if this has been discussed. I am running NWChem on a Linux cluster (12 cores/24 threads)
using the following command

/opt/openmpi/bin/mpirun -np 24 nwchem bv6-esp-res.nw >& bv6-esp-res2.out &

The job starts normally on all threads with high CPU load (~100%) but after some seconds
the CPU load drops very low, to ~2%-15%. Also, the number of processes appears to
decrease and fluctuate.

Is this behaviour normal?

Thanks in advance
George

  • Guest -
Thanks for your reply. I am trying now to run NWChem on an Apple cluster. After a day or so of running
the job aborts with the following message:

[xgrid-node05:95464] *** Process received signal ***
[xgrid-node05:95464] Signal: Segmentation fault (11)
[xgrid-node05:95464] Signal code: Address not mapped (1)
[xgrid-node05:95464] Failing at address: 0x17cd0a000
[xgrid-node05:95464] [ 0] 2 libSystem.B.dylib 0x00007fff8274f66a _sigtramp + 26
[xgrid-node05:95464] [ 1] 3  ??? 0x000000010de39d90 0x0 + 4527988112
[xgrid-node05:95464] [ 2] 4 nwchem 0x00000001025d4406 ga_dadd_ + 14
[xgrid-node05:95464] [ 3] 5 nwchem 0x00000001001a4959 diis_hamwgt_ + 345
[xgrid-node05:95464] [ 4] 6 nwchem 0x00000001001a3c99 diis_driver_ + 585
[xgrid-node05:95464] [ 5] 7 nwchem 0x000000010018a810 dft_scf_ + 16128
[xgrid-node05:95464] [ 6] 8 nwchem 0x0000000100186021 dft_main0d_ + 7728
[xgrid-node05:95464] [ 7] 9 nwchem 0x00000001002e7ad7 nwdft_ + 2936
[xgrid-node05:95464] [ 8] 10 nwchem 0x00000001002e7f81 dft_energy_ + 68
[xgrid-node05:95464] [ 9] 11 nwchem 0x0000000100008e92 task_energy_doit_ + 840
[xgrid-node05:95464] [10] 12 nwchem 0x000000010000a7d8 task_energy_ + 610
[xgrid-node05:95464] [11] 13 nwchem 0x0000000100014082 task_ + 3660
[xgrid-node05:95464] [12] 14 nwchem 0x000000010000312b MAIN__ + 1404
[xgrid-node05:95464] [13] 15 nwchem 0x000000010262586e main + 14
[xgrid-node05:95464] [14] 16 nwchem 0x0000000100001804 start + 52

Is this a memory problem? Should I recompile with export LARGE_FILES=TRUE ?
George

  • Guest -
Same problem
I have the same problem runnig with mpi. But i noticed the drop starts Before the scf iterations and presists during them.
Also this seems to happen only for systems(S.P.E. even) with basis functions >380(approximately).

Jonathan

Just Got Here
Quote: Jan 12th 8:27 pm
I have the same problem runnig with mpi. But i noticed the drop starts Before the scf iterations and presists during them.
Also this seems to happen only for systems(S.P.E. even) with basis functions >380(approximately).

Jonathan


The following worked for me,

adding the line,
 semidirect memsize 200000000 filesize 0

Which saves integrals to the given memory.

Gets Around
I guess the problem you see is the IO of Nwchem: check the thread http://nwchemgit.github.io/Special_AWCforum/st/id271/junk_files%3A_what_are_they.h...

  • Guest -
Thanks for this. Indeed, it was I/O problem. It occurred because I had the scratch dir located on the
head node while running the job on another node.


Forum >> NWChem's corner >> Running NWChem