int/Integer argument type mismatch inside peigs is causing segmentation fault


Clicked A Few Times
Most of peigs calls function reduce_maps without a complete prototype, and in many cases with a literal zero for argument nC.

src/c/pdcomplex.c:  extern void     pdiff(), xstop_(), pgexit(), mapdif_(), reduce_maps_();
src/c/pdcomplex.c:   reduce_maps( msize, mapZ, *n, mapZ, 0, mapZ, &nn_proc, proclist );
src/c/pdspeval.c:    extern void     pdiff(), xstop_(), pgexit(), mapdif_(), reduce_maps_();
src/c/pdspeval.c:    reduce_maps( *n, mapA, *n, mapZ, 0, mapZ, &nn_proc, proclist );
src/c/pdspevx.c:    extern void     pdiff(), xstop_(), pgexit(), mapdif_(), reduce_maps();
src/c/pdspevx.c:    reduce_maps( *n, mapA, *n, mapZ, 0, mapZ, &nn_proc, proclist );
src/c/pdspev_c.c:  extern void     pdiff(), xstop_(), pgexit(), mapdif_(), reduce_maps();
src/c/pdspev_c.c:   reduce_maps( *n, mapA, *n, mapZ, 0, mapZ, &nn_proc, proclist );
src/c/pdspgvx.c:  extern void     reduce_maps();
src/c/pdspgvx.c:  reduce_maps( *n, mapA, *n, mapB, *n, mapZ, &nn_proc, proclist );
src/c/pdspgv_c.c:    extern void     reduce_maps();
src/c/pdspgv_c.c:    reduce_maps( *n, mapA, *n, mapB, *n, mapZ, &nn_proc, proclist );
src/c/pstein3.c:  extern void     reduce_maps_();
src/c/pstein3.c:  reduce_maps( *meigval, mapZ, 0, mapZ, 0, mapZ, &nn_proc, proclist );
src/c/pstein4.c:  extern void reduce_maps();
src/c/pstein4.c:  reduce_maps( *meigval, mapZ, 0, mapZ, 0, mapZ, &nn_proc, proclist );
src/c/pstein44.c:  extern void     reduce_maps_();
src/c/pstein44.c:  reduce_maps( *meigval, mapZ, 0, mapZ, 0, mapZ, &nn_proc, proclist );
src/c/pstein5.c:   extern void     reduce_maps();
src/c/pstein5.c:  reduce_maps( *meigval, mapZ, 0, mapZ, 0, mapZ, &nn_proc, proclist );


Such a function call with unspecified argument list causes the C compiler to determine argument types from "the usual promotions". The type deduced for a literal 0 is "int".

But the actual type of this argument in the function's definition is "Integer".

void reduce_maps( nA, mapA, nB, mapB, nC, mapC, nproc, proclist )

  Integer      nA, nB, nC;
  Integer      mapA[], mapB[], mapC[], *nproc, proclist[];


and "Integer" is not necessarily the same type as "int".

#ifdef  STD_INT
typedef int    Integer;
typedef unsigned int unInteger;
#else
typedef long Integer;
typedef unsigned long unInteger;
#endif


In short, there is undefined behavior whenever STD_INT is not defined, and this resulted in actual bad code generation on CYGWIN64, because the compiler used 64-bit alignment for the parameters, but only zeroed the bottom 32-bits, and the high bits caused reduce_maps to see a very large value of nC and led to a segmentation fault.

From the build instructions, it sounds as if there have been problems building 64-bit nwchem on recent cygwin, possibly linked to newer compiler versions. This type mismatch would cause exactly those symptoms.

I've attempted a fix by adding a full prototype to globalp.c.h and it did cure the segmentation fault. (Unfortunately, other failures still exist)

Besides this, that "typedef long Integer" is useful on ILP32 and LP64, but still causes problems on LLP64 platforms such as Win64, where sizeof (long) < sizeof (void*). So that typedef should change, possibly to use intptr_t from inttypes.h.

Forum Vet
Win64 not supported
We cannot support Win64 since the Global Array tools have not been ported yet to Win64 (for very same issues you mentioned on LLP64 platforms). If somebody is willing to do the work of porting GlobalArrays to WIn64, then the rest is (relatively) easy.

Clicked A Few Times
Win64 not supported
I can put in some porting effort. You say that "Global Arrays" (under src/tools/ga*, correct) are the main source of incompatibility? It just happens that I already built that area with debug information (-g) as part of tracking down this bug.

Still, the bug I reported here may exhibit itself on your supported platforms as well -- the code is invalid according to the C standard and newer compiler versions are getting more and more aggressive about these things.

LP64 platforms such as Linux64 may have bogus values passed in these function calls, because sizeof (Integer) > sizeof (int), and the optimizer may choose not to zero out the padding bytes.

I mentioned CYGWIN64 only as proof that this bug has consequences on real systems, not just a theoretical correctness issue.

Forum Vet
Quote:Bvoigt Jan 20th 3:29 pm
I can put in some porting effort. You say that "Global Arrays" (under src/tools/ga*, correct) are the main source of incompatibility? It just happens that I already built that area with debug information (-g) as part of tracking down this bug.


Yes, that's the part I am calling Global Arrays. More details at
http://hpc.pnl.gov/globalarrays/

Quote:Bvoigt Jan 20th 3:29 pm

Still, the bug I reported here may exhibit itself on your supported platforms as well -- the code is invalid according to the C standard and newer compiler versions are getting more and more aggressive about these things.

LP64 platforms such as Linux64 may have bogus values passed in these function calls, because sizeof (Integer) > sizeof (int), and the optimizer may choose not to zero out the padding bytes.

I mentioned CYGWIN64 only as proof that this bug has consequences on real systems, not just a theoretical correctness issue.


I am not sure I am following you. On LP64 we use the Integer = long trick for the Fortran to C interface since we do use 64-bit integers in Fortran on LP64 (by using the -i8 compiler options). I do not see any int appearing in this sequence

Clicked A Few Times
Quote:Edoapra Jan 20th 5:56 pm

I am not sure I am following you. On LP64 we use the Integer = long trick for the Fortran to C interface since we do use 64-bit integers in Fortran on LP64 (by using the -i8 compiler options). I do not see any int appearing in this sequence


src/c/pdspev_c.c:   reduce_maps( *n, mapA, *n, mapZ, 0, mapZ, &nn_proc, proclist );
                                                     ^
                                                this is int


Conversion from (literal of type int, according to clause 6.4.4.1) to (long) is not one of the "integer promotions" (clause 6.3.1.1) and will not occur automatically when there is no prototype for the function being called (clause 6.5.2.2)

That last clause contains the following rule

Quote:
If the function is defined with a type that does not include a prototype, and the types of
the arguments after promotion are not compatible with those of the parameters after
promotion, the behavior is undefined


The type of the indicated argument after promotion is "int", the type of the parameter is "typedef long Integer", and the result is undefined behavior.

That undefined behavior manifested as the caller writing four bytes and the callee reading eight, including garbage. Since that parameter controls pointer arithmetic, a segmentation fault resulted.

Note that on my system the parameters were eight byte aligned, so the caller reserved four padding bytes for alignment purposes, and the remaining arguments were at the expected location. A platform using four byte alignment would cause the location of the other arguments to be shifted, in addition to corruption of the argument passed as literal 0.

This affects a number of files in peigs, as shown in my first post (all those function calls that have a literal zero argument are broken).

Forum Vet
Is this the change you are suggesting?

--- pdspev_c.c	(revision 27970)
+++ pdspev_c.c	(working copy)
@@ -395,7 +395,7 @@
    */
 
    proclist = iscratch;
-   reduce_maps( *n, mapA, *n, mapZ, 0, mapZ, &nn_proc, proclist );
+   reduce_maps( *n, mapA, *n, mapZ, (Integer) 0, mapZ, &nn_proc, proclist );
 
    i_scrat = iscratch + nn_proc;
 

Clicked A Few Times
Quote:Edoapra Jan 21st 11:50 am
Is this the change you are suggesting?

--- pdspev_c.c	(revision 27970)
+++ pdspev_c.c	(working copy)
@@ -395,7 +395,7 @@
    */
 
    proclist = iscratch;
-   reduce_maps( *n, mapA, *n, mapZ, 0, mapZ, &nn_proc, proclist );
+   reduce_maps( *n, mapA, *n, mapZ, (Integer) 0, mapZ, &nn_proc, proclist );
 
    i_scrat = iscratch + nn_proc;
 


That's a correct but fragile fix. It leaves the other dozen instances of calls to the exact same function broken, and requires manually keeping the call site cast in sync with the argument type.

I much prefer the fix of adding a full prototype to a header file which is already included in all of these files. Which I said in the very first post.

I have taken the time to study code in your software written in a three decade old style, isolate and fix a bug that causes UNDEFINED BEHAVIOR (which calls into question the validity of all publications using it), explain it, and your first response is to deny that there is a bug, and then after I carefully show that both the formal and practical rules are being broken, you still can't be bothered to read my explanation??!?

I have done all the work that the rest of the open source world wishes that people reporting bugs would do. What more do you want?


Forum >> NWChem's corner >> Compiling NWChem