OK, finally found the mistake and got it compiled.
I think I am getting closer to the final normally working version. The main problem now is following. When running it with openmpi (mpirun) it is almost 8 times slower on 8 cores comparing to running it on just one core which I find very strange. Any suggestions?