Could you be more specific on the cases showing a large difference between wall and cpu time?
If the case is too small, there is not enough workload to be parallelized and with 8 processes you are are already running out of steam. Using fewer processes (4 or even 2) is probably a more efficient way to exploit your computational resources.