You can use fewer MPI processes per node than there are cores available if you use OpenMP threads. Currently, threads are only used - at least in TCE - in BLAS, i.e. DGEMM, but this is ~half the wall time in most jobs.
I have threaded code for the other dominant kernels in TCE - TCE_SORT_4 and the bottleneck portions of (T) - but it isn't in version 6.3. I will try to make a patch in a month or two.