Make sure we do not get unexpected NumThreads and NumBlocks values when
launching non-bare kernels, and generalize the computation of the
dynamic block memory allocation to handle multi-dimensional blocks.
The DynBlockMem fallback is never used in a non-bare context where
`NumBlocks[1]` and `NumBlocks[2]` are not 1 so the code was correct, but
this patch makes sure that assumption is made explicit, and also
future-proofs the code in case we decide to allow multi-dimensional
blocks for fallback dyn block mem in some path.