Summary:
The NVIDIA ITS protocol allows lanes to diverge inside of a warp. We
previously had contingencies around this, but there were cases where
issues would still show up under highly stressed usage.
The rules state that as long as the PC is the same, threads can
reconverge. This means that we can see a 'convergent' warp even when
they took completely divergent paths to get there. This resulted in the
'index' value in the RPC port lookup loop thinking we were in a
convergent group while all the indices were different. Fix this with a
broadcast to force the expected behavior
Additionally, we did not force that the threads were actually done with
their 'work_fn'. If the work included something that caused divergence
the other threads could continue and toggle the mailbox, resulting in
the server seeing unfinished work. Fix this with an explicit sync and
have one thread do it.
Add a test to make sure this actually works.