When kernel record replay was enabled, the operations on the current stream were not synchronized. That's because the current stream was "ignored", and a new stream was used when RR is active. This is invalid when there are pending operations on the original stream and can lead to invalid prologue recording data. This commit addresses this issue by using the original stream and synchronizing it explicitly before and after kernel launch. This way, we ensure the operations are completed before performing the prologue and epilogue data recording. Additionally, the kernel record replay entry points are moved to the same layer, in `GenericKernelTy::launch()`.
76 KiB
76 KiB