On GFX1250, V_NOPs inserted for WMMA coexecution hazards are placed at
the use-site. When the hazard-consuming instruction is inside a loop and
the WMMA is outside, these NOPs execute every iteration even though the
hazard only needs to be covered once.
This patch hoists the V_NOPs to the loop preheader, reducing executions
from N iterations to 1.
```
Example (assuming a hazard requiring K V_NOPs):
Before:
bb.0 (preheader): WMMA writes vgpr0
bb.1 (loop): V_NOP xK, VALU reads vgpr0, branch bb.1
-> K NOPs executed per iteration
After:
bb.0 (preheader): WMMA writes vgpr0, V_NOP xK
bb.1 (loop): VALU reads vgpr0, branch bb.1
-> K NOPs executed once
```
For nested loops, V_NOPs are hoisted to the outermost preheader where no
WMMA hazard exists within the loop.
Hoisting is restricted to strict preheaders (not any single predecessor)
to avoid introducing V_NOPs on unrelated control flow paths.
The optimization is controlled by `-amdgpu-wmma-vnop-hoisting` (default:
on).
Fixes: SWDEV-573407
4.4 KiB
4.4 KiB