Summary: AMDGPU introduced a high level intrinsic for shuffles. The main advantage of this over the ds_bpermute path is that it is correctly lowered for w32 / w64 and doesn't require the four byte offset. This PR adds '__builtin_amdgcn_wave_shuffle' to access it.
103 KiB
103 KiB