Add a `cooperative` UnitAttr to `gpu.launch_func` that enables cooperative kernel launch semantics. Cooperative launches guarantee that all thread blocks in the grid are co-resident on the GPU simultaneously, enabling grid-wide synchronization patterns. ## Implementation When `cooperative` is set (with or without cluster sizes), the lowering emits a call to the new `mgpuLaunchKernelCooperative` runtime function, which uses `cuLaunchKernelEx` with a `CUlaunchConfig` and `CU_LAUNCH_ATTRIBUTE_COOPERATIVE`. This API is guarded behind `CUDA_VERSION >= 12000`. The HIP path funnels through `hipModuleLaunchCooperativeKernel`. ## Changes - **GPUOps.td**: add `cooperative` UnitAttr and assembly format keyword - **SelectObjectAttr.cpp**: add `getKernelLaunchExFn()`, route cooperative and/or cluster launches through `mgpuLaunchKernelEx` - **CudaRuntimeWrappers.cpp**: implement `mgpuLaunchKernelCooperative` via `cuLaunchKernelEx` or `hipModuleLaunchCooperativeKernel`, depending on platform - **GPUToLLVMConversion.cpp**: propagate cooperative attribute through the legalization pattern - **test/Dialect/GPU/ops.mlir**: round-trip tests for cooperative keyword with and without clusters ## Context MLIR currently has no support for cooperative kernel launches. Flang works around this with a CUF-specific attribute (PRs #124325, #124362), but there is no first-class support in the GPU dialect. This patch adds it at the `gpu.launch_func` level so all frontends can use it. Assisted-by: Claude (Anthropic)
48 KiB
48 KiB