Add a `cooperative` UnitAttr to `gpu.launch_func` that enables
cooperative kernel launch semantics. Cooperative launches guarantee that
all thread blocks in the grid are co-resident on the GPU simultaneously,
enabling grid-wide synchronization patterns.
## Implementation
When `cooperative` is set (with or without cluster sizes), the lowering
emits a call to the new `mgpuLaunchKernelCooperative` runtime function,
which uses `cuLaunchKernelEx` with a `CUlaunchConfig` and
`CU_LAUNCH_ATTRIBUTE_COOPERATIVE`. This API is guarded behind
`CUDA_VERSION >= 12000`. The HIP path funnels through
`hipModuleLaunchCooperativeKernel`.
## Changes
- **GPUOps.td**: add `cooperative` UnitAttr and assembly format keyword
- **SelectObjectAttr.cpp**: add `getKernelLaunchExFn()`, route
cooperative and/or cluster launches through `mgpuLaunchKernelEx`
- **CudaRuntimeWrappers.cpp**: implement `mgpuLaunchKernelCooperative`
via `cuLaunchKernelEx` or `hipModuleLaunchCooperativeKernel`, depending
on platform
- **GPUToLLVMConversion.cpp**: propagate cooperative attribute through
the legalization pattern
- **test/Dialect/GPU/ops.mlir**: round-trip tests for cooperative
keyword with and without clusters
## Context
MLIR currently has no support for cooperative kernel launches. Flang
works around this with a CUF-specific attribute (PRs #124325, #124362),
but there is no first-class support in the GPU dialect. This patch adds
it at the `gpu.launch_func` level so all frontends can use it.
Assisted-by: Claude (Anthropic)