Files
llvm-project/llvm/docs/AMDGPUAsyncOperations.rst
Sameer Sahasrabuddhe b02b395a1e [AMDGPU] Asynchronous loads from global/buffer to LDS on pre-GFX12 (#180466)
The existing "LDS DMA" builtins/intrinsics copy data from global/buffer
pointer to LDS. These are now augmented with their ".async" version,
where the compiler does not automatically track completion. The
completion is now tracked using explicit mark/wait intrinsics, which
must be inserted by the user. This makes it possible to write programs
with efficient waits in software pipeline loops. The program can now
wait for only the oldest outstanding operations to finish, while
launching more operations for later use.

This change only contains the new names of the builtins/intrinsics,
which continue to behave exactly like their non-async counterparts. A
later change will implement the actual mark/wait semantics in
SIInsertWaitcnts.

This is part of a stack split out from #173259:
- #180467
- #180466

Fixes: SWDEV-521121
2026-02-11 05:26:58 +00:00

239 lines
6.3 KiB
ReStructuredText

.. _amdgpu-async-operations:
===============================
AMDGPU Asynchronous Operations
===============================
.. contents::
:local:
Introduction
============
Asynchronous operations are memory transfers (usually between the global memory
and LDS) that are completed independently at an unspecified scope. A thread that
requests one or more asynchronous transfers can use *async marks* to track
their completion. The thread waits for each mark to be *completed*, which
indicates that requests initiated in program order before this mark have also
completed.
Operations
==========
Memory Accesses
---------------
LDS DMA Operations
^^^^^^^^^^^^^^^^^^
.. code-block:: llvm
; "Legacy" LDS DMA operations
void @llvm.amdgcn.load.async.to.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.global.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.raw.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.raw.ptr.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.struct.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.struct.ptr.buffer.load.async.lds(ptr %src, ptr %dst)
Request an async operation that copies the specified number of bytes from the
global/buffer pointer ``%src`` to the LDS pointer ``%dst``.
.. note::
The above listing is *merely representative*. The actual function signatures
are identical to their non-async variants, and supported only on the
corresponding architectures (GFX9 and GFX10).
Async Mark Operations
---------------------
An *async mark* in the abstract machine tracks all the async operations that
are program ordered before that mark. A mark M is said to be *completed*
only when all async operations program ordered before M are reported by the
implementation as having finished, and it is said to be *outstanding* otherwise.
Thus we have the following sufficient condition:
An async operation X is *completed* at a program point P if there exists a
mark M such that X is program ordered before M, M is program ordered before
P, and M is completed. X is said to be *outstanding* at P otherwise.
The abstract machine maintains a sequence of *async marks* during the
execution of a function body, which excludes any marks produced by calls to
other functions encountered in the currently executing function.
``@llvm.amdgcn.asyncmark()``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When executed, inserts an async mark in the sequence associated with the
currently executing function body.
``@llvm.amdgcn.wait.asyncmark(i16 %N)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Waits until there are at most N outstanding marks in the sequence associated
with the currently executing function body.
Memory Consistency Model
========================
Each asynchronous operation consists of a non-atomic read on the source and a
non-atomic write on the destination. Async "LDS DMA" intrinsics result in async
accesses that guarantee visibility relative to other memory operations as
follows:
An asynchronous operation `A` program ordered before an overlapping memory
operation `X` happens-before `X` only if `A` is completed before `X`.
A memory operation `X` program ordered before an overlapping asynchronous
operation `A` happens-before `A`.
.. note::
The *only if* in the above wording implies that unlike the default LLVM
memory model, certain program order edges are not automatically included in
``happens-before``.
Examples
========
Uneven blocks of async transfers
--------------------------------
.. code-block:: c++
void foo(global int *g, local int *l) {
// first block
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
asyncmark();
// second block; longer
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
asyncmark();
// third block; shorter
async_load_to_lds(l, g);
async_load_to_lds(l, g);
asyncmark();
// Wait for first block
wait.asyncmark(2);
}
Software pipeline
-----------------
.. code-block:: c++
void foo(global int *g, local int *l) {
// first block
asyncmark();
// second block
asyncmark();
// third block
asyncmark();
for (;;) {
wait.asyncmark(2);
// use data
// next block
asyncmark();
}
// flush one block
wait.asyncmark(2);
// flush one more block
wait.asyncmark(1);
// flush last block
wait.asyncmark(0);
}
Ordinary function call
----------------------
.. code-block:: c++
extern void bar(); // may or may not make async calls
void foo(global int *g, local int *l) {
// first block
asyncmark();
// second block
asyncmark();
// function call
bar();
// third block
asyncmark();
wait.asyncmark(1); // will wait for at least the second block, possibly including bar()
wait.asyncmark(0); // will wait for third block, including bar()
}
Implementation notes
====================
[This section is informational.]
Optimization
------------
The implementation may eliminate async mark/wait intrinsics in the following cases:
1. An ``asyncmark`` operation which is not included in the wait count of a later
wait operation in the current function. In particular, an ``asyncmark`` which
is not post-dominated by any ``wait.asyncmark``.
2. A ``wait.asyncmark`` whose wait count is more than the outstanding async
marks at that point. In particular, a ``wait.asyncmark`` that is not
dominated by any ``asyncmark``.
In general, at a function call, if the caller uses sufficient waits to track
its own async operations, the actions performed by the callee cannot affect
correctness. But inlining such a call may result in redundant waits.
.. code-block:: c++
void foo() {
asyncmark(); // A
}
void bar() {
asyncmark(); // B
asyncmark(); // C
foo();
wait.asyncmark(1);
}
Before inlining, the ``wait.asyncmark`` waits for mark B to be completed.
.. code-block:: c++
void foo() {
}
void bar() {
asyncmark(); // B
asyncmark(); // C
asyncmark(); // A from call to foo()
wait.asyncmark(1);
}
After inlining, the asyncmark-wait now waits for mark C to complete, which is
longer than necessary. Ideally, the optimizer should have eliminated mark A in
the body of foo() itself.