This patch adds `olGetEventElapsedTime` to the new LLVM Offload API, as requested in [#185728](https://github.com/llvm/llvm-project/issues/185728), and adds the corresponding support in `plugins-nextgen`. A main motivation for this change is to make it possible to measure the elapsed time of work submitted to a queue, especially kernel launches. This is relevant to the intended use of the new Offload API for microbenchmarking GPU libc math functions. ### Summary The new API returns the elapsed time, in milliseconds, between two events on the same device. To support the common pattern `create start event → enqueue kernel → create end event → sync end event → get elapsed time`, `olCreateEvent` now always creates and records a backend event through the device interface. For backends that materialize real event state, this gives the event concrete backend state that can be used for elapsed-time measurement. For backends that do not materialize backend event state, `EventInfo` may still remain null and existing event operations continue to treat such events as trivially complete. Previously, an event created on an empty queue could be represented only as a logical event. That representation was sufficient for sync and completion queries, but it was not suitable for elapsed-time measurement because there was no backend event state to timestamp. The new behavior preserves the meaning of completion of prior work while also allowing backends with timing support to attach real event state. ### Changes in `plugins-nextgen` #### Common interface Add elapsed-time support to the common device and plugin interfaces: * `GenericPluginTy::get_event_elapsed_time` * `GenericDeviceTy::getEventElapsedTime` * `GenericDeviceTy::getEventElapsedTimeImpl` #### AMDGPU * Add the required ROCr declarations and wrappers. * Enable queue profiling at queue creation time. * Record events by enqueuing a real barrier marker packet on the stream. * Retain the timing signal needed to query the recorded marker later. * Implement `getEventElapsedTimeImpl` using `hsa_amd_profiling_get_dispatch_time`, converting the result to milliseconds with `HSA_SYSTEM_INFO_TIMESTAMP_FREQUENCY`. This follows the ROCm/HIP approach of enabling queue profiling at HSA queue creation time, while keeping the AMDGPU queue path simpler than the lazy-enable alternative discussed during review. #### CUDA * Add the required CUDA driver declarations and wrappers. * Implement `getEventElapsedTimeImpl` with `cuEventElapsedTime`. #### Host * Add `getEventElapsedTimeImpl` that stores `0.0f` in the output pointer, when present, and returns success. Reason: the host plugin does not materialize backend event state and already treats event operations as trivially successful. Returning `0.0f` preserves that model without introducing a new failure mode. #### Level Zero * Add `getEventElapsedTimeImpl`, but leave it unimplemented. Reason: the Level Zero plugin currently does not provide standalone backend event support for this event model. For example, `waitEventImpl` / `syncEventImpl` are still unimplemented there. --------- Signed-off-by: Leandro Augusto Lacerda Campos <leandrolcampos@yahoo.com.br> Signed-off-by: Leandro A. Lacerda Campos <leandrolcampos@yahoo.com.br>
57 lines
1.8 KiB
C++
57 lines
1.8 KiB
C++
//===-- Shared/RefCnt.h - Helper to keep track of references --- C++ ------===//
|
|
//
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
//
|
|
//===----------------------------------------------------------------------===//
|
|
//
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
#ifndef OMPTARGET_SHARED_REF_CNT_H
|
|
#define OMPTARGET_SHARED_REF_CNT_H
|
|
|
|
#include <atomic>
|
|
#include <cassert>
|
|
#include <limits>
|
|
#include <memory>
|
|
|
|
namespace llvm {
|
|
namespace omp {
|
|
namespace target {
|
|
|
|
/// Utility class for thread-safe reference counting. Any class that needs
|
|
/// objects' reference counting can inherit from this entity or have it as a
|
|
/// class data member.
|
|
template <typename Ty = uint32_t,
|
|
std::memory_order MemoryOrder = std::memory_order_relaxed>
|
|
struct RefCountTy {
|
|
/// Create a refcount object initialized to zero.
|
|
RefCountTy() : Refs(0) {}
|
|
|
|
~RefCountTy() { assert(Refs == 0 && "Destroying with non-zero refcount"); }
|
|
|
|
/// Increase the reference count atomically by \p Amount.
|
|
void increase(Ty Amount = 1) { Refs.fetch_add(Amount, MemoryOrder); }
|
|
|
|
/// Decrease the reference count by \p Amount and return whether it became
|
|
/// zero. Decreasing the counter by more than it was previously increased
|
|
/// results in undefined behavior.
|
|
bool decrease(Ty Amount = 1) {
|
|
Ty Prev = Refs.fetch_sub(Amount, MemoryOrder);
|
|
assert(Prev >= Amount && "Invalid refcount");
|
|
return (Prev == Amount);
|
|
}
|
|
|
|
Ty get() const { return Refs.load(MemoryOrder); }
|
|
|
|
private:
|
|
/// The atomic reference counter.
|
|
std::atomic<Ty> Refs;
|
|
};
|
|
} // namespace target
|
|
} // namespace omp
|
|
} // namespace llvm
|
|
|
|
#endif
|