Fixes #180334 # Problem The semaphore timed wait test is flaky on Windows. It hangs from time to time. Some examples: [windows (clang-cl-no-vcruntime, false, clang-cl, clang-cl)](https://github.com/llvm/llvm-project/actions/runs/21737380876/job/62707542836#logs) [windows (mingw-static, true, cc, c++)](https://github.com/llvm/llvm-project/actions/runs/21636063482/job/62367831823?pr=179483#logs) [windows (clang-cl-static, false, clang-cl, clang-cl)](https://github.com/llvm/llvm-project/actions/runs/21453876753/job/61794464147#logs) [windows (clang-cl-dll, false, clang-cl, clang-cl)](https://github.com/llvm/llvm-project/actions/runs/21382902941/job/61556154029#logs) [windows (mingw-static, true, cc, c++)](https://github.com/llvm/llvm-project/actions/runs/21365713577/job/61502377123#logs) # Root Cause The internal dylib function takes a timeout ```cpp static void __platform_wait_on_address(void const* __ptr, void const* __val, uint64_t __timeout_ns) ``` We followed the same convention as `__libcpp_thread_poll_with_backoff`, where we used `0ns` to indicate wait indefinitely until being notified. ```cpp _LIBCPP_HIDE_FROM_ABI __poll_with_backoff_results __libcpp_thread_poll_with_backoff( _Poll&& __poll, _Backoff&& __backoff, chrono::nanoseconds __max_elapsed = chrono::nanoseconds::zero()) ``` This is problematic, if the caller indeed wants to wait `0ns` and passes `0`, the internal dylib function `__platform_wait_on_address` would wait indefinitely ```cpp __timeout_ns == 0 ? INFINITE : static_cast<DWORD>(__timeout_ns / 1'000'000) ``` This is what actually happened here. So the fix is to update internal dylib function to use `optional` and use `nullopt` to indicate wait indefinitely ```cpp static void __platform_wait_on_address(void const* __ptr, void const* __val, optional<uint64_t> __timeout_ns) ``` Edit: after code review, the code is updated to use tag type `NoTimeout` to indicate "wait indefinitely". this is superior because it is coded into the type system instead of runtime check # Why `__libcpp_thread_poll_with_backoff` never suffered from this problem? `__libcpp_thread_poll_with_backoff` has this " `0ns` means wait indefinitely " semantic for years (it has always been like that), but it never causes issues. This is because, it has ```cpp chrono::nanoseconds const __elapsed = chrono::high_resolution_clock::now() - __start; if (__max_elapsed != chrono::nanoseconds::zero() && __max_elapsed < __elapsed) return __poll_with_backoff_results::__timeout; ``` `__max_elapsed` is what user passed in, let's assume the user passed in `0ns`, and `__elapsed` is certainly a positive number so it never goes to the backoff function and directly returned. No hanging possible # Why Windows and we passed in `0ns` ? So in the test, we passed wait some time, say `1ms`, from the `semaphore` public API, which calls `__libcpp_thread_poll_with_backoff` with `1ms` timeout. `__libcpp_thread_poll_with_backoff` will do some polling loops and then calling into the backoff function, platform timed wait in this case, with timeout of `1ms - elapsed` , say `950us`. However, Windows platform wait has millisecond precision, so `950us` is rounded down to `0ms` (`static_cast<DWORD>(__timeout_ns / 1'000'000)`), so the function call almost immediately returns, and `__libcpp_thread_poll_with_backoff` will keep its polling loop like this. As time goes by in the polling loop, the timeout for platform wait will decrease from `950us` to smaller and smaller number. In the `__libcpp_thread_poll_with_backoff` ```cpp if (__max_elapsed != chrono::nanoseconds::zero() && __max_elapsed < __elapsed) return __poll_with_backoff_results::__timeout; if (auto __backoff_res = __backoff(__elapsed); __backoff_res == __backoff_results::__continue_poll) ``` `__max_elapsed` is user requested timeout, which is `1ms` in this case, and `__elapsed` gradually increases and eventually, if it becomes greater than `1ms`, we have `__max_elapsed < __elapsed`, it will return and test passes. all Good. But there is a slim chance that on one loop, `__elapsed` is exactly the same number of the user requested `__max_elapsed` `1ms`, so `__max_elapsed == __elapsed`, this will make the code path go to backoff platform wait, with the timeout `__max_elapsed - __elapsed == 0ns`. So now we are requesting `__platform_wait_on_address` to wait exactly `0ns`, and due to our ambiguous API, `0ns` means wait indefinitely ```cpp __timeout_ns == 0 ? INFINITE : static_cast<DWORD>(__timeout_ns / 1'000'000) ``` The test will just hang forever. # Why not `__max_elapsed <= __elapsed` ? in `__libcpp_thread_poll_with_backoff`, If we check `__max_elapsed <= __elapsed` instead of `__max_elapsed < __elapsed`, we would avoid the call to platform wait with `0ns`. But according to the standard, I think the current `__max_elapsed < __elapsed` is more correct > The timeout expires ([[thread.req.timing]](https://eel.is/c++draft/thread.req.timing)) when the current time is after abs_time (for try_acquire_until) or when at least rel_time has passed from the start of the function (for try_acquire_for)[.](https://eel.is/c++draft/thread.sema#cnt-18.sentence-2) https://eel.is/c++draft/thread.sema#cnt-18.2 So this **after** means that it needs strictly greater than i think. # The Fix The fix is to update the internal dylib API to use `nullopt` to indicate wait indefinitely. So a call to platform wait with `0ns` will not cause a hang. Edit: after code review, the code is updated to use tag type `NoTimeout` to indicate "wait indefinitely". this is superior because it is coded into the type system instead of runtime check I made a small change as well. it is very easy to get into a situation where the requested platform wait timeout is `<1ms`, we will be keep calling Windows platform wait with `0ns` because of the rounding, and this is effectively a spin lock . I made a change such that if the requested timeout is between `100us to 1ms`, just rounded up to `1ms` to wait a bit longer (which is conforming IIUC) . let me know if this change is necessary, happy to take this part out if it is not considered good.
46 lines
1.3 KiB
C++
46 lines
1.3 KiB
C++
//===----------------------------------------------------------------------===//
|
|
//
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
//
|
|
//===----------------------------------------------------------------------===//
|
|
//
|
|
// UNSUPPORTED: no-threads
|
|
// UNSUPPORTED: c++03, c++11, c++14, c++17
|
|
|
|
// <semaphore>
|
|
|
|
// This is a regression test for a bug in semaphore::try_acquire_for
|
|
// where it can wait indefinitely
|
|
// https://github.com/llvm/llvm-project/issues/180334
|
|
|
|
#include <semaphore>
|
|
#include <thread>
|
|
#include <chrono>
|
|
#include <cassert>
|
|
|
|
#include "make_test_thread.h"
|
|
#include "test_macros.h"
|
|
|
|
void test() {
|
|
auto const start = std::chrono::steady_clock::now();
|
|
std::counting_semaphore<> s(0);
|
|
|
|
assert(!s.try_acquire_for(std::chrono::nanoseconds(1)));
|
|
assert(!s.try_acquire_for(std::chrono::microseconds(1)));
|
|
assert(!s.try_acquire_for(std::chrono::milliseconds(1)));
|
|
assert(!s.try_acquire_for(std::chrono::milliseconds(100)));
|
|
|
|
auto const end = std::chrono::steady_clock::now();
|
|
assert(end - start < std::chrono::seconds(10));
|
|
}
|
|
|
|
int main(int, char**) {
|
|
for (auto i = 0; i < 10; ++i) {
|
|
test();
|
|
}
|
|
|
|
return 0;
|
|
}
|