Files
llvm-project/llvm/lib/Support/MemoryBuffer.cpp
Alexandre Ganea 77b8b33b5a [LLD][COFF] Prefetch inputs early-on to improve link times (#169224)
This PR reduces outliers in terms of runtime performance, by asking the
OS to prefetch memory-mapped input files in advance, as early as
possible. I have implemented the Linux aspect, however I have only
tested this on Windows 11 version 24H2, with an active security stack
enabled. The machine is a AMD Threadripper PRO 3975WX 32c/64t with 128
GB of RAM and Samsung 990 PRO SSD.

I have used a Unreal Engine-based game to profile the link times. Here's
a quick summary of the input data:
```
                                    Summary
--------------------------------------------------------------------------------
               4,169 Input OBJ files (expanded from all cmd-line inputs)
      26,325,429,114 Size of all consumed OBJ files (non-lazy), in bytes
                   9 PDB type server dependencies
                   0 Precomp OBJ dependencies
         350,516,212 Input debug type records
      18,146,407,324 Size of all input debug type records, in bytes
          15,709,427 Merged TPI records
           4,747,187 Merged IPI records
              56,408 Output PDB strings
          23,410,278 Global symbol records
          45,482,231 Module symbol records
           1,584,608 Public symbol records
```

In normal conditions - meanning all the pages are already in RAM - this
PR has no noticeable effect:
```
>hyperfine "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     29.689 s ±  0.550 s    [User: 259.873 s, System: 37.936 s]
  Range (min … max):   29.026 s … 30.880 s    10 runs

Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     29.594 s ±  0.342 s    [User: 261.434 s, System: 62.259 s]
  Range (min … max):   29.209 s … 30.171 s    10 runs

Summary
  with_pr\lld-link.exe @Game.exe.rsp ran
    1.00 ± 0.02 times faster than before\lld-link.exe @Game.exe.rsp
```

However when in production conditions, we're typically working with the
Unreal Engine Editor, with exteral DCC tools like Maya, Houdini; we have
several instances of Visual Studio open, VSCode with Rust analyzer, etc.
All this means that between code change iterations, most of the input
OBJs files might have been already evicted from the Windows RAM cache.
Consequently, in the following test, I've simulated the worst case
condition by evicting all data from RAM with
[RAMMap64](https://learn.microsoft.com/en-us/sysinternals/downloads/rammap)
(ie. `RAMMap64.exe -E[wsmt0]` with a 5-sec sleep at the end to ensure
the System thread actually has time to evict the pages)
```
>hyperfine -p cleanup.bat "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     48.124 s ±  1.770 s    [User: 269.031 s, System: 41.769 s]
  Range (min … max):   46.023 s … 50.388 s    10 runs

Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     34.192 s ±  0.478 s    [User: 263.620 s, System: 40.991 s]
  Range (min … max):   33.550 s … 34.916 s    10 runs

Summary
  with_pr\lld-link.exe @Game.exe.rsp ran
    1.41 ± 0.06 times faster than before\lld-link.exe @Game.exe.rsp
```

This is similar to the work done in MachO in
https://github.com/llvm/llvm-project/pull/157917
2026-01-09 15:49:35 +00:00

23 KiB