RegReAssign hits an unreachable on AArch64 as it is a pass
(conceptually) specific to X86.
- Add a guard to RegReAssign for non-X86
- Update unsupported-passes.test
`--icp=<value>`/`--indirect-call-promotion=<value>` results in an
`UNIMPLEMENTED` crash when invoked as it is unimplemented in AArch64.
- Guard IndirectCallPromotion for non-X86
- Update unsupported-passes.test with expected error
AsmPrinter::MAI is non-null. This is made more explicit after
PR #194523 changed TargetMachine::getMCAsmInfo to return a reference
with recent MCAsmInfo/MCTargetOptions related refactoring.
Convert the member from const MCAsmInfo * to const MCAsmInfo & and
update all consumers.
The MAI member is non-null. #194280 made this clearer by making the
MCContext constructor take MCAsmInfo by reference. Convert getAsmInfo to
return const MCAsmInfo & and the member to a reference.
`--cmov-conversion` is unsupported in AArch64 as
convertMoveToConditionalMove() is only overriden for X86.
- Add a guard for non-X86
- Update unsupported-passes.test with expected error
Both MCContext::MCContext and TargetMachine::getMCAsmInfo treat
MCAsmInfo as a pointer that must be non-null. Make the contract
explicit:
* MCContext's constructor takes `const MCAsmInfo &MAI`.
* TargetMachine::getMCAsmInfo returns `const MCAsmInfo &`.
Make this change now since the MCContext ctor has recently been updated.
Since #180464 the canonical MCTargetOptions pointer is stored in
MCAsmInfo, but it is bound after construction via `setTargetOptions`
called from TargetRegistry::createMCAsmInfo.
Direct constructions in unit tests can leave the pointer null, leading
to a runtime assert failure. Add MCTargetOptions to every MCAsmInfo
subclass constructor, store it as a reference in MCAsmInfo, and remove
`setTargetOptions()`.
Handle signed values in parseHexField by falling back to int64_t parsing
when uint64_t fails. This allows pre-aggregated profile tools to use -1
for BR_ONLY, -2 for FT_EXTERNAL_ORIGIN, -3 for FT_EXTERNAL_RETURN.
Guard the external address reset loop in parseAggregatedLBREntry to
preserve sentinel values (offsets >= FT_EXTERNAL_RETURN).
Add tests for -1/-2/-3 in parseHexField and T entries with -1,
ffffffffffffffff, and buildid:-1 as BR_ONLY.
JTFootprintReduction results in a no-op on AArch64. This is because it
emits createIJmp32Frag() which is unimplemented for AArch64 and is only
overridden by x86.
- Add a guard for non-x86
- Update unsupported-passes.test with expected error message
Fix two null pointer dereferences in BOLT's DWP processing path that
cause SIGSEGV in worker threads when -update-debug-sections is used with
a co-located .dwp file.
1. getSliceData() in updateDebugData() dereferences the result of
getContribution() without checking for null. getContribution() returns
nullptr when the requested section kind (e.g. DW_SECT_LINE) is not
present as a column in the DWP CU index. When BOLT processes a DWP where
certain section kinds are absent from the index, every worker thread
that hits this path crashes simultaneously.
2. processSplitCU() dereferences getUnitDIEbyUnit() without checking for
null. If buildDWOUnit() fails for a CU, the returned DIE* is null and
the dereference crashes.
Crash signature from dmesg:
```
llvm-worker-*: segfault at 8 ip <offset> error 4 in llvm-bolt
(multiple worker threads crash at the same instruction)
```
The faulting address 0x8 corresponds to accessing the Length field
(offset 8) of a null `DWARFUnitIndex::Entry::SectionContribution*`.
At Meta, I reproduced this building hhvm with a co-located .dwp file and
the flags `update-debug-sections -debug-thread-count=80 -lite=0` with
profile data.
I confirmed that the unfixed BOLT crashes deterministically whereas the
fixed BOLT completes successfully.
mold linker creates relaxation stub from TLSDESC to LE, (lld makes it
IE) using sequence as NOP+NOP+MOVZ+MOVK. This in itself is not an issue,
when --emit-relocs is added the relocs R_AARCH64_TLSDESC_ADD_LO12 and
R_AARCH64_TLSDESC_CALL are associated with useful MOVW instructions.
However bolt does not check for R_AARCH64_TLSDESC_ADD_LO12 in
adjustRelocation() when disassembling the file. This later triggers a
bug when reloc is patched as movk is patched with S_LO12 fixup kind
which is invalid.
Refer to bug: https://github.com/llvm/llvm-project/issues/190366 for
details.
When writeEHFrameHeader needs to allocate new space for .eh_frame_hdr
(because the old section is too small), it calls appendPadding to align
NextAvailableAddress. appendPadding writes zero bytes at the current
stream position, but after the section write loop in rewriteFile the
stream is positioned at the end of the last section written in
BinarySection::operator< order — not at the file offset corresponding to
NextAvailableAddress.
In the common case (single loadObject call) the write order matches file
offset order, so the stream happens to be in the right place. But when a
runtime library adds sections via additional loadObject calls, the
operator< iteration order (code-before-data) can diverge from file
offset order: a runtime library code section may have a higher file
offset than a runtime library data section that comes after it in the
write loop. The stream then ends at a lower offset than expected, and
appendPadding's zeros overwrite the beginning of the code section.
Fix by seeking to the correct file offset before calling appendPadding.
Follow-up to #192289. Swap the remaining `std::unordered_set`/
`std::unordered_map` containers in `Instrumentation.cpp` for `DenseSet`/
`DenseMap`: the `BBToSkip` param and `Visited` local in
`hasAArch64ExclusiveMemop`, and `BBToSkip`, `BBToID`, `VisitedSet` in
`instrumentFunction`. Drop the now-unused `<unordered_set>` include.
The swap removes per-element heap allocations on the hot path, stops
inserting empty buckets on probes where a miss is possible, and replaces
hashed-bucket traversal over node-based storage with lookups over inline
`DenseMap` storage. `BBToID` reads keep `operator[]` since the map is
pre-populated for every basic block of the function, so no
default-construct path is ever taken. NFC.
Measured on `llvm-bolt -instrument` against a relocations-linked
clang-23: -1.3% instrumentation-pass wall time, peak RSS unchanged
(dominated by instrumentation output size).
BOLT hardcoded 4-byte LSDA (exception table) encoding for x86-64. This
is insufficient for large code model binaries where functions in .ltext
sections may be placed at addresses above 2GB, exceeding the range of
DW_EH_PE_udata4/DW_EH_PE_sdata4 encodings.
Detect large code model by checking for .ltext sections
(SHF_X86_64_LARGE) and update LSDAEncoding to use 8-byte pointers:
- Non-PIC: DW_EH_PE_absptr (8-byte absolute)
- PIC: DW_EH_PE_pcrel | DW_EH_PE_sdata8 (8-byte PC-relative)
This was pulled out from
https://github.com/llvm/llvm-project/pull/190637
Swap `std::unordered_map<…, std::set<…>>` for
`DenseMap<…, SmallVector<…>>` in `Instrumentation::instrumentFunction`
and switch read paths from `STOutSet[&BB]` to `find()`. This removes
per-set heap allocations, stops inserting empty buckets on every probe,
and replaces linear `is_contained()` scans over a red-black tree with
linear scans over inline `SmallVector` storage (most basic blocks have
at most a couple of spanning-tree out-edges). NFC.
`AddressSize` parameter is not used by `DataExtractor` and will be
removed in the future. See #190519 for more context.
I took the liberty of switching from using the `StringRef` constructor
overload to `ArrayRef` where appropriate.
Most clients don't have a notion of "address" and pass arbitrary values
(including `0` and `sizeof(void *)`) to `DataExtractor` constructors.
This makes address-extraction methods dangerous to use.
Those clients that do have a notion of address can use other methods
like `getUnsigned()` to extract an address, or they can derive from
`DataExtractor` and add convenience methods if extracting an address is
routine. `DWARFDataExtractor` is an example, where the removed methods
were actually moved.
This does not remove `AddressSize` argument of `DataExtractor`
constructors yet, but makes it unused and overloads constructors in
preparation for their deletion. I'll be removing uses of the
to-be-deleted constructors in follow-up patches.
Fix iterator misuse in four BOLT passes, caught by _GLIBCXX_DEBUG
(enabled via LLVM_ENABLE_EXPENSIVE_CHECKS=ON).
* AllocCombiner: combineAdjustments() erases instructions while
iterating in reverse via llvm::reverse(BB), invalidating the reverse
iterator. Defer erasures to after the loop using a SmallVector.
* ShrinkWrapping: processDeletions() uses
std::prev(BB.eraseInstruction(II)) which is undefined when II ==
begin(). Restructure to standard forward iteration with erase.
* DataflowAnalysis: run() unconditionally dereferences BB->rbegin(),
which crashes on empty basic blocks (possible after the ShrinkWrapping
fix). Guard with an emptiness check.
* IndirectCallPromotion: rewriteCall() dereferences the end iterator via
&(*IndCallBlock.end()). Replace with &IndCallBlock.back().
* TailDuplication: constantAndCopyPropagate() uses
std::prev(OriginalBB.eraseInstruction(Itr)) which is undefined when Itr
== begin(). Restructure to standard forward iteration with erase.
Replace the fragile filename-based check (ends_with(".so")) with
identify_magic()/file_magic::elf_shared_object to reliably detect
shared libraries when filtering pre-aggregated profile data by
build ID.
Test Plan: pre-aggregated-perf-shlib.test
The compareSections lambda in getCodeSections() violates the strict weak
ordering requirement: when A == B, the comparator can return true (e.g.
via the HotText mover name check), which triggers a _GLIBCXX_DEBUG
assertion on self-comparison.
Add an early identity check to satisfy irreflexivity.
Except MC-internal `MCAsmInfo()` uses, MCAsmInfo is always constructed
with `const MCTargetOptions &` via `TargetRegistry::createMCAsmInfo`
(https://reviews.llvm.org/D41349). Store the pointer in MCAsmInfo and
change `MCContext::getTargetOptions()` to retrieve it from there,
removing the `MCTargetOptions const *TargetOptions` member from
MCContext.
MCContext's constructor still accepts an MCTargetOptions parameter
for now but is often omitted by call sites.
A subsequent change will remove this parameter and update all callers.
On AArch64, logical immediate instructions are used to encode some
special immediate values. And even at `-O0` level, the AArch64 backend
would not choose to generate 4 instructions (movz, movk, movk, movk) for
moving such a special value to a 64-bit regiter.
For example, to move the 64-bit value `0x0001000100010001` to `x0`, the
AArch64 backend would not choose a 4-instruction-sequence like
```
movz x0, 0x0001
movk x0, 0x0001, lsl 16
movk x0, 0x0001, lsl 32
movk x0, 0x0001, lsl 48
```
Actually, the AArch64 backend would choose to generate one instruction
```
mov x0, 0x0001000100010001
```
which is essentially
```
orr x1, xzr, 0x0001000100010001
```
We could refer to `AArch64ExpandPseudoImpl::expandMOVImm` and
`AArch64_IMM::expandMOVImm` for related implementation.
Therefore, maybe we could consider to leverage `expandMOVImm` in llvm to
optimize the mov-imm-to-reg operation in BOLT, which would help to speed
up the BOLT-instrumented binary.
Allow `parseString()` to return an empty `StringRef` when the delimiter
appears at position 0. This enables parsing pre-aggregated profile
addresses with an omitted buildid but preserved colon (`:addr` format),
where the empty buildid corresponds to the main binary.
Previously, `parseString()` rejected zero-length fields by treating
`StringEnd == 0` the same as `StringRef::npos` (delimiter not found).
These are distinct situations: `npos` means no delimiter exists, while
`0` means the field before the delimiter is empty. The fix removes the
`StringEnd == 0` sub-condition so only the missing-delimiter case
errors.
The existing test for buildid-prefixed addresses is extended to also
verify that `:addr` input produces identical output to the plain-address
and non-empty-buildid variants.
Test Plan:
Added empty-buildid input file and extended
`pre-aggregated-perf-buildid.test` to run perf2bolt with `:addr` format
and diff the fdata output against the existing buildid-prefixed result.
perf2bolt generates empty fdata files for small binaries and right now
BOLT does this check while parsing by calling `((!hasBranchData() &&
!hasMemData()))`. Instead, early exit as soon as the buffer finishes
reading the data file and exit with error message.
Template patchELFPHDRTable, rewriteNoteSections, markGnuRelroSections,
and discoverStorage to support both ELF32LE and ELF64LE binaries.
Previously these functions were hardcoded for ELF64LE, causing crashes
when processing 32-bit ELF binaries.
The RewriteInstance constructor now accepts ELF32LE objects in addition
to ELF64LE. The ELF_FUNCTION macro is reused (and moved earlier in the
header) to dispatch to the correct template instantiation.
These changes are preparation for adding support to hexagon architecture
in Bolt.
Summary:
When the disk runs out of space during output file writing, BOLT would
crash with SIGSEGV/SIGABRT because raw_fd_ostream silently records write
errors and only reports them via abort() in its destructor. This made it
difficult to distinguish real BOLT bugs from infrastructure issues in
production monitoring.
Add an explicit error check on the output stream before calling
Out->keep(), so BOLT exits cleanly with exit code 1 and a clear error
message instead.
Test: manually verified with a full filesystem that BOLT now prints
"BOLT-ERROR: failed to write output file: No space left on device" and
exits with code 1.
In this patch I am adding the missing target hooks required for the
liveness analysis to run on AArch64. These are
- getFlagsReg()
- getRegsUsedAsParams()
- getDefaultLiveOut()
- getGPRegs()
- isCleanRegXOR()
I am also introducing the following API in LivenessAnalysis
- BitVector getLiveIn/Out(const MCInst &)
- MCPhysReg scavengeRegFromState(BitVector &)
My intention is to allow the LongJmp pass scavenge usable registers when
injecting code.
When the compact-code-model is used, LongJmpPass::relaxLocalBranches
attempts to reverseBranchCondition without calling isReversibleBranch
resulting in runtime error. With this patch I am adding an additional
trampoline to handle irreversible FEAT_CMPBR branches.
In the future the plan is to use liveness analysis and replace the
irreversible branch with compare followed by branch (see #185731) as
long as the condition flags are dead, or emit the additional trampoline
otherwise.
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan:
added pre-aggregated-perf-buildid.test
Sample addresses belonging to external DSOs (buildid doesn't match the
current file) are treated as external (0).
Buildid for the main binary is expected to be omitted.
Test Plan: added pre-aggregated-perf-buildid.test
Reviewers:
paschalis-mpeis, maksfb, yavtuk, ayermolo, yozhu, rafaelauler, yota9
Reviewed By: paschalis-mpeis
Pull Request: https://github.com/llvm/llvm-project/pull/186931
Adding a generator into Perf2bolt is the initial step to support the
large end-to-end tests for Arm SPE. This functionality proves unified format of
pre-parsed profile that Perf2bolt is able to consume.
Why does the test need to have a textual format SPE profile?
* To collect an Arm SPE profile by Linux Perf, it needs to have
an arm developer device which has SPE support.
* To decode SPE data, it also needs to have the proper version of
Linux Perf.
* The minimum required version of Linux Perf is v6.15.
Bypassing these technical difficulties, that easier to prove
a pre-generated textual profile format.
The generator relies on the aggregator work to spawn the required
perf-script jobs based on the the aggregation type, and merges the
results of the pref-script jobs into a single file.
This hybrid profile will contain all required events such as BuildID,
MMAP, TASK, BRSTACK, or MEM event for the aggregation.
Two examples below how to generate a pre-parsed perf data as
an input for ARM SPE aggregation:
`perf2bolt -p perf.data BINARY -o perf.text --spe
--generate-perf-script`
Or for basic aggregation:
`perf2bolt -p perf.data BINARY -o perf.text --ba --generate-perf-script`
Remove some unused code in BOLT:
- `RewriteInstance::linkRuntime` is declared but not defined
- `BranchContext` typedef is never used
- `FuncBranchData::getBranch` is defined but never used
- `FuncBranchData::getDirectCallBranch` is defined but never used
The assert condition (function is not split or split
into less than three fragments) is not always true now
that we will emit more local symbols due to #184074.
This commit enables compatibility of instrumentation-file-append-pid and
instrumentation-sleep-time options. It also requires keeping the
counters mapping between the watcher process and the instrumented binary
process in shared mode. This is useful when we instrument a shared
library that is used by several tasks running on the target system. In
case when we cannot wait for every task to complete, we must use the
sleep-time option. Without append-pid option, we would overwrite the
profile at the same path but collected from different tasks, leading to
unexpected or suboptimal optimization effects.
Co-authored-by: Vasily Leonenko <vasily.leonenko@huawei.com>
Allow `--function-order` to be combined with `--reorder-functions`
algorithms. Functions listed in the order file are pinned first
(indices 0..N-1), then the selected algorithm orders remaining
functions starting at index N.
Add separate options to enable each of the available gadget detectors.
Furthermore, add two meta-options enabling all PtrAuth scanners and all
available scanners of any type (which is only PtrAuth for now, though).
This commit renames `pacret` option to `ptrauth-pac-ret` and `pauth` to
`ptrauth-all`.
Launch this perf job with the others at the beginning of the aggregation
process.
Extracting buildid-list from perf data is not a costly process, so it
can be performed by default. This provides a distinct advantage when
this dataset is required in other perf2bolt stages as well.
Please see PR #171144.
Some binaries are built using `-gz=zstd`, but when using
`--update-debug-sections` on said binaries BOLT crashes.
This patch fixes this issue by recognising compressed debug sections in
binaries via their flag `SHF_COMPRESSED` and appropriately erroring out.
Legacy GNU-style compression is not handled.
The "private global" terminology, likely came from
llvm/lib/IR/Mangler.cpp, is misleading: "private" is the opposite of
"global", and these prefixed symbols are not global in the object file
format sense (e.g. ELF has STB_GLOBAL while these symbols are always
STB_LOCAL). The term "internal symbol" better describes their purpose:
symbols for internal use by compilers and assemblers, not meant to be
visible externally.
This rename is a step toward adopting the "internal symbol prefix"
terminology agreed with GNU as
(https://sourceware.org/pipermail/binutils/2026-March/148448.html).
There are cases in which `getEntryIDForSymbol` is called, where the
given Symbol is in a constant island, and so BOLT can not find its
function. This causes BOLT to reach `llvm_unreachable("symbol not
found")` and crash. This patch adds a check that avoids this crash.
BOLT currently strips all STT_NOTYPE STB_LOCAL zero-sized symbols
that fall inside function bodies. Certain such symbols are named
labels (loop markers and subroutine entry points) or local function
symbols in hand-written assembly. We now keep them in local symbol
table in BOLT processed binaries for better symbolication.
`BinaryFunction::translateInputToOutputAddress()` contains fallback
logic in case that querying `IOAddressMap` doesn't yield an output
address. Because this function could be called in scenarios where
`IOAddressMap` won't be set up, we should check if the map actually
exists before lookup.
When applying BTI fixups to indirect branch targets, ignored functions
are
considered as a special case:
- these hold no instructions,
- have no CFG,
- and are not emitted in the new text section.
The solution is to patch the entry points in the original location.
If such a situation occurs in a binary, recompilation using the
-fpatchable-function-entry flag is required. This will place a nop at
all
function starts, which BOLT can use to patch the original section.
Without the extra nop, BOLT cannot safely patch the original .text
section.
An alternative solution could be to also ignore the function from which
the stub starts. This has not been tried as LongJmp pass - where most
stubs are inserted - is currently not equipped to ignore functions.
Testing: both the success and failure cases are covered with lit tests.
Insert new PT_LOAD segments right after the last existing PT_LOAD in the
program header table, instead of before PT_DYNAMIC or at the end. This
maintains the ascending p_vaddr order required by the ELF specification.
Previously, new segments could end up breaking PT_LOAD p_vaddr order
when PT_LOAD segments followed PT_DYNAMIC or PT_GNU_STACK. This lead to
runtime loader incorrectly assessing dynamic object size and silently
corrupting memory.
Summary:
When .bolt_reserved section is defined in the linker script, there's
no way to mark the containing segment executable other than via PHDRS
command which overrides program headers entirely which is impractical.
Since .bolt_reserved contains executable code, mark segment executable
in BOLT.
Test Plan: bolt-reserved.test