On x86-64, PLT optimization does not require the binary to be linked
with -znow because indirect calls through GOT work correctly with lazy
binding. At runtime, the dynamic linker's resolver will populate the GOT
entry on the first call, just like with a regular PLT call.
This change removes the -znow requirement specifically for x86-64 while
keeping it for other architectures. I haven't checked RISV-V, but it's
still necessary on AArch64.
Rename passes to names that better reflect their intent,
and describe their relationship to each other.
InsertNegateRAStatePass renamed to PointerAuthCFIFixup,
MarkRAStates renamed to PointerAuthCFIAnalyzer.
Added the --print-<passname> flags for these passes.
During InsertNegateRAState pass we check the annotations on
instructions,
to decide where to generate the OpNegateRAState CFIs in the output
binary.
As only instructions in the input binary were annotated, we have to make
a judgement on instructions generated by other BOLT passes.
Incorrect placement may cause issues when an (async) unwind request
is received during the new "unknown" instructions.
This patch adds more logic to make a more informed decision on by taking
into account:
- unknown instructions in a BasicBlock with other instruction have the
same RAState. Previously, if the BasicBlock started with an unknown
instruction,
the RAState was copied from the preceding block. Now, the RAState is
copied from
the succeeding instructions in the same block.
- Some BasicBlocks may only contain instructions with unknown RAState,
As explained in issue #160989, these blocks already have incorrect
unwind info. Because of this, the last known RAState based on the layout order
is copied.
Updated bolt/docs/PacRetDesign.md to reflect changes.
Major part of this PR is commit implementing support for DT_INIT_ARRAY
for BOLT runtime libraries initialization. Also, it adds related
hook-init test & fixes couple of X86 instrumentation tests.
This commit follows implementation of instrumentation hook via
DT_FINI_ARRAY (https://github.com/llvm/llvm-project/pull/67348) and
extends it for BOLT runtime libraries (including instrumentation
library) initialization hooking.
Initialization has has differences compared to finalization:
- Executables always use ELF entry point address. Update code checks it
and updates init_array entry if ELF is shared library (have no interp
entry) and have no DT_INIT entry. Also this commit introduces
"runtime-lib-init-hook" option to select primary initialization hook
(entry_point, init, init_array) with fall back to next available hook in
input binary. e.g. in case of libc we can explicitly set it to
init_array.
- Shared library init_array entries relocations usually has
R_AARCH64_ABS64 type on AArch64 binaries. We check relocation type and
adjust methods for reading init_array relocations in discovery and
update methods.
---------
Co-authored-by: Vasily Leonenko <vasily.leonenko@huawei.com>
Since the "--write-dwp" option has been removed in
[PR](https://github.com/llvm/llvm-project/pull/100771), this patch also
cleans up the corresponding document and test to avoid misleading
issues.
The `--nl` flag, originally for Non-LBR mode, is deprecated and will be
replaced by `--basic-events` (alias `--ba`).
`--nl` remains as a deprecated alias for backward compatibility.
Update guides to use brstack, with a mention to BRBE for AArch64. Use
brstack in user-facing outputs.
---------
Co-authored-by: Amir Ayupov <aaupov@fb.com>
Reapply "[BOLT][AArch64] Handle OpNegateRAState to enable optimizing
binaries with pac-ret hardening (#120064)" (#162353)
This reverts commit c7d776b068.
#120064 was reverted for breaking builders.
Fix: changed the mismatched type in MarkRAStates.cpp to `auto`.
---
Original message:
OpNegateRAState is an AArch64-specific DWARF CFI used to change the value
of the RA_SIGN_STATE pseudoregister. The RA_SIGN_STATE register records
whether the current return address has been signed with PAC.
OpNegateRAState requires special handling in BOLT because its placement
depends on the function layout. Since BOLT reorders basic blocks during
optimization, these CFIs must be regenerated after layout is finalized.
This patch introduces two new passes:
- MarkRAStates (runs before optimizations): assigns a signedness annotation to each
instruction based on OpNegateRAState CFIs in the input binary.
- InsertNegateRAStates (runs after optimizations): reads the annotations and emits
new OpNegateRAState CFIs where RA state changes between instructions.
Design details are described in: `bolt/docs/PacRetDesign.md`.
OpNegateRAState is an AArch64-specific DWARF CFI used to change the value
of the RA_SIGN_STATE pseudoregister. The RA_SIGN_STATE register records
if the current return address has been signed with PAC.
OpNegateRAState requires special handling in BOLT because its placement
depends on the function layout. Since BOLT reorders basic blocks during
optimization, these CFIs must be regenerated after layout is finalized.
This patch introduces two new passes:
- MarkRAStates (runs before optimizations): assigns a signedness annotation to each
instruction based on OpNegateRAState CFIs in the input binary.
- InsertNegateRAStates (runs after optimizations): reads the annotations and emits
new OpNegateRAState CFIs where RA state changes between instructions.
Design details are described in: `bolt/docs/PacRetDesign.md`.
The pass for inlining memcpy in BOLT was currently X86-specific and was
using the instruction `rep movsb`.
This patch implements a static size analysis system for AArch64 memcpy
inlining that extracts copy sizes from preceding instructions to then
use it to generate the optimal width-specific load/store sequences.
## Change:
* Added `--dump-dot-func` command-line option that allows users to dump
CFGs only for specific functions instead of dumping all functions (the
current only available option being `--dump-dot-all`)
## Usage:
* Users can now specify function names or regex patterns (e.g.,
`--dump-dot-func=main,helper` or `--dump-dot-func="init.*`") to generate
.dot files only for functions of interest
* Aims to save time when analysing specific functions in large binaries
(e.g., only dumping graphs for performance-critical functions identified
through profiling) and we can now avoid reduce output clutter from
generating thousands of unnecessary .dot files when analysing large
binaries
## Testing
The introduced test `dump-dot-func.test` confirms the new option does
the following:
- [x] 1. `dump-dot-func` can correctly filter a specified functions
- [x] 2. Can achieve the above with regexes
- [x] 3. Can do 1. with a list of functions
- [x] No option specified creates no dot files
- [x] Passing in a non-existent function generates no dumping messages
- [x] `dump-dot-all` continues to work as expected
Add a capability to produce multiple heatmaps with given bucket sizes.
The default heatmap block size (64B) could be too fine-grained for
large binaries. Extend the option `block-size` to accept a list of
bucket sizes for additional heatmaps with coarser granularity. The
heatmap is simply rescaled so provided sizes should be multiples of
each other. Human-readable suffixes can be used, e.g. 4K, 16kb, 1MiB.
New defaults: 64B (base bucket size), 4KB (default page size),
256KB (for large binaries).
Test Plan: updated heatmap-preagg.test
Despite our attempt (build-docs.sh)
to build the documentation with SVG,
it still uses PNG https://llvm.org/doxygen/classllvm_1_1StringRef.html,
and that renders terribly on any high dpi display.
SVG leads to smasller installation and works fine
on all browser (that has been true for _a while_
https://caniuse.com/svg), so this patch just unconditionally build all
dot graphs as SVG in all subprojects and remove the option.
Identical Code Folding (ICF) folds functions that are identical into one
function, and updates symbol addresses to the new address. This reduces
the size of a binary, but can lead to problems. For example when
function pointers are compared. This can be done either explicitly in
the code or generated IR by optimization passes like Indirect Call
Promotion (ICP). After ICF what used to be two different addresses
become the same address. This can lead to a different code path being
taken.
This is where safe ICF comes in. Linker (LLD) does it using address
significant section generated by clang. If symbol is in it, or an object
doesn't have this section symbols are not folded.
BOLT does not have the information regarding which objects do not have
this section, so can't re-use this mechanism.
This implementation scans code section and conservatively marks
functions symbols as unsafe. It treats symbols as unsafe if they are
used in non-control flow instruction. It also scans through the data
relocation sections and does the same for relocations that reference a
function symbol. The latter handles the case when function pointer is
stored in a local or global variable, etc. If a relocation address
points within a vtable these symbols are skipped.
Implemented call graph function matching. First, two call graphs are
constructed for both profiled and binary functions. Then functions are
hashed based on the names of their callee/caller functions. Finally,
functions are matched based on these neighbor hashes and the
longest common prefix of their names. The `match-with-call-graph`
flag turns this matching on.
Test Plan: Added match-with-call-graph.test. Matched 164 functions
in a large binary with 10171 profiled functions.
A mapping - from namespace to associated binary functions - is used to
match function profiles to binary based on the
'--name-similarity-function-matching-threshold' flag set edit distance
threshold. The flag is set to 0 (exact name matching) by default as it is
expensive, requiring the processing of all BFs.
Test Plan: Added name-similarity-function-matching.test. On a binary
with 5M functions, rewrite passes took ~520s without the flag and
~2018s with the flag set to 20.
Added flag '--match-profile-with-function-hash' to match functions
based on exact hash. After identical and LTO name matching, more
functions can be recovered for inference with exact hash, in the case
of function renaming with no functional changes. Collisions are
possible in the unlikely case where multiple functions share the same
exact hash. The flag is off by default as it requires the processing of
all binary functions and subsequently is expensive.
Test Plan: added hashing-based-function-matching.test.
Summary: Functions with high discrepancy
(measured by matched function blocks)
can be ignored with an added command line
argument for better performance.
Test Plan: Added
stale-matching-min-matched-block.test
---------
Co-authored-by: Amir Ayupov <aaupov@fb.com>
Update the folder titles for targets in the monorepository that have not
seen taken care of for some time. These are the folders that targets are
organized in Visual Studio and XCode (`set_property(TARGET <target>
PROPERTY FOLDER "<title>")`) when using the respective CMake's IDE
generator.
* Ensure that every target is in a folder
* Use a folder hierarchy with each LLVM subproject as a top-level folder
* Use consistent folder names between subprojects
* When using target-creating functions from AddLLVM.cmake, automatically
deduce the folder. This reduces the number of
`set_property`/`set_target_property`, but are still necessary when
`add_custom_target`, `add_executable`, `add_library`, etc. are used. A
LLVM_SUBPROJECT_TITLE definition is used for that in each subproject's
root CMakeLists.txt.
Provide secondary entry points for `EntryDiscriminator` call info field
in YAML profile.
Increases BAT section size to:
- large binary: 39655300 bytes (1.03x the original),
- medium binary: 3834328 bytes (0.65x),
- small binary: 924 bytes (0.64x).
Depends on: https://github.com/llvm/llvm-project/pull/76911
Test Plan:
- Updated bolt-address-translation{,-yaml}.test
- Added openssl test: https://github.com/rafaelauler/bolt-tests/pull/30
Reviewers: dcci, rafaelauler, maksfb, ayermolo
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/86218
YAML profile reader checks the number of basic blocks in regular,
no-stale-matching mode. Add it to BAT.
This increases the size of BAT section to:
- large binary: 39583080 bytes (1.02x of the original),
- medium binary: 3816492 bytes (0.64x),
- small binary: 920 bytes (0.64x, no change due to alignment).
Test Plan: Updated bolt-address-translation-yaml.test
Reviewers: rafaelauler, ayermolo, maksfb, dcci
Reviewed By: rafaelauler
Pull Request: https://github.com/llvm/llvm-project/pull/86045
Add input basic block index to BAT metadata. This addresses the case
where some basic blocks are eliminated, and output index is not equal
to the input block index. These indices are used in non-stale-matching
mode.
Increases BAT section size to:
- large binary: 39521512 bytes (1.02x original),
- medium binary: 3799988 bytes (0.64x),
- small binary: 920 bytes (0.64x).
Test Plan:
Updated bolt-address-translation{,-yaml}.test
Pull Request: https://github.com/llvm/llvm-project/pull/86044