This also improves the error message to be more clear for folks who haven't used a lot of rst.
350 lines
12 KiB
ReStructuredText
350 lines
12 KiB
ReStructuredText
llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool
|
|
==========================================================
|
|
|
|
.. program:: llvm-ir2vec
|
|
|
|
SYNOPSIS
|
|
--------
|
|
|
|
:program:`llvm-ir2vec` [*subcommand*] [*options*]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
|
|
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec and MIR2Vec.
|
|
It generates embeddings for both LLVM IR and Machine IR (MIR) and supports
|
|
triplet generation for vocabulary training.
|
|
|
|
The tool provides three main subcommands:
|
|
|
|
1. **triplets**: Generates numeric triplets in train2id format for vocabulary
|
|
training from LLVM IR.
|
|
|
|
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
|
|
training.
|
|
|
|
3. **embeddings**: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary
|
|
at different granularity levels (instruction, basic block, or function).
|
|
|
|
The tool supports two operation modes:
|
|
|
|
* **LLVM IR mode** (``--mode=llvm``): Process LLVM IR bitcode files and generate
|
|
IR2Vec embeddings
|
|
* **Machine IR mode** (``--mode=mir``): Process Machine IR (.mir) files and generate
|
|
MIR2Vec embeddings
|
|
|
|
The tool is designed to facilitate machine learning applications that work with
|
|
LLVM IR or Machine IR by converting them into numerical representations that can
|
|
be used by ML models. The `triplets` subcommand generates numeric IDs directly
|
|
instead of string triplets, streamlining the training data preparation workflow.
|
|
|
|
.. note::
|
|
|
|
For information about using IR2Vec and MIR2Vec programmatically within LLVM
|
|
passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
|
|
section in the MLGO documentation.
|
|
|
|
OPERATION MODES
|
|
---------------
|
|
|
|
The tool operates in two modes: **LLVM IR mode** and **Machine IR mode**. The mode
|
|
is selected using the ``--mode`` option (default: ``llvm``).
|
|
|
|
Triplet Generation and Entity Mapping Modes are used for preparing
|
|
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
|
|
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
|
|
|
|
The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
|
|
by modeling the relationships between opcodes, types, and operands as a knowledge
|
|
graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
|
|
triplets and entity mappings in the standard format used for knowledge graph
|
|
embedding training (see
|
|
<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format>
|
|
for details).
|
|
|
|
See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how
|
|
these two modes are used to generate the triplets and entity mappings.
|
|
|
|
Triplet Generation
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR or Machine IR
|
|
and extracts numeric triplets consisting of opcode IDs and operand IDs. These triplets
|
|
are generated in the standard format used for knowledge graph embedding training.
|
|
The tool outputs numeric IDs directly using the vocabulary mapping infrastructure,
|
|
eliminating the need for string-to-ID preprocessing.
|
|
|
|
Usage for LLVM IR:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec triplets --mode=llvm input.bc -o triplets_train2id.txt
|
|
|
|
Usage for Machine IR:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec triplets --mode=mir input.mir -o triplets_train2id.txt
|
|
|
|
Entity Mapping Generation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
|
|
supported by IR2Vec or MIR2Vec in the standard format used for knowledge graph embedding
|
|
training. This subcommand outputs all supported entities with their corresponding numeric IDs.
|
|
|
|
For LLVM IR, entities include opcodes, types, and operands. For Machine IR, entities include
|
|
machine opcodes, common operands, and register classes (both physical and virtual).
|
|
|
|
Usage for LLVM IR:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec entities --mode=llvm -o entity2id.txt
|
|
|
|
Usage for Machine IR:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec entities --mode=mir input.mir -o entity2id.txt
|
|
|
|
.. note::
|
|
|
|
For LLVM IR mode, the entity mapping is target-independent and does not require an input file.
|
|
For Machine IR mode, an input .mir file is required to determine the target architecture,
|
|
as entity mappings vary by target (different architectures have different instruction sets
|
|
and register classes).
|
|
|
|
Embedding Generation
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
|
|
generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.
|
|
|
|
Example Usage for LLVM IR:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt
|
|
|
|
Example Usage for Machine IR:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=vocab.json --level=func input.mir -o embeddings.txt
|
|
|
|
OPTIONS
|
|
-------
|
|
|
|
Common options (applicable to both LLVM IR and Machine IR modes):
|
|
|
|
.. option:: --mode=<mode>
|
|
|
|
Specify the operation mode. Valid values are:
|
|
|
|
* ``llvm`` - Process LLVM IR bitcode files (default)
|
|
* ``mir`` - Process Machine IR (.mir) files
|
|
|
|
.. option:: -o <filename>
|
|
|
|
Specify the output filename. Use ``-`` to write to standard output (default).
|
|
|
|
.. option:: --help
|
|
|
|
Print a summary of command line options.
|
|
|
|
Subcommand-specific options:
|
|
|
|
**embeddings** subcommand:
|
|
|
|
.. option:: <input-file>
|
|
|
|
The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
|
|
This positional argument is required for the `embeddings` subcommand.
|
|
|
|
.. option:: --level=<level>
|
|
|
|
Specify the embedding generation level. Valid values are:
|
|
|
|
* ``inst`` - Generate instruction-level embeddings
|
|
* ``bb`` - Generate basic block-level embeddings
|
|
* ``func`` - Generate function-level embeddings (default)
|
|
|
|
.. option:: --function=<name>
|
|
|
|
Process only the specified function instead of all functions in the module.
|
|
|
|
**IR2Vec-specific options** (for ``--mode=llvm``):
|
|
|
|
.. option:: --ir2vec-kind=<kind>
|
|
|
|
Specify the kind of IR2Vec embeddings to generate. Valid values are:
|
|
|
|
* ``symbolic`` - Generate symbolic embeddings (default)
|
|
* ``flow-aware`` - Generate flow-aware embeddings
|
|
|
|
Flow-aware embeddings consider control flow relationships between instructions,
|
|
while symbolic embeddings focus on the symbolic representation of instructions.
|
|
|
|
.. option:: --ir2vec-vocab-path=<path>
|
|
|
|
Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding
|
|
generation). The vocabulary file should be in JSON format and contain the trained
|
|
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
|
|
for pre-trained vocabulary files.
|
|
|
|
.. option:: --ir2vec-opc-weight=<weight>
|
|
|
|
Specify the weight for opcode embeddings (default: 1.0). This controls
|
|
the relative importance of instruction opcodes in the final embedding.
|
|
|
|
.. option:: --ir2vec-type-weight=<weight>
|
|
|
|
Specify the weight for type embeddings (default: 0.5). This controls
|
|
the relative importance of type information in the final embedding.
|
|
|
|
.. option:: --ir2vec-arg-weight=<weight>
|
|
|
|
Specify the weight for argument embeddings (default: 0.2). This controls
|
|
the relative importance of operand information in the final embedding.
|
|
|
|
**MIR2Vec-specific options** (for ``--mode=mir``):
|
|
|
|
.. option:: --mir2vec-vocab-path=<path>
|
|
|
|
Specify the path to the MIR2Vec vocabulary file (required for Machine IR
|
|
embedding generation). The vocabulary file should be in JSON format and
|
|
contain the trained vocabulary for embedding generation.
|
|
|
|
.. option:: --mir2vec-kind=<kind>
|
|
|
|
Specify the kind of MIR2Vec embeddings to generate. Valid values are:
|
|
|
|
* ``symbolic`` - Generate symbolic embeddings (default)
|
|
|
|
.. option:: --mir2vec-opc-weight=<weight>
|
|
|
|
Specify the weight for machine opcode embeddings (default: 1.0). This controls
|
|
the relative importance of machine instruction opcodes in the final embedding.
|
|
|
|
.. option:: --mir2vec-common-operand-weight=<weight>
|
|
|
|
Specify the weight for common operand embeddings (default: 1.0). This controls
|
|
the relative importance of common operand types in the final embedding.
|
|
|
|
.. option:: --mir2vec-reg-operand-weight=<weight>
|
|
|
|
Specify the weight for register operand embeddings (default: 1.0). This controls
|
|
the relative importance of register operands in the final embedding.
|
|
|
|
|
|
**triplets** subcommand:
|
|
|
|
.. option:: <input-file>
|
|
|
|
The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
|
|
This positional argument is required for the `triplets` subcommand.
|
|
|
|
**entities** subcommand:
|
|
|
|
.. option:: <input-file>
|
|
|
|
The input Machine IR file (.mir) to process. This positional argument is required
|
|
for the `entities` subcommand when using ``--mode=mir``, as the entity mappings
|
|
are target-specific. For ``--mode=llvm``, no input file is required as IR2Vec
|
|
entity mappings are target-independent.
|
|
|
|
OUTPUT FORMAT
|
|
-------------
|
|
|
|
Triplet Mode Output
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
In triplet mode, the output consists of numeric triplets in train2id format with
|
|
metadata headers. The format includes:
|
|
|
|
.. code-block:: text
|
|
|
|
MAX_RELATION=<max_relation_count>
|
|
<head_entity_id> <tail_entity_id> <relation_id>
|
|
<head_entity_id> <tail_entity_id> <relation_id>
|
|
...
|
|
|
|
Each line after the metadata header represents one instruction relationship,
|
|
with numeric IDs for head entity, tail entity, and relation type. The metadata
|
|
header (MAX_RELATION) indicates the maximum relation ID used.
|
|
|
|
**Relation Types:**
|
|
|
|
For LLVM IR (IR2Vec):
|
|
* **0** = Type relationship (instruction to its type)
|
|
* **1** = Next relationship (sequential instructions)
|
|
* **2+** = Argument relationships (Arg0, Arg1, Arg2, ...)
|
|
|
|
For Machine IR (MIR2Vec):
|
|
* **0** = Next relationship (sequential instructions)
|
|
* **1+** = Argument relationships (Arg0, Arg1, Arg2, ...)
|
|
|
|
**Entity IDs:**
|
|
|
|
For LLVM IR: Entity IDs represent opcodes, types, and operands as defined by the IR2Vec vocabulary.
|
|
|
|
For Machine IR: Entity IDs represent machine opcodes, common operands (immediate, frame index, etc.),
|
|
physical register classes, and virtual register classes as defined by the MIR2Vec vocabulary. The entity layout is target-specific.
|
|
|
|
Entity Mode Output
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
In entity mode, the output consists of entity mappings in the format:
|
|
|
|
.. code-block:: text
|
|
|
|
<total_entities>
|
|
<entity_string> <numeric_id>
|
|
<entity_string> <numeric_id>
|
|
...
|
|
|
|
The first line contains the total number of entities, followed by one entity
|
|
mapping per line with tab-separated entity string and numeric ID.
|
|
|
|
For LLVM IR, entities include instruction opcodes (e.g., "Add", "Ret"), types
|
|
(e.g., "INT", "PTR"), and operand kinds.
|
|
|
|
For Machine IR, entities include machine opcodes (e.g., "COPY", "ADD"),
|
|
common operands (e.g., "Immediate", "FrameIndex"), physical register classes
|
|
(e.g., "PhyReg_GR32"), and virtual register classes (e.g., "VirtReg_GR32").
|
|
|
|
Embedding Mode Output
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
In embedding mode, the output format depends on the specified level:
|
|
|
|
* **Function Level**: One embedding vector per function
|
|
* **Basic Block Level**: One embedding vector per basic block, grouped by function
|
|
* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
|
|
|
|
Each embedding is represented as a floating point vector.
|
|
|
|
EXIT STATUS
|
|
-----------
|
|
|
|
:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
|
|
|
|
Common failure cases include:
|
|
|
|
* Invalid or missing input file
|
|
* Missing or invalid vocabulary file (in embedding mode)
|
|
* Specified function not found in the module
|
|
* Invalid command line options
|
|
|
|
SEE ALSO
|
|
--------
|
|
|
|
:doc:`../MLGO`
|
|
|
|
For more information about the IR2Vec algorithm and approach, see:
|
|
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
|
|
|
|
For more information about the MIR2Vec algorithm and approach, see:
|
|
`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.
|