llvm-project/llvm/docs/CommandGuide/llvm-ir2vec.rst

llvm-ir2vec - IR2Vec and MIR2Vec Embedding Generation Tool
==========================================================

.. program:: llvm-ir2vec

SYNOPSIS
--------

:program:`llvm-ir2vec` [*subcommand*] [*options*]

DESCRIPTION
-----------

:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec and MIR2Vec.
It generates embeddings for both LLVM IR and Machine IR (MIR) and supports
triplet generation for vocabulary training.

The tool provides three main subcommands:

1. **triplets**: Generates numeric triplets in train2id format for vocabulary
   training from LLVM IR.

2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
   training.

3. **embeddings**: Generates IR2Vec or MIR2Vec embeddings using a trained vocabulary
   at different granularity levels (instruction, basic block, or function).

The tool supports two operation modes:

* **LLVM IR mode** (``--mode=llvm``): Process LLVM IR bitcode files and generate
  IR2Vec embeddings
* **Machine IR mode** (``--mode=mir``): Process Machine IR (.mir) files and generate
  MIR2Vec embeddings

The tool is designed to facilitate machine learning applications that work with
LLVM IR or Machine IR by converting them into numerical representations that can
be used by ML models. The `triplets` subcommand generates numeric IDs directly
instead of string triplets, streamlining the training data preparation workflow.

.. note::

   For information about using IR2Vec and MIR2Vec programmatically within LLVM
   passes and the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
   section in the MLGO documentation.

OPERATION MODES
---------------

The tool operates in two modes: **LLVM IR mode** and **Machine IR mode**. The mode
is selected using the ``--mode`` option (default: ``llvm``).

Triplet Generation and Entity Mapping Modes are used for preparing
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.

The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
by modeling the relationships between opcodes, types, and operands as a knowledge
graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
triplets and entity mappings in the standard format used for knowledge graph
embedding training (see
<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format>
for details).

See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how
these two modes are used to generate the triplets and entity mappings.

Triplet Generation
~~~~~~~~~~~~~~~~~~

With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR or Machine IR
and extracts numeric triplets consisting of opcode IDs and operand IDs. These triplets
are generated in the standard format used for knowledge graph embedding training.
The tool outputs numeric IDs directly using the vocabulary mapping infrastructure,
eliminating the need for string-to-ID preprocessing.

Usage for LLVM IR:

.. code-block:: bash

   llvm-ir2vec triplets --mode=llvm input.bc -o triplets_train2id.txt

Usage for Machine IR:

.. code-block:: bash

   llvm-ir2vec triplets --mode=mir input.mir -o triplets_train2id.txt

Entity Mapping Generation
~~~~~~~~~~~~~~~~~~~~~~~~~

With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
supported by IR2Vec or MIR2Vec in the standard format used for knowledge graph embedding
training. This subcommand outputs all supported entities with their corresponding numeric IDs.

For LLVM IR, entities include opcodes, types, and operands. For Machine IR, entities include
machine opcodes, common operands, and register classes (both physical and virtual).

Usage for LLVM IR:

.. code-block:: bash

   llvm-ir2vec entities --mode=llvm -o entity2id.txt

Usage for Machine IR:

.. code-block:: bash

   llvm-ir2vec entities --mode=mir input.mir -o entity2id.txt

.. note::

   For LLVM IR mode, the entity mapping is target-independent and does not require an input file.
   For Machine IR mode, an input .mir file is required to determine the target architecture,
   as entity mappings vary by target (different architectures have different instruction sets
   and register classes).

Embedding Generation
~~~~~~~~~~~~~~~~~~~~

With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
generate numerical embeddings for LLVM IR or Machine IR at different levels of granularity.

Example Usage for LLVM IR:

.. code-block:: bash

   llvm-ir2vec embeddings --mode=llvm --ir2vec-vocab-path=vocab.json --ir2vec-kind=symbolic --level=func input.bc -o embeddings.txt

Example Usage for Machine IR:

.. code-block:: bash

   llvm-ir2vec embeddings --mode=mir --mir2vec-vocab-path=vocab.json --level=func input.mir -o embeddings.txt

OPTIONS
-------

Common options (applicable to both LLVM IR and Machine IR modes):

.. option:: --mode=<mode>

   Specify the operation mode. Valid values are:

   * ``llvm`` - Process LLVM IR bitcode files (default)
   * ``mir`` - Process Machine IR (.mir) files

.. option:: -o <filename>

   Specify the output filename. Use ``-`` to write to standard output (default).

.. option:: --help

   Print a summary of command line options.

Subcommand-specific options:

**embeddings** subcommand:

.. option:: <input-file>

   The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
   This positional argument is required for the `embeddings` subcommand.

.. option:: --level=<level>

   Specify the embedding generation level. Valid values are:

   * ``inst`` - Generate instruction-level embeddings
   * ``bb`` - Generate basic block-level embeddings
   * ``func`` - Generate function-level embeddings (default)

.. option:: --function=<name>

   Process only the specified function instead of all functions in the module.

**IR2Vec-specific options** (for ``--mode=llvm``):

.. option:: --ir2vec-kind=<kind>

   Specify the kind of IR2Vec embeddings to generate. Valid values are:

   * ``symbolic`` - Generate symbolic embeddings (default)
   * ``flow-aware`` - Generate flow-aware embeddings

   Flow-aware embeddings consider control flow relationships between instructions,
   while symbolic embeddings focus on the symbolic representation of instructions.

.. option:: --ir2vec-vocab-path=<path>

   Specify the path to the IR2Vec vocabulary file (required for LLVM IR embedding
   generation). The vocabulary file should be in JSON format and contain the trained
   vocabulary for embedding generation. See `llvm/lib/Analysis/models`
   for pre-trained vocabulary files.

.. option:: --ir2vec-opc-weight=<weight>

   Specify the weight for opcode embeddings (default: 1.0). This controls
   the relative importance of instruction opcodes in the final embedding.

.. option:: --ir2vec-type-weight=<weight>

   Specify the weight for type embeddings (default: 0.5). This controls
   the relative importance of type information in the final embedding.

.. option:: --ir2vec-arg-weight=<weight>

   Specify the weight for argument embeddings (default: 0.2). This controls
   the relative importance of operand information in the final embedding.

**MIR2Vec-specific options** (for ``--mode=mir``):

.. option:: --mir2vec-vocab-path=<path>

   Specify the path to the MIR2Vec vocabulary file (required for Machine IR
   embedding generation). The vocabulary file should be in JSON format and
   contain the trained vocabulary for embedding generation.

.. option:: --mir2vec-kind=<kind>

   Specify the kind of MIR2Vec embeddings to generate. Valid values are:

   * ``symbolic`` - Generate symbolic embeddings (default)

.. option:: --mir2vec-opc-weight=<weight>

   Specify the weight for machine opcode embeddings (default: 1.0). This controls
   the relative importance of machine instruction opcodes in the final embedding.

.. option:: --mir2vec-common-operand-weight=<weight>

   Specify the weight for common operand embeddings (default: 1.0). This controls
   the relative importance of common operand types in the final embedding.

.. option:: --mir2vec-reg-operand-weight=<weight>

   Specify the weight for register operand embeddings (default: 1.0). This controls
   the relative importance of register operands in the final embedding.


**triplets** subcommand:

.. option:: <input-file>

   The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
   This positional argument is required for the `triplets` subcommand.

**entities** subcommand:

.. option:: <input-file>

   The input Machine IR file (.mir) to process. This positional argument is required
   for the `entities` subcommand when using ``--mode=mir``, as the entity mappings
   are target-specific. For ``--mode=llvm``, no input file is required as IR2Vec
   entity mappings are target-independent.

OUTPUT FORMAT
-------------

Triplet Mode Output
~~~~~~~~~~~~~~~~~~~

In triplet mode, the output consists of numeric triplets in train2id format with
metadata headers. The format includes:

.. code-block:: text

   MAX_RELATION=<max_relation_count>
   <head_entity_id> <tail_entity_id> <relation_id>
   <head_entity_id> <tail_entity_id> <relation_id>
   ...

Each line after the metadata header represents one instruction relationship,
with numeric IDs for head entity, tail entity, and relation type. The metadata
header (MAX_RELATION) indicates the maximum relation ID used.

**Relation Types:**

For LLVM IR (IR2Vec):
  * **0** = Type relationship (instruction to its type)
  * **1** = Next relationship (sequential instructions)
  * **2+** = Argument relationships (Arg0, Arg1, Arg2, ...)

For Machine IR (MIR2Vec):
  * **0** = Next relationship (sequential instructions)
  * **1+** = Argument relationships (Arg0, Arg1, Arg2, ...)

**Entity IDs:**

For LLVM IR: Entity IDs represent opcodes, types, and operands as defined by the IR2Vec vocabulary.

For Machine IR: Entity IDs represent machine opcodes, common operands (immediate, frame index, etc.),
physical register classes, and virtual register classes as defined by the MIR2Vec vocabulary. The entity layout is target-specific.

Entity Mode Output
~~~~~~~~~~~~~~~~~~

In entity mode, the output consists of entity mappings in the format:

.. code-block:: text

   <total_entities>
   <entity_string>	<numeric_id>
   <entity_string>	<numeric_id>
   ...

The first line contains the total number of entities, followed by one entity
mapping per line with tab-separated entity string and numeric ID.

For LLVM IR, entities include instruction opcodes (e.g., "Add", "Ret"), types
(e.g., "INT", "PTR"), and operand kinds.

For Machine IR, entities include machine opcodes (e.g., "COPY", "ADD"),
common operands (e.g., "Immediate", "FrameIndex"), physical register classes
(e.g., "PhyReg_GR32"), and virtual register classes (e.g., "VirtReg_GR32").

Embedding Mode Output
~~~~~~~~~~~~~~~~~~~~~

In embedding mode, the output format depends on the specified level:

* **Function Level**: One embedding vector per function
* **Basic Block Level**: One embedding vector per basic block, grouped by function
* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function

Each embedding is represented as a floating point vector.

EXIT STATUS
-----------

:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.

Common failure cases include:

* Invalid or missing input file
* Missing or invalid vocabulary file (in embedding mode)
* Specified function not found in the module
* Invalid command line options

SEE ALSO
--------

:doc:`../MLGO`

For more information about the IR2Vec algorithm and approach, see:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.

For more information about the MIR2Vec algorithm and approach, see:
`RL4ReAl: Reinforcement Learning for Register Allocation <https://doi.org/10.1145/3578360.3580273>`_.