Eptalights: Why We Chose GCC GIMPLE Over LLVM IR for C/C++ Code Analysis

Introduction

Eptalights Research is a startup dedicated to simplifying code analysis by transforming and processing code at its foundational level—bytecode or IR stage into a user-friendly 7-instruction IR. Additionally, our solution provides access to the original, lowest-level information for comprehensive analysis.

Currently, we support C/C++ as one of our primary languages. In terms of code analysis for C/C++, we advocate for the use of compiler-based techniques rather than analyzing code at the source level or the abstract syntax tree (AST). For a more detailed explanation, please refer to our blog post here.

When selecting intermediate representations (IR) for compilers, developers frequently face the choice between GCC GIMPLE and LLVM IR. While both options offer distinct advantages, we believe that GCC GIMPLE IR aligns more closely with our philosophy on effective code analysis for C/C++. Despite LLVM IR’s growing popularity and its status as a preferred choice for many, we find GIMPLE IR to be particularly advantageous for our approach.

Here’s why we chose GCC GIMPLE IR over LLVM IR:

(1) Simplicity in GCC GIMPLE-IR

GCC GIMPLE IR is a Simplified Intermediate Representation (IR) that emphasizes high-level constructs with less than 10 instruction types. This streamlined approach facilitates easier manipulation compared to managing a wider variety of instruction types.

GCC GIMPLE focuses on high-level constructs, meaning that the IR-generated code closely resembles the original code in a simplified form. Despite some differences, it retains much of the original C/C++ code structure, making it easier to understand and reason about.

Using the example code provided below:

//example.c

#include <stdio.h>
struct person
{
   int age;
};

int main()
{
    struct person *personPtr, person1;
    personPtr = &person1;   

    printf("Enter age: ");
    scanf("%d", &personPtr->age);

    printf("Displaying:\n");
    printf("Age: %d\n", personPtr->age);

    return 0;
}

When we generate the GCC GIMPLE IR for the code above using the command gcc -fdump-tree-ssa-raw example.c, the resulting GIMPLE IR dump for the C code is as follows:

main ()
{
  struct person person1;
  struct person * personPtr;
  int D.2339;
  int * _1;
  int _2;
  int _9;

  <bb 2> :
  gimple_assign <addr_expr, personPtr_3, &person1, NULL, NULL>
  gimple_call <printf, NULL, "Enter age: ">
  gimple_assign <addr_expr, _1, &personPtr_3->age, NULL, NULL>
  gimple_call <scanf, NULL, "%d", _1>
  gimple_call <__builtin_puts, NULL, &"Displaying:"[0]>
  gimple_assign <component_ref, _2, personPtr_3->age, NULL, NULL>
  gimple_call <printf, NULL, "Age: %d\n", _2>
  gimple_assign <integer_cst, _9, 0, NULL, NULL>
  gimple_assign <constructor, person1, {CLOBBER}, NULL, NULL>

  <bb 3> :
  gimple_label <<L1>>
  gimple_return <_9>
}

We can see that the generated GCC GIMPLE IR dump is more easily understood compared to LLVM bytecode-like IR, allowing for clearer insight into the code’s functionality.

Although the GIMPLE IR appears readable in text form, writing GIMPLE IR passes and managing tree-structured nodes in C/C++ can be quite complex. This complexity arises because GCC was designed as a monolithic compiler, a choice reflecting Stallman’s commitment to open-source integrity and concerns about proprietary software (source: lwn.net).

Therefore, working with GIMPLE requires understanding various concepts, such as its tree-like structured nodes and numerous functions, in addition to addressing challenges related to C/C++.

This is where our expertise comes into play. With our in-depth knowledge of GCC GIMPLE IR, we can extract with our gcc plugin, process and transform the raw GIMPLE IR code with our APIs into a clean, robust data model, accessible through our simplified Python APIs.

Examples of what you can achieve with our processed Pythonic data model include:

(i) The data model output for the main function:

fid = "/example/src/01_print_pointer_value.cc:main"
fn = db.get_function(fid)
for step in fn.steps:
    print(step.op, step.readable())

# output
ASSIGN EGimpleIRAssignModel <step_index=0, lowlevel_steps=[0], dst=c_0 src=expr_type:NO_EXPR, lhs:5 >
ASSIGN EGimpleIRAssignModel <step_index=1, lowlevel_steps=[1], dst=p_4 src=expr_type:NO_EXPR, lhs:&c_0 >
ASSIGN EGimpleIRAssignModel <step_index=2, lowlevel_steps=[2], dst=$T1_1 src=expr_type:NO_EXPR, lhs:p_4*p_4 >
CALL EGimpleIRCallModel <step_index=3, lowlevel_steps=[3], fname=printf, dst=None fargs=[0=""%d"",  1=$T1_1]>
ASSIGN EGimpleIRAssignModel <step_index=4, lowlevel_steps=[4], dst=$T6_6 src=expr_type:NO_EXPR, lhs:0 >
ASSIGN EGimpleIRAssignModel <step_index=5, lowlevel_steps=[5], dst=c_0 src=expr_type:NO_EXPR, lhs: >
NOP EGimpleIRNopModel <step_index=6, lowlevel_steps=[6]>
RETURN EGimpleIRReturnModel <step_index=7, lowlevel_steps=[7], dst=$T6_6>
...[redacted]...

(ii) - Working with callsites

"""
Get all `printf` callsite names.
"""
for callsite in fn.callsite_manager.all(name="printf"):
    print(callsite)
    print(fn.steps[callsite.step_index].readable())

# output
step_index=3 fn_name=['printf'] num_of_args=2 variables_used_as_callsite_arg=['$T1'] variables_defined_here=[] ssa_variables_used_as_callsite_arg=['$T1_1'] ssa_variables_defined_here=[]
EGimpleIRCallModel <step_index=3, lowlevel_steps=[3], fname=printf, dst=None fargs=[0=""%d"",  1=$T1_1]>

(iii) - Working with variables, regardless of their complexity.

for step in fn.steps:
    if step.op == models.OpType.ASSIGN:
        if step.dst.get_total_array_index_tokens() > 0:
            print(step.step_index, step.dst.readable())

# output
0 arr_0[0]
1 arr_0[1]
2 arr_0[2]
3 arr_0[3]
17 arr2_0[i_5]

---

step1 = fn.steps[0]
print(step1.variables_defined_here)
print(step1.ssa_variables_defined_here)

# output
['arr']
['arr_0']

For more detailed examples, visit here

Most importantly, you can develop better rules with the assurance that the original code closely aligns with the data models or the simplified GIMPLE IR. This makes code analysis in C/C++ more intuitive, as you can reason about the code more easily compared to working with an IR that is significantly different from the original code.

(2) Dominance in the Open Source Ecosystem

GCC is the leading C/C++ compiler, widely used in most GNU open-source projects and the sole compiler officially supported for compiling the Linux kernel. By adopting GCC GIMPLE, you ensure seamless integration with a broad range of existing open-source projects, offering a consistent and well-supported development environment.

A substantial number of open-source projects are built using GCC, and its reliability and performance have been thoroughly tested and proven over the years. Consequently, GCC remains the most widely used C/C++ compiler today, offering easier integration with minimal overhead or modifications. This makes it our compiler of choice as a startup.

(3) Challenges of Using `gcc-python-plugin`

The gcc-python-plugin is an experimental proof-of-concept that enables writing GCC plugins in Python but has not been updated for several years. Although Python is known for its simplicity and ease of use, this plugin is outdated and mirrors the complexity of C APIs.

Python is renowned for its simplicity and ease of use, but this extension falls short of leveraging those benefits. It is cumbersome to work with due to its outdated GCC extension, supporting only up to GCC version 8.x.x, whereas the Eptalights plugin supports up to GCC version 13.3.0.

Compiler engineer Krister Walfridsson shared his insights on the gcc-python-plugin in his blog post after experimenting with it years ago.

Many instructions from tree.def are not fully implemented, so I had to extend the gcc-python-plugin (or, in some cases, return “not implemented”).
The plugin crashes when trying to compile some files4.

Eptalights currently supports all instructions from tree.def, and thanks to our straightforward and user-friendly APIs, you can work effectively without needing to understand tree.def and can focus on what truly matters.

Using the example from Krister Walfridsson’s blog post

import gcc
import enchant

spellingdict = enchant.Dict("en_US")

def spellcheck_node(node, loc):
    if isinstance(node, gcc.StringCst):
        words = node.constant.split()
        for word in words:
            if not spellingdict.check(word):
                gcc.warning(
                    loc,
                    f"Possibly misspelt word in string constant: {word}",
                    gcc.Option("-Wall"),
                )

class SpellcheckingPass(gcc.GimplePass):
    def execute(self, fun):
        for bb in fun.cfg.basic_blocks:
            for stmt in bb.gimple:
                stmt.walk_tree(spellcheck_node, stmt.loc)

ps = SpellcheckingPass(name="spellchecker")
ps.register_after("cfg")

We can rewrite it using the Eptalights Python library as follows. This provides a simpler, more understandable version without needing extensive knowledge of GCC terms or internals.

"""
Iterate through all steps/instructions/GIMPLEs
"""
for step in fn.steps:
    """
    Iterate through all arguments used in the step or GIMPLE
    """
    for operand in step.used_tokenized_operands():
        """
        Iterate through all tokens in the operand.
        Check why we tokenize all values/variables in our examples.
        """
        for token in operand.tokens:
            """
            check If token is a constant type
            """
            if token.token_type == models.TokenType.IS_CONSTANT:
                """
                Check spelling here
                """
                words = token.value.split()
                for word in words:
                    if not spellingdict.check(word):
                        print(f"Possibly misspelled word in string constant: {word}")

Check out more examples on our GitHub page here.

Conclusion

While LLVM IR has its own merits and is adopted by others for code analysis, GCC GIMPLE IR offers unique advantages that have yet to be fully utilized in the realm of code analysis.
But its ability to preserve more of the original code structure of C/C++ can be harnessed for simpler code analysis of complex C/C++ codebases.
Additionally, the dominance of GCC in the compiler landscape makes it a strong candidate for developers seeking a reliable and intuitive IR for their C/C++ projects.

Thank you for your interest in our technology. We are eager to explore potential collaboration opportunities with you. For further inquiries, please do not hesitate to contact us or follow our updates on social media via X.