Back to blog
11 min read

Introduction to Eptalights Sophia Platform for Code Analysis

Simplifying Code Analysis

Introduction

Eptalights Research Sophia is a platform to simplifying code analysis by transforming and processing code at its foundational level (bytecode or IR stage) and lifted them into a user-friendly Pythonic Model.

Whether you’re a solo developer or a security engineer, part of a growing startup, or a member of a large enterprise team, Eptalights can help you identify potential issues, optimize performance, and adhere to best practices in your code.

We are currently have support for C/C++/PHP/JAVA programming language with more languages in the pipeline. All soupported Languages are lifted into Sophia-IR and our APIs will are consistent, regardless of the programming language, ByteCode, or Intermediate Representation (IR) which makes it easy to work with.

Our Technology

Analyzing C and C++ code has always been a notoriously complex undertaking. The language intricacies, undefined behaviors, and low-level constructs make static analysis anything but straightforward. Our journey began with a mission to tackle this challenge head-on by leveraging existing intermediate representations (IRs) like LLVM IR and GCC’s GIMPLE.

Having prior experience working with GCC and its GIMPLE IR, we initially chose to develop GCC plugins to detect specific code issues. While GIMPLE provided the abstraction we needed, the constant recompilation of C++ plugins quickly became a bottleneck in our development workflow.

In our pursuit of a more efficient and flexible approach, we explored gcc-python-plugin, an impressive attempt to bring Python scripting capabilities into the GCC plugin ecosystem. However, the project was outdated and didn’t deliver the ease-of-use or extensibility we were looking for.

This pain point sparked the creation of Sophia-IR which is a lightweight, Pythonic intermediate representation inspired by GCC GIMPLE, but tailored to modern development needs. Sophia-IR retains the reduced instruction set and operational semantics of GIMPLE, while introducing several key enhancements:

  • Custom IR design: Built from scratch but modeled closely after GIMPLE for familiarity and clarity.
  • Python-first tooling: Enables rapid iteration without recompilation, drastically reducing development/research time.
  • Extensible data annotations: We added computed metadata and analysis hooks to enable deeper insights during static analysis.
  • Multi-language support: While it began with C/C++, Sophia-IR now supports other languages while maintaining a consistent instruction model.

We help break down code into Sophia-IR, allowing you to naturally write patterns or rules to find where function call sites are called, where variables are defined or used, and to determine the control flow paths of the code. This is done using the well-known language Python, enabling you to explore your code without needing to learn a custom query language like Semgrep, CodeQL, etc. Think of it as Binary Ninja Python APIs, which are used for binary analysis, but tailored for programming languages.

In Sophia-IR, everything is a FunctionModel, and instructions or steps within a FunctionModel can fall into one of the 7 OpType models:

  • SophiaIRNopModel: Represents a no-operation (IR) model.
  • SophiaIRAssignModel: Represents an assignment (IR) model.
  • SophiaIRCallModel: Represents a call (IR) model.
  • SophiaIRCondModel: Represents a conditional (IR) model.
  • SophiaIRReturnModel: Represents a return (IR) model.
  • SophiaIRGotoModel: Represents a goto (IR) model.
  • SophiaIRSwitchModel: Represents a switch (IR) model.

With variables both defined or used already computed and readily available from the FunctionModel:

print(step1.variables_defined_here)
print(step1.ssa_variables_defined_here)

# output
['arr']
['arr_0']

Easy access to call sites and their arguments from the FunctionModel:

for callsite in fn.callsite_manager.all(name="malloc"):
    print(
      callsite.name, 
      callsite.variables_used_as_callsite_arg
    )

Easy access to the Control Flow Graph from the FunctionModel:

for block_index, block_edges in fn.cfg.basicblock_edges.items():
    print(block_index, block_edges)

Sophia-IR also helps manage complex variables naturally and effortlessly, similar to their usage in the source code. This includes structures, arrays, and their fields. By representing these variables in a straightforward manner, Sophia-IR simplifies the analysis of code that involves complex data structures.

We believe that the 7 OpTypes are sufficient for performing complex code analysis. For more examples, visit here.

Analysis of C Code

One of the primary difficulties in C/C++ code analysis stems from the extensive use of macros. Macros are defined using #define, #if, and #else, generating code at compile time and transforming the source in ways that are not immediately visible.

This makes it difficult for analysis tools to work effectively at the source or Abstract Syntax Tree (AST) level, as the code being analyzed might not reflect the actual code executed.

These macros are extensively used in various products, including databases, kernels, servers, and runtimes for different platform configurations. Analyzing code specific to configurations and the actual final build is often more practical in the real world than ignoring how macros affect large codebases that may not be a priority and might not reflect the final build.

For instance, consider the following macro:

#define SQUARE(x) ((x) * (x))

This simple macro can generate different code depending on its usage and can be significantly different from the original statement:

int a = SQUARE(5);   // Expands to: int a = (5 * 5);
int b = SQUARE(a+1); // Expands to: int b = ((a+1) * (a+1));

This transformation complicates source or AST-based code analysis, as the tool needs to understand the macro expansions and their implications. Analyzing these expansions at the source level can be cumbersome and error-prone.

Given the example below, examining it at the source level would be challenging:

#if defined(_WIN32) || defined(_WIN64)
    #define MAX_ARRAY_INDEX 10
#else
    #define MAX_ARRAY_INDEX 5
#endif

char filePath[MAX_ARRAY_INDEX];

This macro-based conditional compilation can alter the code significantly depending on the type of platform. Understanding these variations from the source alone can be complex and error-prone.

By collecting all the information at compile time after configuring everything according to your platform, you can more effectively manage and analyze these variations. This approach enables you to handle platform-specific configurations and macro expansions comprehensively.

At Eptalights, this is one of the primary reasons we adopted a compiler-based approach rather than a source-based or AST-based approach. We chose GCC over LLVM, and you can read more about our decision to use GCC GIMPLE instead of LLVM IR here.

Analysis of C++ Code

At Eptalights, we view C++ code analysis differently from existing popular tools like CodeQL and Semgrep. We see C++ as a combination of C, code-generated structs, and code-generated functions.

Using the simple code example below:

int main() 
{
    std::vector<int> vect;   
    vect.push_back(10); 
    return 0; 
}

After dumping the GIMPLE SSA IR using the GCC command g++ -fdump-tree-ssa-raw example.cpp, you’ll find numerous code-generated functions that support the use of std::vector type. These functions are typically not relevant to our code analysis unless you’re specifically auditing the C++ std library. The actual main function, which contains the core logic, is just a few lines long.

Let’s examine the dumped C++ code, with all functions redacted except for the main function:

operator new (size_t D.3408, void * __p){
..[redacted]..
}

std::_Vector_base<int, std::allocator<int> >::_Vector_impl_data::_Vector_impl_data (struct _Vector_impl_data * const this){
..[redacted]..
}

std::_Vector_base<int, std::allocator<int> >::_Vector_impl::~_Vector_impl (struct _Vector_impl * const this) {
..[redacted]..
}

// more generated functions redacted ...

main ()
{
  struct vector vect;
  int D.46500;
  value_type D.41711;
  int _6;

  <bb 2> :
  gimple_call <__ct_comp , NULL, &vect>
  gimple_assign <integer_cst, D.41711, 10, NULL, NULL>
  gimple_call <push_back, NULL, &vect, &D.41711>

  <bb 3> :
  gimple_assign <constructor, D.41711, {CLOBBER}, NULL, NULL>
  gimple_assign <integer_cst, _6, 0, NULL, NULL>
  gimple_call <__dt_comp , NULL, &vect>
  gimple_assign <constructor, vect, {CLOBBER}, NULL, NULL>

  <bb 4> :
gimple_label <<L2>>
  gimple_return <_6>

..[redacted]..
}

Link to the full GCC GIMPLE IR dumped code here.

Now, let’s quickly walk through the code generated for the main function:

(1) First, we see that std::vector<int> vect; was transformed into struct vector vect;. Using the GCC compiler for C++, data types like vectors, maps, and other similar types are not primary data types but rather templates used to generate struct types in C.

struct vector vect;

(2) The values to be pushed into our vect struct must be of value_type type.

value_type D.41711;

(3) Our vect struct is initialized by passing it through a function called __ct_comp.

gimple_call <__ct_comp, NULL, &vect>

(4) From our code vect.push_back(10);, we are pushing an integer value 10 into vect. Therefore, the compiler casts the value 10 to D.41711, which is a value_type data type.

gimple_assign <integer_cst, D.41711, 10, NULL, NULL>

(5) After assigning 10 to D.41711 in (4), the compiler calls the function push_back to store D.41711 in vect.

gimple_call <push_back, NULL, &vect, &D.41711>

(6) The rest involves deconstructing the struct values vect and D.41711.

<bb 3> :
  gimple_assign <constructor, D.41711, {CLOBBER}, NULL, NULL>
  gimple_assign <integer_cst, _6, 0, NULL, NULL>
  gimple_call <__dt_comp , NULL, &vect>
  gimple_assign <constructor, vect, {CLOBBER}, NULL, NULL>

  <bb 4> :
gimple_label <<L2>>
  gimple_return <_6>

..[redacted]..

With the code above, the pseudocode for the dump GIMPLE code for the main function would be:

struct vector vect
value_type D.41711

__ct_comp(&vect)

D.41711 = 10
push_back(&vect, &D.41711)

_6 = 0
__dt_comp(&vect)

return _6

With the example above, we can use Eptalights to write a simple Python script to find out all values that are pushed to the vector vect.

"""
Retrieve all function calls containing 'push_back'
"""
for callsite in fn.callsite_manager.search(name="::push_back"):
    """
    Get the vector variable name, which is our first argument
    Code:
        gimple_call <push_back, NULL, &vect, &D.41711>
    """
    vector_varname = callsite.variables_used_as_callsite_arg[0]
    if vector_varname != 'vect':
        continue

    """
    Get the variable and SSA variable name of the second argument, 
    which is the value pushed into our `vect`
    Code:
        gimple_call <push_back, NULL, &vect, &D.41711>
    """
    second_arg_varname = callsite.variables_used_as_callsite_arg[1]
    second_arg_ssaname = callsite.ssa_variables_used_as_callsite_arg[1]

    """
    Get the full SSA variable properties of the second argument
    """
    var = fn.variable_manager \
            .get(second_arg_varname) \
            .unique_ssa_variables.get(second_arg_ssaname)

    """
    Print the value that was assigned to the second argument
    """
    for step_index in var.variable_defined_at_steps:
        value_to_be_push_back = fn.steps[step_index].src.lhs.readable()

        print(f"callsite={callsite}")
        print(
            f"callsite={callsite.fn_name}, "
            f"vector_varname={vector_varname}, "
            f"variable_of_value_to_be_push_back={second_arg_varname}, "
            f"value_to_be_push_back={value_to_be_push_back}" 
        )

# output 
callsite=step_index=2 fn_name=['std::vector<int, std::allocator<int>>::push_back'] num_of_args=2 variables_used_as_callsite_arg=['vect', 'D'] variables_defined_here=[] ssa_variables_used_as_callsite_arg=['vect_0', 'D.41711'] ssa_variables_defined_here=[]

callsite=['std::vector<int, std::allocator<int>>::push_back'], vector_varname=vect, variable_of_value_to_be_push_back=D, value_to_be_push_back=10

C/C++ code analysis is inherently complex due to the intricacies of the languages and their compilation processes. Macros and preprocessing directives introduce additional layers of difficulty.

By using Eptalights, we simplify the code analysis process. Eptalights’s Sophia-IR reduced instruction set and straightforward handling of complex variables make it an appealing alternative for static analysis. Leveraging Sophia-IR allows developers to perform effective C/C++ code analysis with reduced complexity and increased ease.

Reasoning about C++ code analysis this way is much easier to handle without necessarily being an expert in C++, while still maintaining a high focus on finding issues or patterns in code. We have provided more information on C++ code analysis here.

Analysis of PHP Code

At Eptalight, our philosophy around PHP code analysis is simple: analyze what actually executes.

Most existing tools operate at the source code or Abstract Syntax Tree (AST) level which a useful approach, but one that comes with significant downsides. PHP’s syntax changes regularly across versions, and these changes often break AST-based tools or require constant patching. More importantly, AST representations don’t always align with runtime behavior.

Instead, we analyze PHP bytecode, the compiled representation of your code produced by the Zend Engine. Bytecode changes far less frequently than PHP syntax, and it gives us an accurate picture of what actually runs, even in complex, obfuscated, or encoded environments.

Here’s a simple example to demonstrate:

Given the PHP Source Code below

<?php
   echo "Hello from PHP!";
?>

And the Zend Bytecode (via phpdbg)

$_main:
     ; (lines=1, args=0, vars=0, tmps=0)
     ; /example.php:2-4
L0001 0000 ECHO string("Hello from PHP!")
L0004 0001 RETURN int(1)

Here is Decompiled Output via Eptalight Sophia-IR (Python API)

fn = api.get_function_by_id(fid="/example.php:main#1")
print(fn.decompile())

Decompiled Output:

main  (  )
{
    <bb 0> :
    echo  ( 'Hello from PHP!' );
    return 1;
}

With just a few lines of Python, you gain access to a fully decompiled view of your PHP bytecode — clean, readable, and accurate to what the Zend Engine will execute. This also unlocks full access to:

  • Function and class metadata
  • Call sites and control flow graphs (CFG)
  • Data flow and variable definitions
  • Obfuscated control structures (e.g. goto)
  • Encoded or protected code paths (e.g. Ioncube or Plesk runtime)

Whether you’re building advanced static analysis tools, scanning for security flaws, or reverse-engineering legacy systems, bytecode-level analysis provides a powerful, future-proof foundation.

For more information about our PHP code analysis. Visit our PHP Blog Post.

Conclusion

Key Takeaways

  • (1) Simplified Analysis: By breaking down code into Sophia-IR, we make it easier to work with complex code structures and patterns.
  • (2) Python-Friendly: Leverage Python for writing analysis scripts, making the process more accessible and flexible.
  • (3) Focus on Practical Analysis: Eptalights allows for detailed code analysis without needing to become an expert in C++ or learn new query languages.
  • (4) Comprehensive Support: Access information about defined and used variables, function call sites, control flow graphs, and more.

Thank you for your interest in our technology. We are eager to explore potential collaboration opportunities with you. For further inquiries, please do not hesitate to contact us or follow our updates on social media via X. You can also visit out documentation site for more examples.