Introduction
Eptalights Research is a startup dedicated to simplifying code analysis by transforming and processing code at its foundational level—bytecode or IR stage into a user-friendly 7-instruction IR. Additionally, our solution provides access to the original, lowest-level information for comprehensive analysis.
Currently, we support C/C++ as one of our primary languages. In terms of code analysis for C/C++, we advocate for the use of compiler-based techniques rather than analyzing code at the source level or the abstract syntax tree (AST). For a more detailed explanation, please refer to our blog post here.
At Eptalights, we view C++ code analysis differently from existing popular tools like CodeQL
and Semgrep
. We see C++ as a combination of C, code-generated structs, and code-generated functions. But is that really the case? Let find ou in tthe next section.
A view of C++ from GCC GIMPLE IR
For C/C++ codebase, we leverage GCC GIMPLE-IR which an intermidate representation for GCC Compiler alternate to LLVM IR by extracting all IR information using our GCC plugin and transform them into Python Pydantic Models and store in a database which can now be queried using a our Python Library in a pythonic way without learning any custom Language.
Let first see how C++ is represented in GIMPLE when compiled with GCC using the C++ code below:
#include <iostream>
#include <vector>
using namespace std;
int main()
{
vector<int> vect;
vect.push_back(10);
return 0;
}
After dumping the GCC GIMPLE IR using the GCC compiler gcc --dump-ir vector_example.cpp
.
main ()
{
struct vector vect; // C++ vector simplified to struct
bool retval.0; // Temporary variable for return value
value_type D.41711; // Temporary variable for value storage
bool retval.0_21;
<bb 2> :
gimple_call <__ct_comp , NULL, &vect> // Constructor call for vector
gimple_assign <integer_cst, D.41711, 10, NULL, NULL> // Assign integer 10 to D.41711
gimple_call <push_back, NULL, &vect, &D.41711> // Push D.41711 into vect
...[redacted]
}
// several redacted generated functions
//
operator new (size_t D.3408, void * __p) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::operator++ (struct __normal_iterator * const this) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::operator* (const struct __normal_iterator * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_impl_data::_Vector_impl_data (struct _Vector_impl_data * const this) ...[redacted]
__gnu_cxx::new_allocator<int>::~new_allocator (struct new_allocator * const this) ...[redacted]
std::allocator<int>::~allocator (struct allocator * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_impl::~_Vector_impl (struct _Vector_impl * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_M_get_Tp_allocator (struct _Vector_base * const this) ...[redacted]
std::move<int&> (int & __t) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::__normal_iterator (struct __normal_iterator * const this, int * const & __i) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::base (const struct __normal_iterator * const this) ...[redacted]
__gnu_cxx::operator!=<int*, std::vector<int> > (const struct __normal_iterator & __lhs, const struct __normal_iterator & __rhs) ...[redacted]
__gnu_cxx::new_allocator<int>::new_allocator (struct new_allocator * const this) ...[redacted]
std::allocator<int>::allocator (struct allocator * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_impl::_Vector_impl (struct _Vector_impl * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_base (struct _Vector_base * const this) ...[redacted]
std::vector<int>::vector (struct vector * const this) ...[redacted]
std::forward<int> (type & __t) ...[redacted]
...
Since C++ heavily relies on templates, even this small snippet results in quite a bit of IR code with most of them not needed during code analysis. However, we’ll focus on the key parts relevant to the vector operations.
Link to the full dumped code here which is about 1892 lines of code for just about 12 lines of code.
When analyzing code with Eptalights, much of the boilerplate from the C++ Standard Template Library (STL) is ignored, focusing only on project-relevant Intermediate Representation (IR) code. This streamlines the analysis by filtering out unnecessary standard library details.
Let’s walk through how typical STL containers and operations are represented in GIMPLE IR.
for vector declarations like the code below:
std::vector<int> vect;
when lifted to GIMPLE-IR
with GCC
, it’s transformed to:
struct vector vect;
Here, the std::vector<int>
is simplified to a struct vector
in GIMPLE IR.
We also take a look at how elements are pushed to cectors, using the C++ code below:
vect.push_back(10);
is transfformed to:
// Assign 10 to temporary variable D.41711
gimple_assign <integer_cst, D.41711, 10, NULL, NULL>
// Call push_back with vect and D.41711
gimple_call <push_back, NULL, &vect, &D.41711>
-
Temporary Variable Creation:
The integer10
is stored in a temporary variableD.41711
. -
Function Call:
Thepush_back
function is called, passingvect
(ourstruct vector
) andD.41711
(holding the value10
).
Full Explanation of GIMPLE Steps:
<bb 2> :
gimple_call <__ct_comp, NULL, &vect> // Constructor call for vector
gimple_assign <integer_cst, D.41711, 10, NULL, NULL> // Assign 10 to D.41711
gimple_call <push_back, NULL, &vect, &D.41711> // Call push_back with vect and D.41711
-
gimple_call <__ct_comp, NULL, &vect>
- This initializes the
vect
structure (constructor call).
- This initializes the
-
gimple_assign <integer_cst, D.41711, 10>
- A temporary variable
D.41711
is assigned the value10
.
- A temporary variable
-
gimple_call <push_back, NULL, &vect, &D.41711>
- The
push_back
operation is invoked withvect
and the value stored inD.41711
.
- The
Analysing C++ code with Eptalights
From Eptalights’s view, everything is a functions, steps/instructions, variables, callsiites, control flow graphs, etc.
We can begin with viewing the Pseudo-C code of the function we are dealing with vectors.
fn = api.get_function_by_id(fid="/example/src/05_vector.cpp:main#1")
print(fn.decompile())
# output
"""
main ( )
{
<bb 2> :
std::vector<int, std::allocator<int>>::vector ( &vect );
689541713 = 10;
std::vector<int, std::allocator<int>>::push_back ( &vect, &689541713 );
<bb 3> :
689541713 = R"({)"R"(CLOBBER)"R"(})";
689541714 = 20;
std::vector<int, std::allocator<int>>::push_back ( &vect, &689541714 );
<bb 4> :
689541714 = R"({)"R"(CLOBBER)"R"(})";
689541715 = 30;
std::vector<int, std::allocator<int>>::push_back ( &vect, &689541715 );
...redacted
}
"""
C++ std::vector
is essentially treated as a struct
. Let’s inspect the type of the vect
variable.
fid = "/example/src/05_vector.cpp:main#1"
fn = db.get_function(fid)
print(fn.variable_manager.get('vect').full_declaration)
print(fn.variable_manager.get('vect').type_declaration)
output
struct vector vect
struct vector
To retrieve all values pushed to the vector, we can:
- Find all
push_back
calls. - Extract the second argument passed to push_back (the value).
- Track where the value is defined and retrieve its actual content.
"""
get all callsites to `::push_back`
"""
for cs in fn.callsite_manager.search(name='::push_back'):
"""
get second argument,
get ssa version of second argument
"""
second_arg = cs.variables_used_as_callsite_arg[1]
second_arg_ssa = cs.ssa_variables_used_as_callsite_arg[1]
"""
get full ssa variable
"""
ssa_variable = fn.variable_manager
.get(second_arg)
.get_ssa_var(second_arg_ssa)
"""
print steps where its defined
"""
for step_index in ssa_variable.variable_defined_at_steps:
step = fn.steps[step_index]
"""
print the lhs of the assignment step - `DST = LHS`
"""
print(step.src.lhs.decompile())
output
10
20
30
No matter how complex C++ syntax can get, Eptalights offers an easy and Pythonic way to access and analyze your code. This means you can focus on what matters and understanding your code’s behavior without getting bogged down by the intricacies of C++.
Why Use Eptalights for C++?
-
Simplified Access:
Navigate through functions, variables, and control flows in C++ using intuitive Python commands. -
Intermediate Representation (IR) Focus:
Bypass unnecessary STL boilerplate and directly analyze project-relevant IR code. -
Consistent API:
Whether working with vectors, maps, or overloaded functions, Eptalights provides a consistent, easy-to-use API.
Check out the Eptalights Documentation for more detailed examples on handling Vectors ,Maps ,Function Overloading ,Control Structures, More.
Conclusion
Thank you for your interest in our technology. We are eager to explore potential collaboration opportunities with you. For further inquiries, please do not hesitate to contact us or follow our updates on social media via X. You can also visit out documentation site for more examples.
Blog Post by our Founder Samuel Asirifi.