Introduction
Eptalights Research is a startup that simplifies code analysis by converting bytecode or IR into a user-friendly 7-instruction IR, while also providing access to the original low-level information for thorough analysis.
Whether you’re a solo developer, part of a growing startup, or a member of a large enterprise team, Eptalights can help you identify potential issues, optimize performance, and adhere to best practices in your code.
In this blog post, we will provide an overview of Eptalight’s technology and what you can expect when working with us. We are currently starting out with support for the C/C++ programming language and will add other languages over time. Our APIs will remain consistent, regardless of the programming language, ByteCode, or Intermediate Representation (IR) used.
Our Technology
We start by acknowledging that C/C++ code analysis has always been a challenging task. In this blog post, we explore some of the key difficulties associated with C/C++ code analysis and introduce a simplified approach using GCC GIMPLE-like Intermediate Representation (IR) called EGIMPLE IR
.
What is Eptalights’s EGIMPLE Intermediate Representation (IR)? EGIMPLE IR is heavily influenced by and modeled after GCC GIMPLE, with several customizations and additional computed data, making code analysis extremely easy.
We help break down code into EGIMPLE, allowing you to naturally write patterns or rules to find where function call sites are called, where variables are defined or used, and to determine the control flow paths of the code. This is done using the well-known language Python
, enabling you to explore your code without needing to learn a custom query language like Semgrep
, CodeQL
, etc. Think of it as Binary Ninja Python APIs, which are used for binary analysis, but tailored for programming languages.
In EGIMPLE, everything is a FunctionModel
, and instructions or steps within a FunctionModel
can fall into one of the 7 OpType
models:
EGimpleIRNopModel
: Represents a no-operation (IR) model.EGimpleIRAssignModel
: Represents an assignment (IR) model.EGimpleIRCallModel
: Represents a call (IR) model.EGimpleIRCondModel
: Represents a conditional (IR) model.EGimpleIRReturnModel
: Represents a return (IR) model.EGimpleIRGotoModel
: Represents a goto (IR) model.EGimpleIRSwitchModel
: Represents a switch (IR) model.
With variables both defined or used already computed and readily available from the FunctionModel
:
print(step1.variables_defined_here)
print(step1.ssa_variables_defined_here)
# output
['arr']
['arr_0']
Easy access to call sites and their arguments from the FunctionModel
:
for callsite in fn.callsite_manager.all(name="malloc"):
print(
callsite.name,
callsite.variables_used_as_callsite_arg
)
Easy access to the Control Flow Graph from the FunctionModel
:
for block_index, block_edges in fn.cfg.basicblock_edges.items():
print(block_index, block_edges)
EGIMPLE also helps manage complex variables naturally and effortlessly, similar to their usage in the source code. This includes structures, arrays, and their fields. By representing these variables in a straightforward manner, EGIMPLE simplifies the analysis of code that involves complex data structures.
We believe that the 7 OpType
s are sufficient for performing complex code analysis. For more examples, visit here.
The Challenge of Macros and Preprocessing Code
One of the primary difficulties in C/C++ code analysis stems from the extensive use of macros. Macros are defined using #define
, #if
, and #else
, generating code at compile time and transforming the source in ways that are not immediately visible.
This makes it difficult for analysis tools to work effectively at the source or Abstract Syntax Tree (AST) level, as the code being analyzed might not reflect the actual code executed.
These macros are extensively used in various products, including databases, kernels, servers, and runtimes for different platform configurations. Analyzing code specific to configurations and the actual final build is often more practical in the real world than ignoring how macros affect large codebases that may not be a priority and might not reflect the final build.
For instance, consider the following macro:
#define SQUARE(x) ((x) * (x))
This simple macro can generate different code depending on its usage and can be significantly different from the original statement:
int a = SQUARE(5); // Expands to: int a = (5 * 5);
int b = SQUARE(a+1); // Expands to: int b = ((a+1) * (a+1));
This transformation complicates source or AST-based code analysis, as the tool needs to understand the macro expansions and their implications. Analyzing these expansions at the source level can be cumbersome and error-prone.
Given the example below, examining it at the source level would be challenging:
#if defined(_WIN32) || defined(_WIN64)
#define MAX_ARRAY_INDEX 10
#else
#define MAX_ARRAY_INDEX 5
#endif
char filePath[MAX_ARRAY_INDEX];
This macro-based conditional compilation can alter the code significantly depending on the type of platform. Understanding these variations from the source alone can be complex and error-prone.
By collecting all the information at compile time after configuring everything according to your platform, you can more effectively manage and analyze these variations. This approach enables you to handle platform-specific configurations and macro expansions comprehensively.
At Eptalights, this is one of the primary reasons we adopted a compiler-based approach rather than a source-based or AST-based approach. We chose GCC over LLVM, and you can read more about our decision to use GCC GIMPLE instead of LLVM IR here.
Analysis of C++ Code
At Eptalights, we view C++ code analysis differently from existing popular tools like CodeQL
and Semgrep
. We see C++ as a combination of C, code-generated structs, and code-generated functions.
Using the simple code example below:
int main()
{
std::vector<int> vect;
vect.push_back(10);
return 0;
}
After dumping the GIMPLE SSA IR using the GCC command g++ -fdump-tree-ssa-raw example.cpp
, you’ll find numerous code-generated functions that support the use of std::vector
type. These functions are typically not relevant to our code analysis unless you’re specifically auditing the C++ std
library. The actual main
function, which contains the core logic, is just a few lines long.
Let’s examine the dumped C++ code, with all functions redacted except for the main
function:
operator new (size_t D.3408, void * __p){
..[redacted]..
}
std::_Vector_base<int, std::allocator<int> >::_Vector_impl_data::_Vector_impl_data (struct _Vector_impl_data * const this){
..[redacted]..
}
std::_Vector_base<int, std::allocator<int> >::_Vector_impl::~_Vector_impl (struct _Vector_impl * const this) {
..[redacted]..
}
// more generated functions redacted ...
main ()
{
struct vector vect;
int D.46500;
value_type D.41711;
int _6;
<bb 2> :
gimple_call <__ct_comp , NULL, &vect>
gimple_assign <integer_cst, D.41711, 10, NULL, NULL>
gimple_call <push_back, NULL, &vect, &D.41711>
<bb 3> :
gimple_assign <constructor, D.41711, {CLOBBER}, NULL, NULL>
gimple_assign <integer_cst, _6, 0, NULL, NULL>
gimple_call <__dt_comp , NULL, &vect>
gimple_assign <constructor, vect, {CLOBBER}, NULL, NULL>
<bb 4> :
gimple_label <<L2>>
gimple_return <_6>
..[redacted]..
}
Link to the full GCC GIMPLE IR dumped code here.
Now, let’s quickly walk through the code generated for the main
function:
(1) First, we see that std::vector<int> vect;
was transformed into struct vector vect;
. Using the GCC compiler for C++, data types like vectors, maps, and other similar types are not primary data types but rather templates used to generate struct types in C.
struct vector vect;
(2) The values to be pushed into our vect
struct must be of value_type
type.
value_type D.41711;
(3) Our vect
struct is initialized by passing it through a function called __ct_comp
.
gimple_call <__ct_comp, NULL, &vect>
(4) From our code vect.push_back(10);
, we are pushing an integer value 10
into vect
. Therefore, the compiler casts the value 10
to D.41711
, which is a value_type
data type.
gimple_assign <integer_cst, D.41711, 10, NULL, NULL>
(5) After assigning 10
to D.41711
in (4), the compiler calls the function push_back
to store D.41711
in vect
.
gimple_call <push_back, NULL, &vect, &D.41711>
(6) The rest involves deconstructing the struct values vect
and D.41711
.
<bb 3> :
gimple_assign <constructor, D.41711, {CLOBBER}, NULL, NULL>
gimple_assign <integer_cst, _6, 0, NULL, NULL>
gimple_call <__dt_comp , NULL, &vect>
gimple_assign <constructor, vect, {CLOBBER}, NULL, NULL>
<bb 4> :
gimple_label <<L2>>
gimple_return <_6>
..[redacted]..
With the code above, the pseudocode for the dump GIMPLE code for the main
function would be:
struct vector vect
value_type D.41711
__ct_comp(&vect)
D.41711 = 10
push_back(&vect, &D.41711)
_6 = 0
__dt_comp(&vect)
return _6
With the example above, we can use Eptalights to write a simple Python script to find out all values that are pushed to the vector vect
.
"""
Retrieve all function calls containing 'push_back'
"""
for callsite in fn.callsite_manager.search(name="push_back"):
"""
Get the vector variable name, which is our first argument
Code:
gimple_call <push_back, NULL, &vect, &D.41711>
"""
vector_varname = callsite.variables_used_as_callsite_arg[0]
if vector_varname != 'vect':
continue
"""
Get the variable and SSA variable name of the second argument,
which is the value pushed into our `vect`
Code:
gimple_call <push_back, NULL, &vect, &D.41711>
"""
second_arg_varname = callsite.variables_used_as_callsite_arg[1]
second_arg_ssaname = callsite.ssa_variables_used_as_callsite_arg[1]
"""
Get the full SSA variable properties of the second argument
"""
var = fn.variable_manager \
.get(second_arg_varname) \
.unique_ssa_variables.get(second_arg_ssaname)
"""
Print the value that was assigned to the second argument
"""
for step_index in var.variable_defined_at_steps:
value_to_be_push_back = fn.steps[step_index].src.lhs.readable()
print(f"callsite={callsite}")
print(
f"callsite={callsite.fn_name}, "
f"vector_varname={vector_varname}, "
f"variable_of_value_to_be_push_back={second_arg_varname}, "
f"value_to_be_push_back={value_to_be_push_back}"
)
# output
callsite=step_index=2 fn_name=['std::vector<int, std::allocator<int>>::push_back'] num_of_args=2 variables_used_as_callsite_arg=['vect', 'D'] variables_defined_here=[] ssa_variables_used_as_callsite_arg=['vect_0', 'D.41711'] ssa_variables_defined_here=[]
callsite=['std::vector<int, std::allocator<int>>::push_back'], vector_varname=vect, variable_of_value_to_be_push_back=D, value_to_be_push_back=10
Reasoning about C++ code analysis this way is much easier to handle without necessarily being an expert in C++, while still maintaining a high focus on finding issues or patterns in code. We will be providing more information on C++ code analysis with Eptalights. For more inquiries, feel free to contact us, or follow us on social media for updates - X.
Conclusion
C/C++ code analysis is inherently complex due to the intricacies of the languages and their compilation processes. Macros and preprocessing directives introduce additional layers of difficulty.
By using Eptalights, we simplify the code analysis process. Eptalights’s EGIMPLE’s reduced instruction set and straightforward handling of complex variables make it an appealing alternative for static analysis. Leveraging EGIMPLE allows developers to perform effective C/C++ code analysis with reduced complexity and increased ease.
Key Takeaways
- (1) Simplified Analysis: By breaking down code into EGIMPLE, we make it easier to work with complex code structures and patterns.
- (2) Python-Friendly: Leverage Python for writing analysis scripts, making the process more accessible and flexible.
- (3) Focus on Practical Analysis: Eptalights allows for detailed code analysis without needing to become an expert in C++ or learn new query languages.
- (4) Comprehensive Support: Access information about defined and used variables, function call sites, control flow graphs, and more.
Thank you for your interest in our technology. We are eager to explore potential collaboration opportunities with you. For further inquiries, please do not hesitate to contact us or follow our updates on social media via X.
Blog Post by our Founder Samuel Asirifi.