Is C++ Just C with Structs and Functions, or Is It More?

Introduction

Eptalights Research is a startup dedicated to simplifying code analysis by transforming and processing code at its foundational level (bytecode or IR stage) and lifted them into a user-friendly Pythonic Model. Additionally, our solution provides access to the original, lowest-level information for comprehensive analysis.

Currently, we support C/C++, JAVA, PHP as our primary languages. In terms of code analysis for C/C++, we advocate for the use of compiler-based techniques rather than analyzing code at the source level or the abstract syntax tree (AST). For a more detailed explanation, please refer to our blog post here.

At Eptalights, we view C++ code analysis differently from existing popular tools like CodeQL and Semgrep. We see C++ as a combination of C, code-generated structs, and code-generated functions. But is that really the case? Let find ou in tthe next section.

A view of C++ from GCC GIMPLE IR

For C/C++ codebase, we leverage GCC GIMPLE-IR which an intermidate representation for GCC Compiler alternate to LLVM IR by extracting all IR information using our GCC plugin and transform them into Python Pydantic Models and store in a database which can now be queried using a our Python Library in a pythonic way without learning any custom Language.

Let first see how C++ is represented in GIMPLE when compiled with GCC using the C++ code below:

#include <iostream> 
#include <vector> 
using namespace std; 
  
int main() 
{ 
    vector<int> vect; 
    vect.push_back(10); 
    return 0; 
}

After dumping the GCC GIMPLE IR using the GCC compiler gcc --dump-ir vector_example.cpp.

main ()
{
  struct vector vect;          // C++ vector simplified to struct
  bool retval.0;               // Temporary variable for return value
  value_type D.41711;          // Temporary variable for value storage
  bool retval.0_21;           

  <bb 2> :
  gimple_call <__ct_comp , NULL, &vect>             // Constructor call for vector
  gimple_assign <integer_cst, D.41711, 10, NULL, NULL>  // Assign integer 10 to D.41711
  gimple_call <push_back, NULL, &vect, &D.41711>        // Push D.41711 into vect

  ...[redacted]
}

// several redacted generated functions
//
operator new (size_t D.3408, void * __p) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::operator++ (struct __normal_iterator * const this) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::operator* (const struct __normal_iterator * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_impl_data::_Vector_impl_data (struct _Vector_impl_data * const this) ...[redacted]
__gnu_cxx::new_allocator<int>::~new_allocator (struct new_allocator * const this) ...[redacted]
std::allocator<int>::~allocator (struct allocator * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_impl::~_Vector_impl (struct _Vector_impl * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_M_get_Tp_allocator (struct _Vector_base * const this) ...[redacted]
std::move<int&> (int & __t) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::__normal_iterator (struct __normal_iterator * const this, int * const & __i) ...[redacted]
__gnu_cxx::__normal_iterator<int*, std::vector<int> >::base (const struct __normal_iterator * const this) ...[redacted]
__gnu_cxx::operator!=<int*, std::vector<int> > (const struct __normal_iterator & __lhs, const struct __normal_iterator & __rhs) ...[redacted]
__gnu_cxx::new_allocator<int>::new_allocator (struct new_allocator * const this) ...[redacted]
std::allocator<int>::allocator (struct allocator * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_impl::_Vector_impl (struct _Vector_impl * const this) ...[redacted]
std::_Vector_base<int, std::allocator<int> >::_Vector_base (struct _Vector_base * const this) ...[redacted]
std::vector<int>::vector (struct vector * const this) ...[redacted]
std::forward<int> (type & __t) ...[redacted]
...

Since C++ heavily relies on templates, even this small snippet results in quite a bit of IR code with most of them not needed during code analysis. However, we’ll focus on the key parts relevant to the vector operations.

Link to the full dumped code here which is about 1892 lines of code for just about 12 lines of code.

When analyzing code with Eptalights, much of the boilerplate from the C++ Standard Template Library (STL) is ignored, focusing only on project-relevant Intermediate Representation (IR) code. This streamlines the analysis by filtering out unnecessary standard library details.

Let’s walk through how typical STL containers and operations are represented in GIMPLE IR.

for vector declarations like the code below:

std::vector<int> vect;

when lifted to GIMPLE-IR with GCC, it’s transformed to:

struct vector vect;

Here, the std::vector<int> is simplified to a struct vector in GIMPLE IR.

We also take a look at how elements are pushed to cectors, using the C++ code below:

vect.push_back(10);

is transfformed to:

// Assign 10 to temporary variable D.41711
gimple_assign <integer_cst, D.41711, 10, NULL, NULL>

// Call push_back with vect and D.41711
gimple_call <push_back, NULL, &vect, &D.41711>

Temporary Variable Creation:
The integer 10 is stored in a temporary variable D.41711.
Function Call:
The push_back function is called, passing vect (our struct vector) and D.41711 (holding the value 10).

Full Explanation of GIMPLE Steps:

<bb 2> :
gimple_call <__ct_comp, NULL, &vect>               // Constructor call for vector
gimple_assign <integer_cst, D.41711, 10, NULL, NULL>  // Assign 10 to D.41711
gimple_call <push_back, NULL, &vect, &D.41711>        // Call push_back with vect and D.41711

gimple_call <__ct_comp, NULL, &vect>
- This initializes the vect structure (constructor call).
gimple_assign <integer_cst, D.41711, 10>
- A temporary variable D.41711 is assigned the value 10.
gimple_call <push_back, NULL, &vect, &D.41711>
- The push_back operation is invoked with vect and the value stored in D.41711.

Analysing C++ code with Eptalights

From Eptalights’s view, everything is a functions, steps/instructions, variables, callsiites, control flow graphs, etc.

We can begin with viewing the Pseudo-C code of the function we are dealing with vectors.


fn = api.get_function_by_id(fid="/example/src/05_vector.cpp:main#1")
print(fn.decompile())

# output
"""
main  (  )
{

        <bb 2> :
        std::vector<int, std::allocator<int>>::vector  ( &vect );
        689541713 = 10;
        std::vector<int, std::allocator<int>>::push_back  ( &vect, &689541713 );

        <bb 3> :
        689541713 = R"({)"R"(CLOBBER)"R"(})";
        689541714 = 20;
        std::vector<int, std::allocator<int>>::push_back  ( &vect, &689541714 );

        <bb 4> :
        689541714 = R"({)"R"(CLOBBER)"R"(})";
        689541715 = 30;
        std::vector<int, std::allocator<int>>::push_back  ( &vect, &689541715 );

        ...redacted
}
"""

C++ std::vector is essentially treated as a struct. Let’s inspect the type of the vect variable.

fid = "/example/src/05_vector.cpp:main#1"
fn = db.get_function(fid)

print(fn.variable_manager.get('vect').full_declaration)
print(fn.variable_manager.get('vect').type_declaration)

output

struct vector vect
struct vector

To retrieve all values pushed to the vector, we can:

Find all push_back calls.
Extract the second argument passed to push_back (the value).
Track where the value is defined and retrieve its actual content.

"""
get all callsites to `::push_back`
"""
for cs in fn.callsite_manager.search(name='::push_back'):
    """
    get second argument, 
    get ssa version of second argument
    """
    second_arg = cs.variables_used_as_callsite_arg[1]
    second_arg_ssa = cs.ssa_variables_used_as_callsite_arg[1]

    """
    get full ssa variable
    """
    ssa_variable = fn.variable_manager
	    .get(second_arg)
	    .get_ssa_var(second_arg_ssa)

    """
    print steps where its defined
    """
    for step_index in ssa_variable.variable_defined_at_steps:
        step = fn.steps[step_index]

        """
        print the lhs of the assignment step - `DST = LHS`
        """
        print(step.src.lhs.decompile())

output

10
20
30

No matter how complex C++ syntax can get, Eptalights offers an easy and Pythonic way to access and analyze your code. This means you can focus on what matters and understanding your code’s behavior without getting bogged down by the intricacies of C++.

Why Use Eptalights for C++?

Simplified Access:
Navigate through functions, variables, and control flows in C++ using intuitive Python commands.
Intermediate Representation (IR) Focus:
Bypass unnecessary STL boilerplate and directly analyze project-relevant IR code.
Consistent API:
Whether working with vectors, maps, or overloaded functions, Eptalights provides a consistent, easy-to-use API.

Check out the Eptalights Documentation for more detailed examples on handling Vectors ,Maps ,Function Overloading ,Control Structures, More.

Conclusion

Thank you for your interest in our technology. We are eager to explore potential collaboration opportunities with you. For further inquiries, please do not hesitate to contact us or follow our updates on social media via X. You can also visit out documentation site for more examples.