Introduction
Eptalights Research is a startup dedicated to simplifying code analysis by transforming and processing code at its foundational level (bytecode or IR stage) and lifted them into a user-friendly Pythonic Model. Additionally, our solution provides access to the original, lowest-level information for comprehensive analysis.
Currently, we support C, C++, and PHP as our primary languages.
At Eptalight, our approach to PHP code analysis is fundamentally different from existing tools, which typically focus on the Abstract Syntax Tree (AST) or source level.
In contrast, we analyze PHP at the bytecode level and in the sections that follow, we’ll dive deeper into why we chose this approach.
Comparing PHP Source-Level (AST) and Bytecode-Level Analysis
PHP, like any other language, changes or updates its syntax with each version. Writing code analysis scripts or tools at the source or AST level requires constant updates.
This often shifts the research focus from actual analysis to simply keeping up with syntax changes. Additionally, the AST does not perfectly reflect what is executed—it’s the bytecode that represents the true execution path.
Bytecode, on the other hand, changes far less frequently and this is where Eptalight comes in.
For PHP codebases, we leverage our PHP Bytecode Extractor to extract all Zend bytecode information from specified PHP source files into JSON.
This JSON is then lifted and transformed into our Pythonic data model.
Even though we provide low-level access to the bytecode, our Lifter transforms it into a Pythonic data model, allowing you to express your analysis naturally without worrying about syntax or AST intricacies.
Let first see how a simple Hello World below is transformed into Eptalights’s Pythonic data model:
<!DOCTYPE html>
<html>
<body>
<h1>My PHP Website</h1>
<?php
echo "Hello World!";
?>
</body>
</html>
After disassembling this using phpdbg
and inspecting the Zend Bytecode instructions, we get the following:
$_main:
; (lines=4, args=0, vars=0, tmps=0)
; /extractor/abc.php:1-10
L0001 0000 ECHO string("<!DOCTYPE html>
<html>
<body>
<h1>My PHP Website</h1>
")
L0006 0001 ECHO string("Hello World!")
L0008 0002 ECHO string("</body>
</html>
")
L0010 0003 RETURN int(1)
And when lifted with Eptalight’s Cloud Lifter, we obtain a clean, Pythonic data model.
We can then print the decompiled pseudo-C to view a high-level representation of the code while still having full access to call sites, variables (including their definitions and usages), control flow graphs (CFG), and more.
fn = api.get_function_by_id(fid="/01_helloworld.php:main#1")
print(fn.decompile())
# output
"""
main ( )
{
<bb 0> :
echo ( '<!DOCTYPE html>\n<html>\n<body>\n <h1>My PHP Website</h1>\n ' );
echo ( 'Hello World!' );
echo ( '</body>\n</html>\n' );
return 1;
}
"""
Analysing Obfuscated PHP
Some PHP projects use obfuscation tools to conceal their true functionality.
Below is a code snippet from CloudPanel, a free software tool for configuring and managing servers.
As we can see, it makes extensive use of goto
statements to obscure the program’s control flow.
Additionally, all strings are encoded, rendering them unreadable to the human eye.
<?php
namespace App;
use Symfony\Bundle\FrameworkBundle\Kernel\MicroKernelTrait;
use Symfony\Component\DependencyInjection\Loader\Configurator\ContainerConfigurator;
use Symfony\Component\HttpKernel\Kernel as BaseKernel;
use Symfony\Component\Routing\Loader\Configurator\RoutingConfigurator;
class Kernel extends BaseKernel
{
use MicroKernelTrait;
protected function configureContainer(
ContainerConfigurator $container
): void {
goto A4243;
d61c8:
F2ccd:
goto C97c8;
dc8bf:
goto F2ccd;
goto Cee12;
Fcd67:
$container->import(
"\x2e\56\x2f\143\x6f\156\146\151\147\x2f\x7b\160\x61\143\x6b\141\x67\x65\163\x7d\x2f" .
$this->environment .
"\x2f\52\x2e\171\141\x6d\x6c"
);
goto f9b9a;
Ab043:
$container->import(
"\56\56\x2f\143\157\x6e\x66\x69\147\57\x7b\x73\x65\162\166\151\x63\x65\x73\x7d\x5f" .
$this->environment .
"\x2e\171\x61\x6d\x6c"
);
goto d61c8;
A4243:
$container->import(
"\x2e\x2e\x2f\x63\x6f\x6e\146\151\x67\x2f\173\x70\141\x63\153\141\147\x65\163\x7d\x2f\x2a\x2e\x79\x61\155\154"
);
goto Fcd67;
a4649:
$container->import(
"\56\56\57\x63\x6f\x6e\x66\x69\x67\57\163\145\x72\x76\151\143\x65\163\56\171\x61\155\x6c"
);
goto Ab043;
f9b9a:
if (
is_file(
\dirname(__DIR__) .
"\x2f\x63\157\156\146\151\147\57\163\x65\162\x76\151\143\x65\163\56\x79\x61\155\154"
)
) {
goto A3922;
}
goto C7490;
Cee12:
A3922:
goto a4649;
C7490:
$container->import(
"\x2e\x2e\57\x63\x6f\156\x66\151\x67\57\x7b\163\x65\x72\166\151\x63\x65\x73\x7d\56\160\x68\x70"
);
goto dc8bf;
C97c8:
}
}
But when we examine the code at the bytecode level using phpdbg
, it becomes readable!
We discover that the strings are fully accessible once the code is compiled into bytecode by the PHP Zend engine.
This means the only viable way to analyze such PHP code is at the bytecode level.
$_main:
; (lines=2, args=0, vars=0, tmps=0)
; /extractor/abc.php:1-64
L0007 0000 DECLARE_CLASS string("app\\kernel") string("symfony\\component\\httpkernel\\kernel")
L0064 0001 RETURN int(1)
user class: App\Kernel
1 methods: configureContainer
App\Kernel::configureContainer:
; (lines=43, args=1, vars=1, tmps=14)
; /extractor/abc.php:10-63
L0010 0000 CV0($container) = RECV 1
L0013 0001 JMP 0019
L0016 0002 JMP 0042
L0018 0003 JMP 0002
L0019 0004 JMP 0037
L0021 0005 INIT_METHOD_CALL 1 CV0($container) string("import")
L0023 0006 T1 = FETCH_OBJ_R THIS string("environment")
L0023 0007 T2 = CONCAT string("../config/{packages}/") T1
L0024 0008 T3 = CONCAT T2 string("/*.yaml")
L0024 0009 SEND_VAL_EX T3 1
L0021 0010 DO_FCALL
L0026 0011 JMP 0027
L0028 0012 INIT_METHOD_CALL 1 CV0($container) string("import")
L0030 0013 T5 = FETCH_OBJ_R THIS string("environment")
L0030 0014 T6 = CONCAT string("../config/{services}_") T5
L0031 0015 T7 = CONCAT T6 string(".yaml")
L0031 0016 SEND_VAL_EX T7 1
L0028 0017 DO_FCALL
L0033 0018 JMP 0002
L0035 0019 INIT_METHOD_CALL 1 CV0($container) string("import")
L0036 0020 SEND_VAL_EX string("../config/{packages}/*.yaml") 1
L0035 0021 DO_FCALL
L0038 0022 JMP 0005
L0040 0023 INIT_METHOD_CALL 1 CV0($container) string("import")
L0041 0024 SEND_VAL_EX string("../config/services.yaml") 1
L0040 0025 DO_FCALL
L0043 0026 JMP 0012
L0046 0027 INIT_NS_FCALL_BY_NAME 1 string("App\\is_file")
L0047 0028 INIT_FCALL 1 96 string("dirname")
L0047 0029 SEND_VAL string("/extractor") 1
L0047 0030 V11 = DO_ICALL
L0048 0031 T12 = CONCAT V11 string("/config/services.yaml")
L0048 0032 SEND_VAL_EX T12 1
L0046 0033 V13 = DO_FCALL
L0048 0034 JMPZ V13 0036
L0051 0035 JMP 0037
L0053 0036 JMP 0038
L0056 0037 JMP 0023
L0058 0038 INIT_METHOD_CALL 1 CV0($container) string("import")
L0059 0039 SEND_VAL_EX string("../config/{services}.php") 1
L0058 0040 DO_FCALL
L0061 0041 JMP 0003
L0063 0042 RETURN null
After extracting and lifting the PHP bytecode, we can print out the decompiled version of the obfuscated code above.
"""
get file metadata - name of classes, functions, class properties in a file
"""
file_metadata = api.get_file_metadata_by_filepath(filepath="/09_magic_constant3.php")
"""
get file data - get the functions models of the names listed in the metadata
"""
file_data = api.get_file_data_by_metadata(file_metadata)
"""
print out pseudo-c representation of the lifted bytecode
"""
print(file_data.decompile())
# output
"""
// Class Properties and Constants
class App\Kernel {
}
App\Kernel :: configureContainer ( [unnamed] $container )
{
<bb 0> :
nop;
goto 'BB_6';
<bb 1> :
goto 'BB_13';
<bb 2> :
goto 'BB_1';
<bb 3> :
goto 'BB_13';
<bb 4> :
T1 = $this->environment;
T2 = '../config/{packages}/' + T1;
T3 = T2 + '/*.yaml';
$container->import ( T3 );
goto 'BB_8';
<bb 5> :
T5 = $this->environment;
T6 = '../config/{services}_' + T5;
T7 = T6 + '.yaml';
$container->import ( T7 );
goto 'BB_1';
<bb 6> :
$container->import ( '../config/{packages}/*.yaml' );
goto 'BB_4';
<bb 7> :
$container->import ( '../config/services.yaml' );
goto 'BB_5';
<bb 8> :
@11 = dirname ( '/src' );
T12 = @11 + '/config/services.yaml';
@13 = App\is_file ( T12 );
if ( @13 )
goto <bb 9>;
else
goto <bb 10>;
<bb 9> :
goto 'BB_11';
<bb 10> :
goto 'BB_12';
<bb 11> :
goto 'BB_7';
<bb 12> :
$container->import ( '../config/{services}.php' );
goto 'BB_2';
<bb 13> :
return 'null';
}
main ( )
{
<bb 0> :
class ( 'app\\kernel', 'symfony\\component\\httpkernel\\kernel' );
return 1;
}
"""
Now that our strings are readable, we can programmatically navigate the goto
statements, something that isn’t feasible with source or AST-based analysis.
Dealing with Encoded PHP Code (Ioncube/etc)
Most enterprise PHP source codes are encoded to resemble the example below.
The source code typically begins with a loader, followed by the encoded segment which, in most cases, is not recoverable.
Encoded PHP code can only be executed using its corresponding loader module or, in special cases such as the example below from Plesk server management software, a custom PHP runtime.
In such cases, the code begins with a loader, followed by obscured content, which is in fact the portion that gets executed.
<?php
die("The file {$_SERVER['SCRIPT_FILENAME']} is part of Plesk distribution. It cannot be run outside of Plesk environment.\n");
__sw_loader_pragma__('PLESK_18_0_59');
?>
²0 $®äûrs“7Õîõ\òå<µYúzR®µªó˜I…—5ßâÚœÓ!瘛”ÿ·Üxy·ò¯8¢HÆ¡çiX¢NN5zM=ºöÝ’…ðÙ½¤la]ÓB”©zÐþç ƒ�]1£I9BÛ3Rô+sì{6kd÷´¾¥ö|œnÒ«”�#pó$…™*r:#o<¼ÀýQd¦*’Ñ…Zè´U äž©~BÞç¡òFÄψ“lîv#’LýÄ:¬“’ Q(FtÇ£ê
ó_AéŠè÷Î]â$…’í³D¦ËøÄw׈Ņ¾º»µ¨>ÖVKµã¯)RýaC»uv††”h'ÁîðÐÞ}ðMºÍ¹¼L¶%±gY°j»öätÊÆ#Ô‘æÐÒB²áÑH³ã6“žÀÆë€òh÷Ï^~ýÍ-Ç&=3û«-�i c*±Kv/ÿ÷¡b'AìÛFþl—q]Ò2USóê$û»Â¾îÙbyJ š´æ2ŠäÌuËTI)½þ6ŸÈ±±HM…«t£SÖåqÀK\”láúápîµf´´´í|£YX\�ZПTJI~s1Á¨¢,å¼àaòfLÿ*¨×(ŸŽTF“AI-±Ç'˜ÐrºÌµ÷Ñ%µìpš-ˆéhJrÑéèiT]‚<!ž¬ä@‡ˆÔü
With the above obscured code, we are left with working only at the bytecode level. After extracting the bytecode and lifting it, we can print out the decompiled code.
And remember everything you see in the Pseudo-C dump can easily be accessed from the python API.
"""
get file metadata - name of classes, functions, class properties in a file
"""
file_metadata = api.get_file_metadata_by_filepath(filepath="/09_magic_constant3.php")
"""
get file data - get the functions models of the names listed in the metadata
"""
file_data = api.get_file_data_by_metadata(file_metadata)
"""
print out pseudo-c representation of the lifted bytecode
"""
print(file_data.decompile())
# output
"""
main ( )
{
<bb 0> :
T3 = $$fetch_class_constant ( 'Session', 'IS_UNDEFINED' );
@4 = get ( );
@5 = @4->auth ( );
@6 = @5->getUser ( );
@7 = @6->getType ( );
T8 = @7 != T3;
if ( T8 )
goto <bb 1>;
else
goto <bb 2>;
<bb 1> :
@9 = get ( 'webApplication' );
$app = @9;
$app->destroySession ( );
go_to_self ( );
<bb 2> :
$page = '/login_up.php';
@14 = getUrlFromRequest ( 'success_redirect_url', 'false', 'false' );
$successRedirectUrl = @14;
T15 = $successRedirectUrl;
if ( T15 )
goto <bb 3>;
else
goto <bb 4>;
<bb 3> :
@16 = urlencode ( $successRedirectUrl );
T17 = '?success_redirect_url=' + @16;
$page = $page + T17;
<bb 4> :
go_to ( $page );
return 1;
}
"""
Even though this blog post focuses on showcasing what’s possible with PHP code analysis at the bytecode level,
variables, call sites, control flow graphs (CFG), data flow, and more can all be easily accessed using our Python library.
Further details are available in our documentation.
Conclusion
Thank you for your interest in our technology. We are eager to explore potential collaboration opportunities with you. For further inquiries, please do not hesitate to contact us or follow our updates on social media via X. You can also visit out documentation site for more examples.
Blog Post by our Founder Samuel Asirifi.