How the Zend Engine Works: Opcodes and Op Arrays

The Zend Engine executes a script by walking it through the following steps:

1.	The script is run through a lexical analyzer (often called a lexer) to convert the human-readable code into machine-digestible tokens. These tokens are then passed to the parser.
2.	The parser parses the stream of tokens passed to it from the lexer and generates an instruction set (or intermediate code) that runs on the Zend Engine. The Zend Engine is a virtual machine that takes assembly-style, three-address instruction code and executes it. Many parsers generate an abstract syntax tree or parse tree that can then be manipulated or optimized before being passed to the code generator. The Zend Engine parser combines these steps into one and generates intermediate code directly from the tokens passed to it from the lexer. From the point of view of someone authoring PHP extensions or embedding PHP into applications, this functionality is wrapped into a single phase: compilation. Compilation takes the location of a script and returns intermediate code for it. This intermediate code is (more or less) machine-independent code that one can think of as "assembler code" for the Zend virtual machine. This intermediate code is an ordered array (an op arrayshort for operations array) of instructions (known as opcodesshort for operation code) that are basically three-address code: two operands for the inputs, a third operand for the result, plus the handler that will process the operands. The operands are either constants (representing static values) or an offset to a temporary variable, which is effectively a register in the Zend virtual machine. In the simplest case, an opcode performs a basic operation on its two input operands and stores the result in a register pointed at by the result operand. In a more complex case, opcodes can also implement flow control, resetting the position in the op array for looping and conditionals.
3.	After the intermediate code is generated, it is passed to the executor. The executor steps through the op array, executing each quad in turn.

What Is a Virtual Machine?

The Zend Engine is a virtual machine (VM), which means it is a software program that simulates a physical computer. In a language such as Java, the VM architecture provides portability, allowing you to move compiled bytecode from one machine to another. The Zend Engine has no native support for precompiled programs. A VM provides flexibility to PHP.

In contrast to the 75 base operations on an x86 series processor (what most likely drives your computer), the Zend Engine implements approximately 150 base instructions (called opcodes in Zend language). This instruction set includes not only typical VM instructions such as logical and mathematical operations, but also complex instructions, such as calling include() (a single Zend Engine instruction) and printing a string (also a single instruction).

A VM is always slower than the physical machine it runs on, so extra speed is gained by performing complex instructions as a single VM operation. This is in general called a Complex Instruction Set Computer (CISC) architecture, in contrast to a Reduced Instruction Set Computer (RISC), which uses a small set of simple instructions and relies on being able to execute them extremely quickly.

These compilation and execution phases are handled by two separate functions in the Zend Engine: zend_compile and zend_execute. These are both implemented internally as function pointers, which means that you can write an extension that overloads either of these steps with custom code at runtime. (We will explore the why and how of this later in this chapter.)

Here is a representation of the intermediate code for the following simple script:

<?php
  $hi = 'hello';
  echo $hi;
?>

opnum     line                           opcode         op1        op2      result
    0        2                     ZEND_FETCH_W        "hi"                     '0
    1        2                      ZEND_ASSIGN          '0    "hello"          '0
    2        3                     ZEND_FETCH_R        "hi"                     '2
    3        3                        ZEND_ECHO          '2
    4        5                      ZEND_RETURN           1

Note

The intermediate code dumps in this chapter were all generated with a tool call op_dumper. op_dumper is fully developed as an example in Chapter 23, "Writing SAPIs and Extending the Zend Engine." VLD, developed by Derick Rethans and available at http://www.derickrethans.nl/vld.php, provides similar functionality.

Here's what is going on in this script:

opcode 0 First, you assign Register 0 to be a pointer to the variable named $hi. Then you use ZEND_FETCH_W op because you need to assign to the variable (W is for "write").
opcode 1 Here the ZEND_ASSIGN handler assigns to Register 0 (the pointer to $hi) the value hello. Register 1 is also assigned to, but it is never used. Register 1 would be utilized if the assignment were being used in an expression like this:
```
if($hi = 'hello'){}
```
opcode 2 Here you re-fetch the value of $hi, now into Register 2. You use the op ZEND_FETCH_R because the variable is used in a read-only context.
opcode 3 ZEND_ECHO prints the value of Register 2 (or, more accurately, sends it to the output buffering system). echo (and print, its alias) are operations that are built in to PHP itself, as opposed to functions that need to be called.
opcode 4 ZEND_RETURN is called, setting the return value of the script to 1. Even though return is not explicitly called in the script, every script contains an implicit return 1, which is executed if the script completes without return being explicitly called.

Here is a more complex example:

<?php
  $hi = 'hello';
  echo strtoupper($hi);
?>

The intermediate code dump looks similar:

     opnum    line                           opcode          op1         op2     result
         0       2                     ZEND_FETCH_W         "hi"                     '0
         1       2                      ZEND_ASSIGN           '0     "hello"         '0
         2       3                     ZEND_FETCH_R         "hi"                     '2
         3       3                    ZEND_SEND_VAR         '2
         4       3                    ZEND_DO_FCALL "strtoupper"                     '3
         5       3                        ZEND_ECHO         '3
         6       5                      ZEND_RETURN          1

Notice the differences between these two scripts.

opcode 3 The ZEND_SEND_VAR op pushes a pointer to Register 2 (the variable $hi) onto the argument stack. This argument stack is how the called function receives its arguments. Because the function called here is an internal function (implemented in C and not in PHP), its operation is completely hidden from PHP. Later you will see how a userspace function receives arguments.
opcode 4 The ZEND_DO_FCALL op calls the function strtoupper and indicates that Register 3 is where its return value should be set.

Here is an example of a trivial PHP script that implements conditional flow control:

<?php
$i = 0;
while($i < 5) {
  $i++;
}
?>

opnum   line                         opcode        op1        op2    result
    0      2                   ZEND_FETCH_W        "i"                   '0
    1      2                    ZEND_ASSIGN         '0          0        '0
    2      3                   ZEND_FETCH_R        "i"                   '2
    3      3                ZEND_IS_SMALLER         '2          5        '2
    4      3                      ZEND_JMPZ         $3
    5      4                  ZEND_FETCH_RW        "i"                   '4
    6      4                  ZEND_POST_INC         '4                   '4
    7      4                      ZEND_FREE         $5
    8      5                       ZEND_JMP
    9      7                    ZEND_RETURN          1

Note here that you have a ZEND_JMPZ op to set a conditional branch point (to evaluate whether you should jump to the end of the loop if $i is greater than or equal to 5) and a ZEND_JMP op to bring you back to the top of the loop to reevaluate the condition at the end of each iteration.

Observe the following in these examples:

Six registers are allocated and used in this code, even though only two registers are ever used at any one time. Register reuse is not implemented in PHP. For large scripts, thousands of registers may be allocated.
No real optimization is performed on the code. This postincrement:
```
$i++;
```
could be optimized to a pre-increment:
```
++$i;
```
because it is used in a void context (that is, it is not used in an expression where the former value of $i needs to be stored.) This would save you having to stash its value in a register.
The jump oplines are not displayed in the debugger. This is really the fault of the assembly dumper. The Zend Engine leaves ops used for some internal purposes marked as unused.

Before we move on, there is one last important example to look at. The example showing function calls earlier in this chapter uses strtoupper, which is a built-in function. Calling a function written in PHP looks similar to that to calling a built-in function :

<?php
function hello($name) {
  echo "hello\n";
}
hello("George");
?>

opnum   line                         opcode        op1        op2    result
    0      2                       ZEND_NOP
    1      5                  ZEND_SEND_VAL   "George"
    2      5                  ZEND_DO_FCALL    "hello"                   '0
    3      7                    ZEND_RETURN          1

But where is the function code? This code simply sets the argument stack (via ZEND_SEND_VAL) and calls hello, but you don't see the code for hello anywhere. This is because functions in PHP are op arrays as well, as if they were miniature scripts. For example, here is the op array for the function hello:

FUNCTION: hello
opnum   line                         opcode        op1        op2    result
    0      2                   ZEND_FETCH_W     "name"                   '0
    1      2                      ZEND_RECV          1                   '0
    2      3                      ZEND_ECHO "hello%0A"
    3      4                    ZEND_RETURN       NULL

This looks pretty similar to the inline code you've seen before. The only difference is ZEND_RECV, which reads off the argument stack. As with standalone scripts, even though you don't explicitly return at the end, a ZEND_RETURN op is implicitly added, and it returns null.

Calling includes work similarly to function calls:

<?php
include("file.inc");
?>

opnum   line                         opcode        op1        op2    result
    0      2           ZEND_INCLUDE_OR_EVAL "file.inc"                   '0
    1      4                    ZEND_RETURN          1

This illustrates an important aspect of the PHP language: All includes and requires happen at runtime. So when a script is initially parsed, the op array for that script is generated, and any functions and classes defined in its top-level file (the one that is actually run) are inserted into the symbol table; but no potentially included scripts are parsed yet. When the script is executed, if an include statement is encountered, the include is then parsed and executed on the spot. Figure 20.1 illustrates the flow of a normal PHP script.

Figure 20.1. The execution path of a PHP script.

How the Zend Engine Works: Opcodes and Op Arrays

This design choice has a number of repercussions:

Flexibility It is an oft-vaunted fact that PHP is a runtime language. One of the important things that being a runtime language means for PHP is that it supports conditional inclusion of files and conditional declaration of functions and classes. Here's an example:
```
if($condition) {
  include("file1.inc");
}
else {
  include("file2.inc");
}
```
In this example, the runtime parsing and execution of included files makes this operation more efficient (because files are included only when needed), and it eliminates the potential hassles of symbol conflicts if two files contain different implementations of the same function or class.
Speed Having to actually compile includes on-the-fly means that a significant portion of a script's execution time is spent simply compiling its dependant includes. If a file is included twice, it must be parsed and executed twice. include_once and require_once partially solve that problem, but it is further exacerbated by the fact that PHP resets its compiler state completely between script executions. (We'll talk about that more in a minute, as well as some ways to minimize that effect.)