

In this article we take a look at the PHP tokenizer and its potential at analyzing and processing PHP source code. We will build several working examples, which you can start using and extending for your own purposes.
When PHP has to process a request, the engine goes through several passes of parsing until the code is expressed as a set of instructions that the interpreter can execute. The first such step is “lexical scanning”, which splits the code into smaller strings called “tokens”. The token is the smallest meaningful unit of your source code, and it can represent a reserved word (for, while, class, if, etc.), operator (+, -, *, /, && etc.), value literals (integers, floats, strings etc.) and other special symbols.
The same lexical scanner which PHP uses, is also available to userspace PHP developers via the function token_get_all(). It is very simple to use: you pass your PHP source code as text, and it returns an array of tokens, that we will further process in the examples of this article.
Let's see what the tokenizer will output for a little code snippet:
Output of var_export($tokens) (whitespace adjusted for clarity):
We see that most tokens are defined as arrays, each of which has three items. Item [0] is the token type code, item [1] is the slice of source code representing the token, and item [2] is source line position of the token.
You can fetch readable string names for token types with function token_name(). The same names are also defined as constants in PHP. If you have code autocompletion in an IDE like Eclipse PDT, type T_ to see a full list of token name constants. Check also this page in the PHP manual for a list of token types: www.php.net/manual/en/tokens.php.
Notice that several tokens (index 4 and 6) are defined as a single character string. The token_get_all() function represents all single-character operators and symbols in PHP in this manner, including curly braces, parenthesis etc. You can think of that character as both the token string representation, and the token type.
Since reading the raw token array is hard, and we'll need this for debugging purposes and learning, here is a simple “reporting” function, which prints the tokens for human reading:
Output of echo reportTokens($code):
This looks better. Don't forget to prepare the output for HTML, if you want to run the report in a browser: echo “<pre>”, htmlspecialchars(reportTokens($code)), “</pre>”.
Many of us have become used to writing elaborate in-code documentation in the form of PHPDoc comments (see http://www.phpdoc.org). Unlike regular line and block comments, PHPDoc blocks are not stripped by the PHP engine, as they are accessible at runtime via the Reflection API. Thus, software licenses, code examples and method descriptions become a permanent part of your PHP compiled files, even with opcode cache engines, such as APC or XCache.
If you have collected a big library of reusable code over time, which changes sparingly, and don't use this feature of the Reflection API, you can strip some unneeded weight from your library by removing comments and whitespace. It also speeds up the parsing nominally for those of us deploying without an opcode cache.
We will need a handy “token position” cursor that we can advance, token by token, among all methods that will handle aspects of the parsing. Normally we would have to implement this ourselves, however, PHP arrays have a built-in array position cursor, which is perfect for our purposes. We'll use the following PHP functions for interacting with that cursor and the array:
Let's make the skeleton for our filter class:
In the above code we walk through each token and append its output to self::$out with no modification. Next, we'll check for token types T_WHITESPACE, T_DOC_COMMENT and T_COMMENT and redirect the flow to a private method which will handle any sequence of those three tokens. You will see that splitting the parsing subtasks in separate methods is helping code comprehension a lot, especially in real world scenarios where the parsing task may become far more complex.
We'll add method skipWhiteAndComments(), which will advance the position cursor for as long as it keeps finding whitespace and comments, however, we don't want to completely ignore those tokens as this can produce incorrect code. Instead, we'll replace any sequence of comments and whitespace with a single space. For example, if we have a constant called “name”, we do not want echo name or echo/*comment*/name to become echoname. Curiously, the built-in PHP utility for compacting source code (php –w, php_strip_whitespace()) currently renders invalid code in the latter case.
Notice that after the flow returns from method skipWhiteAndComments(), I added continue, which is needed so $t gets updated to the current token position set by skipWhiteAndComments() and is processed fully.
And it really is as simple as this, we're done. Here's a sample input to test our class with:
And the resulting output:
In practice, if we compact code like this we may find it hard to find the origin of an error as the lines of the statements have changed. However, it is possible to easily strip comments and whitespace while preserving the exact original lines for all statements. All we need to do is filter the tokens in method skipWhiteAndComments() down to the newline characters (\r and \n) they contain, and add them to the output:
Here is the resulting output:
We see that the indentation and comments are gone, but the line numbers are preserved.
Source preprocessing is the act of using macro commands to modify your source code before compilation and execution. This may include skipping parts of our code, or adding new code, depending on a certain predefined condition. Just like our first example, source preprocessing may be too elaborate for our day-to-day application scripting tasks, but becomes useful when handling large libraries of reusable code. You can filter out debug-only code, for example logging and various development-time aids. We can also have one source file compiling to multiple platform-specific “driver” classes, for example for each server we target (by defining symbols for ex. APACHE, IIS), or each database engine we target (for ex. MYSQL, PGSQL, SQLITE), etc.
In this example we will implement two preprocessing constructs using the PHP tokenizer:
#IF_DEFINED <Symbol> ... #END_IF#IF_DEFINED _INSERT <Symbol1>:<Code1>, <Symbol2>:<Code2>, ...IF_DEFINED ... END_IF) will allow us to selectively filter out or include portions of code depending on certain “symbols” we pass as environment to the preprocessor. The second construct (IF_DEFINED_INSERT) will insert a different code snippet depending on the defined symbols (I will show you the need for this a bit later). The “#” symbol in PHP is the start of a Perl-style line comment, and since this symbol is also used for preprocessing in other languages, it's a natural choice for our macro syntax.We will start with the same class skeleton as in Example 1, and redirect to method processMacro() when we detect a T_COMMENT token. Because not every comment is a macro, the processMacro() method will return true or false depending on whether the comment was recognized as a macro:
Now we'll add the processMacro() method. We will check with a regular expression if the comment matches one of the supported macro syntax rules, and if not just return (the comment will display normally, we don't filter regular comments in this example). But if recognized, we send each macro to its own method for further processing.
Let's implement the IF_DEFINED_INSERT macro first. We parse the expression, and then just output the relevant code snippets for all defined symbols:
And for the IF_DEFINED block implementation, we should check if the symbol is defined or not, and depending on that, output or skip all tokens right up to the next END_IF macro:
And our simple preprocessor is complete. Let's try it with an example. I have a created a simple class, representing a database wrapper for MySQL, Microsoft SQL and PostgreSQL. My table/column quoting method has a different syntax for each database engine. Instead of manually maintaining three files or doing plenty of “if” checks at runtime, when many of my other methods are similar, I'll use our preprocessor to compile three classes out of a single source.
I also have an example debug block, which I can filter out at will depending on what kind of distribution I'm preparing:
Output for echo Preprocessor:process($source, array('MYSQL')) (whitespace adjusted for clarity):
Output for echo Preprocessor:process($source, array('MSSQL')) (whitespace adjusted for clarity):
Output for echo Preprocessor:process($source, array('PGSQL')) (whitespace adjusted for clarity):
Output for echo Preprocessor:process($source, array('PGSQL', 'DEBUG')) (whitespace adjusted for clarity):
Things you can do to improve the Preprocessor is support line-preserving processing similar to Example 1, and handling of nested IF_DEFINED blocks. This is left as an exercise for the reader.
In our last example we will parse a source file and return the names of all classes, interfaces and functions defined inside it. Such a list can then be saved and used for a flexible function/class autoloader or for reporting purposes.
Yet again we start with a skeleton class similar to our first example, this time however we have no $out member collecting output, but $definition where we'll push definition names. I also included the same skipWhiteAndComments() method, but removed any lines that produce output, the use of this method will be clear in a bit. We will redirect all T_CLASS, T_INTERFACE, T_FUNCTION tokens (standing for the class, interface, function reserved words, respectively) to a method we'll call readDefinition().
The structure is simple: the class/interface/function token is followed by one or more comment or whitespace tokens, and the next token must be the definition name identifier (of type T_STRING). We won't check the type of that last token, since we assume the file we are scanning has no syntax errors:
Here's the source sample we'll test this on:
And the output of var_export(DefinitionScanner::scan($exampleSource)) (whitespace adjusted for clarity):
It worked well, except for one thing: all class methods ended up detected as standalone functions. This happened since scan() was allowed to enter in a class definition and find all method definitions in there. To solve this, now we'll build a method that skips over the entire definition code block (as defined by curly brackets). Curly brackets can be nested, as they are also used for if/while/foreach blocks etc., so we will have to count the nesting level until the outermost block is closed. For simplicity, again, we assume there are no syntax errors in the code, and hence all code blocks are nested properly. Let's see how this looks in code:
Now let's augment readDefinition() with that functionality:
And here's the output on the same example source below:
There are no misdetected definitions this time.
<?php ?>. Just like when you execute a regular PHP page, the tags are required, or the tokenizer will not recognize the proper context and will parse your source as token type T_INLINE_HTML.T_INLINE_HTML token. This is expected, and allows the PHP engine to process and output the page in small chunks versus all at once.token_get_all() format.Stan Vassilev has been employed in the IT industry for the past 9 years as a developer, software documentation writer, and a web designer. His interests include software architecture and development, web applications, graphical user interfaces, and he has been an active contributor to the Adobe Flash community for the last several years. Since 2005, he specializes in OSS powered backend development for online applications and services, using technologies such as Apache, PHP, Python and MySQL.
Can you post the source for this so I don't have to piece it together from the article?

