Embedding code in a web page like PHP

PHP (and a few other languages) have a very rewarding feature: They let you embed code into a web site. Moreover, you can interleave the code with the actual text. I.e. you can write stuff like:

Hello World!
<?php if( $showTheText ) { ?>
This is conditional web site text.
<?php } ?>
End of the world.

and depending on the variable’s state, the text in between the two statements will be output or not. You can also do loops like that.

Where do we want to get?

For a while, I thought this must be complex magic. I knew how I could find the code in the text file, evaluate it, and then insert the result into the text. But that would not work for conditionals or loops. Then one day I realized it’s just a very clever re-framing of the problem.

Think of how you would write this in code without the embedding:

printf("Hello World!");
if( $showTheText )
{
	printf("This is conditional web site text.");
}
printf("End of the world.");

So how do we get from the first to the second? Well, it’s magic in the tokenizer and parser.

How a Tokenizer sees your text

A tokenizer is the part of a programming language that reads your file, character by character, and splits it up into words and strings and operators. A tokenizer is usually a state machine. E.g. in our second script, it starts in the state whitespace, then encounters the first character and switches to state identifier. Once it encounters the (, it creates the first token, “if”, and then adds the “(“ token. Then it encounters a quote mark and switches to state string, reads until it encounters the second quote mark and adds a string token “Hello World!”, etc.:

How PHP sees your text

Now, in this case, we want to turn the start of our text into an “printf” command, so we need some way to detect that this is text to be printed. How we do that? We add a new state printf_text. Printf text is basically like a string token, just with a different name, so we can tell them from regular strings, and different (and distinct) start and end indicators.

Whenever PHP parses a web page, it starts out in printf_text state. It eats all the characters, until it hits a <?php. That is basically the closing quote of our printf_text string. Once we encounter that, we switch to whitespace state, and parse just like our example above. Then, at some point, we reach the ?> indicator, which is basically the indicator telling us to start a new printf_text string, like an opening quote for regular strings.

So basically, we’ve reversed the meaning of the <?php and ?> tags. Instead of starting and ending a code section, for the parser, they end and start a printf_text section. If we parse like this, our token list looks like this:

«printf_text 'Hello World!'», «if», «open parenthesis», «variable showTheText», «closing parenthesis», «open curly brace», «printf_text 'This is conditional web site text.'», «closing curly brace», «printf_text 'End of the world.'»

Now this is basically a simpler form of the token stream above. The PHP parser can treat every printf_text token we encounter as if it was a printf command given that text and followed by a semicolon.

Amazing how, sometimes, when you take a few steps back and squint at a problem, it suddenly turns out to be the reverse of your problem. I was trying to figure out how to run code in a text file, when really what I needed to do was embed the text file portions in my code.