PHP Markdown, speed, MovableType (was "Michel Fortin")

Tue Dec 7 12:35:39 EST 2004

Michel Fortin <michel.fortin at michelf.com> wrote on 12/02/04 at 5:52pm:

> You are right about this. Perl is a lot faster, at least when dealing 
> with regular expressions, and it shows when using Markdown vs. PHP 
> Markdown. It would be interesting to see a benchmarking tool included 
> with the testing suite John is working on so we can get a better 
> picture.

I had no idea that this was the case -- that it's faster to call my
Perl version from PHP than to use PHP Markdown natively. 

> I still hope the optimization included in 1.0.1b2 will improve things 
> for many people. I think people who write many long paragraphs without 
> manual line breaks will see the difference. But it's still way slower 
> than Perl.

I found this very surprising at first, because there's a bit of
overhead calling Perl from PHP because you have to launch an instance
of the perl interpretter. (This is why CGI is generally slower than
PHP -- PHP generally runs *within* the parent Apache process.)

When I get the chance, I will add some simple benchmarking to my
testing tool. (And I'll release the tool soon too, I promise.)

I've been thinking about this, and I have a theory why the Perl
version is so much faster.

As anyone who's glanced at the source code can tell, the entire
Markdown.pl engine is implemented as a series of regular expression
substitutions. Rather than using a handful of monolithic monster
regex patterns, it uses many smaller patterns, each of which apply
to one bit of Markdown syntax at a time.

This raised a few eyebrows when I first released the code, but in
practice, it seems to have turned out pretty well. It's a little
over a year now since Markdown.pl was in good enough shape for me to
use it at Daring Fireball, and in that time I've found the code to
be relatively easy to maintain.

For example, here's the subroutine that turns asterisks and
underscores into `<strong>` and `<em>` tags

    sub _DoItalicsAndBold {
        my $text = shift;

        # <strong> must go first:
        $text =~ s{ (\*\*|__) (?=\S) (.+?[*_]*) (?<=\S) \1 }
            {<strong>$2</strong>}gsx;

        $text =~ s{ (\*|_) (?=\S) (.+?) (?<=\S) \1 }
            {<em>$2</em>}gsx;

        return $text;
    }

This routine gets called separately for each span of text -- once
for each paragraph, list item, header, etc. For example, when
processing the Markdown syntax documentation file
(<http://daringfireball.net/projects/markdown/syntax>), it gets
called 153 times. The same goes for most of Markdown's other
subroutines.

This "do it all with a bunch of little regexes" strategy is, in my
opinion at least, a very Perl-ish strategy. Perl treats regex
matching and substitution as first order language constructs. The
`m//` and `s///` features are *operators* in Perl, just like
arithmetical operators such as +, -, and *. In other (normal?)
langauges, regular expression matching and substitution
are accomplished through function calls.

The fact that Perl treats regexes as part of the language means more
than just the way the syntax is parsed. Perl's interpretter cheats
in a bunch of very clever ways that help in terms of performance.

All regex engines -- Perl's, PHP's, Java's, BBEdit's, etc. -- work
in two stages. In the first stage, each regex pattern is compiled
from text into an internal binary object. Then, the regex engine
uses the compiled pattern object to do the matching.

As humans, we just write regex patterns as text. For example:

    foo\w*

is a pattern that will match any word that starts with "foo". When
it goes to match that pattern against target text, a regex engine
will compile the pattern, then use the compiled pattern to find a
match. This is similar, in a general way, to how a C compiler will
take source code written as text, and compile it into a binary file
of machine code.

Compilation is slightly expensive. If you're only compiling a few
patterns, a few times each, you won't notice. But if you're talking
about many patterns, or looping over them dozens or hundreds or
thousands of times, it adds up.

Perl is smart enough to cache the compiled state of static regex
patterns. I.e. if Perl looks at a pattern that it can tell is never
going to change, no matter how many times it is invoked -- such as
those in the _DoItalicsAndBold() routine above, as well as nearly
all the other patterns in Markdown -- it will only compile them
*once*, cache the results, and re-use the same compiled regex
objects each time they're invoked.

In other words, even though _DoItalicsAndBold() gets called 153
times when processing the syntax documentation file, those two
patterns are only compiled one time each.

I'm fairly certain that PHP does no such caching. Each time you call
preg_replace(), you're starting from scratch again. Instead of
compiling once and matching 153 times, PHP has to compile the
patterns 153 times *and* match them 153 times.

Here's a segment from Jeffrey Friedl's wonderful "Mastering Regular
Expressions" (2nd ed.), p. 351, where he's discussing the benefits
of Perl's caching (he ran benchmarks to test how much time they
save):

>   The results are pretty interesting. I ran a version that has no
>   interpolation (the entire regex manually spelled out within
>   m/.../), and used that as the basis of comparison. The
>   interpolation and check, if the regex doesn't change each time,
>   takes about 25x longer. The full pre-processing (which adds the
>   recompilation of the regex each time) takes about 1,000x longer!
>   Wow.
>   
>   Just to put these numbers into context, realize that even the full
>   pre-processing, despite being over 1,000x times slower than the
>   static regex literal pre-processing, still takes only about
>   0.00026 seconds on my system. (It benchmarked at a rate of 3,846
>   per second; on the other hand, the static regex literal's
>   pre-processing benchmarked at a rate of about 3.7 million per
>   second.)

Wow, indeed.

I am not a PHP expert. And unfortunately, my cursory research
indicates there is no way to cache compiled regex patterns in PHP.
Not just automatically, as Perl does, but even manually.

Python, on the other hand, allows the programmer to optionally
control the compilation of patterns before using them:

>   The sequence
> 
>       prog = re.compile(pat)
>       result = prog.match(str)
>   
>   is equivalent to
> 
>       result = re.match(pat, str)
>   
>    but the version using compile() is more efficient when the
>    expression will be used several times in a single program.

Cf. <http://docs.python.org/lib/node111.html>.

If there's a way to do something like the above Python snippet in
PHP, I think it would significantly speed up PHP Markdown.

-J.G.