PHP Markdown, speed, MovableType (was "Michel Fortin")

Tue Dec 7 13:03:33 EST 2004

On Tue, 2004-12-07 at 12:35 -0500, John Gruber wrote:
> I had no idea that this was the case -- that it's faster to call my
> Perl version from PHP than to use PHP Markdown natively. 

perl is much faster than people often give it credit. Even the "cost" of
starting an interpreter is fairly small - it's a small program, and on a
webserver (that often has lots of memory) you don't have to pull it from
disk usually; it's so much in use it's already in memory. The start-up
cost of the interpreter isn't huge either.

Writing object-oriented CGI in Perl can often run slowly in perl, but
there are a variety of reasons why. perl itself is actually pretty
speedy, faster than PHP like-for-like (although the range of expression
of Perlish code perhaps gives it an unfair advantage).

> Perl is smart enough to cache the compiled state of static regex
> patterns. I.e. if Perl looks at a pattern that it can tell is never
> going to change, no matter how many times it is invoked -- such as
> those in the _DoItalicsAndBold() routine above, as well as nearly
> all the other patterns in Markdown -- it will only compile them
> *once*, cache the results, and re-use the same compiled regex
> objects each time they're invoked.

I think you may well be right, unless there is something else seriously
wrong with the PHP version. PHP uses PCRE after all, so I would expect
that the implementation wouldn't make a difference (some *regexes* can
be ferociously slow to run, and sometimes the implementation does make a
difference).

If PHP is unable to compile a static regex form - and, I don't believe
it can (I hate PHP with a passion; but that's mostly because I work with
it most of the time) - then I don't really think there's an easy
algorithmic solution.

The obvious solution is to either compile more-clever regexes (i.e., do
more in each pass), but I'm not sure that's a sane way to proceed. 

I wonder if there is a simple solution, though? It seems to me that the
way forward, if speed is a significant issue, is to write the program as
one large state machine - realistically speaking, there are only a few
things that are going to change it's state. This would be an interesting
exercise anyway (it would quite quickly find ambiguities in the Markdown
syntax), but potentially loses the bug-for-bug compatibility of the PHP
version.

Maybe also it's possible to reduce the difficulty of the problem - one
of the major issues parsing Markdown is the difficulty disambiguating
new paragraphs (it can be 100% ambiguous, text which contains 1. and
happens to wrap to the start of the line is a good example) - if you had
to leave a clear blank line between each "structure", that would solve
some of the regex-ing around. That's actually how xMarkdown kind of
works - it separates things out structurally at first, which then allows
it to apply regexes on a paragraph, not line, basis. Almost divide and
conquer. Having an intermediate 'canonical' form would probably help.

-- bob.

PS. would it be possible to time the PHP version tightly? maybe there is
one specific area which is *really* bad - going through with some
microtimes() and seeing which large chunks are sucking all the time
might finger one obvious culprit. If it's just generally bad... well,
lose ;)