Incremental parser
Michel Fortin
michel.fortin at michelf.com
Wed Aug 15 11:24:02 EDT 2007
Le 2007-08-14 à 21:54, Jacob Rus a écrit :
> Okay, but the spec could easily be changed to continue emphasis
> until the end of the current paragraph, if no ending * is found.
> This would IMO be much more logical for document readers/writers
> (as well as computer parsers), as some asterisk 60 lines later
> would no longer affect part of the document on the current line.
I disagree about it being better for readers and writers. To me a
sole asterisk or underscore doesn't mean emphasis. If someone
voluntarily writes only one asterisk in front of a word, he probably
didn't mean "emphasis until the end of the paragraph" either.
Hum, and if your paragraph takes 60 lines, you'd better fix that
first. :-)
But seriously, if you happend to have put an asterisk at the start of
a word 10 lines earlier but forgot the closing asterisk, would you
prefer the whole remaining paragraph to be screwed up or to just have
one rogue asterisk where emphasis should have started? Personally,
I'd prefer the later as it would be more readable.
> Given that `*`s surrounded by spaces are left alone, I don't think
> this change would have any real downside, in terms of unintentional
> emphasis, and it would make it possible to have syntax highlighting
> which actually accurately matched the expected output, which would
> make my life as a document author much easier.
I wouldn't be so sure that no one has been writing asterisks at the
start of words. You're right though that by the current rules this
wouldn't be very common.
> Well, the specification is clearly ambiguous, if it can be
> interpreted (pretty reasonably AFAICT) in two completely different
> ways.
The specification for code spans has an ambiguity, I agree. I still
think my interpretation and the parsing currently implemented by both
Markdown.pl and PHP Markdown is more useful, but I guess that's still
open for debate (in the other thread).
> You don't actually need to look to the end of the document when
> constructing the parse tree, however. Links which actually have a
> reference can easily be distinguished from those which don't in a
> quick pass at the end of parsing.
The tricky case is when deciding between a link or litteral text has
consequences for parsing subsequent text. For instance, take this image:
![some *image][] and text*
If there is no corresponding link definition, then "image][] and
text" is text and should be emphased; otherwise if the link is
defined, then you have a first litteral asterisks inside the alt
attribute and another in the text, and no emphasis.
The same problem could apply to links, although with links -- which
can contain markup -- you could use some complicated trickery to span
the emphasis inside and outside the link with two emphasis elements.
> Yes, but that's not really in the scope of our discussion; I don't
> particularly care how clearly invalid markdown is converted. It
> might be nice to specify (as the HTML5 spec is trying to do for
> HTML), but document authors don't need to know about it.
There is no thing such as "invalid Markdown" currently. When would
you call a Markdown document "invalid"?
> It requires one pass to create a document tree, and one more pass
> for a few other things, such as assigning links.
You could also do one pass to strip link definitions and another to
do the actual parsing incrementally. :-)
>> Ok, back to performance. How many time do you start a new Perl
>> process when building the manual? My benchmarks indicates that
>> starting the
>
> AFAIK it only takes starting one perl process to run markdown on a
> large file...
Sure, that's true, but that doesn't answer my question. Is the manual
parsed as one big file or many smaller ones? And if only one file,
what size is it? I'm interested in understanding what makes it so
slow, but I haven't much data at hand to comment on the speed issue.
>> I'm totally not convinced that creating a byte-by-byte parser in
>> Perl or PHP is going to be very useful. Using regular expressions
>> is much faster than executing equivalent PHP code in my experience.
>
> This is a silly statement. Compiled vs. interpreted languages has
> nothing to do with the big-O run-time of an algorithm. Real
> parsers are optimized for this type of task and will take linear
> time (with suitable changes in the markdown spec), while regular
> expressions are not (and it's possible even that a document could
> be created that would take exponential time with the current
> markdown implementation).
>
> [...]
>
> TextMate's parser has to do quite a bit of extra work that a
> dedicated parser for a particular format would not have to do. I
> would not be at all surprised if a Perl or PHP implementation could
> be made extremely fast.
>
> [...]
>
> I really don't understand where you're coming from on this point.
> Why would a PHP state machine be so terribly slow?
This is not about compiled vs. interpreted languages. I've mentioned
coding in C++ only because it's clear that using C++ (or most
compiled languages for that matter) you can get at the very least the
same speed as with a regular expression. What really matters here is
only the speed of PHP code vs. regular expressions, because these are
the two things I have at my disposal for PHP Markdown.
In all my tests, the more work you do inside the regular expression,
the fastest it is. I've observed that many times while optimizing PHP
Markdown.
As a demonstration, I've made a small test program that counts
characters 'a', 'b', and 'c' inside a string. There's nine different
implementations counting those characters. No matter what the string
size is (still within reasonable limits), the pure-regex
implementation is always faster in my tests: even if it needs to pass
three times over the whole string, even if it generates irrelevant
arrays containing thousand of matches, it always surpass in speed the
simplest PHP loop by a factor of more than five. Hybrid approaches
with a single regular expression and simple PHP branching code tend
to be in-between.
You can find the test script [here][1] if you want to check for
yourself. Here's a typical output (tested on an iBook G4 1.2 Ghz, PHP
5.0.4 with Zend Optimizer, 200 Kb string.):
1) a: 12623; b: 12606; c:12373; time: 1346 ms.
2) a: 12623; b: 12606; c:12373; time: 2572 ms.
3) a: 12623; b: 12606; c:12373; time: 180 ms.
4) a: 12623; b: 12606; c:12373; time: 184 ms.
5) a: 12623; b: 12606; c:12373; time: 407 ms.
6) a: 12623; b: 12606; c:12373; time: 397 ms.
7) a: 12623; b: 12606; c:12373; time: 208 ms.
8) a: 12623; b: 12606; c:12373; time: 25 ms.
9) a: 12623; b: 12606; c:12373; time: 15 ms.
As for the algorithms used:
1: simple for loop + switch over characters
2: split text to an array + foreach + switch over characters
3: count(explode('a', $text)) (called 3 times: one for each
character)
4: preg_match_all (called 3 times: one for each character)
5: preg_match_all + foreach on matches + switch over matched
characters
6: preg_replace_callback + callback + switch over matched character
7: preg_replace_callback + callback (called 3 times: one for
each character)
8: preg_replace + string length difference
9: str_replace + string length difference
[1]: http://www.michelf.com/docs/charcount.zip
Take notice how the simple PHP loop (1) is absurdly slow compared to
even the most ridiculous approaches using a regular expression.
Common sense would tell us that the first approach should be the
fastest because all the others do extra work, but that's certainly
not the case: all other approaches involving a native function is
faster by a wide margin, almost proportionally to the amount of PHP
code they have to execute. That means that executing PHP code *is* in
itself a lot more extra work and anything else done in these tests.
So the issue isn't so much about algorithmic complexity, it's about
PHP code being a magnitude slower than regular expressions to
execute, or any native code for that matter. The smallest the amount
of PHP code and the least repeated it is, the better for performance;
that's how I have optimised PHP Markdown in the last few years.
Another case in point: PHP Markdown Extra's HTML block parser. It
works mostly as an incremental parser. Going into the details would
take too long, but it's generally much slower than PHP Markdown's
HTML block parser, even though for parsing HTML blocks in PHP
Markdown the whole document is traversed 5 times with different
regular expressions.
Note that all this is not about incremental parsing vs. a bunch of
regex; it's PHP speed vs. regex speed. Regular expressions can be
used pretty effectively to speed up tokenization inside an
incremental parser (as Extra's HTML block parser do). I'm pretty sure
I could come up with something that parse incrementally Markdown's
whole block gamut by combining all the blocks into a single regex and
then looping on the result. This would surely improve performance a
little, although it could also make code harder to maintain.
Now, about algorithmic complexity (big-O). It's certainly true that
backtracking in regular expressions can be pretty bad. Two times I
had to fix things like this: the first one was a few years ago in the
header expression ported from Markdown.pl (which Perl was optimising
better, avoiding backtracking); the second was in PHP Markdown
Extra's HTML block parser and was reported by someone using pretty
bad HTML as input. In both cases, it was easy to fix, and I'm pretty
confident there's no more loopholes now. PHP supports (?> once-only
subpatterns ) (Perl does too): those have been of precious help to
avoid unwanted backtracking.
Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/
More information about the Markdown-Discuss
mailing list