Incremental parser (was: Backtick Hickup)

Michel Fortin michel.fortin at michelf.com
Sun Sep 2 10:44:30 EDT 2007


Le 2007-08-28 à 18:51, Allan Odgaard a écrit :


>> Then you talk about the lack of extensibility of the language

>> grammar (which I'm not sure what you mean by that, is there a

>> language grammar for Markdown anyway?).

>

> With a formal grammar, extending the syntax is generally just

> adding or editing a rule, and we have the syntax extension. By hand-

> writing the parser, you tend to end up with code written for a very

> specific purpose generally not easy to extend. Tweak something one

> place in the source, and you break something in another place, I

> think we have seen that already on a few occasions (when something

> is fixed/changed in Markdown.pl).


A case in point would be Markdown.pl 1.0.2b1, which added a fix for
this:

<span attr='`ticks`'>like this</span>

but at the same time created a problem which did not exist previously
with this case:

`<span attr='`ticks`'>like this</span>`

In Markdown.pl, this problem is still unfixed as we speek -- you can
confirm that for yourself on the Dingus. PHP Markdown handles both
cases correctly since 1.0.1d by making the HTML tokenizer aware of
code spans, and in yesterday's release 1.0.1i it's handled by a small
incremental parser for HTML tags, code spans, and backslash escapes;
all done in one stage.



>> Then you go on the lack of performance (are you calling this a

>> syntax or parser issue or both?).

>

> I mention that because if we had a grammar and a generated parser,

> we would get a known good time complexity and pretty efficient code.


The time complexity would be known, but the speed of the generated
parser is only as good as the parser generator can get. I'm curious
to see how a generated parser can perform in PHP, or in Perl, for a
complex syntax like Markdown. Any example?


> I.e. my point was that all these problems I raise are really all

> rooted in the lack of a grammar -- sure we can address them even w/

> o a grammar, and maybe it is not (all) the case with the PHP

> Markdown implementation, I was just adding some (more) arguments

> for why I would like to see the goal of a formal grammar be taken

> more serious.


What do you mean by "taken more serious"?

Up to now, you've expressed your wishes for Markdown as formal
grammar, backed it out with plenty of good arguments, but I'm still
not catching what you're trying to make happen. Are you hoping John
Gruber will reappear and say he has rewritten Markdown as a formal
grammar?

Or perhaps you want to convince me to do it... I'm convinced it'd be
useful for plenty of reasons you've pointed out. But it turns out
that I have plenty of other things to do and I'm not so interested in
writing a formal grammar by myself (not that I wouldn't be willing to
help if someone was doing it).

Or perhaps you just want me (maybe others) to commit using the
grammar if you come with one...


>> [...]

>> I don't really want to see the syntax changed in and out only to

>> make it easier to implement as an incremental parser.

>

> Yeah, that is a more interesting discussion -- how much would be

> okay to change? For example if we change the rules so that we had

> _emphasis_ and *strong*, we would solve the problem with ***, and

> IMO a welcomed change since typing four asterisks for bold is

> tedious and noisy in the text (granted, cmd-B will do the asterisks

> for me, but still…)


I think that should be a case by case basis. A first draft of the
grammar for a particular syntax is written, then reviewed, and we
then decide if it needs to be complexified further to better handle
current Markdown documents.

But changing single-asterisks to mean "strong emphasis" is way too
much diverging in my opinion. I'm almost always using single-
asterisks to denote emphasis, not strong emphasis, and I expect such
a change may break about half the Markdown documents out there.
That's what I'd call forking.



>> I don't think such a parser would be usable (read fast-enough) in

>> PHP anyway. Well, perhaps it could be, but not in the traditional

>> sense of an incremental parser; the concept would probably need to

>> be stretched a lot to fit with regular expressions.

>

> I am not sure what you base these assumptions on. What exactly is

> it that makes PHP so extremely slow that it is unfitted for a

> parser, yet the current (granted, regexp-based) PHP Markdown works

> fine?


I was thinking about a byte-by-byte parser written in plain PHP at
the time. See the second half of my recent reply to Jacob Rus... just
after "Why would a PHP state machine be so terribly slow?":

<http://six.pairlist.net/pipermail/markdown-discuss/2007-August/
000740.html>


Note how it's the silliest techniques (from a compiled language
standpoint) that performs the fastest in PHP in the benchmarks cited
in the email above. I don't know much about parser generators, but I
suspect they may not so well-suited for performance in PHP.

Anyway, if a generated parser is not enough, it's always possible to
use a formal grammar as the basis for writing a parser more optimized
than what a parser generator can do. I'm not trying to put this as an
argument against a formal grammar.



>> Hum, I disagree strongly here that creating links to nowhere (#)

>> is the solution to undefined reference links. This is bad

>> usability for authors who will need to test every links in

>> resulting page to make sure they're linking where they should be

>

> On the contrary, add this to your preview style sheet:

>

> a[href="#"] {

> background: blue;

> border: 2px solid red;

> color: white;

> }

>

> Now you have a very good indicator for missing links, contrary to

> now, where they easily blend in with the regular text, and there is

> no simple way to find them.


That's assuming you have a separate preview mode, are using a special
preview stylesheet, and that you actually look at the preview before
publishing.

I would not expect Markdown's usability to depend on such a precise
workflow. For instance, have you thought about the poor commenter on
a website who doesn't know the comment form use Markdown, writing:

Type these three keys in sequence: [1] [2] [3]

getting this:

<p>Type these three keys in sequence: <a href="#">1</a> [3]

and seeing that in his browser:

Type these three keys in sequence: 1 [3]

Even assuming the user did preview his comment before posting it,
he'll probably struggle to figure out what's happening and to find a
fix.

Markdown is often used in a context where the user doesn't even
*know* what he/she is writing will pass through a Markdown parser,
and, with a few exceptions (like for underscore emphasis within a
word), Markdown works very well for that; your proposed changes would
make Markdown unsuitable to these environments.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list