Re: Formal Grammar — some thoughts

Michel Fortin michel.fortin at michelf.com
Sun Jul 30 16:34:39 EDT 2006


Le 29 juil. 2006 à 17:54, A. Pagaltzis a écrit :


> I wouldn’t go for a pure formal grammar. If you don’t, then it’s

> easy to tolerate ambiguity in the language by deferring

> disambiguation until possible. Just accumulate potential tokens

> and only assign meaning once it’s decidable.


Personally, I'd do it with multiple passes of tokenization. I'd first
tokenize block-level elements and define a particular rendering
procedure for each of these block-level tokens. Then, when parsing of
span-level elements is needed inside block-level tokens, I'd tokenize
the text content of these blocks (with proper indentation removed as
needed) into span-level tokens. This means you'd have two grammars:
one to separate block elements, one to separate span elements.

I'd like to point out that in my view John's implementation is
already doing tokenization in some form. The most obvious is the
replacement of HTML blocks by md5 hashes. If you consider the hash as
a token, and the text before and after it as text tokens too, you
have, in a way, a string composed of tokens. It's not really the
usual way of working with tokens, but all the tokens being part of a
single string makes possible to pass the entire text through a single
regular expression.

Markdown then separates blocks and renders them, replace them in the
text by the generated HTML, then reuse the HTML block parser to hash,
or "tokenize" what it just outputted (it would be better to hash/
tokenize blocks directly instead of relying on the HTML block parser
to catch them all later, and this is what PHP Markdown Extra does).

A similar strategy could be used for span-level elements too. PHP
Markdown Extra already does create hashes for some kinds of inline-
level tags, which prevents Markdown from interfering with the content
of <script> or <math> or <code>. The same strategy could be used with
emphasis, links, and other generated markup to prevent invalid
nesting. For example, let's create a link with a new "tokenized" way
from this:

__some text [with a link__ oh!](somewhere)

When Markdown encounters the link, it'll use this markdown text:

with a link__ oh!

When processed with doSpanGamut, the text is unchanged. The link is
then formed:

<a href="somewhere">with a link__ oh!</a>

tokenized (md5 hash):

c168b0c687ed1c4696a41207dd654824

and inserted in the text:

__some text c168b0c687ed1c4696a41207dd654824

When the actual HTML output is created, hash values are replaced by
their corresponding valid HTML string, and then you you have this
perfectly valid span-level HTML snippet:

__some text <a href="somewhere">with a link__ oh!</a>

See? No invalid nesting anymore!

I recognize that md5 hashes are somewhat overkill for this process.
In fact, any alphanumeric string which isn't present in the input
text is suitable for "tokens". You could, for instance, label them as
"x1x", "x2x", "x3x" in their order of insertion: it'd work
beautifully, as long as you prevent any x digit x in the input from
being seen as a token.

This is far from having a formal grammar, but it shows that a lot
more could be done by reusing the current approach.


Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/




More information about the Markdown-Discuss mailing list