Re: Formal Grammar — some thoughts
    Michel Fortin 
    michel.fortin at michelf.com
       
    Sun Jul 30 16:34:39 EDT 2006
    
    
  
Le 29 juil. 2006 à 17:54, A. Pagaltzis a écrit :
> I wouldn’t go for a pure formal grammar. If you don’t, then it’s
> easy to tolerate ambiguity in the language by deferring
> disambiguation until possible. Just accumulate potential tokens
> and only assign meaning once it’s decidable.
Personally, I'd do it with multiple passes of tokenization. I'd first  
tokenize block-level elements and define a particular rendering  
procedure for each of these block-level tokens. Then, when parsing of  
span-level elements is needed inside block-level tokens, I'd tokenize  
the text content of these blocks (with proper indentation removed as  
needed) into span-level tokens. This means you'd have two grammars:  
one to separate block elements, one to separate span elements.
I'd like to point out that in my view John's implementation is  
already doing tokenization in some form. The most obvious is the  
replacement of HTML blocks by md5 hashes. If you consider the hash as  
a token, and the text before and after it as text tokens too, you  
have, in a way, a string composed of tokens. It's not really the  
usual way of working with tokens, but all the tokens being part of a  
single string makes possible to pass the entire text through a single  
regular expression.
Markdown then separates blocks and renders them, replace them in the  
text by the generated HTML, then reuse the HTML block parser to hash,  
or "tokenize" what it just outputted (it would be better to hash/ 
tokenize blocks directly instead of relying on the HTML block parser  
to catch them all later, and this is what PHP Markdown Extra does).
A similar strategy could be used for span-level elements too. PHP  
Markdown Extra already does create hashes for some kinds of inline- 
level tags, which prevents Markdown from interfering with the content  
of <script> or <math> or <code>. The same strategy could be used with  
emphasis, links, and other generated markup to prevent invalid  
nesting. For example, let's create a link with a new "tokenized" way  
from this:
     __some text [with a link__ oh!](somewhere)
When Markdown encounters the link, it'll use this markdown text:
     with a link__ oh!
When processed with doSpanGamut, the text is unchanged. The link is  
then formed:
     <a href="somewhere">with a link__ oh!</a>
tokenized (md5 hash):
     c168b0c687ed1c4696a41207dd654824
and inserted in the text:
     __some text c168b0c687ed1c4696a41207dd654824
When the actual HTML output is created, hash values are replaced by  
their corresponding valid HTML string, and then you you have this  
perfectly valid span-level HTML snippet:
     __some text <a href="somewhere">with a link__ oh!</a>
See? No invalid nesting anymore!
I recognize that md5 hashes are somewhat overkill for this process.  
In fact, any alphanumeric string which isn't present in the input  
text is suitable for "tokens". You could, for instance, label them as  
"x1x", "x2x", "x3x" in their order of insertion: it'd work  
beautifully, as long as you prevent any x digit x in the input from  
being seen as a token.
This is far from having a formal grammar, but it shows that a lot  
more could be done by reusing the current approach.
Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/
    
    
More information about the Markdown-Discuss
mailing list