Re: Formal Grammar — some thoughts
    Michel Fortin 
    michel.fortin at michelf.com
       
    Mon Jul 31 15:03:28 EDT 2006
    
    
  
Le 30 juil. 2006 à 21:29, Allan Odgaard a écrit :
> On 30/7/2006, at 22:34, Michel Fortin wrote:
>
>> [...] I'd like to point out that in my view John's implementation  
>> is already doing tokenization in some form [...]
>
> Well, this here [1] is what people generally refer to when speaking  
> of tokenizing input.
Yeah, I know that isn't exactly like a tokenization process. I just  
wanted to draw the parallel between the way Markdown currently works  
and a regular tokenizer. I called it *some form* of tokenization, and  
used more often than not the word "token" inside quotes to emphasise  
the precariousness of the comparaison.
At the same time, I'm not sure I have a better name than "token" for  
these md5 hashes in the eventuality they would be replaced by another  
non-hashing labeling scheme.
> Now try the same on these two lines of text:
>
>     This `is raw [text`](#)
>
>     This is a [`link](#) and more text`
>
> If you choose to replace links with an md5 first, then the result  
> of converting the first line will be wrong, whereas if you choose  
> to convert raw first, the second line will be wrong.
What's wrong and right here? It could be argued that since it's not  
defined in the syntax description whichever comes first should be the  
rule and no priority should be given to one syntax construct over  
another, but the fact is that it's still undefined and that John's  
reference implementation prioritize code spans over links.
> This is easy to handle with a real parser, actually, even a regexp  
> can do it. There is little need for this multi-pass content  
> obfuscation paradigm currently being used ;)
I thought a while ago about combining all the span-level regular  
expressions into one big expression: this would implement the  
whichever-comes-first rule. But I don't see the multi-pass approach  
as wrong either: it simply implements some priority relationship  
among the different syntax constructs.
One question though: is it so much important that these border cases  
be consistent across all implementations? No doubt it would be a good  
thing, but at what price in term of complexity of implementation?
>> [...] This is far from having a formal grammar, but it shows that  
>> a lot more could be done by reusing the current approach.
>
> Well, yes, a lot more can be done. But I think the energy would be  
> better spent trying to move toward a more formal grammar and more  
> standard parsing mechanisms. This is quite a challenge, and it  
> can’t be done without revising some parts of the syntax, OTOH the  
> problematic parts (e.g. nested block elements) is often not handled  
> consistently (or properly) by the current implementation, so I’d  
> think it would be possible to tweak this a bit.
Formal grammar or not, it's certain the specification could be  
revised to clarify a lot of edge cases. That said, I don't think the  
syntax should be allowed to *change* just to accomodate a formal  
grammar requirement.
Michel Fortin
michel.fortin at michelf.com
http://www.michelf.com/
    
    
More information about the Markdown-Discuss
mailing list