Backtick Hickup
    Allan Odgaard 
    29mtuz102 at sneakemail.com
       
    Sun Aug 19 10:46:30 EDT 2007
    
    
  
On Aug 14, 2007, at 9:45 AM, Michel Fortin wrote:
> [...] Your interpretation of the syntax would require that:
>
>     (mine)   ` `````````` `
>     (your's) ``````````` `````````` ```````````
Well, showing that my interpretation of Gruber’s writings leads to a  
lot of redundant back-ticks (in a fictional case) is not really  
showing that my interpretation is wrong ;)
But based on the code for Markdown.pl it would seem that the standard  
has an additional requirement, not made explicit in the syntax  
document (namely that back-ticks must not follow the end-token).
Personally, as I have said before, the back-tick rules are confusing  
(when you want to include a back-tick in the code) and we might be  
better off by just defining some simpler rules.
My proposal (from the thoughts on a formal grammar) was to have  
`normal raw` and ``double-quoted raw`` where the latter would support  
escape codes (at least \`).
But there are other options. Having escape-codes in raw though could  
prove to be generally useful.
> [...]
> (There's also a check for a backslash at the start, although I just  
> realised that this needs work as it doesn't give a correct result  
> for an escaped litteral backslash like this: \\`code`.)
And this is *exactly* why I think the current parser is so flawed,  
because you can’t look at things in isolation -- *everything* is  
dependent on what precedes it, not just the previous character, but  
every single character that comes before the current one (granted, it  
seems that in practice, i.e. the standard inferred from how the  
parser actually works, things are dependent only on characters  
preceding things _in the same paragraph_ -- but it seems to me that  
this is really just a side-effect of how the parser is written, and  
not always desired. For example embedded HTML does not lean itself  
well to the “split the document into paragraphs”).
Anyway, if we agree that everything is dependent on everything that  
precedes it, I think we can slowly start to agree that *also* having  
things depend on what follows, is problematic. I.e. we turn parsing  
into the chinese game of pickup sticks -- the way this is presently  
(mostly) solved is by doing iterative scans, where each iteration is  
handling a given “token”, so rather than have the placement of the  
token in the document define the outcome (i.e. the closer it is to  
the start of the document, the higher its precedence), it is based on  
the order of the iterative scans (i.e. the first token “seen” by the  
parser, where it might be blind to `**` the first time it scans the  
document), take this example:
     This **is `raw** text`
Here we “naively” (i.e. regular parser) see the bold start-token  
first, and it is paired, but since Markdown scans for raw text before  
bold text, it ends up as:
     <p>This **is <code>raw** text</code></p>
If we actually addressed this edge case in the standard, would we  
really define the above to be the expected behavior? And if so, how  
do we even document the general rule used here?
The “syntax” quickly becomes the implementation, because we would  
have to define it like “first the document is broken into embedded- 
HTML parts and non-embedded HTML parts, the HTML embedded parts is  
found using this heuristic: …, the non-embedded HTML parts are then  
broken into paragraphs (where a paragraph is defined using …), for  
each paragraph we first scan for one or more back-ticks and see if  
there is an equal number in the same paragraph, if so, that part is  
made raw, and that part is no longer worked on, and for the text to  
the left and right of the raw text we do …” etc.
Such specification a) can lead to a lot of misunderstandings (already  
in the above I neglected to mention how escaping ` will not cause a  
code-span, although Markdown 1.0.1 does turn \`this\` into <code>,  
but it seems the regexp you use, does not), and b) requires the  
parser to be written in a certain way which is rather non-standard,  
so parser tools cannot help in this.
A more formal approach would be something like semi-EBNF:
markdown: html | block-element
html: '<' ID attribute* '>' html* '</' ID '>'
     | '<' ID attribute* '/>'
block-element: heading | list | blockquote | raw | inline
heading: '#'+ inline | inline '\n' ('-'|'='){3,} '\n'
inline: (ESCAPE | bold | italic | code | link | PARA-TEXT)+
bold: '**' inline '**' | '__' inline '__'
code: s-q-code | d-q-code
s-q-code: '`' CODE+ '`'
d-q-code: '``' (CODE | ESCAPE)+ '``'
ID:        [A-Za-z][A-Za-z0-9]*
CODE:      [^`]
ESCAPE:    \.
PARA-TEXT: [^\n] | \n[^\n]
…
The above is written in Mail, and not meant to be exact, just give a  
rough idea of what I am talking about, as I am not sure that is  
entirely clear to you.
And sure, we can’t get all the way with EBNF, but maybe we can get  
95% of the way, and that would be a tremendous win.
As I noted in my initial letter (last year about thoughts on a formal  
grammar) we would (unfortunately) have to break with current behavior  
for (undocumented) edge-cases, like the raw above, since with the  
above specification, it is the first token seen, that decides which  
style to switch to -- we can still make requirements that it needs to  
be paired, e.g.:
     This is **not bold.
Would not have `**` start bold. But personally, I am not favoring  
that direction, mainly though because it easily leads to problems  
parsing, but also because I am not sure it really is desired.
Take e.g. a paragraph like:
     You can set the SVN_EDITOR variable.
Now someone figures it would be good to append `(similar to  
CVS_EDITOR)`. This Now makes the full paragraph transform in an  
undesired way, even though the two sentences on their own transform  
fine, but when they follow each other, they do not (<em> is  
introduced in the resulting HTML).
I know that it is stated somewhere that Markdown should be all about  
the person and implementation complexity is irrelevant. The problem  
is that implementation complexity has lead us to the current  
situation where we have parsers doing different things (and syntax  
highlight not always being accurate) and we have lots of broken edge- 
cases and IMO unintuitive behavior -- so we got the implementation  
complexity, but I don’t think we have something which is “better”  
than had this followed more formal rules.
> [...] That said, it's certainly the very edge of an edge case. If  
> we're to define a formal syntax, let's not start there.
As should hopefully be clear from the above, I am *not* talking about  
documenting every single edge case, I am talking about defining the  
syntax using more traditional means of defining syntaxes.
Anyway, enough dead horse beating for now. Hopefully I’ll find time  
to do a mostly complete parser based on EBNF for the current Markdown  
syntax, and then I can bring up the topic again, listing the  
compromises necessary for it to work, and the advantages/ 
disadvantages may be more apparent then.
    
    
More information about the Markdown-Discuss
mailing list