Backtick Hickup
Allan Odgaard
29mtuz102 at sneakemail.com
Sun Aug 19 10:46:30 EDT 2007
On Aug 14, 2007, at 9:45 AM, Michel Fortin wrote:
> [...] Your interpretation of the syntax would require that:
>
> (mine) ` `````````` `
> (your's) ``````````` `````````` ```````````
Well, showing that my interpretation of Gruber’s writings leads to a
lot of redundant back-ticks (in a fictional case) is not really
showing that my interpretation is wrong ;)
But based on the code for Markdown.pl it would seem that the standard
has an additional requirement, not made explicit in the syntax
document (namely that back-ticks must not follow the end-token).
Personally, as I have said before, the back-tick rules are confusing
(when you want to include a back-tick in the code) and we might be
better off by just defining some simpler rules.
My proposal (from the thoughts on a formal grammar) was to have
`normal raw` and ``double-quoted raw`` where the latter would support
escape codes (at least \`).
But there are other options. Having escape-codes in raw though could
prove to be generally useful.
> [...]
> (There's also a check for a backslash at the start, although I just
> realised that this needs work as it doesn't give a correct result
> for an escaped litteral backslash like this: \\`code`.)
And this is *exactly* why I think the current parser is so flawed,
because you can’t look at things in isolation -- *everything* is
dependent on what precedes it, not just the previous character, but
every single character that comes before the current one (granted, it
seems that in practice, i.e. the standard inferred from how the
parser actually works, things are dependent only on characters
preceding things _in the same paragraph_ -- but it seems to me that
this is really just a side-effect of how the parser is written, and
not always desired. For example embedded HTML does not lean itself
well to the “split the document into paragraphs”).
Anyway, if we agree that everything is dependent on everything that
precedes it, I think we can slowly start to agree that *also* having
things depend on what follows, is problematic. I.e. we turn parsing
into the chinese game of pickup sticks -- the way this is presently
(mostly) solved is by doing iterative scans, where each iteration is
handling a given “token”, so rather than have the placement of the
token in the document define the outcome (i.e. the closer it is to
the start of the document, the higher its precedence), it is based on
the order of the iterative scans (i.e. the first token “seen” by the
parser, where it might be blind to `**` the first time it scans the
document), take this example:
This **is `raw** text`
Here we “naively” (i.e. regular parser) see the bold start-token
first, and it is paired, but since Markdown scans for raw text before
bold text, it ends up as:
<p>This **is <code>raw** text</code></p>
If we actually addressed this edge case in the standard, would we
really define the above to be the expected behavior? And if so, how
do we even document the general rule used here?
The “syntax” quickly becomes the implementation, because we would
have to define it like “first the document is broken into embedded-
HTML parts and non-embedded HTML parts, the HTML embedded parts is
found using this heuristic: …, the non-embedded HTML parts are then
broken into paragraphs (where a paragraph is defined using …), for
each paragraph we first scan for one or more back-ticks and see if
there is an equal number in the same paragraph, if so, that part is
made raw, and that part is no longer worked on, and for the text to
the left and right of the raw text we do …” etc.
Such specification a) can lead to a lot of misunderstandings (already
in the above I neglected to mention how escaping ` will not cause a
code-span, although Markdown 1.0.1 does turn \`this\` into <code>,
but it seems the regexp you use, does not), and b) requires the
parser to be written in a certain way which is rather non-standard,
so parser tools cannot help in this.
A more formal approach would be something like semi-EBNF:
markdown: html | block-element
html: '<' ID attribute* '>' html* '</' ID '>'
| '<' ID attribute* '/>'
block-element: heading | list | blockquote | raw | inline
heading: '#'+ inline | inline '\n' ('-'|'='){3,} '\n'
inline: (ESCAPE | bold | italic | code | link | PARA-TEXT)+
bold: '**' inline '**' | '__' inline '__'
code: s-q-code | d-q-code
s-q-code: '`' CODE+ '`'
d-q-code: '``' (CODE | ESCAPE)+ '``'
ID: [A-Za-z][A-Za-z0-9]*
CODE: [^`]
ESCAPE: \.
PARA-TEXT: [^\n] | \n[^\n]
…
The above is written in Mail, and not meant to be exact, just give a
rough idea of what I am talking about, as I am not sure that is
entirely clear to you.
And sure, we can’t get all the way with EBNF, but maybe we can get
95% of the way, and that would be a tremendous win.
As I noted in my initial letter (last year about thoughts on a formal
grammar) we would (unfortunately) have to break with current behavior
for (undocumented) edge-cases, like the raw above, since with the
above specification, it is the first token seen, that decides which
style to switch to -- we can still make requirements that it needs to
be paired, e.g.:
This is **not bold.
Would not have `**` start bold. But personally, I am not favoring
that direction, mainly though because it easily leads to problems
parsing, but also because I am not sure it really is desired.
Take e.g. a paragraph like:
You can set the SVN_EDITOR variable.
Now someone figures it would be good to append `(similar to
CVS_EDITOR)`. This Now makes the full paragraph transform in an
undesired way, even though the two sentences on their own transform
fine, but when they follow each other, they do not (<em> is
introduced in the resulting HTML).
I know that it is stated somewhere that Markdown should be all about
the person and implementation complexity is irrelevant. The problem
is that implementation complexity has lead us to the current
situation where we have parsers doing different things (and syntax
highlight not always being accurate) and we have lots of broken edge-
cases and IMO unintuitive behavior -- so we got the implementation
complexity, but I don’t think we have something which is “better”
than had this followed more formal rules.
> [...] That said, it's certainly the very edge of an edge case. If
> we're to define a formal syntax, let's not start there.
As should hopefully be clear from the above, I am *not* talking about
documenting every single edge case, I am talking about defining the
syntax using more traditional means of defining syntaxes.
Anyway, enough dead horse beating for now. Hopefully I’ll find time
to do a mostly complete parser based on EBNF for the current Markdown
syntax, and then I can bring up the topic again, listing the
compromises necessary for it to work, and the advantages/
disadvantages may be more apparent then.
More information about the Markdown-Discuss
mailing list