29mtuz102 at sneakemail.com
Tue Aug 28 19:32:14 EDT 2007
On Aug 27, 2007, at 10:35 PM, Michel Fortin wrote:
>> Personally, as I have said before, the back-tick rules are
>> confusing (when you want to include a back-tick in the code) and
>> we might be better off by just defining some simpler rules.
> I don't find them confusing, but perhaps it's only because I'm used
> to it. Which aspect of it do you find confusing?
Maybe ‘intuitive’ would have been a better choice of word. But this
thread started because somebody did not understand how to embed back-
ticks in back-tick quoted strings -- personally I didn’t understand
it either until I looked at the implementation.
> I think I prefer the current behaviour. I can't really see when
> having to escape the content of code span would be useful. Perhaps
> you had something in mind when proposing that?
Yes, when you need special characters -- you can’t use entities
inside `…` so ``…`` would allow you to do e.g. \u2620 for a unicode
character or similar -- with everybody using utf-8 these days (knock
on wood) escape codes for special characters are less useful than in
> I have some difficulty figuring out an what you mean by "embeded
> HTML does not lean itself well to the 'split the document into
> Markdown currently distinguish block-level HTML elements from span-
> level HTML elements: The former creates blocks which are left alone
> by Markdown (and left outside paragraphs) while the later gets
> wrapped into paragraphs (as valid HTML expects them to be) along
> with Markdown-formatted text.
Yes, we are dependent on Markdown finding the HTML before it does the
paragraph splitting, so it doesn’t insert <p> in my HTML -- yet the
present heuristic to find HTML is easily confused (talking
Markdown.pl), for me it actually got worse when John switched to the
Perl library thing.
In fact, presently I have my own preprocessor for my Markdown pages
(on my site, which sometimes need to embed tables and stuff) to take
out the HTML before giving it to Markdown -- although this is also
because Markdown does not know about <% scripting %> <?php tags ?>
and since there is no grammar where I can just educate it about them,
I need to handle that myself in a pre-parse step.
>> Anyway, if we agree that everything is dependent on everything
>> that precedes it, I think we can slowly start to agree that *also*
>> having things depend on what follows, is problematic.
> Well, I think you mean problematic for writing a parser, in which
> case I disagree.
No, I mean problematic as in; what the hell should we do? You and I
disagree about how to interpret the same line of Markdown exactly
because it depends on the angle you view it from (read: which token
you think is most important), i.e. totally subjective…
>> The “syntax” quickly becomes the implementation [...]
> Well, look at how the WHATWG is defining HTML right now: it's
> exactly that. They describe how the parser works (in english), and
> everything that match its behaviour is conforming...
Yes, and do you know *why* they are doing that?
It is because all the initial browsers had no scent of a real parser,
they (seriously!) did things like:
bold = true;
else if(strcmp("</b>", tag))
bold = false;
Even though there was an official specification for how to parse HTML
(well, SGML), no browsers actually did it that way, and authors did
lots of totally broken pages, and browsers interpreted them
differently, and browsers didn’t even interpret valid HTML correct
(i.e. there are e.g. the rule that when you close a context in SGML,
all missing close tags are implicit, and I haven’t seen a single
browser actually do that, even though it is actually a quite nice
feature, since you can leave out lots of close tags -- but since they
did not have a recursive descent parser or similar, they had no clue
what the current context was, so that is likely why they didn’t do
it, that and the fact that they probably never read the SGML
So W3C said fuck this, let’s totally scrap SGML, it was too complex
for browser implementors to wrap their head around (understandable!),
so let’s do a “simple” subset (XML, which turned out to be not that
simple in the end when they retrofitted namespaces and all sorts of
crap into it) and XHTML is the new thing, totally strict! But no-one
cared about XHTML, no browser really supported it, because we have
like billions HTML pages out there, we can’t just drop them.
So given this rather broken situation, the WhatWG decided to try to
figure out in which ways all the browsers were broken and document
that to get them in sync, and make that the official spec, so that we
can move on with (expanding) the HTML specification w/o cutting
backwards compatibility -- because browser vendors don’t want
existing pages to break, cause that makes them lose users, so if W3C
adds features to HTML which require the browser to have a strict
parser to really work, browser vendors may not do it because of
backwards compatibility, or something like that…
You really think Markdown should take the same route? ;)
> which brings out an interesting side topic: how should HTML be
> parsed (or event specified) within Markdown? :-)
I would say strict (for which a grammar is pretty simple)! There is
no reason Markdown should conform to the looser WhatWG definition,
since strict HTML is a subset of WhatWG’s definition, and they made a
superset only to be compatible with existing bad pages, but Markdown
does not need to support that.
> I think the better solution to that problem would be to disallow
> emphasis starting in the middle of a word ending within another.
> And as for underscore-emphasis problem, I'd suggest doing just as
> PHP Markdown Extra does (one of its *documented* features): only
> allow it on word boundaries, not in the middle of a word. I've yet
> to get a complain about that change in behaviour and I know some
> people switched to Extra just because of that.
I would prefer that interpretation as well, I have even requested it
in the past, since it is the #1 mistake I see from people who post
comments on my blog (they do not escape underscores in
snake_case_words or surround them with back-ticks). I can’t find the
thread, but most thought it was useful, but it is not uncommon that
people argue for one behavior that they are actually not ever using
It sounds like I should switch to Markdown Extra for my blog comments…
>> [...] I am *not* talking about documenting every single edge case,
>> I am talking about defining the syntax using more traditional
>> means of defining syntaxes.
> But how can you write a formal grammar without having to think of
> the edge cases? Are you suggesting we should ignore edge cases when
> defining the syntax? And if yes, what does qualify as an "edge case"?
I don’t think you have worked with parser generators and grammars.
Basically the implementation is generated from the grammar (if it is
possible to specify the grammar fully) -- the grammar can be tricky
to get right, but there are no edge-cases luring in the corners in
the same way that there is for a hand-written multi-pass regular parser.
I.e. compare it to a mathematical equation, people can’t get to
different solutions for the same equation unless they are misreading it.
This is why a formal grammar is so powerful, because it really
specifies *everything* -- the problem can be if it specifies it the
way we want it. E.g. in the short example I gave, I did not support
self-closing HTML tags in paragraph text, so this is simply not
supported, no argument there -- the argument is thus whether we
should add it to the grammar, not how to interpret the grammar.
That said, a grammar can be ‘invalid’ so to speak -- but if specified
e.g. as an ANTLR grammar, ANTLR will tell us which rules cause which
More information about the Markdown-Discuss