text/markdown effort in IETF (invite)

Wed Jul 9 18:07:19 EDT 2014

Le 9-juil.-2014 à 16:08, Sean Leonard <dev+ietf at seantek.com> a écrit :

> The operating question is: What metadata (companion data) is /necessary/ to reflect the creator's intent with respect to the data?
> 
> With Markdown, I think the answer is: you need the character set, and you need to know how to turn the text into HTML (or XHTML, PDF, RTF, MS Word/Office Open XML, or whatever).

Indeed.

> Markdown has no way to communicate the character set in the document (other than the Unicode Byte Order Marks, which is a generalized property about text streams, not specific to Markdown)--and it would be counterproductive to invent one. So that is a perfect example of relevant metadata.

Fun fact: PHP Markdown is mostly encoding agnostic. It understands UTF-8 sequences but any byte that is not a valid UTF-8 sequence is treated as a character in itself. It's only relevant when converting tabs into spaces however, and only if you have non-ASCII characters before the tab.

So whatever the input encoding is becomes the output's encoding (this works for HTML). Naturally, it's good to know the input's encoding if you want to know the output's. So obviously it's a good idea to specify the text encoding even though the parser itself doesn't need it, so you know the resulting document's encoding.

That's not really relevant though.

> And the second one, is how to turn it into something else that the author wants. If it's not communicated, it's going to be implied. Implied means "guessing" and likely "guessing wrong".

Ideally you'd use the exact same version of the same parser the author used to interpret the document in the first place.

Or you could be loose and use another version of the same parser.
Or you could be loose and use another parser claiming to be of the same flavor.
Or you could be loose and use another parser claiming to be of a superset of the given flavor.
Or you could be loose and use another Markdown parser.

It's a spectrum. Each step down will increase the likeliness of something going wrong.

> Hopefully this makes sense. I want to be more educated about this.

This makes perfect sense, but I fear there's no good answer to your second question. Since you want to know more, here's some insight.

It's important to understand that there is no notion of invalid Markdown input. As an implementer every time you fix what looks like a parsing bug to you or add a feature you're also breaking some valid input that was producing something else before. The implementer will usually only choose to break valid input that was deemed very unlikely to ever have been used before, but there's no way to know for sure (and no reliable way to measure impact either). So if you really really want to be sure things are parsed in the intended way, you should use the closest version possible of the same parser as the creator of the document was using.

Also, subtle changes can make things technically incompatible. For instance, Markdown Extra is mostly a superset of the original Markdown feature-wise, except for one small incompatible change: underscore emphasis within a word is disallowed. This was a deliberate change to fix some problems users were having with words that contained underscore. So even though most people would consider Markdown Extra as a superset of Markdown, it technically isn't. Other implementers might do the same thing but consider it as a bug fix instead and tell their users implementation implements the original syntax.

Babelmark 2 will tell you that implementations are pretty much evenly split on this:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=word_with_emphasis

You'll even see that Pandoc implements both behaviour depending on whether you're in strict mode or not.

Something stranger happens with the shortcut reference syntax:
http://johnmacfarlane.net/babelmark2/?normalize=1&text=%5Blink%3F%5D%0A%0A%5Blink%3F%5D%3A+http%3A%2F%2Flink.x%2F

It's pretty much universally supported. It comes from a Markdown.pl beta that was never formally released but which is widely in use. If you were to go to the Markdown website and use the download link, you won't get the beta and it won't work. And while the second one is feature-wise a superset of the first, technically it could in some rare situations break documents, turning square bracketed text into links where it shouldn't:

	Someone on [street Ivanhoe Carol][sIC] told me this:

	> This is bad [sic].

	[sIC]: http://sic.sickdomain

I sure wish things would be simpler. But as things are now, I have a hard time identifying what "flavor" could mean. Should "Markdown.pl-1.0.1" be a flavor on its own?

-- 
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca