text/markdown effort in IETF (invite)
dev+ietf at seantek.com
Wed Jul 9 16:08:41 EDT 2014
On 7/9/2014 12:06 PM, Michel Fortin wrote:
> Le 9-juil.-2014 à 11:49, Sean Leonard <dev+ietf at seantek.com> a écrit :
> The "flavor" parameter is a good idea in theory. [...] Nobody is going to annotate their file with the right flavor unless there's a tangible benefit[...]
> [...] HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.
About this "flavors" thing. I know there are several lists floating out
there of different Markdown implementations and variants (or if you
don't like them being called Markdown, you can call them Illegitimate
Sons of Markdown™). Which list is the most complete? Can someone show me
(or make for the community) a really comprehensive list, and agree to
When I wrote the -00 draft, I tried to follow the Media Type
Registration Procedures. One requirement is to list required and
optional parameters. Parameters are defined in RFC 6838 as "companion
data". See RFC 6838 and in particular, Sections 1, 4.2.1, and 4.3.
All text/ types have at least one parameter: the charset. That is
because all text data has to be interpreted according to a code (i.e.,
character set) that converts the bits of data into useful information.
Nowadays we take Unicode (specifically UTF-8) for granted, but it's just
not the case in reality. You can't just open a text file and hope for
the best--you have to have /metadata/, express or implied, that tells
you how to handle the blob of bits. The very fact that it is textual
data has to be inferred from other things, such as the filename
extension (when the data is in a file). A filename is just another piece
When dealing with HTML, the charset could determined at least six ways:
1. as express external metadata, when the Content-Type has a charset
parameter in the HTTP header.
2. as implied external metadata, when the HTTP header is absent but the
client infers it from "other things" (e.g., the server, the IP address,
or by looking at the ccTLD).
3. as express internal "metadata", with <meta charset="iso-2022-jp"> or
<meta http-equiv="Content-Type" content="text/html;
charset=iso-2022-jp">; or in the case of XHTML, <?xml version="1.0"
4. as express internal *data*, that is, the first bytes are 0xFF 0xFE
(likely UTF-16LE), 0xFE 0xFF (likely UTF-16BE), or 0xEF 0xBB 0xBF
5. as implied internal *data*, that is, "take the first 256 bytes and
try to see if it decodes to something approximating HTML soup using some
common character sets; if it fits, you quit".
6. as express user preference, that is, "I'm Japanese in Japan on a
Windows machine, therefore on my browser, just assume everything is
See...there are all these crazy options...because nobody standardized on
the character set when HTTP/HTML was developed; people assumed it was
US-ASCII and then shoehorned lots of zany ways to make it something else.
At least with Markdown, we can probably safely eliminate #3 since
Markdown is not intended to generate the <head> part of (X)HTML.
The operating question is: What metadata (companion data) is /necessary/
to reflect the creator's intent with respect to the data?
With Markdown, I think the answer is: you need the character set, and
you need to know how to turn the text into HTML (or XHTML, PDF, RTF, MS
Word/Office Open XML, or whatever).
Markdown has no way to communicate the character set in the document
(other than the Unicode Byte Order Marks, which is a generalized
property about text streams, not specific to Markdown)--and it would be
counterproductive to invent one. So that is a perfect example of
relevant metadata. And the second one, is how to turn it into something
else that the author wants. If it's not communicated, it's going to be
implied. Implied means "guessing" and likely "guessing wrong".
Hopefully this makes sense. I want to be more educated about this. Thanks!
More information about the Markdown-Discuss