text/markdown effort in IETF (invite)

Wed Jul 9 16:08:41 EDT 2014

On 7/9/2014 12:06 PM, Michel Fortin wrote:
> Le 9-juil.-2014 à 11:49, Sean Leonard <dev+ietf at seantek.com> a écrit :
> The "flavor" parameter is a good idea in theory. [...] Nobody is going to annotate their file with the right flavor unless there's a tangible benefit[...]
>
> [...] HTML never got anything like a "flavor" parameter in its MIME type, and even if it did it'd not have helped clear the mess in any way.

About this "flavors" thing. I know there are several lists floating out 
there of different Markdown implementations and variants (or if you 
don't like them being called Markdown, you can call them Illegitimate 
Sons of Markdown™). Which list is the most complete? Can someone show me 
(or make for the community) a really comprehensive list, and agree to 
update it?

When I wrote the -00 draft, I tried to follow the Media Type 
Registration Procedures. One requirement is to list required and 
optional parameters. Parameters are defined in RFC 6838 as "companion 
data". See RFC 6838 and in particular, Sections 1, 4.2.1, and 4.3.

All text/ types have at least one parameter: the charset. That is 
because all text data has to be interpreted according to a code (i.e., 
character set) that converts the bits of data into useful information. 
Nowadays we take Unicode (specifically UTF-8) for granted, but it's just 
not the case in reality. You can't just open a text file and hope for 
the best--you have to have /metadata/, express or implied, that tells 
you how to handle the blob of bits. The very fact that it is textual 
data has to be inferred from other things, such as the filename 
extension (when the data is in a file). A filename is just another piece 
of metadata.

When dealing with HTML, the charset could determined at least six ways:
1. as express external metadata, when the Content-Type has a charset 
parameter in the HTTP header.
2. as implied external metadata, when the HTTP header is absent but the 
client infers it from "other things" (e.g., the server, the IP address, 
or by looking at the ccTLD).
3. as express internal "metadata", with <meta charset="iso-2022-jp"> or 
<meta http-equiv="Content-Type" content="text/html; 
charset=iso-2022-jp">; or in the case of XHTML, <?xml version="1.0" 
encoding="iso-2022-jp"?>.
4. as express internal *data*, that is, the first bytes are 0xFF 0xFE 
(likely UTF-16LE), 0xFE 0xFF (likely UTF-16BE), or 0xEF 0xBB 0xBF 
(likely UTF-8).
5. as implied internal *data*, that is, "take the first 256 bytes and 
try to see if it decodes to something approximating HTML soup using some 
common character sets; if it fits, you quit".
6. as express user preference, that is, "I'm Japanese in Japan on a 
Windows machine, therefore on my browser, just assume everything is 
Shift-JIS".

See...there are all these crazy options...because nobody standardized on 
the character set when HTTP/HTML was developed; people assumed it was 
US-ASCII and then shoehorned lots of zany ways to make it something else.

At least with Markdown, we can probably safely eliminate #3 since 
Markdown is not intended to generate the <head> part of (X)HTML.

The operating question is: What metadata (companion data) is /necessary/ 
to reflect the creator's intent with respect to the data?

With Markdown, I think the answer is: you need the character set, and 
you need to know how to turn the text into HTML (or XHTML, PDF, RTF, MS 
Word/Office Open XML, or whatever).

Markdown has no way to communicate the character set in the document 
(other than the Unicode Byte Order Marks, which is a generalized 
property about text streams, not specific to Markdown)--and it would be 
counterproductive to invent one. So that is a perfect example of 
relevant metadata. And the second one, is how to turn it into something 
else that the author wants. If it's not communicated, it's going to be 
implied. Implied means "guessing" and likely "guessing wrong".

Hopefully this makes sense. I want to be more educated about this. Thanks!

Sean