Markdown validity Re: Agreeing on "Historical Markdown"
Sean Leonard
dev+ietf at seantek.com
Sat Jul 12 10:32:06 EDT 2014
As I'm thinking about this, I have other questions:
Can a Markdown parser/processor fail? Is there a concept of Markdown
validity--i.e., can Markdown content be invalid (from the perspective of
Markdown, not (X)HTML)?
As I understand it:
A Markdown processor identifies Markdown control sequences (aka
markdown, in lowercase) in a stream of text and converts these sequences
to the target markup--namely (X)HTML.
A Markdown processor identifies (X)HTML in markdown and passes this
content to the target markup.
<-- Do Markdown processors (i.e., existing implementations) attempt to
fix or normalize the markup (by deserializing and then reserializing the
markup), or is it a straight pass? It sounds like whether or not a
Markdown processor reserializes the markup is implementation-dependent;
Gruber's syntax rules do not say. However, if you have Markdown in the
HTML content with markdown="1" as with PHP Markdown Extra, it is
necessary to parse the HTML with something other than a straight HTML
parser since the straight HTML parser will misinterpret the Markdown
(e.g., & will be a validation error).
Therefore:
Markdown has no concept of markdown validity. A Markdown processor never
fails due to invalid markdown input. If a sequence of text is not
recognized as markdown (i.e., control sequences), it is treated as text
and passed accordingly to the target markup. (This property is directly
related to the "degradation" feature of Markdown, namely, if your
processor cannot understand the markdown, the output is "worse" than an
author intended, but does not cause utter failure--the non-understood
markdown is visible in the output. This is in contrast to HTML, where
tags or attributes that are not understood have no effect on the
presentation of the HTML.)
Markdown may have a concept of HTML validity. A Markdown processor that
identifies HTML in Markdown content may determine that the HTML is valid
or invalid. For example, it may identify <div> ... [end of document] as
HTML that is invalid because it lacks a closing </div> tag. Then, it has
five choices:
1. treat the invalid HTML as text--pass the text-as-text to the markup
(i.e., turn & into & , < into < , etc.)
2. treat the invalid HTML as Markdown--keep on processing the input and
look for markdown inside of it (thus *hello* inside the invalid HTML
will get marked up...and <div><a
href="http://www.example.com/">hello</a>[end of document] will become a
real link with the literal text '<div>' preceding it)
<-- this is the same behavior as "not identifying the text as HTML in
the first place"
3. pass the invalid HTML as HTML
4. attempt to fix the HTML...thus <div><a
href="http://www.example.com/">hello</a>[end of document] might become
<div><a href="http://www.example.com/">hello</a></div>
5. fail due to HTML invalidity
?
Sean
More information about the Markdown-Discuss
mailing list