Markdown validity Re: Agreeing on "Historical Markdown"

Sat Jul 12 10:32:06 EDT 2014

As I'm thinking about this, I have other questions:

Can a Markdown parser/processor fail? Is there a concept of Markdown 
validity--i.e., can Markdown content be invalid (from the perspective of 
Markdown, not (X)HTML)?

As I understand it:
A Markdown processor identifies Markdown control sequences (aka 
markdown, in lowercase) in a stream of text and converts these sequences 
to the target markup--namely (X)HTML.
A Markdown processor identifies (X)HTML in markdown and passes this 
content to the target markup.
  <-- Do Markdown processors (i.e., existing implementations) attempt to 
fix or normalize the markup (by deserializing and then reserializing the 
markup), or is it a straight pass? It sounds like whether or not a 
Markdown processor reserializes the markup is implementation-dependent; 
Gruber's syntax rules do not say. However, if you have Markdown in the 
HTML content with markdown="1" as with PHP Markdown Extra, it is 
necessary to parse the HTML with something other than a straight HTML 
parser since the straight HTML parser will misinterpret the Markdown 
(e.g., & will be a validation error).

Therefore:
Markdown has no concept of markdown validity. A Markdown processor never 
fails due to invalid markdown input. If a sequence of text is not 
recognized as markdown (i.e., control sequences), it is treated as text 
and passed accordingly to the target markup. (This property is directly 
related to the "degradation" feature of Markdown, namely, if your 
processor cannot understand the markdown, the output is "worse" than an 
author intended, but does not cause utter failure--the non-understood 
markdown is visible in the output. This is in contrast to HTML, where 
tags or attributes that are not understood have no effect on the 
presentation of the HTML.)

Markdown may have a concept of HTML validity. A Markdown processor that 
identifies HTML in Markdown content may determine that the HTML is valid 
or invalid. For example, it may identify <div> ... [end of document] as 
HTML that is invalid because it lacks a closing </div> tag. Then, it has 
five choices:
1. treat the invalid HTML as text--pass the text-as-text to the markup 
(i.e., turn & into & , < into < , etc.)
2. treat the invalid HTML as Markdown--keep on processing the input and 
look for markdown inside of it (thus *hello* inside the invalid HTML 
will get marked up...and <div><a 
href="http://www.example.com/">hello</a>[end of document] will become a 
real link with the literal text '<div>' preceding it)
   <-- this is the same behavior as "not identifying the text as HTML in 
the first place"
3. pass the invalid HTML as HTML
4. attempt to fix the HTML...thus <div><a 
href="http://www.example.com/">hello</a>[end of document] might become 
<div><a href="http://www.example.com/">hello</a></div>
5. fail due to HTML invalidity

?

Sean