Inline HTML legalities

Andy Bennett andyjpb at ashurst.eu.org
Wed Nov 30 09:27:24 EST 2011


Hi,

I'm writing a Markdown Parser in Scheme by porting bits of Markdown.pl.

As you're probably aware, the Perl version massages the file into the
final output with a number of regexes. In my version I'm trying to use
the regexes to detect the starts and ends of the features and then take
specific action to emit the final representation. I'm doing it this way
because I want to emit an SXML tree data structure rather than an opaque
string.

I read the source file into a list of lines and can run regexes on
individual lines.

I can successfully extract link references.

I'm currently trying to extract inline HTML, basing my code on regexes
from _HashHTMLBlocks.

I can detect the start of an inline block:
"^<(block-tags-a\b.*

...but I'm having trouble detecting the end of the blocks in the same
way as the Perl version.


There seems to be a discrepancy between the "Markdown: Syntax" document
and the implementation in _HashHTMLBlocks.

The syntax document says "...and the start and end tags of the block
should not be indented with tabs or spaces."
Whilst this is true for the first regex in _HashHTMLBlock, the second
regex (block_tags_b) will sweep up malformed entries like so:


-----
<div>
what happens when we have a <div>nested block</div> and
then the <div>nested block</div>
ends at the end of a line,
but no proper end tag?
-----
becomes
-----
<div>
what happens when we have a <div>nested block</div> and
then the <div>nested block</div>

<p>ends at the end of a line,
but no proper end tag?</p>
-----

i.e. the trailing </div> is detected as the end of the block.

If that block is not at the end of the file and is followed later by a
properly formed block with the same tag, then everything between the
first opening tag and the first properly formed closing tag will get
entirely consumed... which is correct per the Syntax document.


Furthermore, the syntax document does not mandate the user to indent the
block contents, although the example implies it:

-----
<div>
<div>
Test nested HTML without indents
</div>
</div>
-----
becomes
-----
<div>
<div>
Test nested HTML without indents
</div>

<p></div></p>
-----


Finally, capitalised tag names appear to get wrapped in <p>s:
-----
<div>
<div>
tags for inner block must be indented.
</div>
</div>

<DIV>
<DIV>
TAGS FOR INNER BLOCK MUST BE INDENTED.
</DIV>
</DIV>
-----
becomes
-----
<div>
<div>
tags for inner block must be indented.
</div>
</div>

<p><DIV>
<DIV>
TAGS FOR INNER BLOCK MUST BE INDENTED.
</DIV>
</DIV></p>
-----


What is the correct way to parse these examples? Should I aim to produce
the same output as the Perl implementation in all cases?

I'm not entirely sure what purpose the 2nd regex in _HashHTMLBlocks
serves (block_tags_b): I can't find reference to that type of syntax in
the syntax document. Why is the tag list different from block_tags_a? It
strikes me that perhaps the block_tags_b regex shouldn't match over
multiple lines.

In my line based parser, to match the same way as the Perl parser I'd
have to backtrack when I didn't find a valid end tag before the end of
the document and then sweep up with the same logic as the block_tags_b
regex.


I've attached the test cases that I've thought of so far.




I felt inclined to build up the SXML tree by parsing the original
document, rather than transforming the original into XHTML and then
parsing that into SXML at the end, because if I can detect the features
myself then I don't need to handle escaping and encoding in the parser.
SXML data structures are escaped and encoded when they are finally rendered.



Many thanks for any guidance you can offer.




Regards,
@ndy

--
andyjpb at ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: html.md
Url: <http://six.pairlist.net/pipermail/markdown-discuss/attachments/20111130/06ddd578/attachment.ksh>


More information about the Markdown-Discuss mailing list