evolving the spec (was: forking Markdown.pl?)
Allan Odgaard
29mtuz102 at sneakemail.com
Tue Mar 4 00:49:24 EST 2008
On 3 Mar 2008, at 13:30, Michel Fortin wrote:
> [...]
>> 1. A regexp that makes the parser enter the context the rule
>> represents (e.g. block quote, list, raw, etc.).
>>
>> 2. A list of which rules are allowed in the context of this rule.
>>
>> 3. A regexp for leaving the context of this rule.
>>
>> 4. A regexp which is pushed onto a stack when entering the context of
>> this rule, and popped again when leaving this rule.
>>
>> The fourth item here is really the interesting part, because it is
>> what made Markdown nesting work (99% of the time) despite this being
>> 100% rule-driven.
>
> I'm not sure that the regular expression in 4 does, beside being
> pushed and popped from the stack
Yeah, I accidentally sent the letter w/o noticing I forgot to explain
the fourth rule.
The regexps which end on this stack are used to preprocess the current
line, so for example the rule for code blocks is:
RAW[1] = /\g {4}/ # Four spaces starts raw.
RAW[2] = [ RAW_TEXT ] # No other rules are active inside
raw, RAW_TEXT is a dummy .+ rule
RAW[4] = /\g( {4}| {,3}$)/ # While in the raw context, we need to
eat the first
# four spaces of each line, or the
line must be empty.
Two things to notice here:
1. I don’t use an explicit ‘end’ rule since we automatically leave
the context if RAW[4] doesn’t successfully match.
2. I use \g instead of ^ since we need to anchor to where the last
block-rule stopped matching, not necessarily BOL.
Now take the rule for block quote:
BQ[1] = /\g {,3}> {,3}/ # We start it for lines with > allowing
# up to 3 spaces before/after.
BQ[2] = [ BQ, RAW, PAR, … ] # Basically all block elements
# can go inside block quote.
BQ[3] = /\g( *$|«hr»)/ # We leave block quote at empty lines or
# horizontal rulers¹. The actual
pattern for
# «hr» is something like:
# [ ]{,3}(?<M>[-*_])([ ]{,2}\k<M>)
{2,}[ \t]*+$
BQ[4] = /\g( {,3}> ?)?/ # While in BQ eat leading quote
characters.
¹ I am actually not sure if this is “the spec” or just a bug. But
placing a horizontal ruler just below a block quoted paragraph does
not give the expected “lazy mode” and places the <hr> inside the block
quote, instead it leaves the block quote.
Just to make the example more complete, let us also have a paragraph
rule:
PAR[1] = /\g {,3}(?=[^ >])/ # Any non-special character with less
than
# 4 leading spaces starts a paragraph.
PAR[2] = [ B, EM, LINK, TEXT, … ] # All the inline stuff works in
this context
PAR[3] = /\g(?= | {,3}>| {,3}$)/ # We exit the paragraph when
the line
# is starting raw, block
quote, or is
# empty. In practice
paragraphs do end
# with block quote, but not
with raw.
Now we have 3 rules, be aware I typed all this just now without actual
testing, and the goal is not to replicate Markdown.pl 100%, just to
give an example of how the rule-system works.
So our ROOT rule looks like this:
ROOT[1] = //
ROOT[2] = [ RAW, BQ, PAR ]
So when we start to process a document, using this root rule, we will
get a match (without actually advancing our position in the document,
since zero characters were matched).
After this match we have RAW, BQ, and PAR as active rules. Say our
document looks like this:
> A normal paragaph
> Some raw text
> Normal text again
Out of the block quote
The first line is ‘> A normal paragaph’ and we have 3 rules to apply,
BQ[1], RAW[1], and PAR[1].
Since all of these regexps starts with \g, they are anchored to the
first byte of the document, and only BQ[1] will match.
This “eats” the ‘> ’ prefix, pushes BQ[4] on our stack, and makes BQ,
RAW, and PAR our new active rules (yeah, the same as before).
So we now have ‘A normal paragaph’ and again apply our 3 active rules,
this time PAR[1] will match, it won’t actually eat any characters, and
it won’t push additional rules onto our stack, but ti will change the
active rules to: B, EM, LINK, TEXT, …
I didn’t define TEXT, but that is a fallback rule for non-special text-
runs. We apply these rules to the line, and TEXT will match the line.
Now comes the special part, when we move to next line, which is ‘>
Some raw text’ we start by applying the rules from our stack to this
line, we have BQ[4] on the stack, which will eat the leading ‘> ’. The
line is now: ‘ Some raw text’ and we have no more rules on the
stack. Before we apply the active rules though, we need to check if we
need to leave the current context, which is PAR, thus we try to apply
PAR[3], and we do get a match, so we leave PAR.
The active rules now revert to those active before we entered PAR,
i.e. RAW, BQ, and PAR. Applying these will give a match for RAW, so we
eat the match (the leading four spaces), push RAW[4] on the stack, and
set the new active rules to RAW[2], i.e. RAW_TEXT.
The line is now ‘Some raw text’ which will be eaten by the RAW_TEXT
rule.
Next line is ‘> Normal text again’ and we have both BQ[4] and RAW[4]
on the stack. We apply these in a FIFO order, so first BQ[4] which
eats ‘> ’, then RAW[4], which fails to match, instructing us to leave
RAW, …
Okay, enough writing — I hope the above gives a better understanding
of how the rules are used.
> [...] You also need a way for the regular expression in 3 to be
> variable depending on what you caught in 1 (to match the same number
> of backticks in a code span for instance; to catch a matching
> closing HTML tag, etc.).
I allow captures from the match done by 1 to be referenced in 3.
More information about the Markdown-Discuss
mailing list