From jgm at berkeley.edu Fri May 2 20:52:56 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Fri, 2 May 2008 17:52:56 -0700 Subject: markdown implementation in C using PEG grammar Message-ID: <20080503005255.GA15423@berkeley.edu> I've just uploaded an implementation of markdown in C. It defines the syntax using a PEG grammar, so it should be easy to extend and modify. Right now it can produce output in either HTML or LaTeX, but it would be simple to add other output formats. It's very fast: on my machine, it converts a 178K markdown file in 0.14 seconds (vs. 9.6 seconds for the latest Markdown.pl and 0.57 seconds for phpmarkdown). It passes all the tests in the Markdown 1.0.3 test suite, with one exception (an edge case where there is room for disagreement). It's on github at http://github.com/jgm/peg-markdown/tree/master. If you use git, you can clone the repository: git clone git://github.com/jgm/peg-markdown.git Otherwise, download a tarball: http://github.com/jgm/peg-markdown/tarball/master Once you're in the peg-markdown directory, you can compile the program by typing 'make'. For convenience, all required dependencies are included, including Ian Piumarta's excellent peg/leg parser generator. If you just want to have a look at the formal grammar, it's here: http://github.com/jgm/peg-markdown/tree/master/markdown.leg#L675 John From bobtfish at bobtfish.net Fri May 2 21:23:05 2008 From: bobtfish at bobtfish.net (Tomas Doran) Date: Sat, 3 May 2008 02:23:05 +0100 Subject: markdown implementation in C using PEG grammar In-Reply-To: <20080503005255.GA15423@berkeley.edu> References: <20080503005255.GA15423@berkeley.edu> Message-ID: On 3 May 2008, at 01:52, John MacFarlane wrote: > I've just uploaded an implementation of markdown in C. It defines > the syntax using a PEG grammar, so it should be easy to extend and > modify. Right now it can produce output in either HTML or LaTeX, > but it > would be simple to add other output formats. > I've added support for this to my babelmark: http://babelmark.bobtfish.net/?markdown=++*+foo%0D%0A%0D%0Abar Enjoy. It is not, howevere, showing up in the 'compare' option. As it's 2:30am here, I'm going to ignore this and go to bed in the short term. Have fun. Cheers Tom From michel.fortin at michelf.com Mon May 5 22:33:09 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Mon, 05 May 2008 22:33:09 -0400 Subject: Markdown Extra Specification (First Draft) Message-ID: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> It took much more time than I expected, and it is currently less complete than I have hoped, but I've finaly made a first draft of the Markdown Extra spec. You can find it at Currently, the specification defines its goals and a document model for Markdown Extra. It lacks the most important part though: the parsing section, which I'm going to write next. As I update the document locally on my computer, I'll update the public specification page to always reflect the latest version. I'm working on this on my free time, alongside a full-time job, a few other projects, and other duties. I'm hoping to have a draft of the parsing section covering all parts of the syntax by July, and to have completed the whole specification by 2009. I'm not sure of how realistic this timetable is, but we'll see. To take part in this work, simply leave your comments, suggestions, and potential issues on this list, or send them to me by private email if you don't want your comment to be public for some reason. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From sgbotsford at gmail.com Tue May 6 18:18:53 2008 From: sgbotsford at gmail.com (Sherwood Botsford) Date: Tue, 06 May 2008 16:18:53 -0600 Subject: Markdown Extra Specification (First Draft) In-Reply-To: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> Message-ID: <4820D94D.8090807@gmail.com> As a suggestion for the next pass at this, add an example of each, and how it should be rendered. I found fireball's web site fairly lucid for this. E.g. Example abbreviation definition *[TLA] :Three Letter Acronym Note, If I'm reading this correctly then * [TLA]: Three Letter Acronym would be incorrect, as there is a space after the asterisk and after the colon. It's also not clear if this critter has to appear on a line by itself. If the abbreviation is long enough that it spans line ends, is that ok? * [ODITLOID]: A Day in the Life of Ivan Denisovich (Alexander Solzhenitsyn) *** A lot of the critters that appear as references aren't clear about how they appear in the text, and how they appear when resolved. The footnote dfinition starts with [^ but does the footnote in the text also start that way? 2.2.1 Link Reference Quote: A link reference is alone on a line. It begins with the reference name inside square brackets, optionally followed by a space or a no-break-space, a colon, a URI (either enclosed in angle brackets or not), and an optional title enclosed in single or double quotes, or in parenthesis (which can be preceded by a newline). four things: 1. Sometimes the link is *(@*&^(^ long. So if I'm editing with vi, I have everything else in 60-70 column lines, then this great bloody honker. I'd like an optional syntax for breaking a long URI into chunks. Eg. the usual unix convention of \ with optional trailing whitespace means continued on next line, with the \ and whitespace going to the bit bucket. 2. I'd like some way to hook references to an external file, or database lookup instead of doing them internally. 3. Why three versions of quoting characters for the title? 4. Why the <> around the URI? 5. Only if you put the title in () can you start on a newline? 2.2.2 Abbreviation Again, I'd like a hook so that I can put these in an external file. In my tree farm web page I'd like to use botanical descriptions, but be able to let users see the definition on mouse over or click. But the word 'glabrous' may appear on 40 pages. Be nice if I only had to define it once. If someone is creating an annotated Shakespeare they would want to use an Elizabethan English dictionary as their external file, style it so that defined words are barely different from the text, and let the confused reader click for enlightenment. 2.2.2 footnotes Note possible numbering error both abbreviation and footnotes are 2.2.2 How does the footnote appear in the text? For clarity in reading, all the things that refer to something else should be visible different. E.g. In markdown we presently have [link text ][LINKREF] ![Image alt text][IMGREF] so ^[FOOTNOTE] (although I'd prefer _[FOOTNOTE] as it tells me it's below the ruler line at the bottom of the page) Except from your text it appears it should be [^FOOTNOTE] which is at odds with the image and abbreviation syntax. How are footnotes numbered? I think you could make a case for a footnote being the child of the block element that the reference appears in. This may potentially allow clever people with CSS to have the footnote appear as a sidebar div, adjacent to the reference. 2.3.7 Table syntax Suggested syntax [TT] Table title |[TH] elements | separated by pipes | with white space | on either side | | anything | that | appears | with | leading and trailing | | is | formated | as a | table row| |> This cell spans two columns | and | so forth| | This cell also spans two columns <| and | so forth| |>> This cell spans three columns | in the | table | | This cell spans two rows | in the | table| | " " | because it has ditto marks | in the cell below | Since many tables are done without a title or header, the pipe syntax is the usual. You can spent some time pretty printing it. Suggested implementation would have warnings when the number of cells per row is inconsistent. 2.4.2 and 2.4.3 Emphasis and strong emphasis. The current markdown uses either _ or * for emphasis and any combination of the two doubled for strong emphasis. I suggested that * be used for strong (default bold) and _ be used for emphasis. (default italic) This gives three combinations possible with the same set of symbols, and fits the general intuitive nature of markdown. 2.4.6 Hard line break This one bites me regularly, as I learned to touch type in high school and to end a sentence with a period followed by two spaces. This means that every time I end a sentence on a line end, I get an involuntary break. Lots of head scratching over this one. I don't like markup that depends on invisible trailing characters. I would favour ending a line with a forward slash. You sometimes see this in poetry where line length exceeds the column width. And it has an easy mnuemonic: If a back slash means concatenate the next line onto this one then forward slash means, force the line break here. Thus my address would appear Sherwood Botsford / RR 1 Site 2 Box 5 / Warburg, Alberta T0C 2T0 / I would propose that any amount of white space surrounding the / would be allowed. So if you wanted to add extra space so the /'s would line up, you could. Abbreviation is element 2.2.2 and 2.4.7 Is this correct? It is both a document element and a span element? Ditto Link. Will this cause trouble for designing the parser? From waylan at gmail.com Tue May 6 21:43:34 2008 From: waylan at gmail.com (Waylan Limberg) Date: Tue, 6 May 2008 21:43:34 -0400 Subject: Markdown Extra Specification (First Draft) In-Reply-To: <4820D94D.8090807@gmail.com> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> Message-ID: Sherwood, First of all, realize that Michel is currently documenting existing behavior. I like some of our suggestions, but they should have happened years ago when the discussion happened here on the list. Various other implementations have copied the existing behavior and there are countless documents already using them, so I doubt we'll see any changes, unless we move to Markdown 2.0 or something. I get the impression that's not likely any time soon. Anyway, there were a few things I'll comment on individually: On Tue, May 6, 2008 at 6:18 PM, Sherwood Botsford wrote: > As a suggestion for the next pass at this, add an example of each, and how > it should be rendered. I agree. I was going to make the same suggestion. This would be helpful. [snip] > 2.2.2 Abbreviation > Again, I'd like a hook so that I can put these in an external file. In my > tree farm web page I'd like to use botanical descriptions, but be able to > let users see the definition on mouse over or click. But the word 'glabrous' > may appear on 40 pages. Be nice if I only had to define it once. If someone > is creating an annotated Shakespeare they would want to use an Elizabethan > English dictionary as their external file, style it so that defined words > are barely different from the text, and let the confused reader click for > enlightenment. An excellent idea! After all, I had the same idea some time ago and implemented it in the Abbreviation Extension [1] for Python-Markdown. However, I'm not sure this should be a requirement of a syntax specification. [1]: http://achinghead.com/markdown/abbr/ As I noted above, the rest I'll leave to existing behavior, even if I like your suggestion better. We don't want to forget J.G.'s motivation and goals for creating markdown to start with. A review of that will answer some of your questions about current behavior. And, we must also never forget the (even as implementors) we should not care so much about how hard it is to implement if it makes if easier (and more permisive/relaxed) for the document author. -- ---- Waylan Limberg waylan at gmail.com From michel.fortin at michelf.com Tue May 6 23:57:19 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Tue, 06 May 2008 23:57:19 -0400 Subject: Markdown Extra Specification (First Draft) In-Reply-To: <4820D94D.8090807@gmail.com> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> Message-ID: Le 2008-05-06 ? 18:18, Sherwood Botsford a ?crit : > As a suggestion for the next pass at this, add an example of each, > and how it should be rendered. > I found fireball's web site fairly lucid for this. The content model section currently includes a basic description of each syntax element, but its primary goal is to explain the structure of a Markdown Extra document. The parsing section is intended to define without ambiguity the syntax for each element. Nothing is set in stone though. I could, for instance, remove entirely the syntax description part from the document model and leave that entirely to the parsing section. I'll see what fits when I write the parsing section. > E.g. > Example abbreviation definition > *[TLA] :Three Letter Acronym > > Note, If I'm reading this correctly then > * [TLA]: Three Letter Acronym > > would be incorrect, as there is a space after the asterisk and after > the colon. > It's also not clear if this critter has to appear on a line by itself. This will be defined in more precise terms in due time. I'm not there just yet. That said, I think the first one should be allowed, but not the second one. Putting a space before a colon is common in the French-speaking world, and don't think it causes a problem to allow it; whereas the second one is ambiguous with a list item and I'd rather have the user see a list item where he doesn't expect one than seeing a list item disappear because it looks like an abbreviation definition. In fact, the first one is allowed by PHP Markdown Extra, but not the second one which becomes a list item. > If the abbreviation is long enough that it spans line ends, is that > ok? > > * [ODITLOID]: A Day in the Life of Ivan > Denisovich (Alexander Solzhenitsyn) Currently, PHP Markdown Extra doesn't handle that very well. I'm not sure what will the final specification should say about this. I'll look at it when doing the parsing section. > *** > A lot of the critters that appear as references aren't clear about > how they > appear in the text, and how they appear when resolved. The footnote > dfinition starts with [^ but does the footnote in the text also > start that way? I'm not sure I understand your concern here. The spec may not be clear about that currently, but the spec's extra features are coming from PHP Markdown Extra, and so will follow PHP Markdown Extra's syntax. Please take a look at this document: > 2.2.1 Link Reference > > Quote: > A link reference is alone on a line. It begins with the reference > name inside square brackets, optionally followed by a space or a no- > break-space, a colon, a URI (either enclosed in angle brackets or > not), and an optional title enclosed in single or double quotes, or > in parenthesis (which can be preceded by a newline). > > > four things: > 1. Sometimes the link is *(@*&^(^ long. So if I'm editing with vi, I > have everything else in 60-70 column > lines, then this great bloody honker. I'd like an optional syntax > for breaking a long URI into chunks. > > Eg. the usual unix convention of \ with optional trailing whitespace > means continued on next line, with the \ and whitespace going to the > bit bucket. I'm not sure that's useful enough to justify the added complexity for parsers. But still, please remind me of this problem after I've done the part of the parsing section that deals with URLs. > 2. I'd like some way to hook references to an external file, or > database lookup instead of doing them internally. That can be a parser feature; it's out of the scope of the specification. > 3. Why three versions of quoting characters for the title? > > 4. Why the <> around the URI? Because that's what Markdown.pl supports, alongside other implementations based on it such as PHP Markdown. I think that's the exact kind of detail that is under-documented right now and that is making life difficult for other implementers who want to be compatible with existing documents. > 5. Only if you put the title in () can you start on a newline? No idea. I haven't looked at the implications of this yet, but perhaps it could be done. > 2.2.2 Abbreviation > Again, I'd like a hook so that I can put these in an external file. > In my tree farm web page I'd like to use botanical descriptions, but > be able to let users see the definition on mouse over or click. But > the word 'glabrous' may appear on 40 pages. Be nice if I only had to > define it once. If someone is creating an annotated Shakespeare they > would want to use an Elizabethan English dictionary as their > external file, style it so that defined words are barely different > from the text, and let the confused reader click for enlightenment. That should be an implementation-specific feature; perhaps we should make sure the spec doesn't disallow that. > 2.2.2 footnotes > > Note possible numbering error both abbreviation and footnotes are > 2.2.2 Oops... should be fixed now. [Note to self: I really need to add an automatic numbering system to my publishing system.] > How does the footnote appear in the text? For clarity in reading, > all the things that refer to something else > should be visible different. E.g. In markdown we presently have > [link text ][LINKREF] > ![Image alt text][IMGREF] > so > ^[FOOTNOTE] (although I'd prefer _[FOOTNOTE] as it tells me it's > below the ruler line at the bottom of the page) > Except from your text it appears it should be [^FOOTNOTE] which is > at odds with the image and abbreviation > syntax. > How are footnotes numbered? That feature, along others, should follow how PHP Markdown Extra does it. Footnote numbering could be left implementation defined however. > I think you could make a case for a footnote being the child of the > block element that the reference appears in. > This may potentially allow clever people with CSS to have the > footnote appear as a sidebar div, adjacent to the reference. You'll probably need a different HTML output if you want to have sidenotes, but that shouldn't be disallowed. I intend to describe a reference HTML output in the spec, but that section will probably be non-normative so that implementers are free to give any output they feel right for their users. > 2.3.7 Table syntax > > Suggested syntax > [TT] Table title > |[TH] elements | separated by pipes | with white space | on either > side | > | anything | that | appears | with | leading and trailing | > | is | formated | as a | table row| > > |> This cell spans two columns | and | so forth| > | This cell also spans two columns <| and | so forth| > |>> This cell spans three columns | in the | table | > | This cell spans two rows | in the | table| > | " " | because it has ditto marks | in the cell below | > > Since many tables are done without a title or header, the pipe > syntax is the usual. > You can spent some time pretty printing it. > Suggested implementation would have warnings when the number of > cells per row is inconsistent. > > 2.4.2 and 2.4.3 Emphasis and strong emphasis. > The current markdown uses either _ or * for emphasis and any > combination of the two > doubled for strong emphasis. I suggested that * be used for strong > (default bold) and _ be used for emphasis. > (default italic) This gives three combinations possible with the > same set of symbols, and fits the general > intuitive nature of markdown. The plan is to use the syntax implemented by PHP Markdown Extra. For emphasis with underscore, there's going to be a special note about the difference in Markdown Extra and plain Markdown (about middle-word emphasis), so that implementers of plain Markdown can implement the thing correctly. > 2.4.6 Hard line break > This one bites me regularly, as I learned to touch type in high > school and to end a sentence with > a period followed by two spaces. This means that every time I end a > sentence on a line end, I > get an involuntary break. Lots of head scratching over this one. I > don't like markup that depends on invisible > trailing characters. > > > I would favour ending a line with a forward slash. You sometimes see > this in poetry where line length exceeds > the column width. And it has an easy mnuemonic: If a back slash > means concatenate the next line onto this one > then forward slash means, force the line break here. > > Thus my address would appear > > Sherwood Botsford / > RR 1 Site 2 Box 5 / > Warburg, Alberta T0C 2T0 / > > I would propose that any amount of white space surrounding the / > would be allowed. So if you wanted > to add extra space so the /'s would line up, you could. Removing the double-space-at-the-end rule isn't on the table; such a change would break all documents that are already out there using the current hard line break syntax. That said, I agree with you that it can bite easily if you have the habit of writing two spaces after a sentence (and I know this is quite common among people where I live). But, sadly perhaps, I think it's too late to change. > Abbreviation is element 2.2.2 and 2.4.7 Is this correct? It is both > a document element and a span element? Ditto Link. Will this cause > trouble for designing the parser? The document element 2.2.2 is "Abbreviation definition", telling what word means what, and the span element 2.4.7 is "Abbreviation", representing an instance of an abbreviation in the text (deduced automatically by the parser). I'm not sure what is the problem there. It's pretty much alike 2.2.1 Link Reference and 2.4.4 Link: one is the definition of the URL and title of a link; the other is the actual link. * * * I'm sorry to ditch most of your suggestions like that, but I can't really do any breaking change to the syntax, or that syntax wouldn't be Markdown anymore. The idea behind the spec is to give implementors an unambiguous reference about how to implement Markdown (and Markdown Extra), allowing documents tested with one parser to work with any other, unchanged. Given the current situation, it may be a little utopian to believe no current document will be broken as implementations adjust themselves to the spec, but we should try to minimize that. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From sgbotsford at gmail.com Wed May 7 16:08:35 2008 From: sgbotsford at gmail.com (Sherwood Botsford) Date: Wed, 07 May 2008 14:08:35 -0600 Subject: Markdown Extra Specification (First Draft) In-Reply-To: References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> Message-ID: <48220C43.4050401@gmail.com> ns what, and the span element 2.4.7 is "Abbreviation", > representing an instance of an abbreviation in the text (deduced > automatically by the parser). I'm not sure what is the problem there. > It's pretty much alike 2.2.1 Link Reference and 2.4.4 Link: one is the > definition of the URL and title of a link; the other is the actual link. > > * * * > > I'm sorry to ditch most of your suggestions like that, but I can't > really do any breaking change to the syntax, or that syntax wouldn't be > Markdown anymore. The idea behind the spec is to give implementors an > unambiguous reference about how to implement Markdown (and Markdown > Extra), allowing documents tested with one parser to work with any > other, unchanged. > Not to worry. I wasn't expecting backward compatibility, so that flavoured much of what I said. I was not aware of PHP Markdown extra. I will read further before commenting again. THAT said, however, maintaining perfect backward compatibility slows down progress. Can markdown extra have a configuration file: The default behaviour is to emulate markdown. The configuration file allows for new features that don't fit well into the old set. Implementation specs: The program should have a compiled in set of locations to look for the config file, a command line option, and an environment option. Consider too, if it is truly an improvement, it can be given a new name, and a new calling convention, "MarkdownX" This allows both systems to be in use while a system is in transition. Couple this with a program that scans old markdown files for 'gotchas' that have changed in the new one. > Given the current situation, it may be a little utopian to believe no > current document will be broken as implementations adjust themselves to > the spec, but we should try to minimize that. > > I agree that you need a way for people to gracefully make the transition. The best approach is a method that allows old and new systems to co-exist in the same environment. If you call it with a new name, there shouldn't be a problem. From jacobolus at gmail.com Wed May 7 16:40:03 2008 From: jacobolus at gmail.com (Jacob Rus) Date: Wed, 07 May 2008 16:40:03 -0400 Subject: Markdown Extra Specification (First Draft) In-Reply-To: <48220C43.4050401@gmail.com> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> Message-ID: Sherwood Botsford wrote: > Not to worry. I wasn't expecting backward compatibility, so that [...] > THAT said, however, maintaining perfect backward compatibility slows > down progress. If this is your view, you shouldn't put "markdown" in the name. > Implementation specs: The program should have a compiled in > set of locations to look for the config file, a command line option, and > an environment option. Wait, compiled? Environment options? This is getting way more complex than necessary. Keep it simple, on general principle. > Consider too, if it is truly an improvement, it can be given a Yes, I'd guess that's unlikely. There have been a half dozen attempts to "improve" markdown; I don't particularly like any of them. (no offense intended to those implementors) > I agree that you need a way for people to gracefully make the > transition. The best approach is a method that allows old > and new systems to co-exist in the same environment. If you call it > with a new name, there shouldn't be a problem. The new one is unlikely to gain much mindshare unless it is a) unquestionably better, and b) gets used by some prominent system/tool/etc. Good luck. Jacob (not trying to rain on parades here :) From jacobolus at gmail.com Wed May 7 16:54:43 2008 From: jacobolus at gmail.com (Jacob Rus) Date: Wed, 07 May 2008 16:54:43 -0400 Subject: Markdown Extra Specification (First Draft) In-Reply-To: <4820D94D.8090807@gmail.com> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> Message-ID: Sherwood Botsford wrote: > four things: > 1. Sometimes the link is *(@*&^(^ long. So if I'm editing with vi, I > have everything else in 60-70 column > lines, then this great bloody honker. Vi can't handle line wrap? Why do you need to look at the end of the URL? > Eg. the usual unix convention of \ with optional trailing whitespace > means continued on next line, with the \ and whitespace going to the bit > bucket. There's an RFC which recommends wrapping URIs in plain text in `<>`s. I suggest if you are bothered by long lines, you use that format. > 2. I'd like some way to hook references to an external file, or database > lookup instead of doing them internally. Just make a tool which concatenates your files first, then runs markdown on it. > 3. Why three versions of quoting characters for the title? Presumably either in case you want to use the others in the title, or just so you don't have to think about what the quoting character is when you're writing. I think it's unnecessary. > 4. Why the <> around the URI? There is as mentioned an RFC which recommends this. It has been the accepted format in plain-text emails for decades. > Since many tables are done without a title or header, the pipe syntax is > the usual. > You can spent some time pretty printing it. > Suggested implementation would have warnings when the number of cells > per row is inconsistent. Go back and look through this list's archives. 3-4 years ago there was a long and fruitful discussion of table syntax. > I suggested that * be used for strong (default bold) and _ be used for > emphasis. (default italic) This gives three combinations possible with > the same set of symbols, and fits the general intuitive nature of > markdown. > 2.4.6 Hard line break > This one bites me regularly, as I learned to touch type in high school > and to end a sentence with [...] > > I would favour ending a line with a forward slash. You sometimes see I put your chances of selling John on these at about zero. -Jacob From michel.fortin at michelf.com Thu May 8 23:59:46 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Thu, 08 May 2008 23:59:46 -0400 Subject: Markdown Extra Spec: Parsing Section Message-ID: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Hello all, I've began writing the parsing section of the spec, and I though I'd let you know about where I'm heading with all this. Basically, parsing is defined as three consecutive passes: parsing document elements, parsing block elements and parsing span elements. Each pass is going to contain a set of rules the parser should attempt to match while parsing the input. Rules are expressed in English, but are highly structured so that it should be pretty straightforward to convert to a formal grammar if the grammar is powerful enough to express them. I'm not saying too much here; elaborate explanations are better in the spec than in this volatile email. If you're interested, take a look and tell me what you think: Michel Fortin michel.fortin at michelf.com http://michelf.com/ From andrea at censi.org Fri May 9 00:27:54 2008 From: andrea at censi.org (Andrea Censi) Date: Thu, 8 May 2008 21:27:54 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Message-ID: On Thu, May 8, 2008 at 8:59 PM, Michel Fortin wrote: > Basically, parsing is defined as three consecutive passes: parsing document > elements, parsing block elements and parsing span elements. Looks good so far. The most delicate part is still to come (defining indentation for lists, and (X)(HT)ML fragments in the text flow). -- Andrea Censi PhD student, Control & Dynamical Systems, Caltech http://www.cds.caltech.edu/~andrea/ "Life is too important to be taken seriously" (Oscar Wilde) From qaramazov at gmail.com Fri May 9 01:13:05 2008 From: qaramazov at gmail.com (Yuri Takhteyev) Date: Thu, 8 May 2008 22:13:05 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Message-ID: I agree. This is starting to look very good and I think this is the right approach. I am eagerly waiting for the really hard parts, though. - yuri > > Basically, parsing is defined as three consecutive passes: parsing document > > elements, parsing block elements and parsing span elements. > > Looks good so far. > > The most delicate part is still to come (defining indentation for > lists, and (X)(HT)ML fragments in the text flow). -- http://sputnik.freewisdom.org/ From jgm at berkeley.edu Fri May 9 13:39:42 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Fri, 9 May 2008 10:39:42 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Message-ID: <20080509173941.GA10740@berkeley.edu> Michel, I think there's a problem with: > refname > > A run of one or more characters, excluding any newline and U+005D > Closing Square Bracket. This doesn't allow refnames with embedded brackets. But PHP Markdown allows [[hi]](/url) as a valid link. Also, PHP Markdown currently allows embedded newlines, which are excluded by your definition: [hi there](/url) Of course, embedded *blank* lines should be excluded. John +++ Michel Fortin [May 08 08 23:59 ]: > Hello all, > > I've began writing the parsing section of the spec, and I though I'd let > you know about where I'm heading with all this. > > Basically, parsing is defined as three consecutive passes: parsing > document elements, parsing block elements and parsing span elements. > Each pass is going to contain a set of rules the parser should attempt > to match while parsing the input. Rules are expressed in English, but > are highly structured so that it should be pretty straightforward to > convert to a formal grammar if the grammar is powerful enough to express > them. > > I'm not saying too much here; elaborate explanations are better in the > spec than in this volatile email. If you're interested, take a look and > tell me what you think: > > > > Michel Fortin > michel.fortin at michelf.com > http://michelf.com/ > > > _______________________________________________ > Markdown-Discuss mailing list > Markdown-Discuss at six.pairlist.net > http://six.pairlist.net/mailman/listinfo/markdown-discuss > From qaramazov at gmail.com Fri May 9 13:50:21 2008 From: qaramazov at gmail.com (Yuri Takhteyev) Date: Fri, 9 May 2008 10:50:21 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <20080509173941.GA10740@berkeley.edu> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> Message-ID: > This doesn't allow refnames with embedded brackets. But PHP Markdown > allows > > [[hi]](/url) > > as a valid link. Also, PHP Markdown currently allows embedded newlines, > which are excluded by your definition: In my reading of Michel's document, refnames are labels used to connect "reference-style" links and footnotes to their definition. That's different from link _text_. You can have [[hi]](/url) but you cannot have [Hi!][[hi]] [[hi]]: http://hi.com I am assuming that there will be a different type to deal with link text. - yuri -- http://sputnik.freewisdom.org/ From jgm at berkeley.edu Fri May 9 14:35:33 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Fri, 9 May 2008 11:35:33 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> Message-ID: <20080509183533.GA15602@berkeley.edu> > In my reading of Michel's document, refnames are labels used to > connect "reference-style" links and footnotes to their definition. > That's different from link _text_. You can have > > [[hi]](/url) > > but you cannot have > > [Hi!][[hi]] > > [[hi]]: http://hi.com > > I am assuming that there will be a different type to deal with link text. Ah yes, you're right about the document, and about the current behavior of PHP markdown. Still, why shouldn't refnames be allowed to have embedded brackets and newlines, if explicit links can? (Pandoc allows both, as does peg-markdown; I hadn't noticed the divergence before.) It seems a bad idea to expect users to be aware of the distinction. A user would naturally expect that a link like [[link with embedded brackets]](/url) could be converted into a reference-style link like this: [[link with embedded brackets]] [[link with embedded brackets]]: /url But this won't work in the current implementations. The markdown syntax documentation on daringfireball.net doesn't say anything that would lead one to expect this behavior, either. If people think there's a good reason to keep the current behavior, though, I can change my programs to conform. John From qaramazov at gmail.com Fri May 9 14:57:01 2008 From: qaramazov at gmail.com (Yuri Takhteyev) Date: Fri, 9 May 2008 11:57:01 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <20080509183533.GA15602@berkeley.edu> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <20080509183533.GA15602@berkeley.edu> Message-ID: > Still, why shouldn't refnames be allowed to have embedded brackets > and newlines, if explicit links can? To me those seem to be two entirely different things... The _text_ of the link has to be flexible to allow almost anything. refnames, on the other hand are identifiers and as such it makes sense for them to be more constrained. For instance, if you allow new lines in them it opens a whole bunch of questions as to what white space "counts". > could be converted into a reference-style link like this: > > [[link with embedded brackets]] I don't believe Markdown allows for links to be defined like this. A reference style link would be defined as: [[link with embedded brackets]][link with embedded brackets] or [link with embedded brackets][] The latter case is positioned as a shortcut for the special (though common) case where the link text is simple enough that you might as well use use it as the ID. - yuri -- http://sputnik.freewisdom.org/ From jgm at berkeley.edu Fri May 9 15:06:08 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Fri, 9 May 2008 12:06:08 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <20080509183533.GA15602@berkeley.edu> Message-ID: <20080509190608.GA17937@berkeley.edu> > > could be converted into a reference-style link like this: > > > > [[link with embedded brackets]] > > I don't believe Markdown allows for links to be defined like this. A > reference style link would be defined as: > > [[link with embedded brackets]][link with embedded brackets] > > or > > [link with embedded brackets][] Newer versions of Markdown.pl and PHP Markdown do allow the trailing '[]' to be omitted. % Markdown.pl --version This is Markdown, version 1.0.2b8. Copyright 2004 John Gruber http://daringfireball.net/projects/markdown/ % Markdown.pl [hi] [hi]: /url

hi

I agree that without this style of reference links, it would make more sense to distinguish between link text and 'refnames'. But this style of link gives a reason not to distinguish them. John From bobtfish at bobtfish.net Fri May 9 16:19:02 2008 From: bobtfish at bobtfish.net (Tomas Doran) Date: Fri, 9 May 2008 21:19:02 +0100 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <20080509183533.GA15602@berkeley.edu> Message-ID: <2506BBF6-931E-4D91-AA53-175B099CE734@bobtfish.net> On 9 May 2008, at 19:57, Yuri Takhteyev wrote: >> Still, why shouldn't refnames be allowed to have embedded brackets >> and newlines, if explicit links can? > > To me those seem to be two entirely different things... The _text_ of > the link has to be flexible to allow almost anything. refnames, on > the other hand are identifiers and as such it makes sense for them to > be more constrained. For instance, if you allow new lines in them it > opens a whole bunch of questions as to what white space "counts". > In Text::Markdown, I'm allowing new lines in link text: http://svn.kulp.ch/cpan/text_multimarkdown/trunk/t/Text- Markdown.mdtest/Links_multiline_bugs_1.text http://svn.kulp.ch/cpan/text_multimarkdown/trunk/t/Text- Markdown.mdtest/Links_multiline_bugs_2.text As it was specifically reported to me as a bug: http://bugs.debian.org/459885 Cheers Tom From jgm at berkeley.edu Fri May 9 16:51:18 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Fri, 9 May 2008 13:51:18 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <2506BBF6-931E-4D91-AA53-175B099CE734@bobtfish.net> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <20080509183533.GA15602@berkeley.edu> <2506BBF6-931E-4D91-AA53-175B099CE734@bobtfish.net> Message-ID: <20080509205118.GA23364@berkeley.edu> Just to clarify: Most implementations allow newlines and embedded brackets in link text. But pandoc and peg-markdown seem to be the only ones that currently allow them in link reference definitions: http://babelmark.bobtfish.net/?markdown=[hi%0D%0Athere][]%0D%0A%0D%0A[hi+there]%3A+%2Furl%0D%0A%0D%0A[hi%0D%0Aagain][]%0D%0A%0D%0A[hi%0D%0Aagain]%3A+%2Furl%0D%0A%0D%0A[[hi]](%2Furl)%0D%0A%0D%0A[[hello]]%0D%0A%0D%0A[[hello]]%3A+%2Furl%0D%0A&normalize=on&src=1&dest=2 +++ Tomas Doran [May 09 08 21:19 ]: >>> Still, why shouldn't refnames be allowed to have embedded brackets >>> and newlines, if explicit links can? >> >> To me those seem to be two entirely different things... The _text_ of >> the link has to be flexible to allow almost anything. refnames, on >> the other hand are identifiers and as such it makes sense for them to >> be more constrained. For instance, if you allow new lines in them it >> opens a whole bunch of questions as to what white space "counts". >> > > In Text::Markdown, I'm allowing new lines in link text: > > http://svn.kulp.ch/cpan/text_multimarkdown/trunk/t/Text- > Markdown.mdtest/Links_multiline_bugs_1.text > http://svn.kulp.ch/cpan/text_multimarkdown/trunk/t/Text- > Markdown.mdtest/Links_multiline_bugs_2.text > > As it was specifically reported to me as a bug: > > http://bugs.debian.org/459885 > > Cheers > Tom > > > _______________________________________________ > Markdown-Discuss mailing list > Markdown-Discuss at six.pairlist.net > http://six.pairlist.net/mailman/listinfo/markdown-discuss > From michel.fortin at michelf.com Fri May 9 22:26:36 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Fri, 09 May 2008 22:26:36 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <20080509190608.GA17937@berkeley.edu> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <20080509183533.GA15602@berkeley.edu> <20080509190608.GA17937@berkeley.edu> Message-ID: <55774298-C8A3-4BA1-8FE3-43AC6DD7F1B5@michelf.com> Le 2008-05-09 ? 15:06, John MacFarlane a ?crit : > Newer versions of Markdown.pl and PHP Markdown do allow the trailing > '[]' to be omitted. I think you should say "semi-private betas" of Markdown.pl and PHP Markdown allow trailing [] to be omitted. This feature isn't supported in the publicly available versions linked from Daring Fireball's Markdown page or my own project page for PHP Markdown. In the case of PHP Markdown, the code is there but only activated in some beta versions; otherwise it is commented out until a corresponding Markdown.pl goes out of beta with that feature. I've been wondering if I should enable it for PHP Markdown Extra though, in which case it'd have to be in the spec. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From michel.fortin at michelf.com Fri May 9 22:44:29 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Fri, 09 May 2008 22:44:29 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> Message-ID: <85347E33-285E-4469-9DC1-793537BD1032@michelf.com> Le 2008-05-09 ? 13:50, Yuri Takhteyev a ?crit : > In my reading of Michel's document, refnames are labels used to > connect "reference-style" links and footnotes to their definition. > That's different from link _text_. This is correct. > You can have > > [[hi]](/url) > > but you cannot have > > [Hi!][[hi]] > > [[hi]]: http://hi.com This is what the current spec says. It could still change though. I've added a note to the refname rule so I can look deeper into this later. > I am assuming that there will be a different type to deal with link > text. There will. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From michel.fortin at michelf.com Fri May 9 23:06:03 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Fri, 09 May 2008 23:06:03 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Message-ID: I've clarified and changed a few things about some parsing rules and started defining new rules for the block elements pass. Of notice is the "flat code block" in the block elements pass, which is is going to be part of the next version of PHP Markdown Extra. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From michel.fortin at michelf.com Sun May 11 08:31:59 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Sun, 11 May 2008 08:31:59 -0400 Subject: PHP Markdown 1.0.1l & Extra 1.2 Message-ID: <2571D1DF-4F3C-4513-AFB3-E15EE035FA09@michelf.com> Time for an update to PHP Markdown and PHP Markdown Extra. This new version of PHP Markdown Extra adds support for "fenced" code blocks (which I was previously calling "flat"). Fenced code blocks overcome many limitations of Markdown's indented code blocks: they can can be put immediately following a list item, can start and end with blank lines, and can be put one after the other as two consecutive code blocks. Also, if you're using an editor which cannot indent automatically a selected block of text, such as a text box in your web browser, it's easier to paste code in. 1.0.1l (11 May 2008): * Now removing the UTF-8 BOM at the start of a document, if present. * Now accepting capitalized URI schemes (such as HTTP:) in automatic links, such as ``. * Fixed a problem where `
` was seen as a horizontal rule instead of an automatic link. * Fixed an issue where some characters in Markdown-generated HTML attributes weren't properly escaped with entities. * Fix for code blocks as first element of a list item. Previously, this didn't create any code block for item 2: * Item 1 (regular paragraph) * Item 2 (code block) * A code block starting on the second line of a document wasn't seen as a code block. This has been fixed. * Added programatically-settable parser properties `predef_urls` and `predef_titles` for predefined URLs and titles for reference-style links. To use this, your PHP code must call the parser this way: $parser = new Markdwon_Parser; $parser->predef_urls = array('linkref' => 'http://example.com'); $html = $parser->transform($text); You can then use the URL as a normal link reference: [my link][linkref] [my link][linkRef] Reference names in the parser properties *must* be lowercase. Reference names in the Markdown source may have any case. * Added `setup` and `teardown` methods which can be used by subclassers as hook points to arrange the state of some parser variables before and after parsing. Extra 1.2 (11 May 2008): * Added fenced code block syntax which don't require indentation and can start and end with blank lines. A fenced code block starts with a line of consecutive tilde (~) and ends on the next line with the same number of consecutive tilde. Here's an example: ~~~~~~~~~~~~ Hello World! ~~~~~~~~~~~~ * Rewrote parts of the HTML block parser to better accomodate fenced code blocks. * Footnotes may now be referenced from within another footnote. * Added programatically-settable parser property `predef_attr` for predefined attribute definitions. * Fixed an issue where an indented code block preceded by a blank line containing some other whitespace would confuse the HTML block parser into creating an HTML block when it should have been code. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From jacobolus at gmail.com Sun May 11 20:55:59 2008 From: jacobolus at gmail.com (Jacob Rus) Date: Sun, 11 May 2008 20:55:59 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Message-ID: Michel Fortin wrote: > I've began writing the parsing section of the spec, and I though I'd let > you know about where I'm heading with all this. You should write it in something closer to a BNF-like format. The current version is about 10x more verbose than necessary, and it makes reading the spec considerably more difficult. From jacobolus at gmail.com Sun May 11 21:01:31 2008 From: jacobolus at gmail.com (Jacob Rus) Date: Sun, 11 May 2008 21:01:31 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Message-ID: Michel Fortin wrote: > I've began writing the parsing section of the spec, and I though I'd > let you know about where I'm heading with all this. Also, you're still going to have quite a few sticky edge cases with your current parsing model. What happens when we have a `<>`-delimited URL inside a blockquote? For instance: > what about this google.com/> case? -Jacob From michel.fortin at michelf.com Sun May 11 22:26:33 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Sun, 11 May 2008 22:26:33 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> Message-ID: <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> Le 2008-05-11 ? 20:55, Jacob Rus a ?crit : > You should write it in something closer to a BNF-like format. The > current version is about 10x more verbose than necessary, and it > makes reading the spec considerably more difficult. The reason I'm doing it like this is that I doubt everything will be expressible in a BNF format. Using plain english descriptions allows me to not bother about fitting things to a specific grammar and just write what I feel is the most natural and the easier to understand. Shopping for a more formal and less verbose grammar, if we need one, will be much easier once we know what we need, once we can compare existing grammars against a checklist of what is necessary to implement the given parsing algorithm. If you remember the timetable I've given, you'll see that I've booked about half a year for polishing things out. This includes rephrasing sentences, refactorizing the syntax, and reformatting the spec to make it easier to understand. This *could* include switching to a new grammar format if it makes things more intuitive and readable. > Also, you're still going to have quite a few sticky edge cases with > your current parsing model. What happens when we have a `<>`- > delimited URL inside a blockquote? For instance: > > > what about this > google.com/> case? Well, currently newlines aren't allowed inside automatic links in Markdown.pl, PHP Markdown and some others. Implementations who see an automatic link there sees it as a link to "http:// google.com/" (notice the space) or "http://" (notice what's missing). Anyway, with the parsing model in three passes I'm currently defining it's pretty trivial to do correctly: the block elements pass extracts the text of the blockquote, leaving this to parse by the span element pass: what about this case? The span element pass would then see an autolink and just ignore any newline it finds in the URL. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From waylan at gmail.com Mon May 12 01:01:37 2008 From: waylan at gmail.com (Waylan Limberg) Date: Mon, 12 May 2008 01:01:37 -0400 Subject: Fenced-Code-Blocks in Python-Markdown Message-ID: I'd like to announce a beta release of the Fenced-Code-Blocks Extension for Python-Markdown. The latest code for Python-Markdown and packaged extensions are now available on Gitorious. The same syntax is used as the just released PHP Markdown Extra 1.2. I did add the option to define a class on the block for language identification. Here's an example: ~~~~~~~~~~~~

Hello World!

~~~~~~~~~~~~{.html} Becomes:
<p>Hello World!</p>
    
This should work nicely with Highlight.js [1] if one so desires. Of course, as this is optional, if you leave the class definition off, it works like PHP Markdown Extra. Unfortunately, including the class definition makes PHP Markdown Extra fail to match the block. Consider yourself warned. [1]: http://softwaremaniacs.org/soft/highlight/en/ On Sun, May 11, 2008 at 8:31 AM, Michel Fortin wrote: [snip] > > This new version of PHP Markdown Extra adds support for "fenced" code blocks > (which I was previously calling "flat"). Fenced code blocks overcome many > limitations of Markdown's indented code blocks: they can can be put > immediately following a list item, can start and end with blank lines, and > can be put one after the other as two consecutive code blocks. Also, if > you're using an editor which cannot indent automatically a selected block of > text, such as a text box in your web browser, it's easier to paste code in. > [snip] > > Extra 1.2 (11 May 2008): > > * Added fenced code block syntax which don't require indentation > and can start and end with blank lines. A fenced code block > starts with a line of consecutive tilde (~) and ends on the > next line with the same number of consecutive tilde. Here's an > example: > > ~~~~~~~~~~~~ > Hello World! > ~~~~~~~~~~~~ > [snip] -- ---- Waylan Limberg waylan at gmail.com From mar.nospam at anomy.net Mon May 12 06:58:29 2008 From: mar.nospam at anomy.net (=?ISO-8859-1?Q?M=E1r_=D6rlygsson?=) Date: Mon, 12 May 2008 10:58:29 +0000 Subject: Fenced-Code-Blocks in Python-Markdown In-Reply-To: References: Message-ID: >
<p>Hello World!</p>
>    
It would be a much better idea to place the class-name on the outer element (`
`)


-- 
M?r

From michel.fortin at michelf.com  Mon May 12 07:16:12 2008
From: michel.fortin at michelf.com (Michel Fortin)
Date: Mon, 12 May 2008 07:16:12 -0400
Subject: Fenced-Code-Blocks in Python-Markdown
In-Reply-To: 
References: 
Message-ID: <642AFD2C-E2CC-4849-976E-925D539472B1@michelf.com>

Le 2008-05-12 ? 1:01, Waylan Limberg a ?crit :

> I'd like to announce a beta release of the Fenced-Code-Blocks
> Extension for Python-Markdown.
>
>  >

Wow, that was fast! :-)

There is one particularity in my implementation I think you've missed  
though (which is not surprising given I haven't mentioned it anywhere,  
yet). It touches the output when there are leading blank lines.

If you try this in various browsers:

	
	a
	
You'll notice, when serving the file as text/html, that the first newline is lost. When parsing the file as XML, it stays there however. If you then add:

	a
	
you'll then notice that the first line is there in Gecko, but still lost with Web Kit. So basically, you can't just add newlines at the start of a code block and expect it to work. PHP Markdown Extra's output when you have a leading newline is as such:

a
which solves the situation in both Safari and Gecko-based browsers (haven't tested in others). If you have two leading newlines, then you'll get two
s, etc. > The same syntax is used as the just released PHP Markdown Extra 1.2. I > did add the option to define a class on the block for language > identification. Here's an example: > > ~~~~~~~~~~~~ >

Hello World!

> ~~~~~~~~~~~~{.html} > > Becomes: > >
<p>Hello World!</p>
>    
While it has been suggested some time ago that {.class-name} stand for a class attribute applied to an arbitrary element, I'm wondering if we can't do something better than that for code blocks. I'm currently thinking of allowing the following, which I find more appealing visually: ~~~~~~~~~~~~~~ .html

Hello World!

~~~~~~~~~~~~~~ > Unfortunately, including the class > definition makes PHP Markdown Extra fail to match the block. Consider > yourself warned. Indeed. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From waylan at gmail.com Mon May 12 08:44:55 2008 From: waylan at gmail.com (Waylan Limberg) Date: Mon, 12 May 2008 08:44:55 -0400 Subject: Fenced-Code-Blocks in Python-Markdown In-Reply-To: References: Message-ID: On Mon, May 12, 2008 at 6:58 AM, M?r ?rlygsson wrote: > >
<p>Hello World!</p>
>  >    
> > It would be a much better idea to place the class-name on the outer > element (`
`)
>
>
I had initially intended on doing that myself. However, I figured
support for Highlight.js would be more important. Besides, while I
generally like styling hooks on the outermost level in html, I can't
imagine any benefit in this scenario. And when you think about it, it
is the "code" that determines the styling, so logically the "code" tag
should get the styling hook. I suspect that was the reasoning behind
Highligh.js.



-- 
----
Waylan Limberg
waylan at gmail.com

From waylan at gmail.com  Mon May 12 09:02:49 2008
From: waylan at gmail.com (Waylan Limberg)
Date: Mon, 12 May 2008 09:02:49 -0400
Subject: Fenced-Code-Blocks in Python-Markdown
In-Reply-To: <642AFD2C-E2CC-4849-976E-925D539472B1@michelf.com>
References: 
	<642AFD2C-E2CC-4849-976E-925D539472B1@michelf.com>
Message-ID: 

On Mon, May 12, 2008 at 7:16 AM, Michel Fortin
 wrote:
> Le 2008-05-12 ? 1:01, Waylan Limberg a ?crit :
>
>
>
> > I'd like to announce a beta release of the Fenced-Code-Blocks
> > Extension for Python-Markdown.
> >
> > 
> >
>
>  Wow, that was fast! :-)

Thanks. Python-Markdown's extension API made is easy. Wasn't much more
than getting the regex put together (due to differences in regex
implementations, I couldn't just copy your PHP regex).

>
>  There is one particularity in my implementation I think you've missed
> though (which is not surprising given I haven't mentioned it anywhere, yet).
> It touches the output when there are leading blank lines.

Actually, I did see that in testing your implementation, but wasn't
sure exactly why you were doing it. I meant to come back to it later
and forgot about it. Shouldn't be to hard to add though. Your notes
will no doubt be helpful.

[snip]
>
>  While it has been suggested some time ago that {.class-name} stand for a
> class attribute applied to an arbitrary element, I'm wondering if we can't
> do something better than that for code blocks.
>
>  I'm currently thinking of allowing the following, which I find more
> appealing visually:
>
>     ~~~~~~~~~~~~~~ .html
>     

Hello World!

> ~~~~~~~~~~~~~~ > Actually, I had done it that way first. Then I went back and reviewed the previous discussions. Interestingly, I had originally made the case for having the label on the top, rather than the bottom. But after further thinking, I realized that my current implementation is consistent with the HeaderID syntax. Given the number of complaints we've been getting on the list recently about styling inconsistencies in the syntax, I figured that made for a stronger argument so I used curly brackets at the end. If the consensus is on something different, I can work with it. -- ---- Waylan Limberg waylan at gmail.com From nichols7 at googlemail.com Mon May 12 17:16:15 2008 From: nichols7 at googlemail.com (Thomas Nichols) Date: Mon, 12 May 2008 22:16:15 +0100 Subject: Fenced-Code-Blocks in Python-Markdown In-Reply-To: References: <642AFD2C-E2CC-4849-976E-925D539472B1@michelf.com> Message-ID: <4828B39F.1010207@googlemail.com> Waylan Limberg wrote on 2008/05/12 14:02: > On Mon, May 12, 2008 at 7:16 AM, Michel Fortin > wrote: > [snip] >> While it has been suggested some time ago that {.class-name} stand for a >> class attribute applied to an arbitrary element, I'm wondering if we can't >> do something better than that for code blocks. >> >> I'm currently thinking of allowing the following, which I find more >> appealing visually: >> >> ~~~~~~~~~~~~~~ .html >>

Hello World!

>> ~~~~~~~~~~~~~~ >> >> > > Actually, I had done it that way first. Then I went back and reviewed > the previous discussions. Interestingly, I had originally made the > case for having the label on the top, rather than the bottom. But > after further thinking, I realized that my current implementation is > consistent with the HeaderID syntax. Given the number of complaints > we've been getting on the list recently about styling inconsistencies > in the syntax, I figured that made for a stronger argument so I used > curly brackets at the end. If the consensus is on something different, > I can work with it. > > > Styling consistency is surely a boon, but having the open-fence and close-fence markers indistinguishable seems problematic, as per http://six.pairlist.net/pipermail/markdown-discuss/2007-December/000899.html An attribute list could be used to make this distinction, though it doesn't seem a strikingly elegant solution. Perhaps any text immediately following the ~~~~ that is _not_ an attribute list (i.e. has no {braces}) could be silently ignored? This would allow ~~~~ begin $eix app-misc/anki [I] app-misc/anki [1] Homepage: http://ichi2.net/anki/ Description: Anki - a friendly, intelligent spaced learning system. ~~~~ The closing fence could optionally have `{.html}` appended. Or would ignoring any non-braced content after the ~~~~ cause other problems? -- Thomas From jgm at berkeley.edu Mon May 12 17:50:12 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Mon, 12 May 2008 14:50:12 -0700 Subject: PHP Markdown 1.0.1l & Extra 1.2 In-Reply-To: <2571D1DF-4F3C-4513-AFB3-E15EE035FA09@michelf.com> References: <2571D1DF-4F3C-4513-AFB3-E15EE035FA09@michelf.com> Message-ID: <20080512215012.GA21000@berkeley.edu> The development version of pandoc has for some time implemented "fenced" code blocks. See http://code.google.com/p/pandoc/source/browse/trunk/README#717 for documentation. Two small differences with your syntax: 1. pandoc allows the block to be ended by a line of tildes of the same length as *or longer than* the start line. Reason: It's easy to produce a longer line by just eyeballing it, whereas to produce a line of exactly the same length, you generally need to cut and paste. I'm following the general markdown philosophy of allowing slop when it's harmless. 2. pandoc allows { .haskell .number-lines } (or whatever) after the *top* line of tildes. If pandoc is compiled with highlighting support, it uses this information to highlight the block; otherwise, it just includes it as a class. Thanks for pointing out the complexity involving leading blank lines in code blocks; I'll make a note to fix that. If anyone wants to try pandoc with the delimited ("fenced") code blocks and built-in syntax highlighting, see http://groups.google.com/group/pandoc-discuss/browse_thread/thread/b59714fcfc7a7e69 for instructions. John +++ Michel Fortin [May 11 08 08:31 ]: > Time for an update to PHP Markdown and PHP Markdown Extra. > > > > This new version of PHP Markdown Extra adds support for "fenced" code > blocks (which I was previously calling "flat"). Fenced code blocks > overcome many limitations of Markdown's indented code blocks: they can > can be put immediately following a list item, can start and end with > blank lines, and can be put one after the other as two consecutive code > blocks. Also, if you're using an editor which cannot indent > automatically a selected block of text, such as a text box in your web > browser, it's easier to paste code in. > > > > > 1.0.1l (11 May 2008): > > * Now removing the UTF-8 BOM at the start of a document, if present. > > * Now accepting capitalized URI schemes (such as HTTP:) in automatic > links, such as ``. > > * Fixed a problem where `
` was seen as a horizontal > rule instead of an automatic link. > > * Fixed an issue where some characters in Markdown-generated HTML > attributes weren't properly escaped with entities. > > * Fix for code blocks as first element of a list item. Previously, > this didn't create any code block for item 2: > > * Item 1 (regular paragraph) > > * Item 2 (code block) > > * A code block starting on the second line of a document wasn't seen > as a code block. This has been fixed. > > * Added programatically-settable parser properties `predef_urls` and > `predef_titles` for predefined URLs and titles for reference-style > links. To use this, your PHP code must call the parser this way: > > $parser = new Markdwon_Parser; > $parser->predef_urls = array('linkref' => 'http://example.com'); > $html = $parser->transform($text); > > You can then use the URL as a normal link reference: > > [my link][linkref] > [my link][linkRef] > > Reference names in the parser properties *must* be lowercase. > Reference names in the Markdown source may have any case. > > * Added `setup` and `teardown` methods which can be used by subclassers > as hook points to arrange the state of some parser variables before and > after parsing. > > > Extra 1.2 (11 May 2008): > > * Added fenced code block syntax which don't require indentation > and can start and end with blank lines. A fenced code block > starts with a line of consecutive tilde (~) and ends on the > next line with the same number of consecutive tilde. Here's an > example: > > ~~~~~~~~~~~~ > Hello World! > ~~~~~~~~~~~~ > > * Rewrote parts of the HTML block parser to better accomodate > fenced code blocks. > > * Footnotes may now be referenced from within another footnote. > > * Added programatically-settable parser property `predef_attr` for > predefined attribute definitions. > > * Fixed an issue where an indented code block preceded by a blank > line containing some other whitespace would confuse the HTML > block parser into creating an HTML block when it should have > been code. > > > Michel Fortin > michel.fortin at michelf.com > http://michelf.com/ > > > _______________________________________________ > Markdown-Discuss mailing list > Markdown-Discuss at six.pairlist.net > http://six.pairlist.net/mailman/listinfo/markdown-discuss > From jgm at berkeley.edu Mon May 12 18:14:32 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Mon, 12 May 2008 15:14:32 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> Message-ID: <20080512221432.GB21000@berkeley.edu> +++ Michel Fortin [May 11 08 22:26 ]: > Le 2008-05-11 ? 20:55, Jacob Rus a ?crit : > >> You should write it in something closer to a BNF-like format. The >> current version is about 10x more verbose than necessary, and it makes >> reading the spec considerably more difficult. > > The reason I'm doing it like this is that I doubt everything will be > expressible in a BNF format. You can come pretty close with a PEG grammar: http://github.com/jgm/peg-markdown/tree/master/markdown_parser.leg#L236 I have implemented the basic markdown syntax + the footnote syntax from PHP markdown extra, and so far I've found only two things that can't be cleanly expressed using a PEG: 1. Indented block contexts like lists and blockquotes. Here I use a multi-pass approach. The first pass takes, say, a list item 1. my list item - with - nested list and returns a listitem with "raw" contents my list item - with - nested list which are piped through the markdown parser again. 2. Inline code. PEG can't express "a row of backticks, followed by a string of characters not containing an equally long row of backticks, followed by an equally long row of backticks." It can express, for particular values of N, "a row of N backticks, followed by a string of characters not containing a row of N backticks, followed by a row of N backticks." So if you have a fixed limit on the number of backticks that can start a stretch of inline code, you're okay. peg-markdown sets this limit at 5, which should be enough for most purposes. But one could set it higher without much of a performance penalty. The PEG representation is concise, precise, and readable. But the big advantage is that it can be converted automatically into a fast parser. This means that you can be sure that your markdown program really does implement the formal specification. An informal English specification won't give you that. John From mar.nospam at anomy.net Mon May 12 20:23:25 2008 From: mar.nospam at anomy.net (=?ISO-8859-1?Q?M=E1r_=D6rlygsson?=) Date: Tue, 13 May 2008 00:23:25 +0000 Subject: Fenced-Code-Blocks in Python-Markdown In-Reply-To: References: Message-ID: Waylan Limberg wrote: > I had initially intended on doing that myself. However, I figured > support for Highlight.js would be more important. Placing the class-name on the outer element gives more styling flexibility. I'd like to think Markdown will be around much longer than highlight.js. Also: libraries change, gain features, etc... -- M?r From jgm at berkeley.edu Mon May 12 21:55:19 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Mon, 12 May 2008 18:55:19 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <85347E33-285E-4469-9DC1-793537BD1032@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <85347E33-285E-4469-9DC1-793537BD1032@michelf.com> Message-ID: <20080513015519.GA30186@berkeley.edu> >> I am assuming that there will be a different type to deal with link >> text. > > There will. Is there any good reason for having two different types here? As far as I can see, allowing anything that can serve as link text to be a refname would not contradict anything in the official Markdown syntax specification. In addition, it is hard to imagine a realistic case where allowing brackets and newlines in refnames would break an existing document. Why make users remember extra restrictions? (I didn't even know about them until a few days ago, and I've used markdown for years.) And why expose users to the risk that their documents will break if they hard-wrap a long refname? I think the current behavior of phpmarkdown and Markdown.pl is very confusing. This produces a link: [[hi]][] [[hi]]: /url But this doesn't produce a link: [hello][[hi]] [[hi]]: /url So either (a) not all link references begin with a refname, or (b) refnames can sometimes (but not always!) contain embedded brackets. Either option would conflict with Michel's syntax specification as it now stands. John From michel.fortin at michelf.com Mon May 12 23:28:59 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Mon, 12 May 2008 23:28:59 -0400 Subject: Fenced-Code-Blocks in Python-Markdown In-Reply-To: References: <642AFD2C-E2CC-4849-976E-925D539472B1@michelf.com> Message-ID: <9DE21550-099F-41F0-8532-BDDB260FC97F@michelf.com> Le 2008-05-12 ? 9:02, Waylan Limberg a ?crit : >> I'm currently thinking of allowing the following, which I find more >> appealing visually: >> >> ~~~~~~~~~~~~~~ .html >>

Hello World!

>> ~~~~~~~~~~~~~~ >> > > Actually, I had done it that way first. Then I went back and reviewed > the previous discussions. Interestingly, I had originally made the > case for having the label on the top, rather than the bottom. But > after further thinking, I realized that my current implementation is > consistent with the HeaderID syntax. More or less. The header id syntax always puts the id value on the first line. :-) > Given the number of complaints we've been getting on the list > recently about styling inconsistencies in the syntax, I figured that > made for a stronger argument so I used curly brackets at the end. If > the consensus is on something different, I can work with it. If I'm not mistaken, most of the complains are about the inconsistencies between different implementations of the syntax, resulting, at least in part, of the lack of a precise and unambiguous grammar. In the example above, I try to think of the ".html" as a file extension indicating the format of the code snippet and that using it as a class name in the HTML representation is just coincidental. I think it feels very natural that way. I wouldn't recommend doing class names outside curly brakets in general. Only here, it fits pretty nicely. What do other people think? Michel Fortin michel.fortin at michelf.com http://michelf.com/ From michel.fortin at michelf.com Tue May 13 00:18:00 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Tue, 13 May 2008 00:18:00 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <20080513015519.GA30186@berkeley.edu> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <85347E33-285E-4469-9DC1-793537BD1032@michelf.com> <20080513015519.GA30186@berkeley.edu> Message-ID: <400AA59A-8D5D-4DC8-9D92-75BB0414C30C@michelf.com> Le 2008-05-12 ? 21:55, John MacFarlane a ?crit : >>> I am assuming that there will be a different type to deal with link >>> text. >> >> There will. > > Is there any good reason for having two different types here? The link text can contain other span-level elements, such as emphasis, code blocks, etc. This *has* to be taken into account while parsing. On the other hand, text in the reference part is just plain text. > As far as > I can see, allowing anything that can serve as link text to be a > refname > would not contradict anything in the official Markdown syntax > specification. In addition, it is hard to imagine a realistic case > where > allowing brackets and newlines in refnames would break an existing > document. Why make users remember extra restrictions? (I didn't even > know about them until a few days ago, and I've used markdown for > years.) > And why expose users to the risk that their documents will break if > they > hard-wrap a long refname? I'm in favor of allowing hard-wrapped reference names where the line break is not significant, so that will probably end up in the spec when I write the part about parsing the link span element. Please keep in mind that the current refname construct is for the reference name in link definitions, and may be different from the one used in the link span element. > I think the current behavior of phpmarkdown and Markdown.pl is very > confusing. This produces a link: > > [[hi]][] > > [[hi]]: /url > > But this doesn't produce a link: > > [hello][[hi]] > > [[hi]]: /url > > So either (a) not all link references begin with a refname, or (b) > refnames can sometimes (but not always!) contain embedded brackets. > Either option would conflict with Michel's syntax specification > as it now stands. This situation is indeed inconsistant. I'd be in favor of allowing balanced square brakets in link reference, even though John Gruber seems (or seemed in 2006) to think they should be disallowed completely. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From michel.fortin at michelf.com Tue May 13 00:32:19 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Tue, 13 May 2008 00:32:19 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <20080512221432.GB21000@berkeley.edu> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> Message-ID: <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> Le 2008-05-12 ? 18:14, John MacFarlane a ?crit : > The PEG representation is concise, precise, and readable. Readable, hum... if I look at this rule from PEG Markdown: ListContinuationBlock = a:StartList ( BlankLines { if (strlen($$.contents.str) == 0) $$.contents.str = strdup("\001"); /* block separator */ pushelt($$, &a); } ) ( Indent ListBlock { pushelt($$, &a); } )+ { $$ = mk_str(concat_string_list(reverse(a.children))); } it looks a lot like code to me, half of it I don't understand. If we're going this way, there's going to be a learning curve: for me, and for everyone trying to understand the syntax. I'd prefer to avoid forcing people to learn a new language only to understand the specification. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From jacobolus at gmail.com Tue May 13 01:11:27 2008 From: jacobolus at gmail.com (Jacob Rus) Date: Tue, 13 May 2008 01:11:27 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> Message-ID: Michel Fortin wrote: > Anyway, with the parsing model in three passes I'm currently defining > it's pretty trivial to do correctly: the block elements pass extracts > the text of the blockquote, leaving this to parse by the span element pass: > > what about this google.com/> case? > > The span element pass would then see an autolink and just ignore any > newline it finds in the URL. Ah, okay. Somehow I misread that. Yes, that seems about right. -Jacob From jacobolus at gmail.com Tue May 13 01:14:49 2008 From: jacobolus at gmail.com (Jacob Rus) Date: Tue, 13 May 2008 01:14:49 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> Message-ID: Michel Fortin wrote: > Le 2008-05-12 ? 18:14, John MacFarlane a ?crit : > >> The PEG representation is concise, precise, and readable. > > Readable, hum... if I look at this rule from PEG Markdown: > > ListContinuationBlock = a:StartList > ( BlankLines > { if (strlen($$.contents.str) == 0) > $$.contents.str = strdup("\001"); /* block separator */ > pushelt($$, &a); } ) > ( Indent ListBlock { pushelt($$, &a); } )+ > { $$ = mk_str(concat_string_list(reverse(a.children))); } > > it looks a lot like code to me, half of it I don't understand. If we're > going this way, there's going to be a learning curve: for me, and for > everyone trying to understand the syntax. I'd prefer to avoid forcing > people to learn a new language only to understand the specification. Yeah, that's worse. Mainly I just would suggest taking all those numbered lists of things, and putting them on a single line. It's not that it has to be BNF or EBNF/ABNF/whatever, but parts which *can* be expressed in such a way, and can be condensed to fit in a more compact space, should be. The current numbered lists + English approach, in many parts of your current work, just add visual clutter. :) -Jacob From jgm at berkeley.edu Tue May 13 01:28:18 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Mon, 12 May 2008 22:28:18 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <400AA59A-8D5D-4DC8-9D92-75BB0414C30C@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <20080509173941.GA10740@berkeley.edu> <85347E33-285E-4469-9DC1-793537BD1032@michelf.com> <20080513015519.GA30186@berkeley.edu> <400AA59A-8D5D-4DC8-9D92-75BB0414C30C@michelf.com> Message-ID: <20080513052818.GA11967@berkeley.edu> +++ Michel Fortin [May 13 08 00:18 ]: >> Is there any good reason for having two different types here? > > The link text can contain other span-level elements, such as emphasis, > code blocks, etc. This *has* to be taken into account while parsing. On > the other hand, text in the reference part is just plain text. But it needn't be. Both constructs could be parsed as sequences of inline elements. That's how pandoc and peg-markdown treat them. John From jgm at berkeley.edu Tue May 13 02:06:11 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Mon, 12 May 2008 23:06:11 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> Message-ID: <20080513060611.GB11967@berkeley.edu> +++ Michel Fortin [May 13 08 00:32 ]: > Le 2008-05-12 ? 18:14, John MacFarlane a ?crit : > >> The PEG representation is concise, precise, and readable. > > Readable, hum... if I look at this rule from PEG Markdown: > > ListContinuationBlock = a:StartList > ( BlankLines > { if (strlen($$.contents.str) == 0) > $$.contents.str = strdup("\001"); /* block separator */ > pushelt($$, &a); } ) > ( Indent ListBlock { pushelt($$, &a); } )+ > { $$ = mk_str(concat_string_list(reverse(a.children))); } > > it looks a lot like code to me, half of it I don't understand. Well, you've picked the ugliest part. But don't be repelled too quickly. Note that the stuff between { } is C code that constructs the syntax tree. If you just want to see the syntax specification, you can pretty much ignore those parts. The "StartList" bit can also be ignored, as it just initializes a list. With that stripped out, you get: ListContinuationBlock = Blanklines (Indent ListBlock)+ That is, a list continuation block is some blank lines followed by one or more ListBlocks, each preceded by indentation. That seems pretty readable to me. Here's the part that concerns the recent discussion about "refname" (again, I've omitted the {} parts and parts that modify the rules depending on which syntax extensions have been selected). Reference = NonindentSpace Label ':' Spnl RefSrc Spnl RefTitle BlankLine* A reference is some space of less than one indent, followed by a Label, followed by ':', followed by optional blank space including at most one newline, followed by a RefSrc, followed by optional blank space including at most one newline, followed by a RefTitle, followed by optional blanklines. (You may not agree with that. But it's easy to see how to modify the rule above if, for example, you don't think leading space should be allowed.) Label = '[' (!']' Inline)+ ']' A label is a '[' followed by a string of one or more Inline elements that don't begin with ']', followed by ']'. (Note: this allows text within balanced brackets, which will be parsed as a single Inline element.) RefSrc = Nonspacechar+ A RefSrc is a string of one or more nonspace characters. And so on. Again, a lot of the ugliness of the specification is due to the C code that constructs the parse tree. If that bothers you, you might like the Haskell version better (though there are a few problems with that grammar that I have corrected in the C version). It contains a minimum of embedded code, and the whole grammar fits in just 160 lines: http://github.com/jgm/markdown-peg/tree/4438c336444a714f15ed619c9897d91c3ab6b40e/Markdown.hs#L68 I've been working on the C version mainly because more people have access to a C compiler, and because it's significantly faster than the Haskell. John From jgm at berkeley.edu Tue May 13 02:20:13 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Mon, 12 May 2008 23:20:13 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> Message-ID: <20080513062013.GA14024@berkeley.edu> > If we're going this way, there's going to be a learning curve: for > me, and for everyone trying to understand the syntax. I'd prefer to > avoid forcing people to learn a new language only to understand the > specification. PS. Here's all you have to learn in order to write or read a PEG grammar. A B C A followed by B followed by C A | B A or B (ordered choice) A+ one or more As A* zero or more As A? optional A !A not followed by A &A followed by A (but does not consume A) (A B) grouping . matches any character 'x' matches the character 'x' "string" matches the string "string" [a-z] matches a character from 'a' to 'z' English could be used to specify how a semantic value is to be constructed for each matching rule. This part would be implemented differently in different languages, but the basic PEG grammar would be the same. John From michel.fortin at michelf.com Tue May 13 08:52:56 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Tue, 13 May 2008 08:52:56 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <20080513062013.GA14024@berkeley.edu> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> <20080513062013.GA14024@berkeley.edu> Message-ID: Le 2008-05-13 ? 2:20, John MacFarlane a ?crit : > PS. Here's all you have to learn in order to write or read a PEG > grammar. > > A B C A followed by B followed by C > A | B A or B (ordered choice) > A+ one or more As > A* zero or more As > A? optional A > !A not followed by A > &A followed by A (but does not consume A) > (A B) grouping > . matches any character > 'x' matches the character 'x' > "string" matches the string "string" > [a-z] matches a character from 'a' to 'z' It certainly true that many parts could be converted to this and be less verbose, and I find this idea appealing. I doubt the whole Markdown Extra ruleset can be expressed in this format though. Can a PEG grammar have parametrized rules? I've just added nested block element support in the spec. This is done by having the block element generator (formerly the block element pass) have a stack of rules to match when starting each line. This idea coming straight from Allan Odgaard's explanation of his lost Markdown parser. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From jgm at berkeley.edu Tue May 13 10:02:48 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Tue, 13 May 2008 07:02:48 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> <20080513062013.GA14024@berkeley.edu> Message-ID: <20080513140248.GA2358@berkeley.edu> > It certainly true that many parts could be converted to this and be less > verbose, and I find this idea appealing. I doubt the whole Markdown Extra > ruleset can be expressed in this format though. Can a PEG grammar have > parametrized rules? > > I've just added nested block element support in the spec. This is done > by having the block element generator (formerly the block element pass) > have a stack of rules to match when starting each line. This idea coming > straight from Allan Odgaard's explanation of his lost Markdown parser. > > indentedLine) <++> (many (many1 (optional indent ->> blankline) <++> many1 (doesNotMatch blankline ->> indentedLine)) ## concat) <<- many blankline ## Verbatim . concat John From michel.fortin at michelf.com Thu May 15 23:58:38 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Thu, 15 May 2008 23:58:38 -0400 Subject: Parsing Code Blocks Message-ID: <7B9AB192-DCEF-4625-AF5A-5F498351161E@michelf.com> I've rewritten the code block grammar in the Markdown Extra [spec][] to match what Markdown.pl and PHP Markdown do. It should now handle things such as this: ~~~ > One Two > Three Four Five ~~~ as one blockquote containing only one code block with five lines, equivalent to this one (using fenced code blocks instead for clarity): ~~~ > One > Two > > Three > Four Five ~~~ I'm wondering though if code blocks shouldn't force a "non-lazy" syntax, which would mean yielding a result identical to this instead: ~~~ > One Two > Three Four Five ~~~ Thoughts? [spec]: Michel Fortin michel.fortin at michelf.com http://michelf.com/ From michel.fortin at michelf.com Thu May 15 23:58:56 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Thu, 15 May 2008 23:58:56 -0400 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: <20080513140248.GA2358@berkeley.edu> References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> <20080513062013.GA14024@berkeley.edu> <20080513140248.GA2358@berkeley.edu> Message-ID: Le 2008-05-13 ? 10:02, John MacFarlane a ?crit : > No, PEG can't do this. But there is a different approach that works > (described in my earlier email). You're right, and I'm quite familiar with this approach for parsing nested blocks as it is what Markdown.pl and PHP Markdown do. I may switch back to that solution if problems arise with the current approach or if it proves to be more useful. By not choosing a grammar early, I'm less constrained about what I can do in the spec, and I think that's desirable at this early stage. > By the way: if I understand it correctly, your description of "Code > block" would parse the following as two code blocks, not one code > block > containing a blank line: > > some code > > more code > > (Note: there is no tab on the middle line.) I don't think that's the > desired behavior. Indeed. Fixed, but please read the new thread about the subject. > Here's the markdown-peg version (and remember, this is "runnable"): > > verbatim <- newRule $ > many1 (doesNotMatch blankline ->> indentedLine) <++> > (many (many1 (optional indent ->> blankline) <++> > many1 (doesNotMatch blankline ->> indentedLine)) ## concat) > <<- > many blankline ## Verbatim . concat And I was under the impression that you had given me a nearly complete cheatsheet of the PEG grammar in that previous email. What does $, - >>, <++>, and ## mean? Michel Fortin michel.fortin at michelf.com http://michelf.com/ From qaramazov at gmail.com Fri May 16 00:31:09 2008 From: qaramazov at gmail.com (Yuri Takhteyev) Date: Thu, 15 May 2008 21:31:09 -0700 Subject: Parsing Code Blocks In-Reply-To: <7B9AB192-DCEF-4625-AF5A-5F498351161E@michelf.com> References: <7B9AB192-DCEF-4625-AF5A-5F498351161E@michelf.com> Message-ID: Your first two examples are not treated as the same by any implementation. It seems that all implementations interprete this: ~~~ > One Two > Three Four Five ~~~ as meaning that "One" is in a code block, but "Two" is not. Or did you mean to put a few more spaces in front of "Two"? > [spec]: I think it would help if the spec maked it more clear what part of each line of the blockquote is consumed before we go looking for sub-elements, especially as far as consuming initial whitespace goes. - yuri -- http://sputnik.freewisdom.org/ From jgm at berkeley.edu Fri May 16 01:37:19 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Thu, 15 May 2008 22:37:19 -0700 Subject: Markdown Extra Spec: Parsing Section In-Reply-To: References: <80AE6E03-1FF1-40D2-92FB-7C58D3F8A253@michelf.com> <445A584E-A771-436F-86B2-ACBBF71F8377@michelf.com> <20080512221432.GB21000@berkeley.edu> <3E7AFA5E-1E46-4750-B322-13FA8C716866@michelf.com> <20080513062013.GA14024@berkeley.edu> <20080513140248.GA2358@berkeley.edu> Message-ID: <20080516053719.GA13625@berkeley.edu> > And I was under the impression that you had given me a nearly complete > cheatsheet of the PEG grammar in that previous email. What does $, ->>, > <++>, and ## mean? Sorry, these are not standard PEG symbols, but they are used in the Haskell PEG library I'm using (Frisby). If you look at the source code of Markdown.hs, you'll find a table correlating Frisby notation with standard PEG notation: http://github.com/jgm/markdown-peg/tree/master/Markdown.hs#L69 The only symbol not explained there is '$'. '$' is a standard Haskell sign for function application; basically it's just a way to write things with fewer parentheses. So, for example, newRule $ blah blah blah is the same as newRule (blah blah blah) John From jgm at berkeley.edu Fri May 16 02:12:48 2008 From: jgm at berkeley.edu (John MacFarlane) Date: Thu, 15 May 2008 23:12:48 -0700 Subject: Parsing Code Blocks In-Reply-To: <7B9AB192-DCEF-4625-AF5A-5F498351161E@michelf.com> References: <7B9AB192-DCEF-4625-AF5A-5F498351161E@michelf.com> Message-ID: <20080516061248.GA15649@berkeley.edu> > I'm wondering though if code blocks shouldn't force a "non-lazy" syntax, > which would mean yielding a result identical to this instead: I'd say no. It's unlikely that anyone's going to combine lazy blockquote syntax with code blocks anyway, but if they do, it seems to me that they ought to expect the lazy syntax to work in the usual way. John From pagaltzis at gmx.de Thu May 22 02:10:39 2008 From: pagaltzis at gmx.de (Aristotle Pagaltzis) Date: Thu, 22 May 2008 08:10:39 +0200 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: <48220C43.4050401@gmail.com> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> Message-ID: <20080522061039.GA23076@klangraum.plasmasturm.org> * Sherwood Botsford [2008-05-07 22:10]: > THAT said, however, maintaining perfect backward compatibility > slows down progress. I don?t know. It seems to me perfect backward compatibility is not even possible, considering that Markdown.pl is not set in stone (John takes bug reports and writes fixes, every so often) and yet is not formally defined anywhere. As such, there is no way to say what is backward compatible and what isn?t. I think at most, backcompat for the purposes of a spec for Markdown can only be defined as targetting a particular feature set, but not an exact implementation of it. That is, after all, the entire reason for the spec effort in the first place. > Can markdown extra have a configuration file: > The default behaviour is to emulate markdown. > The configuration file allows for new features that don't fit > well into the old set. Optional features are dangerous and impede interoperability. Everyone who ever thinks about chipping in on the design of a spec should read [section 5 of RFC?3339][1]. That RFC is a spec for a particular datetime format, but section?5 is largely agnostic of the nature of the format, and lays down the principles according to which the design decisions for this format were made. [Section 5.3][2] is the part with direct relevance to your stipulation, but the entire section is readworthy. [1]: http://tools.ietf.org/html/rfc3339#section-5 [2]: http://tools.ietf.org/html/rfc3339#section-5.3 One problem is that every new option leads to a geometric increase in the number of feature combinations that have to be tested. Another issue is that Markdown is a document format. If it has many optional features, what are the chances that if I send you a document ostensibly written in Markdown that will work in your implementation of Markdown exactly as it did in mine? You really really don?t want to have to wonder. This was a major reason why SGML mostly failed, f.ex., and only gained traction when it was restandardised as XML. SGML had legions of optional author-friendly features that it made it an extreme amount of work to implement a parser that correctly implemented the entire spec. The XML working group sat down and basically chucked out 95% of the optional features and made the rest mandatory. The rest is history. Optional features in a document format are an invitation for interoperability problems. Since the entire point of the Markdown spec effort was to reduce existing interoperability problems, I strongly advise that as little as possible in the spec be made optional. Ideally, nothing would be. It is, mind, perfectly fine to have two (or maybe three?) specs of which one is a superset of the other, as seems to be Michel?s current thrust with Markdown vs Markdown Extra. Assuming that no feature in either spec is optional, that means you would be able to expect Markdown Extra documents to work in all Markdown Extra processors, and all Markdown documents to work in all Markdown and Markdown Extra processors. The scope of the problem is much smaller in such a scenario, enough so to be perfectly tractable. Regards, -- Aristotle Pagaltzis // From 29mtuz102 at sneakemail.com Thu May 22 06:03:50 2008 From: 29mtuz102 at sneakemail.com (Allan Odgaard) Date: Thu, 22 May 2008 12:03:50 +0200 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: <20080522061039.GA23076@klangraum.plasmasturm.org> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> <20080522061039.GA23076@klangraum.plasmasturm.org> Message-ID: <22877-63346@sneakemail.com> On 22 May 2008, at 08:10, Aristotle Pagaltzis wrote: > [...] > Optional features are dangerous and impede interoperability. > > Everyone who ever thinks about chipping in on the design of > a spec should read [section 5 of RFC 3339][1]. [...] I love how it says: [...] A format which includes rarely used options is likely to cause interoperability problems [...] The format defined below includes only one rarely used option: fractions of a second. [...] Which reminds me of when svn started to report fractions of seconds in their ?svn log --xml? output, causing a few log visualizers to break :) From michel.fortin at michelf.com Thu May 22 22:33:45 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Thu, 22 May 2008 22:33:45 -0400 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: <20080522061039.GA23076@klangraum.plasmasturm.org> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> <20080522061039.GA23076@klangraum.plasmasturm.org> Message-ID: Le 2008-05-22 ? 2:10, Aristotle Pagaltzis a ?crit : > It is, mind, perfectly fine to have two (or maybe three?) specs > of which one is a superset of the other, as seems to be Michel?s > current thrust with Markdown vs Markdown Extra. Assuming that no > feature in either spec is optional, that means you would be able > to expect Markdown Extra documents to work in all Markdown Extra > processors, and all Markdown documents to work in all Markdown > and Markdown Extra processors. The scope of the problem is much > smaller in such a scenario, enough so to be perfectly tractable. I perfectly agree with this by the way: optional features should be kept to a minimum. It may be interesting to note there are currently two configurable parsing-related[^1] in PHP Markdown: Tab width (default = 4) : This one comes from a similar configuration option in Markdown.pl and is essentially the size in spaces for one indent through a Markdown document. When John Gruber says "four spaces or one tab" in his syntax description document, he really means " spaces or one tab", where tab-width is a configurable parameter defaulting to 4. I'm not aware of anyone changing this parameter, and I'm not even sure of how well it works, but it is clear that changing this will break many documents written with a different tab width in mind. No markup (default = false) No entities (default = false) : This one prevents the parser from skipping over HTML tags and/or HTML character entities. I was originally opposed to it, and in some way I still am. I decided to add it because there was too much people attempting to disable HTML by preprocessing the input with strip_tags or a substitution regular expression without realizing they were breaking automatic links, code spans and code blocks with HTML in them, and sometimes blockquotes. I'm no fan of this mode, but I feel it was the best way to avoid people breaking the syntax by accident, so I've added it in. I'm not sure those "features" should be formally part of the spec. I believe however that if the spec is well written it should be pretty trivial to see what must be changed to achieve them. [^1]: A "parsing-related setting" is a setting that changes the interpretation of the document given in output. The oposite is an "output-related setting", which changes the HTML output but does not affect the interpretation the parser makes of the document. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From michel.fortin at michelf.com Thu May 22 23:25:27 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Thu, 22 May 2008 23:25:27 -0400 Subject: Parsing Code Blocks In-Reply-To: References: <7B9AB192-DCEF-4625-AF5A-5F498351161E@michelf.com> Message-ID: Le 2008-05-16 ? 0:31, Yuri Takhteyev a ?crit : > Your first two examples are not treated as the same by any > implementation. It seems that all implementations interprete this: > > ~~~ >> One > Two > >> Three > Four > > Five > ~~~ > > as meaning that "One" is in a code block, but "Two" is not. > > Or did you mean to put a few more spaces in front of "Two"? Hum, yes I did, and in fact I had. It just looks like my email client (Mac OS X's Mail) eat the first space on each line that begins with a space... I really wish it wasn't using Web Kit as its text editor when in text-only mode. >> [spec]: > > > > I think it would help if the spec maked it more clear what part of > each line of the blockquote is consumed before we go looking for > sub-elements, especially as far as consuming initial whitespace goes. Quoting item 2 of blockquote (at the moment you wrote the above): > A run of the [block element generator](#block-element-generator) by > pushing the following sequence to the context-line-prefix > stack: > 1. Zero or one [insignificant-indent](#insignificant-indent) > 2. ">" > 3. Zero or one [space](#space) This means that the block element generator is used as a grammar rule at this point. It matches if it can generate one or more block elements. Since each rule in the block generator first checks for a hard-block-content-line-prefix, you could check for yourself that you can match a hard-block-content-line-prefix prior calling the generator (this *could* be more performant). I've added this to the block element generator section: > The block element generator is used as a parsing rule in the grammar of > the document element generator and the block element generator. The block > element generator matches if it one of the following rule matches and creates > an element. That said, I decided to revamp the blockquote rule to no longer use directly the block element generator. Everything now passes through a rule named block-element-run, matching one or more block element (using the block-element generator), and the blockquote first ">" is parsed separately in the blockquote rule instead of indirectly from attempting to parse block elements. Does this makes it clearer? By the way, I agree things are not optimal at the moment. They are also way off the tracks of what PHP Markdown and Markdown.pl actually do in many cases. The plan is to start by making something that mostly work. Then I'll compare with the actual regular expressions used in the code and do the adjustments as necessary. After that, I'll compare with test cases in MDTest, and with the output given by other implementations in Babelmark. And I might mix the order a bit. Michel Fortin michel.fortin at michelf.com http://michelf.com/ From pagaltzis at gmx.de Thu May 22 23:38:41 2008 From: pagaltzis at gmx.de (Aristotle Pagaltzis) Date: Fri, 23 May 2008 05:38:41 +0200 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> <20080522061039.GA23076@klangraum.plasmasturm.org> Message-ID: <20080523033841.GE23076@klangraum.plasmasturm.org> * Michel Fortin [2008-05-23 04:35]: > I'm not sure those "features" should be formally part of the > spec. I believe however that if the spec is well written it > should be pretty trivial to see what must be changed to achieve > them. Yeah, I don?t think they belong in the spec itself either. I also agree with your opposition to them; if anything, one should filter the *output* of a Markdown-to-HTML conversion so that it won?t matter whether people write literal `` tags or use asterisks. In both cases the resulting HTML is benign, so why disallow those tags? OTOH disallowing literal tags in the input means you cannot write a `
` with a `cite` attribute, since Markdown provides no syntax for that. Scrubbing the input indiscriminately therefore removes functionality for no benefit at all. Regards, -- Aristotle Pagaltzis // From qaramazov at gmail.com Fri May 23 02:31:20 2008 From: qaramazov at gmail.com (Yuri Takhteyev) Date: Thu, 22 May 2008 23:31:20 -0700 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: <20080523033841.GE23076@klangraum.plasmasturm.org> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> <20080522061039.GA23076@klangraum.plasmasturm.org> <20080523033841.GE23076@klangraum.plasmasturm.org> Message-ID: > I also agree with your opposition to them; if anything, one > should filter the *output* of a Markdown-to-HTML conversion > so that it won't matter whether people write literal `` > tags or use asterisks. This is true in theory... I actually just recently write something along those lines in Lua [1] to use with my Lua wiki. The idea is to do as you suggest: Convert from MD to HTML first, then filter the HTML. To make it safe, I parse HTML as XHTML and complain if it doesn't parse. Hence a problem: if the user screws up with their HTML (and my filter is pretty unforgiving), it becomes hard to communicate to them what went wrong. I can tell them where there is a problem in the overall HTML, but this doesn't help much, since the user didn't know there was all of this HTML to begin with. There is no easy way to show them where the problem occurred relative to the input that they provided, or to show them the content with just _their_ HTML escaped. So, a good solution in Markdown itself actually would be a good thing. - yuri [1]: http://sputnik.freewisdom.org/lib/xssfilter/ -- http://sputnik.freewisdom.org/ From pagaltzis at gmx.de Sat May 24 05:10:43 2008 From: pagaltzis at gmx.de (Aristotle Pagaltzis) Date: Sat, 24 May 2008 11:10:43 +0200 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> <20080522061039.GA23076@klangraum.plasmasturm.org> <20080523033841.GE23076@klangraum.plasmasturm.org> Message-ID: <20080524091043.GI23076@klangraum.plasmasturm.org> * Yuri Takhteyev [2008-05-23 08:35]: > * Aristotle Pagaltzis [2008-05-23 05:40]: > > I also agree with your opposition to them; if anything, one > > should filter the *output* of a Markdown-to-HTML conversion > > so that it won't matter whether people write literal `` > > tags or use asterisks. > > This is true in theory... I actually just recently write > something along those lines in Lua [1] to use with my Lua wiki. > The idea is to do as you suggest: Convert from MD to HTML > first, then filter the HTML. To make it safe, I parse HTML as > XHTML and complain if it doesn't parse. Hence a problem: if the > user screws up with their HTML (and my filter is pretty > unforgiving), it becomes hard to communicate to them what went > wrong. I can tell them where there is a problem in the overall > HTML, but this doesn't help much, since the user didn't know > there was all of this HTML to begin with. It seems to me that filtering is a red herring in your case. If you want to allow users to enter literal tags, you will have this problem whether you filter the ultimate output or not. > There is no easy way to show them where the problem occurred > relative to the input that they provided, or to show them the > content with just _their_ HTML escaped. So, a good solution in > Markdown itself actually would be a good thing. If your XHTML parser has a streaming input mode, you can couple your Markdown converter directly to the XHTML parser and feed the HTML output to it as you go. If the XHTML parser throws a well- formedness error, you can then relate it to the vicinity of the last Markdown chunk you converted to HTML and passed into the XHTML parser. It will sometimes be an earlier chunk; eg. if the user writes ` ` (notice the missing semicolon) and this is exacly at end of string in the HTML chunk you pass to the XHTML parser, then the XHTML parser will have to wait until the next chunk before it can decide that that entity is broken. If you don?t want to couple the Markdown converter with an XHTML parser that closely, it?s still possible to do this, but the Markdown converter will have to be able to accept streaming input itself and will need to generate output sufficiently frequently that you can track the correlation of input and output with a useful amount of precision. The glue code that combines the Markdown converter with the XHTML parser will have to do some relatively hairy (tho not very complex) bookkeeping in that case. Regards, -- Aristotle Pagaltzis // From qaramazov at gmail.com Sat May 24 15:34:04 2008 From: qaramazov at gmail.com (Yuri Takhteyev) Date: Sat, 24 May 2008 12:34:04 -0700 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: <20080524091043.GI23076@klangraum.plasmasturm.org> References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> <20080522061039.GA23076@klangraum.plasmasturm.org> <20080523033841.GE23076@klangraum.plasmasturm.org> <20080524091043.GI23076@klangraum.plasmasturm.org> Message-ID: > It seems to me that filtering is a red herring in your case. If > you want to allow users to enter literal tags, you will have this > problem whether you filter the ultimate output or not. If I want to allow them, then yes, but this is not the case I was considering. Suppose I do _not_ want to allow them to enter HTML tags. This is easy to implement as an option in a Markdown converter. However, if the converter doesn't do that, then I have a much harder task: user's tags are now mixed with Markdown's tags, and I have to figure out how to sort them out. There _is_ a difference between the inserted by markdown and the inserted by the user. I know Markdown's em will be balanced. I am not sure that the user's will be. At this point the only way to be sure that the HTML is valid is to parse it. > If your XHTML parser has a streaming input mode, you can couple > your Markdown converter directly to the XHTML parser and feed the > HTML output to it as you go. If the XHTML parser throws a well- > formedness error, you can then relate it to the vicinity of the > last Markdown chunk you converted to HTML and passed into the > XHTML parser. I am not quite sure what you mean, but Markdown documents can't always be processed on a chunk by chunk basis. Consider: Here is a [link][id]. ... 100KB of text... [id]: http://example.com/ "Optional Title Here" This document cannot be processed correctly unless it's considered all at the same time. > If you don't want to couple the Markdown converter with an XHTML > parser that closely, it's still possible to do this, but the > Markdown converter will have to be able to accept streaming input > itself and will need to generate output sufficiently frequently > that you can track the correlation of input and output with a > useful amount of precision. Sure, if you want to drop support for references, footnotes, etc. But it's much simpler to implement a "safe mode" that escapes or validates all HTML submitted by the user. - yuri -- http://sputnik.freewisdom.org/ From pagaltzis at gmx.de Sat May 24 18:20:22 2008 From: pagaltzis at gmx.de (Aristotle Pagaltzis) Date: Sun, 25 May 2008 00:20:22 +0200 Subject: Optional features (was: Markdown Extra Specification (First Draft)) In-Reply-To: References: <3B65A0B5-8A1A-485C-B065-302D5A9AA94F@michelf.com> <4820D94D.8090807@gmail.com> <48220C43.4050401@gmail.com> <20080522061039.GA23076@klangraum.plasmasturm.org> <20080523033841.GE23076@klangraum.plasmasturm.org> <20080524091043.GI23076@klangraum.plasmasturm.org> Message-ID: <20080524222022.GJ23076@klangraum.plasmasturm.org> * Yuri Takhteyev [2008-05-24 21:35]: > * Aristotle Pagaltzis [2008-05-24 11:15]: > > If your XHTML parser has a streaming input mode, you can > > couple your Markdown converter directly to the XHTML parser > > and feed the HTML output to it as you go. If the XHTML parser > > throws a well-formedness error, you can then relate it to > > the vicinity of the last Markdown chunk you converted to HTML > > and passed into the XHTML parser. > > I am not quite sure what you mean, but Markdown documents can't > always be processed on a chunk by chunk basis. Consider: > > Here is a [link][id]. > > ... 100KB of text... > > [id]: http://example.com/ "Optional Title Here" > > This document cannot be processed correctly unless it's > considered all at the same time. Good point, so streaming the Markdown input is not possible. But that doesn?t mean you can?t generate the output piecemeal and also feed it to the XHTML parser that way. Regards, -- Aristotle Pagaltzis // From michel.fortin at michelf.com Tue May 27 07:07:40 2008 From: michel.fortin at michelf.com (Michel Fortin) Date: Tue, 27 May 2008 07:07:40 -0400 Subject: PHP Markdown Extra 1.2.1 Message-ID: <1D8A0CAC-A4D0-4C3C-AC65-6E0441CE0DF4@michelf.com> This is an Extra-only release for fixing an embarrassing bug with fenced code blocks. Extra 1.2.1 (27 May 2008): * Fixed a problem where Markdown headers and horizontal rules were transformed into their HTML equivalent inside fenced code blocks. In other words, this: ~~~ # hello - - - ~~~ would generate a header and a horizontal rule *inside* the code block. Michel Fortin michel.fortin at michelf.com http://michelf.com/