Greedy versus non-greedy repeat (was: Minor regexp oversight for setext headings)

Allan Odgaard 29mtuz102 at sneakemail.com
Mon Oct 9 12:44:01 EDT 2006


Greedy matching is generally more efficient (so it’s a good idea to
use it when possible), but the reason it is faster is because of
changed semantics.

Take this string:

foo...«lots of text»...bar...«lots more text»...end

If we match it against ‘foo.*bar’ the regexp engine will first match
‘foo’, then it will match to the end of the string and look for ‘bar’
there, if there is no match, it will backtrack (i.e. going back one
character), it will continue until it finds ‘bar’.

So if the string is 1,000 characters long and ‘bar’ is in the middle,
we make around 1,500 compares.

If instead we search for ‘foo.*?bar’ it is non-greedy, basically
meaning it will let ‘.*?’ match as few characters as possible (0)
then look for ‘bar’, if that didn’t work, it will match another
character, etc. until it finds ‘bar’. So here we only match around
500 characters.

The problem is of course if the string contains ‘bar’ in two
locations, then the two patterns will match a sequence of different
length.

So the OP is right in that non-greedy matching is generally faster
and should be preferred, but the user should also know why, otherwise
he might get a wrong match -- coincidentally a common newbie mistake
is actually the opposite, like doing: ‘<.*>’ to match the contents of
a tag -- this should be ‘<.*?>’ (or <[^>]*>) to handle the case where
the line contains multiple tags -- and coincidentally this was also
sort of the problem with the Markdown heading pattern (‘.*’ matching
too much).





On 9. Oct 2006, at 16:07, Angie Ahl wrote:


> Oi.

>

> All due respect,

>

> If you're going to correct/flame people publicly at least have to

> politeness to point out/qualify what you're saying eg chapter or

> page number otherwise you come across as an ignorant troll with no

> social skills.

>

> Unless of course you are an ignorant troll in which case please

> disregard this message. I was under the impression you weren't but

> I could be wrong.

>

> And I've read the book about 5 times by now so I'd really

> appreciate you qualifying what you say to the list for the benefit

> of the list as I'm pretty sure you're mistaken.

>

> Angie

>

> On 7 Oct 2006, at 23:21, A. Pagaltzis wrote:

>

>> * Angie Ahl <alists at vertebrate.co.uk> [2006-10-07 22:35]:

>>> Not sure that it would but I've read in several places that non

>>> greedy should be used whereever it can as it's more efficient

>>> for the regex parser and processor. Don't have any benchmarks

>>> to back this up but I'm sure it says so in the mastering regex

>>> book.

>>

>> You are mistaken on all counts. Please don’t parrot silly cargo

>> cult advice. Read the book you mention instead of assuming what

>> it says.

>>

>> Regards,

>> --

>> Aristotle Pagaltzis // <http://plasmasturm.org/>




More information about the Markdown-Discuss mailing list