Thursday 19 October 2017

regex - Greedy vs. Reluctant vs. Possessive Quantifiers

itemprop="text">


I found this href="http://download.oracle.com/javase/tutorial/essential/regex/quant.html"
rel="noreferrer">excellent tutorial on regular expressions and while I
intuitively understand what "greedy", "reluctant" and "possessive" quantifiers do, there
seems to be a serious hole in my
understanding.



Specifically, in the following
example:



Enter your regex: .*foo
// greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I
found the text "xfooxxxxxxfoo" starting at index 0 and ending at index
13.

Enter your regex: .*?foo // reluctant
quantifier

Enter input string to search: xfooxxxxxxfoo
I
found the text "xfoo" starting at index 0 and ending at index 4.
I found the
text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter
your regex: .*+foo // possessive quantifier
Enter input string to search:
xfooxxxxxxfoo
No match
found.


The explanation
mentions eating the entire input string, letters been
consumed, matcher backing off,
rightmost occurrence of "foo" has been regurgitated,
etc.




Unfortunately, despite the nice
metaphors, I still don't understand what is eaten by whom... Do you know of another
tutorial that explains (concisely) how regular expressions engines
work?



Alternatively, if someone can explain in
somewhat different phrasing the following paragraph, that would be much
appreciated:




The
first example uses the greedy
quantifier .* to find "anything",
zero
or more times, followed by the letters
"f" "o" "o". Because
the quantifier is

greedy, the .* portion of the

expression first eats the entire input
string. At this point, the
overall
expression cannot succeed, because the
last three letters
("f" "o" "o") have
already been consumed (by
whom?
). So the matcher
slowly backs off (from
right-to-left?
) one letter at a time
until the rightmost
occurrence of
"foo" has been regurgitated (what does this
mean?
), at which
point the match succeeds and
the

search ends.



The
second example, however, is
reluctant, so it starts by first

consuming (by whom?) "nothing". Because "foo"

doesn't appear at the beginning of the
string, it's forced to swallow
(who swallows?) the
first letter (an "x"), which
triggers
the first match at 0 and 4. Our test
harness continues
the process until

the input string is exhausted. It

finds another match at 4 and 13.



The third
example fails to find a
match because the quantifier is

possessive. In this case, the entire
input string is consumed by .*+,
(how?)
leaving nothing left over to
satisfy
the "foo" at the end of the
expression. Use a
possessive

quantifier for situations where you
want to
seize all of something without
ever backing off (what does back
off mean?
); it will outperform
the equivalent greedy
quantifier in
cases where the match is not
immediately
found.



class="post-text" itemprop="text">
class="normal">Answer



I'll give
it a shot.




A
greedy quantifier first matches as much as possible. So the
.* matches the entire string. Then the matcher tries to match
the f following, but there are no characters left. So it
"backtracks", making the greedy quantifier match one less thing (leaving the "o" at the
end of the string unmatched). That still doesn't match the f in
the regex, so it "backtracks" one more step, making the greedy quantifier match one less
thing again (leaving the "oo" at the end of the string unmatched). That
still doesn't match the f in the regex, so
it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now,
the matcher finally matches the f in the regex, and the
o and the next o are matched too.
Success!



A
reluctant or "non-greedy" quantifier first matches as
little as possible. So the .* matches nothing at first, leaving
the entire string unmatched. Then the matcher tries to match the
f following, but the unmatched portion of the string starts
with "x" so that doesn't work. So the matcher backtracks, making the non-greedy
quantifier match one more thing (now it matches the "x", leaving "fooxxxxxxfoo"
unmatched). Then it tries to match the f, which succeeds, and
the o and the next o in the regex
match too. Success!



In your example, it then
starts the process over with the remaining unmatched portion of the string, following
the same process.



A
possessive quantifier is just like the greedy quantifier,
but it doesn't backtrack. So it starts out with .* matching the
entire string, leaving nothing unmatched. Then there is nothing left for it to match
with the f in the regex. Since the possessive quantifier
doesn't backtrack, the match fails there.



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...