Wednesday 8 May 2019

regex - Greedy vs. Reluctant vs. Possessive Quantifiers



I found this excellent tutorial on regular expressions and while I intuitively understand what "greedy", "reluctant" and "possessive" quantifiers do, there seems to be a serious hole in my understanding.



Specifically, in the following example:



Enter your regex: .*foo  // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.


Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.



The explanation mentions eating the entire input string, letters been consumed, matcher backing off, rightmost occurrence of "foo" has been regurgitated, etc.



Unfortunately, despite the nice metaphors, I still don't understand what is eaten by whom... Do you know of another tutorial that explains (concisely) how regular expressions engines work?



Alternatively, if someone can explain in somewhat different phrasing the following paragraph, that would be much appreciated:




The first example uses the greedy
quantifier .* to find "anything", zero
or more times, followed by the letters

"f" "o" "o". Because the quantifier is
greedy, the .* portion of the
expression first eats the entire input
string. At this point, the overall
expression cannot succeed, because the
last three letters ("f" "o" "o") have
already been consumed (by whom?). So the matcher
slowly backs off (from right-to-left?) one letter at a time
until the rightmost occurrence of
"foo" has been regurgitated (what does this mean?), at which

point the match succeeds and the
search ends.



The second example, however, is
reluctant, so it starts by first
consuming (by whom?) "nothing". Because "foo"
doesn't appear at the beginning of the
string, it's forced to swallow (who swallows?) the
first letter (an "x"), which triggers
the first match at 0 and 4. Our test

harness continues the process until
the input string is exhausted. It
finds another match at 4 and 13.



The third example fails to find a
match because the quantifier is
possessive. In this case, the entire
input string is consumed by .*+, (how?)
leaving nothing left over to satisfy
the "foo" at the end of the

expression. Use a possessive
quantifier for situations where you
want to seize all of something without
ever backing off (what does back off mean?); it will outperform
the equivalent greedy quantifier in
cases where the match is not
immediately found.



Answer



I'll give it a shot.




A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less thing (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it "backtracks" one more step, making the greedy quantifier match one less thing again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!



A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more thing (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!



In your example, it then starts the process over with the remaining unmatched portion of the string, following the same process.



A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...