Tuesday 9 July 2019

java - Regex greedy parse direction



I found there're two different opinions about how greedy regex is executed:




  • one is, read all the input string and match the pattern from the back, first match entire input,the first attempt is entire string. Some articles support this opinion are Oracle offical Java tutorial:





Greedy quantifiers are considered "greedy" because they force the
matcher to read in, or eat, the entire input string prior to
attempting the first match. If the first match attempt (the entire
input string) fails, the matcher backs off the input string by one
character and tries again, repeating the process until a match is
found or there are no more characters left to back off from.




also see this article: Performance of Greedy vs. Lazy Regex Quantifiers





  • the other is matching from the front, the first match attempt is from the 0 index in the left. when a match is found, the engine doesn't stop, keep matching the rest until it fails then it'll backtrack. Articles supports this opinion I found is:



Repetition with Star and Plus the Looking Inside The Regex Engine section talk about <.+>:




The first token in the regex is <. This is a literal. As we already
know, the first place where it will match is the first < in the
string.





I want to know which one is correct? This matters because it will affect the efficiency of regex. I added various language tags, because I want to know if it's implemented differently in each language.


Answer



"The matcher backs off the input string by one character and tries again" is just describing backtracking, so "then it'll backtrack" is therefore saying the same thing. Since both of your quotes regarding greediness say the same thing, both are correct. (Your third quote has nothing to do with greediness.)






Let's provide an example.




'xxabbbbbxxabbbbbbbbb' =~ /([ab]*)bb/;



  1. Try at pos 0.

    1. [ab]* matches 0 chars "".

      1. At pos 0, bb fails to match ⇒ Backtrack.



    2. [ab]* can't match any less ⇒ Backtrack.


  2. Try at pos 1.

    1. [ab]* matches 0 chars "".

      1. At pos 1, bb fails to match ⇒ Backtrack.


    2. [ab]* can't match any less ⇒ Backtrack.



  3. Try at pos 2.

    1. [ab]* matches 6 chars "abbbbb".

      1. At pos 8, bb fails to match ⇒ Backtrack.


    2. [ab]* matches 5 chars "abbbb". (Back off one)

      1. At pos 7, bb fails to match ⇒ Backtrack.



    3. [ab]* matches 4 chars "abbb". (Back off one)

      1. At pos 6, bb matches.

        1. Success.







So $1 is "abbb". (Not abbbbbbb. "Greedy" doesn't mean "longest match possible".)






Now, let's see what happens if we make the "*" non-greedy.



'xxabbbbbxxabbbbbbbbb' =~ /([ab]*?)bb/;




  1. Try at pos 0.

    1. [ab]*? matches 0 chars "".

      1. At pos 0, bb fails to match ⇒ Backtrack.


    2. [ab]* can't match any more ⇒ Backtrack.


  2. Try at pos 1.


    1. [ab]*? matches 0 chars "".

      1. At pos 1, bb fails to match ⇒ Backtrack.


    2. [ab]* can't match any more ⇒ Backtrack.


  3. Try at pos 2.

    1. [ab]*? matches 0 chars "".


      1. At pos 2, bb fails to match ⇒ Backtrack.


    2. [ab]*? matches 1 char "a". (Add one)

      1. At pos 3, bb matches.

        1. Success.







So $1 is "a".






Specific implementation might do things differently as an optimisation, as long as it gives the same result as presented here. You can see Perl at work using



perl -Mre=debug -E'say "xxabbbbbxxabbbbbbbbb" =~ /([ab]*)bb/;'
perl -Mre=debug -E'say "xxabbbbbxxabbbbbbbbb" =~ /([ab]*?)bb/;'


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...