Monday 8 January 2018

How can we match a^n b^n with Java regex?

itemprop="text">



This is the second part of a series of educational regex
articles. It shows how lookaheads and nested references can be used to match the
non-regular languge anbn. Nested
references are first introduced in: href="https://stackoverflow.com/questions/3627681/how-does-this-regex-find-triangular-numbers/">How
does this regex find triangular
numbers?





One
of the archetypal non- rel="noreferrer">regular languages
is:




L
= {
a
nbn:
n > 0 }




This is the
language of all non-empty strings consisting of some number of
a's followed by an equal number of
b's. Examples of strings in this language are
ab, aabb,
aaabbb.



This language
can be show to be non-regular by the href="http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages"
rel="noreferrer">pumping lemma. It is in fact an archetypal href="http://en.wikipedia.org/wiki/Context-free_language"
rel="noreferrer">context-free language, which can be generated by the href="http://en.wikipedia.org/wiki/Context-free_language"
rel="noreferrer">context-free grammar S → aSb |
ab
.




Nonetheless, modern
day regex implementations clearly recognize more than just regular languages. That is,
they are not "regular" by formal language theory definition. PCRE and Perl supports
recursive regex, and .NET supports balancing groups definition. Even less "fancy"
features, e.g. backreference matching, means that regex is not
regular.



But just how powerful is this "basic"
features? Can we recognize L with Java regex, for example? Can
we perhaps combine lookarounds and nested references and have a pattern that works with
e.g. href="http://download.oracle.com/javase/6/docs/api/java/lang/String.html#matches%28java.lang.String%29"
rel="noreferrer">String.matches to match strings
like ab, aabb,
aaabbb,
etc?



References





Linked
questions





class="post-text" itemprop="text">
class="normal">Answer



The answer
is, needless to say, YES! You can most certainly write a Java regex
pattern to match
anbn. It uses a
positive lookahead for assertion, and one nested reference for
"counting".



Rather than immediately giving out
the pattern, this answer will guide readers through the process of
deriving it. Various hints are given as the solution is slowly constructed. In this
aspect, hopefully this answer will contain much more than just another neat regex
pattern. Hopefully readers will also learn how to "think in regex", and how to put
various constructs harmoniously together, so they can derive more patterns on their own
in the future.



The language used to develop the
solution will be PHP for its conciseness. The final test once the pattern is finalized
will be done in Java.



/>


Step 1: Lookahead for
assertion



Let's start with a simpler problem:
we want to match a+ at the beginning of a string, but only if
it's followed immediately by b+. We can use
^ to href="http://www.regular-expressions.info/anchors.html"
rel="noreferrer">anchor our match, and since we only want to match the
a+ without the b+, we can use href="http://www.regular-expressions.info/lookaround.html"
rel="noreferrer">lookahead assertion
(?=…).



Here is our
pattern with a simple test
harness:



function testAll($r,
$tests) {
foreach ($tests as $test) {
$isMatch = preg_match($r,
$test, $groups);
$groupsJoined = join('|', $groups);


print("$test $isMatch $groupsJoined\n");

}
}

$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b',
'abbb');

$r1 = '/^a+(?=b+)/';
# └────┘
#
lookahead


testAll($r1,
$tests);


The output is
(as seen on
ideone.com
):



aaa
0
aaab 1 aaa
aaaxb 0
xaaab 0
b
0

abbb 1
a


This is exactly the
output we want: we match a+, only if it's at the beginning of
the string, and only if it's immediately followed by
b+.



Lesson:
You can use patterns in lookarounds to make
assertions.



/>

Step 2: Capturing in a lookahead (and f r e e - s p
a c i n g mode)




Now let's say that
even though we don't want the b+ to be part of the match, we do
want to rel="noreferrer">capture it anyway into group 1. Also, as we anticipate
having a more complicated pattern, let's use x modifier for
rel="noreferrer">free-spacing so we can make our regex more
readable.



Building on our previous PHP snippet,
we now have the following
pattern:



$r2 = '/ ^ a+ (?= (b+) )
/x';
# │ └──┘ │
# │ 1 │
# └────────┘
#
lookahead


testAll($r2,
$tests);


The output is
now (as seen on
ideone.com
):



aaa
0
aaab 1 aaa|b
aaaxb 0
xaaab 0

b
0
abbb 1
a|bbb


Note that e.g.
aaa|b is the result of join-ing what
each group captured with '|'. In this case, group 0 (i.e. what
the pattern matched) captured aaa, and group 1 captured
b.



Lesson:
You can capture inside a lookaround. You can use free-spacing to enhance
readability.



/>


Step 3: Refactoring the lookahead into
the "loop"



Before we can introduce our counting
mechanism, we need to do one modification to our pattern. Currently, the lookahead is
outside of the + repetition "loop". This is fine so far because
we just wanted to assert that there's a b+ following our
a+, but what we really want to do
eventually is assert that for each a that we match inside the
"loop", there's a corresponding b to go with
it.



Let's not worry about the counting mechanism
for now and just do the refactoring as
follows:




  • First refactor
    a+ to (?: a )+ (note that
    (?:…) is a non-capturing
    group)

  • Then move the lookahead inside this non-capturing
    group



    • Note that we must
      now "skip" a* before we can "see" the
      b+, so modify the pattern
      accordingly




So
we now have the following:



$r3 =
'/ ^ (?: a (?= a* (b+) ) )+ /x';
# │ │ └──┘ │ │
# │ │ 1 │


# │ └───────────┘ │
# │ lookahead │
#
└───────────────────┘
# non-capturing
group


The output is
the same as before (as seen on
ideone.com
), so there's no change in that regard. The important thing is that
now we are making the assertion at every iteration of the
+ "loop". With our current pattern, this is not necessary, but
next we'll make group 1 "count" for us using
self-reference.



Lesson:
You can capture inside a non-capturing group. Lookarounds can be
repeated.




/>

Step 4: This is the step where we start
counting



Here's what we're going to do: we'll
rewrite group 1 such
that:




  • At the end of the
    first iteration of the +, when the first
    a is matched, it should capture
    b

  • At the end of the second
    iteration, when another a is matched, it should capture
    bb

  • At the end of the third
    iteration, it should capture
    bbb


  • ...

  • At
    the end of the n-th iteration, group 1 should capture
    bn

  • If there
    aren't enough b to capture into group 1 then the assertion
    simply fails



So group 1,
which is now (b+), will have to be rewritten to something like
(\1 b). That is, we try to "add" a b
to what group 1 captured in the previous
iteration.



There's a slight problem here in that
this pattern is missing the "base case", i.e. the case where it can match without the
self-reference. A base case is required because group 1 starts "uninitialized"; it
hasn't captured anything yet (not even an empty string), so a self-reference attempt
will always fail.



There are many ways around
this, but for now let's just make the self-reference matching href="http://www.regular-expressions.info/optional.html"
rel="noreferrer">optional, i.e. \1?. This may or
may not work perfectly, but let's just see what that does, and if there's any problem
then we'll cross that bridge when we come to it. Also, we'll add some more test cases
while we're at
it.




$tests =
array(
'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb',
'aaaaabbb'
);

$r4 = '/ ^ (?: a (?= a* (\1? b) ) )+
/x';
# │ │ └─────┘ | │
# │ │ 1 | │
# │ └──────────────┘

# │ lookahead │

# └──────────────────────┘
#
non-capturing
group


The output is
now (as seen on
ideone.com
):



aaa
0
aaab 1 aaa|b # (*gasp!*)
aaaxb 0
xaaab
0

b 0
abbb 1 a|b # yes!
aabb 1 aa|bb #
YES!!
aaabbbbb 1 aaa|bbb # YESS!!!
aaaaabbb 1 aaaaa|bb #
NOOOOOoooooo....


A-ha!
It looks like we're really close to the solution now! We managed to get group 1 to
"count" using self-reference! But wait... something is wrong with the second and the
last test cases!! There aren't enough bs, and somehow it
counted wrong! We'll examine why this happened in the next
step.



Lesson:
One way to "initialize" a self-referencing group is to make the self-reference matching
optional.




/>

Step 4½: Understanding what went
wrong



The problem is that since we made the
self-reference matching optional, the "counter" can "reset" back to 0 when there aren't
enough b's. Let's closely examine what happens at every
iteration of our pattern with aaaaabbb as
input.



 a a a a a b b
b

# Initial state: Group 1 is
"uninitialized".

_
a a a a a b b b


# 1st iteration: Group 1 couldn't match \1 since it was
"uninitialized",
# so it matched and captured just b

___
a a a a a b b b

# 2nd iteration: Group 1 matched
\1b and captured bb
_____

a a a a a b b b


# 3rd iteration: Group 1 matched \1b and captured bbb

_
a a a a a b b b

# 4th iteration: Group 1 could
still match \1, but not \1b,
# (!!!) so it matched and captured just
b
___
a a a a a b b b


# 5th
iteration: Group 1 matched \1b and captured bb
#
# No more a, +
"loop"
terminates


A-ha! On
our 4th iteration, we could still match \1, but we couldn't
match \1b! Since we allow the self-reference matching to be
optional with \1?, the engine backtracks and took the "no
thanks" option, which then allows us to match and capture just
b!



Do note, however,
that except on the very first iteration, you could always match just the self-reference
\1. This is obvious, of course, since it's what we just
captured on our previous iteration, and in our setup we can always match it again (e.g.
if we captured bbb last time, we're guaranteed that there will
still be bbb, but there may or may not be
bbbb this
time).




Lesson:
Beware of backtracking. The regex engine will do as much backtracking as you allow until
the given pattern matches. This may impact performance (i.e. href="http://www.regular-expressions.info/catastrophic.html"
rel="noreferrer">catastrophic backtracking) and/or
correctness.



/>

Step 5: Self-possession to the
rescue!



The "fix" should now be obvious:
combine optional repetition with href="http://www.regular-expressions.info/possessive.html"
rel="noreferrer">possessive quantifier. That is, instead of simply
?, use ?+ instead (remember that a
repetition that is quantified as possessive does not backtrack, even if such
"cooperation" may result in a match of the overall
pattern).



In very informal terms, this is what
?+, ? and ??
says:






?+





  • (optional) "It doesn't have to be there,"



    • (possessive) "but if it is there, you must
      take it and not let go!"







?





  • (optional) "It doesn't have to be there,"



    • (greedy) "but if it is you can take it for
      now,"



      • (backtracking)
        "but you may be asked to let it go later!"








??





  • (optional) "It doesn't have to be there,"




    • (reluctant) "and even if it is
      you don't have to take it just yet,"



      • (backtracking) "but you may be asked to take it later!"









In
our setup, \1 will not be there the very first time, but it
will always be there any time after that, and we
always want to match it then. Thus, \1?+
would accomplish exactly what we
want.



$r5 = '/ ^ (?: a (?= a*
(\1?+ b) ) )+ /x';
# │ │ └──────┘ │ │
# │ │ 1 │ │
# │
└───────────────┘ │
# │ lookahead │
#
└───────────────────────┘
# non-capturing
group



Now
the output is (as seen on
ideone.com
):



aaa
0
aaab 1 a|b # Yay! Fixed!
aaaxb 0
xaaab 0
b
0
abbb 1 a|b
aabb 1 aa|bb

aaabbbbb 1
aaa|bbb
aaaaabbb 1 aaa|bbb #
Hurrahh!!!


VoilĂ !!!
Problem solved!!! We are now counting properly, exactly the way we want it
to!



Lesson:
Learn the difference between greedy, reluctant, and possessive repetition.
Optional-possessive can be a powerful combination.



/>


Step 6: Finishing
touches



So what we have right now is a pattern
that matches a repeatedly, and for every
a that was matched, there is a corresponding
b captured in group 1. The +
terminates when there are no more a, or if the assertion failed
because there isn't a corresponding b for an
a.



To finish the job,
we simply need to append to our pattern \1 $. This is now a
back reference to what group 1 matched, followed by the end of the line anchor. The
anchor ensures that there aren't any extra b's in the string;
in other words, that in fact we have
anbn.



Here's
the finalized pattern, with additional test cases, including one that's 10,000
characters long:



$tests =
array(
'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb',
'aaaaabbb',

'', 'ab', 'abb', 'aab', 'aaaabb', 'aaabbb', 'bbbaaa',
'ababab', 'abc',
str_repeat('a', 5000).str_repeat('b',
5000)
);

$r6 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ \1 $
/x';
# │ │ └──────┘ │ │
# │ │ 1 │ │
# │ └───────────────┘

# │ lookahead │
# └───────────────────────┘

#
non-capturing group


It
finds 4 matches: ab, aabb,
aaabbb, and the
a5000b5000. It takes
only 0.06s to run on
ideone.com
.



/>

Step 7: The Java
test



So the pattern works in PHP, but the
ultimate goal is to write a pattern that works in
Java.




public static
void main(String[] args) {

String aNbN = "(?x) (?: a (?= a* (\\1?+
b)) )+ \\1";
String[] tests = {
"", // false
"ab", //
true
"abb", // false
"aab", // false
"aabb", //
true

"abab", // false
"abc", // false

repeat('a', 5000) + repeat('b', 4999), // false
repeat('a', 5000) +
repeat('b', 5000), // true
repeat('a', 5000) + repeat('b', 5001), //
false
};
for (String test : tests) {

System.out.printf("[%s]%n %s%n%n", test, test.matches(aNbN));

}


}

static String repeat(char ch,
int n) {
return new String(new char[n]).replace('\0',
ch);
}


The
pattern works as expected (as
seen on ideone.com
).



/>


And now we come to the
conclusion...



It needs to be said that the
a* in the lookahead, and indeed the "main
+ loop", both permit backtracking. Readers are encouraged to
confirm why this is not a problem in terms of correctness, and why at the same time
making both possessive would also work (though perhaps mixing mandatory and
non-mandatory possessive quantifier in the same pattern may lead to
misperceptions).



It should also be said that
while it's neat that there's a regex pattern that will match
anbn, this is in not
always the "best" solution in practice. A much better solution is to simply match
^(a+)(b+)$, and then compare the length of the strings captured
by groups 1 and 2 in the hosting programming
language.



In PHP, it may look something like
this (as seen in
ideone.com
):



function
is_anbn($s) {

return (preg_match('/^(a+)(b+)$/', $s, $groups))
&&
(strlen($groups[1]) ==
strlen($groups[2]));
}


The
purpose of this article is NOT to convince readers that regex can
do almost anything; it clearly can't, and even for the things it can do, at least
partial delegation to the hosting language should be considered if it leads to a simpler
solution.



As mentioned at the top, while this
article is necessarily tagged [regex] for stackoverflow, it is
perhaps about more than that. While certainly there's value in learning about
assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson
here is the creative process by which one can try to solve problems, the determination
and hard work that it often requires when you're subjected to various constraints, the
systematic composition from various parts to build a working solution,
etc.



/>


Bonus material! PCRE recursive
pattern!



Since we did bring up PHP, it needs to
be said that PCRE supports recursive pattern and subroutines. Thus, following pattern
works for preg_match ( rel="noreferrer">as seen on
ideone.com
):



$rRecursive
= '/ ^ (a (?1)? b) $
/x';


Currently Java's
regex does not support recursive
pattern.




/>

Even more bonus material! Matching
anbncn
!!



So we've seen how to match
anbn which is
non-regular, but still context-free, but can we also match
anbncn,
which isn't even context-free?



The answer is, of
course, YES! Readers are encouraged to try to solve this on their
own, but the solution is provided below (with rel="noreferrer">implementation in Java on
ideone.com
).



class="spoiler">

^ (?: a (?= a* (\1?+ b) b* (\2?+ c)
) )+ \1 \2 $





No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...