Thursday, 14 December 2017

Why does this regular expression kill the Java regex engine?

itemprop="text">


I have this naive regex
"<([\s]|[^<])+?>" (excluding the quotation marks). It seems
so
straightforward but it is indeed evil when it works against the below HTML
text. It sends the Java regular expression engine to an infinite loop.



I have another regex ("<.+?>"), which
does somewhat the same thing, but it doesn't kill anything. Do you know why this
happens?



            language="JavaScript" type="text/javascript">
var numDivs,
layerName;
layerName = "lnavLayer";
catLinkName =
"category";

numDivs = 2;
function
toggleLayer(layerID){
if (!(navigator.appName == "Netscape" &&
navigator.appVersion.substr(0, 1) < 5)){
thisLayer =
document.getElementById(layerName + layerID);
categoryLink =
document.getElementById(catLinkName + layerID);
closeThem();
if
(thisLayer.className == 'subnavDefault'){
thisLayer.className =
'subnavToggled';
categoryLink.className =
'leftnavLinkSelectedSection';
}

}

}
function closeThem(){
for(x = 0; x < numDivs;
x++){
theLayer = document.getElementById(layerName + (x
+
1));
thecategoryLink = document.getElementById(catLinkName + (x +
1));
theLayer.className = 'subnavDefault';

thecategoryLink.className = 'leftnavLink';
}

} var flag
= 0; var lastClicked = 0
//-->




it
even keeps looping with an online Java regex tool (such as href="http://www.fileformat.info/tool/regex.htm"
rel="noreferrer">www.fileformat.info/tool/regex.htm) or a utility like
rel="noreferrer">RegexBuddy.


class="post-text" itemprop="text">
class="normal">Answer



The reason
the Java regex engine crashes is that this part of your regex causes a
(indeed!):



[\s]|[^<]



What
happens here is that every character matched by \s can also be matched by [^<]. That
means there are two ways to match each whitespace character. If we represent the two
character classes with A and
B:



A|B


Then
a string of three spaces could be matched as AAA, AAB, ABA, ABB, BAA, BAB, BBA, or BBB.
In other words the complexity of this part of the regex is 2^N. This will kill any regex
engine that doesn't have any safeguards against what I call href="http://www.regular-expressions.info/catastrophic.html" rel="noreferrer"
title="Catastrophic backtracking">catastrophic
backtracking.



When using alternation
(vertical bar) in a regex, always make sure the alternatives are mutually exclusive.
That is, at most one of the alternatives may be allowed to match any given bit of
text.



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...