Sunday 29 October 2017

regex - Regular expression to match a line that doesn't contain a word

itemprop="text">


I know it's possible to
match a word and then reverse the matches using other tools (e.g. grep
-v
). However, is it possible to match lines that do not contain a specific
word, e.g. hede, using a regular expression?



Input:



hoho
hihi
haha
hede



Code:



grep
""
input


Desired
output:



hoho
hihi

haha


Answer




The notion that regex doesn't support inverse
matching is not entirely true. You can mimic this behavior by using negative
look-arounds:



^((?!hede).)*$


The
regex above will match any string, or line without a line break,
not containing the (sub)string 'hede'. As mentioned, this
is not something regex is "good" at (or should do), but still, it
is possible.




And if you need to match line break
chars as well, use the rel="noreferrer">DOT-ALL modifier (the trailing s
in the following
pattern):



/^((?!hede).)*$/s


or
use it
inline:



/(?s)^((?!hede).)*$/



(where
the /.../ are the regex delimiters, i.e., not part of the
pattern)



If the DOT-ALL modifier is
not available, you can mimic the same behavior with the character class
[\s\S]:



/^((?!hede)[\s\S])*$/


Explanation



A
string is just a list of n characters. Before, and after each
character, there's an empty string. So a list of n characters
will have n+1 empty strings. Consider the string
"ABhedeCD":





┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B
│e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│

└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index
0 1 2 3 4 5 6 7


where
the e's are the empty strings. The regex
(?!hede). looks ahead to see if there's no substring
"hede" to be seen, and if that is the case (so something else
is seen), then the . (dot) will match any character except a
line break. Look-arounds are also called zero-width-assertions
because they don't consume any characters. They only
assert/validate something.




So, in my
example, every empty string is first validated to see if there's no
"hede" up ahead, before a character is consumed by the
. (dot). The regex (?!hede). will do
that only once, so it is wrapped in a group, and repeated zero or more times:
((?!hede).)*. Finally, the start- and end-of-input are anchored
to make sure the entire input is consumed:
^((?!hede).)*$



As you
can see, the input "ABhedeCD" will fail because on
e3, the regex (?!hede) fails (there
is "hede" up ahead!).



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...