Wednesday 5 December 2018

match text against multiple regex in python




I have a text corpus of 11 files each having about 190000 lines.
I have 10 strings one or more of which may appear in each line the above corpus.



When I encounter any of the 10 strings, I need to record that string which appears in the line separately.
The brute force way of looping through the regex for every line and marking it is taking a long time. Is there an efficient way of doing this?



I found a post (Match a line with multiple regex using Python) which provides a TRUE or FALSE output. But how do I record the matching regex from the line:



any(regex.match(line) for regex in [regex1, regex2, regex3])



Edit: adding example



regex = ['quick','brown','fox']
line1 = "quick brown fox jumps on the lazy dog" # i need to be able to record all of quick, brown and fox
line2 = "quick dog and brown rabbit ran together" # i should record quick and brown
line3 = "fox was quick an rabit was slow" # i should be able to record quick and fox.


Looping through the regex and recording the matching one is one of the solutions, but looking at the scale (11 * 190000 * 10), my script is running for a while now. i need to repeat this in my work quite many times. so i was looking at a more efficient way.



Answer



The approach below is in the case that you want the matches. In the case that you need the regular expression in a list that triggered a match, you are out of luck and will probably need to loop.



Based on the link you have provided:



import re
regexes= 'quick', 'brown', 'fox'
combinedRegex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes))

lines = 'The quick brown fox jumps over the lazy dog', 'Lorem ipsum dolor sit amet', 'The lazy dog jumps over the fox'


for line in lines:
print combinedRegex.findall(line)


outputs:



['quick', 'brown', 'fox']
[]
['fox']



The point here is that you do not loop over the regex but combine them.
The difference with the looping approach is that re.findall will not find overlapping matches. For instance if your regexes were: regexes= 'bro', 'own', the output of the lines above would be:



['bro']
[]
[]



whereas the looping approach would result in:



['bro', 'own']
[]
[]

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...