Tuesday 8 October 2019

Building Regular Expression (RegEx) to extract text of HTML tag





I am trying to build a regular expression to extract the text inside the HTML tag as shown below. However I have limited skills in regular expressions, and I'm having trouble building the string.



How can I extract the text from this tag:



text




That is just a sample of the HTML source of the page. Basically, I need a regex string to match the "text" inside of the tag. Can anyone assist me with this? Thank you. I hope my question wasn't phrased too horribly.



UPDATE: Just for clarification, report_drilldown is absolute, but I don't really care if it's present in the regex as absolute or not.



145817 is a random 6 digit number that is actually a database id. "text" is just simple plain text, so it shouldn't be invalid HTML. Also, most people are saying that it's best to not use regex in this situation, so what would be best to use? Thanks so much!


Answer



([^<]*)


This won't really solve the problem, but it may just barely scrape by. In particular, it's very brittle, the slightest change to the markup and it won't match. If report_drilldown isn't meant to be absolute, replace it with [^']*, and/or capture both it and the number if you need.




If you need something that parses HTML, then it's a bit of a nightmare if you have to deal with tag soup. If you were using Python, I'd suggest BeautifulSoup, but I don't know something similar for C#. (Anyone know of a similar tag soup parsing library for C#?)


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...