Tuesday, 26 December 2017

php - Multiple matches within a regex group?

itemprop="text">

I need to match all 'tags' (e.g.
%thisIsATag%) that occur within XML attributes. (Note: I'm guaranteed to receive valid
XML, so there is no need to use full DOM traversal). My regex is working, except when
there are two tags in a single attribute, only the last one
is returned.



In other words, this regex should
find tag1, tag2, ..., tag6. However, it omits tag2 and
tag5.



Here's a fun little test harness for you
(PHP):





$xml
= <<
height="250">


x="30%" y="50%" animatefromx="800">
fontstyle="bold" text="Screen One!%tag2% %tag3%"/>





delay='%tag4%'>










fontstyle="bold" text="Screen Tres!"/>






animatefromx="800">








XML;

$matches
=
null;
preg_match_all('#<[^>]+("([^%>"]*%([^%>"]+)%[^%>"]*)+"|\'([^%>\']*%([^%>\']+)%[^%>\']*)+\')[^>]*>#i',
$xml,
$matches);

print_r($matches);
?>



Thanks!
:)



Answer




What you're trying to do is recover
intermediate captures from groups that match more than once per regex match. As far as I
know, only .NET and Perl 6 provide that capability. You'll have to do the job in two
stages: match an attribute value with one or more %tag%
sequences in it, then break out the individual
sequences.



You don't seem to care which XML tag
or attribute the values are associated with, so you could use this, somewhat simpler
regex to find the values with %tag% sequences in
them:



'#"([^"%<>]*+%[^%"]++%[^"]*+)"|\'([^\'%<>]*+%[^%\']++%[^\']*+)\'#'


EDIT:
That regex captures the attribute value in group 1 or group 2, depending in which quotes
it used. Here's another version that merges the alternatives so it can always save the
value in group
2:




'#(["\'])((?:(?![%<>]|\1).)*+%(?:(?!%|\1).)++%(?:(?!\1).)*+)\1#'


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...