I need to match all 'tags' (e.g.
%thisIsATag%) that occur within XML attributes. (Note: I'm guaranteed to receive valid
XML, so there is no need to use full DOM traversal). My regex is working, except when
there are two tags in a single attribute, only the last one
is returned.
In other words, this regex should
find tag1, tag2, ..., tag6. However, it omits tag2 and
tag5.
Here's a fun little test harness for you
(PHP):
$xml
= <<
height="250">
x="30%" y="50%" animatefromx="800">
fontstyle="bold" text="Screen One!%tag2% %tag3%"/>
delay='%tag4%'>
fontstyle="bold" text="Screen Tres!"/>
animatefromx="800">
XML;
$matches
=
null;
preg_match_all('#<[^>]+("([^%>"]*%([^%>"]+)%[^%>"]*)+"|\'([^%>\']*%([^%>\']+)%[^%>\']*)+\')[^>]*>#i',
$xml,
$matches);
print_r($matches);
?>
Thanks!
:)
Answer
What you're trying to do is recover
intermediate captures from groups that match more than once per regex match. As far as I
know, only .NET and Perl 6 provide that capability. You'll have to do the job in two
stages: match an attribute value with one or more %tag%
sequences in it, then break out the individual
sequences.
You don't seem to care which XML tag
or attribute the values are associated with, so you could use this, somewhat simpler
regex to find the values with %tag%
sequences in
them:
'#"([^"%<>]*+%[^%"]++%[^"]*+)"|\'([^\'%<>]*+%[^%\']++%[^\']*+)\'#'
EDIT:
That regex captures the attribute value in group 1 or group 2, depending in which quotes
it used. Here's another version that merges the alternatives so it can always save the
value in group
2:
'#(["\'])((?:(?![%<>]|\1).)*+%(?:(?!%|\1).)++%(?:(?!\1).)*+)\1#'
No comments:
Post a Comment