Monday, 27 August 2018

regex - Understanding Perl regular expression modifers /m and /s





I have been reading perl regular expression with modifier s m and g. I understand that //g is a global matching where it will be a greedy search.



But I am confused with the modifier s and m. Can anyone explain the difference between s and m with code example to show how it can be different? I have tried to search online and it only gives explanation as in the link http://perldoc.perl.org/perlre.html#Modifiers. In stackoverflow I have even seen people using s and m together. Isn't s is the opposite of m?



//s 
//m
//g


I am not able to match multiple line using using m.




use warnings;
use strict;
use 5.012;

my $file;
{
local $/ = undef;
$file = ;
};

my @strings = $file =~ /".*"/mg; #returns all except the last string across multiple lines
#/"String"/mg; tried with this as well and returns nothing except String
say for @strings;

__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"

"S
t
r
i
n
g"

Answer



The documentation that you link to yourself seems very clear to me. It would help if you would explain what problem you had with understanding it, and how you came to think that /s and /m were opposites.




Very briefly, /s changes the behaviour of the dot metacharacter . so that it matches any character at all. Normally it matches anything except a newline "\n", and so treats the string as a single line even if it contains newlines.



/m modifies the caret ^ and dollar $ metacharacters so that they match at newlines within the string, treating it as a multi-line string. Normally they will match only at the beginning and end of the string.



You shouldn't get confused with the /g modifier being "greedy". It is for global matches which will find all occurrences of the pattern within the string. The term greedy is usually user for the behaviour of quantifiers within the pattern. For instance .* is said to be greedy because it will match as many characters as possible, as opposed to .*? which will match as few characters as possible.






Update




In your modified question you are using /".*"/mg, in which the /m is irrelevant because, as noted above, that modifier alters only the behaviour of the $ and ^ metacharacters, and there are none in your pattern.



Changing it to /".*"/sg improves things a little in that the . can now match the newline at the end of each line and so the pattern can match multi-line strings. (Note that it is the object string that is considered to be "single line" here - i.e. the match behaves just as if there were no newlines in it as far as . is concerned.) Hower here is the conventional meaning of greedy, because the pattern now matches everything from the first double-quote in the first line to the last double-quote at the end of the last line. I assume that isn't what you want.



There are a few ways to fix this. I recommend changing your pattern so that the string you want is a double-quote, followed by any sequence of characters except double-quotes, followed by another double quote. This is written /"[^"]*"/g (note that the /s modifier is no longer necessary as there are now no dots in the pattern) and very nearly does what you want except that the escaped double-quotes are seen as ending the pattern.



Take a look at this program and its output, noting that I have put a chevron >> at the start of each match so that they can be distinguished



use strict;
use warnings;


my $file = do {
local $/;
;
};

my @strings = $file =~ /"[^"]*"/g;

print ">> $_\n\n", for @strings;


__DATA__
"This is string"
"1!=2"
"This is \"string\""
"string1"."string2"
"String"
"S
t
r
i

n
g"


output



>> "This is string"

>> "1!=2"


>> "This is \"

>> ""

>> "string1"

>> "string2"

>> "String"


>> "S
t
r
i
n
g"


As you can see everything is now in order except that in "This is \"string\"" it has found two matches, "This is \", and "". Fixing that may be more complicated than you want to go but it's perfectly possible. Please say so if you need that fixed too.







Update



I may as well finish this off. To ignore escaped double-quotes and treat them as just part of the string, we need to accept either \" or any character except double-quote. That is done using the regex alternation operator | and must be grouped inside non-capturing parentheses (?: ... ). The end result is /"(?:\\"|[^"])*"/g (the backslash itself must be escaped so it is doubled up) which, when put into the above program, produces this output, which I assume is what you wanted.



>> "This is string"

>> "1!=2"


>> "This is \"string\""

>> "string1"

>> "string2"

>> "String"

>> "S
t

r
i
n
g"

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...