Friday 10 November 2017

Non greedy (reluctant) regex matching in sed?

itemprop="text">

I'm trying to use sed to clean up
lines of URLs to extract just the domain.



So
from:



http://www.suepearson.co.uk/product/174/71/3816/


I
want:



http://www.suepearson.co.uk/


(either
with or without the trailing slash, it doesn't
matter)



I have
tried:



 sed
's|\(http:\/\/.*?\/\).*|\1|'


and
(escaping the non-greedy
quantifier)



sed
's|\(http:\/\/.*\?\/\).*|\1|'


but
I can not seem to get the non-greedy quantifier (?) to work, so
it always ends up matching the whole string.


class="post-text" itemprop="text">
class="normal">Answer



Neither
basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a
later regex. Fortunately, Perl regex for this context is pretty easy to
get:



perl -pe
's|(http://.*?/).*|\1|'


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...