Sunday, 8 October 2017

linux - Regex parsing issue of multi-line file, replacing two consistent patterns around arbitrary persistent text

I've crafted a series of regex statements using
sed in bash to parse HTML. I'm aware this isn't recommended,
but to be honest, this is a temporary fix and I'm not looking to do anything (too)
complicated.




Anytime this pattern is
matched:



            class="section1-header">
GROUP 1
ARBITRARY CONTENT

GROUP 2
ARBITRARY
CONTENT



It
should be replaced
with:



GROUP 1 ARBITRARY
CONTENT - GROUP 2 ARBITRARY
CONTENT





And
this is repeated for section[1-3]-header, with h[2-4]
tags.



sed -Ei 's/[<]div
class=\"section1-header\"[>][<]div
class=\"section1-number\"[>](.*?)[<]\/div[>][<]div
class=\"section1-title\"[>](.*?)[<]\/div[>][<]\/div[>]/

\1 -
\2<\/h2>/g' ${1}
sed -Ei 's/[<]div
class=\"section2-header\"[>][<]div
class=\"section2-number\"[>](.*?)[<]\/div[>][<]div
class=\"section2-title\"[>](.*?)[<]\/div[>][<]\/div[>]/

\1 -
\2<\/h3>/g' ${1}
sed -Ei 's/[<]div
class=\"section3-header\"[>][<]div
class=\"section3-number\"[>](.*?)[<]\/div[>][<]div
class=\"section3-title\"[>](.*?)[<]\/div[>][<]\/div[>]/

\1 -
\2<\/h4>/g'
${1}



Testing my regex
online using various sites, every single instance I need to be hit is matched correctly,
without any additional content grabbed. When actually executing it, it seems at random
it'll grab more than is necessary (even though the regex tester matched the correct
sequence of characters, lazy-style).




Before with sample
content:



            class="section1-title">Archive Get Command (            class="id">archive-get)
class="section-intro">WAL segments are required for restoring a class="postgres">PostgreSQL cluster or maintaining a
replica.
id="command-archive-get/category-command"> class="section2-header">
2.1
class="section2-title">Command Options
class="section-body">
id="command-archive-get/category-command/option-archive-async"> class="section3-header">
2.1.1
class="section3-title">Asynchronous Archiving Option ( class="id">--archive-async)



After
with sample content:



2 -
Archive Get Command ( class="id">archive-get)

class="section-intro">WAL segments are required for restoring a class="postgres">PostgreSQL cluster or maintaining a
replica.
id="command-archive-get/category-command"> class="section2-header">
2.1
class="section2-title">Command Options
class="section-body">
id="command-archive-get/category-command/option-archive-async"> class="section3-header">
2.1.1
class="section3-title">Asynchronous Archiving Option ( class="id">--archive-async)



If
you look carefully, you'll notice

is substituted
correctly over that 2 - Archive Get Command but it does not
correctly substitute

with
and instead throws in the
after Asynchronous Archiving Option
(--archive-async)
.



At this point I'm thinking this might be some
kind of multi-line processing issue with sed, but am stuck in the troubleshooting stage
and am unsure where to go from here.

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...