unicode - Using awk to remove the Byte-order mark

Thursday, 14 December 2017

unicode - Using awk to remove the Byte-order mark

itemprop="text">

How would an
awk script (presumably a one-liner) for removing a href="https://en.wikipedia.org/wiki/Byte_order_mark" rel="noreferrer">BOM
look
like?

Specification:

print
every line after the first (NR > 1)

for the first line: If it
starts with #FE #FF or #FF #FE, remove
those and print the rest

class="post-text" itemprop="text">

class="normal">Answer

Try
this:

awk
            'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE >
            OUTFILE

On the first
record (line), remove the BOM characters. Print every
record.

Or slightly shorter, using
the knowledge that the default action in awk is to print the
record:

awk
            'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE >
            OUTFILE

1
is the shortest condition that always evaluates to true, so each record is
printed.

Enjoy!

--
ADDENDUM --

href="http://unicode.org/faq/utf_bom.html#BOM" rel="noreferrer">Unicode Byte Order
Mark (BOM) FAQ includes the following table listing the exact BOM bytes for
each encoding:

Bytes | Encoding
            Form
--------------------------------------
00 00 FE FF | UTF-32,
            big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16,
            big-endian
FF FE | UTF-16, little-endian
EF BB BF |
            UTF-8

Thus,
you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above
table.

Blog

Thursday, 14 December 2017

unicode - Using awk to remove the Byte-order mark

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file