unicode - Using awk to remove the Byte-order mark
itemprop="text">
How would an
awk
script (presumably a one-liner) for removing a href="https://en.wikipedia.org/wiki/Byte_order_mark" rel="noreferrer">BOM
look
like?
Specification:
- print
every line after the first (NR >
1
)
- for the first line: If it
starts with #FE #FF
or #FF #FE
, remove
those and print the rest
class="post-text" itemprop="text">
Try
this:
awk
'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE >
OUTFILE
On the first
record (line), remove the BOM characters. Print every
record.
Or slightly shorter, using
the knowledge that the default action in awk is to print the
record:
awk
'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE >
OUTFILE
1
is the shortest condition that always evaluates to true, so each record is
printed.
Enjoy!
--
ADDENDUM --
href="http://unicode.org/faq/utf_bom.html#BOM" rel="noreferrer">Unicode Byte Order
Mark (BOM) FAQ includes the following table listing the exact BOM bytes for
each encoding:
Bytes | Encoding
Form
--------------------------------------
00 00 FE FF | UTF-32,
big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16,
big-endian
FF FE | UTF-16, little-endian
EF BB BF |
UTF-8
Thus,
you can see how \xef\xbb\xbf
corresponds to EF BB
BF
UTF-8
BOM bytes from the above
table.
No comments:
Post a Comment