Thursday 14 December 2017

unicode - Using awk to remove the Byte-order mark

itemprop="text">

How would an
awk script (presumably a one-liner) for removing a href="https://en.wikipedia.org/wiki/Byte_order_mark" rel="noreferrer">BOM
look
like?



Specification:




  • print
    every line after the first (NR >
    1
    )


  • for the first line: If it
    starts with #FE #FF or #FF #FE, remove
    those and print the rest


class="post-text" itemprop="text">
class="normal">Answer



Try
this:



awk
'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE >
OUTFILE


On the first
record (line), remove the BOM characters. Print every
record.




Or slightly shorter, using
the knowledge that the default action in awk is to print the
record:



awk
'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE >
OUTFILE


1
is the shortest condition that always evaluates to true, so each record is
printed.



Enjoy!



--
ADDENDUM --




href="http://unicode.org/faq/utf_bom.html#BOM" rel="noreferrer">Unicode Byte Order
Mark (BOM) FAQ includes the following table listing the exact BOM bytes for
each encoding:



Bytes | Encoding
Form
--------------------------------------
00 00 FE FF | UTF-32,
big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16,
big-endian
FF FE | UTF-16, little-endian
EF BB BF |
UTF-8



Thus,
you can see how \xef\xbb\xbf corresponds to EF BB
BF
UTF-8 BOM bytes from the above
table.


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...