unicode - Why does modern Perl avoid UTF-8 by default?

itemprop="text">

I wonder why most modern solutions
built using Perl don't enable rel="noreferrer">UTF-8 by
default.

I understand there are many legacy
problems for core Perl scripts, where it may break things. But, from my point of view,
in the 21^st century, big new projects (or projects with a big
perspective) should make their software UTF-8 proof from scratch. Still I don't see it
happening. For example, rel="noreferrer">Moose enables strict and warnings, but not href="http://en.wikipedia.org/wiki/Unicode" rel="noreferrer">Unicode. href="http://search.cpan.org/~chromatic/Modern-Perl-1.03/lib/Modern/Perl.pm"
rel="noreferrer">Modern::Perl reduces boilerplate too, but no UTF-8
handling.

Why? Are there some reasons
to avoid UTF-8 in modern Perl projects in the year
2011?

Commenting @tchrist got too long, so I'm adding it
here.

It seems that I did not make myself clear.
Let me try to add some
things.

tchrist and
I see situation pretty similarly, but our conclusions are completely in opposite ends. I
agree, the situation with Unicode is complicated, but this is why we (Perl users and
coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be
nowadays.

tchrist
pointed to many aspects to cover, I will read and think about them for days or even
weeks. Still, this is not my point. tchrist tries to prove
that there is not one single way "to enable UTF-8". I have not so much knowledge to
argue with that. So, I stick to live examples.

I
played around with rel="noreferrer">Rakudo and UTF-8 was just there as I
needed. I didn't have any problems, it just worked. Maybe there are some
limitation somewhere deeper, but at start, all I tested worked as I
expected.

Shouldn't that be a goal in modern
Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for
core Perl, I suggest the possibility to trigger it with a
snap for those who develop new
projects.

Another example, but with a more
negative tone. Frameworks should make development easier. Some years ago, I tried web
frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not
find how and where to hook Unicode support. It was so time-consuming that I found it
easier to go the old way. Now I saw here there was a bounty to deal with the same
problem with rel="noreferrer">Mason 2: href="https://stackoverflow.com/questions/5858596/how-to-make-mason2-utf8-clean">How
to make Mason2 UTF-8 clean?. So, it is pretty new framework, but
using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign:
STOP, don't use me!

I really like Perl. But
dealing with Unicode is painful. I still find myself running against walls. Some way
tchrist is right and answers my questions: new projects
don't attract UTF-8 because it is too complicated in Perl 5.

Answer

Set
your PERL_UNICODE envariable to AS.
This makes all Perl scripts decode @ARGV as UTF‑8 strings, and
sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are
global effects, not lexical ones.

At
the top of your source file (program, module, library,
dohickey), prominently assert that you are running perl version
5.12 or better via:

use v5.12; #
            minimal for unicode string feature
use v5.14; # optimal for unicode string
            feature

Enable
warnings, since the previous declaration only enables strictures and features, not
warnings. I also suggest promoting Unicode warnings into exceptions, so use both these
lines, not just one of them. Note however that under v5.14, the
utf8 warning class comprises three other subwarnings which can
all be separately enabled: nonchar,
surrogate, and non_unicode. These you
may wish to exert greater control
over.
```
use
            warnings;
use warnings qw( FATAL utf8
            );
```

Declare
that this source unit is encoded as UTF‑8. Although once upon a time this pragma did
other things, it now serves this one singular purpose alone and no
other:
```
use
            utf8;
```

Declare
that anything that opens a filehandle within this lexical scope but not
elsewhere is to assume that that stream is encoded in UTF‑8 unless you tell
it otherwise. That way you do not affect other module’s or other program’s
code.
```
use open qw(
            :encoding(UTF-8) :std
            );
```

Enable
named characters via
\N{CHARNAME}.

use
            charnames qw( :full :short
            );

If you
have a DATA handle, you must explicitly set its encoding. If
you want this to be UTF‑8, then
say:
```
binmode(DATA,
            ":encoding(UTF-8)");
```

There
is of course no end of other matters with which you may eventually find yourself
concerned, but these will suffice to approximate the state goal to “make everything just
work with UTF‑8”, albeit for a somewhat weakened sense of those terms.

One other pragma, although it is not Unicode
related, is:

 use
            autodie;

It is
strongly recommended.

🌴 🐪🐫🐪 🌞
𝕲𝖔 𝕿𝖍𝖔𝖚 𝖆𝖓𝖉 𝕯𝖔 𝕷𝖎𝖐𝖊𝖜𝖎𝖘𝖊 🌞 🐪🐫🐪
🐁

🎁 🐪
𝕭𝖔𝖎𝖑𝖊𝖗⸗𝖕𝖑𝖆𝖙𝖊 𝖋𝖔𝖗 𝖀𝖓𝖎𝖈𝖔𝖉𝖊⸗𝕬𝖜𝖆𝖗𝖊 𝕮𝖔𝖉𝖊 🐪
🎁

My own
boilerplate these days tends to look like
this:

use
            5.014;

use utf8;
use strict;
use
            autodie;
use warnings; 
use warnings qw< FATAL utf8
            >;
use open qw< :std :utf8 >;
use charnames qw< :full
            >;

use feature qw< unicode_strings
            >;

use File::Basename qw< basename >;
use Carp
            qw< carp croak confess cluck >;
use Encode qw< encode decode
            >;
use Unicode::Normalize qw< NFD NFC >;

END {
            close STDOUT }

if (grep /\P{ASCII}/ => @ARGV) {
            

 @ARGV = map { decode("UTF-8", $_) }
            @ARGV;
}

$0 = basename($0); # shorter
            messages
$| = 1;

binmode(DATA,
            ":utf8");

# give a full stack dump on any untrapped
            exceptions
local $SIG{__DIE__} = sub {

 confess "Uncaught
            exception: @_" unless $^S;
};

# now promote run-time
            warnings into stack-dumped
# exceptions *unless* we're in an try block,
            in
# which case just cluck the stack dump instead
local
            $SIG{__WARN__} = sub {
 if ($^S) { cluck "Trapped warning: @_" } 

            else { confess "Deadly warning: @_" }
};


while
            (<>) {
 chomp;
 $_ = NFD($_);
 ...
}
            continue {
 say
            NFC($_);
}

__END__

/>

/>

Saying that “Perl should
[somehow!] enable Unicode by default” doesn’t even start to
begin to think about getting around to saying enough to be even marginally useful in
some sort of rare and isolated case. Unicode is much much more than just a larger
character repertoire; it’s also how those characters all interact in many, many
ways.

Even the simple-minded minimal
measures that (some) people seem to think they want are guaranteed to miserably break
millions of lines of code, code that has no chance to “upgrade” to your spiffy new
Brave New World modernity.

It is way way way more complicated than people
pretend. I’ve thought about this a huge, whole lot over the past few years. I would love
to be shown that I am wrong. But I don’t think I am. Unicode is fundamentally more
complex than the model that you would like to impose on it, and there is complexity here
that you can never sweep under the carpet. If you try, you’ll break either your own code
or somebody else’s. At some point, you simply have to break down and learn what Unicode
is about. You cannot pretend it is something it is not.

🐪 goes out of its way to make Unicode easy,
far more than anything else I’ve ever used. If you think this is bad, try something else
for a while. Then come back to 🐪: either you will have returned to a better world, or
else you will bring knowledge of the same with you so that we can make use of your new
knowledge to make 🐪 better at these things.

💡 𝕴𝖉𝖊𝖆𝖘 𝖋𝖔𝖗 𝖆 𝖀𝖓𝖎𝖈𝖔𝖉𝖊 ⸗
𝕬𝖜𝖆𝖗𝖊 🐪 𝕷𝖆𝖚𝖓𝖉𝖗𝖞 𝕷𝖎𝖘𝖙
💡

At a minimum, here are some things that would appear
to be required for 🐪 to “enable Unicode by default”, as you put
it:

All 🐪 source
code should be in UTF-8 by default. You can get that with use utf8 or export PERL5OPTS=-Mutf8.

The 🐪
DATA handle should be UTF-8. You will have to do this on a
per-package basis, as in binmode(DATA, ":encoding(UTF-8)").

Program
arguments to 🐪 scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS=-CA.

The
standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S for all of them, or I,
O, and/or E for just some of them.
This is like perl -CS.

Any other handles
opened by 🐪 should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and
o for particular ones of these; export PERL5OPTS=-CD would work. That makes -CSAD for all
of them.

Cover both bases
plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std. See href="http://training.perl.com/scripts/uniquote"
rel="noreferrer">uniquote.

You
don’t want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8. And make sure your input streams are
always binmoded to :encoding(UTF-8),
not just to
:utf8.

Code
points between 128–255 should be understood by 🐪 to be the corresponding Unicode code
points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS=-Mfeature=unicode_strings. That will make uc("\xDF") eq "SS" and "\xE9" =~ /\w/. A simple
export PERL5OPTS=-Mv5.12 or better will also get
that.

Named Unicode characters are not
by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek or some such. See href="http://training.perl.com/scripts/uninames"
rel="noreferrer">uninames and href="http://training.perl.com/scripts/tcgrep"
rel="noreferrer">tcgrep.

You
almost always need access to the functions from href="http://search.cpan.org/perldoc?Unicode::Normalize" rel="noreferrer">the
standard Unicode::Normalize module various types of
decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run
incoming stuff through NFD and outbound stuff from NFC. There’s no I/O layer for these
yet that I’m aware of, but see rel="noreferrer">nfc, href="http://training.perl.com/scripts/nfd"
rel="noreferrer">nfd, href="http://training.perl.com/scripts/nfkd"
rel="noreferrer">nfkd, and href="http://training.perl.com/scripts/nfkc"
rel="noreferrer">nfkc.

String
comparisons in 🐪 using eq, ne,
lc, cmp,
sort, &c&cc are always wrong. So instead of
@a = sort @b, you need @a = Unicode::Collate->new->sort(@b). Might as well add that to your
export PERL5OPTS=-MUnicode::Collate. You can cache the key for
binary comparisons.

🐪 built-ins like
printf and write do the wrong thing
with Unicode data. You need to use href="http://search.cpan.org/perldoc?Unicode::GCString" rel="noreferrer">the
Unicode::GCString module for the former, and both
that and also rel="noreferrer">the Unicode::LineBreak module as
well for the latter. See rel="noreferrer">uwc and href="http://training.perl.com/scripts/ucsort"
rel="noreferrer">unifmt.

If
you want them to count as integers, then you are going to have to run your
\d+ captures through href="http://search.cpan.org/~jesse/perl-5.14.0/lib/Unicode/UCD.pm#num"
rel="noreferrer">the Unicode::UCD::num function
because 🐪’s built-in atoi(3) isn’t currently clever
enough.

You are going to have
filesystem issues on 👽 filesystems. Some filesystems silently enforce a conversion to
NFC; others silently enforce a conversion to NFD. And others do something else still.
Some even ignore the matter altogether, which leads to even greater problems. So you
have to do your own NFC/NFD handling to keep
sane.

All your 🐪 code involving
a-z or A-Z and such MUST
BE CHANGED, including m//,
s///, and tr///. It’s should stand out
as a screaming red flag that your code is broken. But it is not clear how it must
change. Getting the right properties, and understanding their casefolds, is harder than
you might think. I use rel="noreferrer">unichars and href="http://training.perl.com/scripts/uniprops"
rel="noreferrer">uniprops every single
day.

Code that uses
\p{Lu} is almost as wrong as code that uses
[A-Za-z]. You need to use \p{Upper}
instead, and know the reason why. Yes, \p{Lowercase} and
\p{Lower} are different from \p{Ll}
and
\p{Lowercase_Letter}.

Code
that uses [a-zA-Z] is even worse. And it can’t use
\pL or \p{Letter}; it needs to use
\p{Alphabetic}. Not all alphabetics are letters, you
know!

If you are looking for 🐪
variables with /[\$\@\%]\w+/, then you have a problem. You need
to look for /[\$\@\%]\p{IDS}\p{IDC}*/, and even that isn’t
thinking about the punctuation variables or package
variables.

If you are checking for
whitespace, then you should choose between \h and
\v, depending. And you should never use
\s, since it DOES NOT MEAN
[\h\v], contrary to popular
belief.

If you are using
\n for a line boundary, or even \r\n,
then you are doing it wrong. You have to use \R, which is not
the same!

If you don’t know when and
whether to call rel="noreferrer">Unicode::Stringprep, then you had better
learn.

Case-insensitive comparisons
need to check for whether two things are the same letters no matter their diacritics and
such. The easiest way to do that is with the href="http://search.cpan.org/perldoc?Unicode::Collate" rel="noreferrer">standard
Unicode::Collate module. Unicode::Collate->new(level => 1)->cmp($a, $b). There are also eq methods and
such, and you should probably learn about the match and
substr methods, too. These are have distinct advantages over
the 🐪 built-ins.

Sometimes that’s
still not enough, and you need href="http://search.cpan.org/perldoc?Unicode::Collate::Locale" rel="noreferrer">the
Unicode::Collate::Locale module instead, as in
Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b) instead. Consider that
Unicode::Collate::->new(level => 1)->eq("d", "ð") is
true, but Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " ð") is false. Similarly, "ae" and "æ" are
eq if you don’t use locales, or if you use the English one, but
they are different in the Icelandic locale. Now what? It’s tough, I tell you. You can
play with rel="noreferrer">ucsort to test some of these things
out.

Consider how to match the pattern
CVCV (consonsant, vowel, consonant, vowel) in the string “niño”.
Its NFD form — which you had darned well better have remembered to put it in — becomes
“nin\x{303}o”. Now what are you going to do? Even pretending that a vowel is
[aeiou] (which is wrong, by the way), you won’t be able to do
something like (?=[aeiou])\X) either, because even in NFD a
code point like ‘ø’ does not decompose! However, it will
test equal to an ‘o’ using the UCA comparison I just showed you. You can’t rely on NFD,
you have to rely on UCA.

💩 𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤
💩

And that’s
not all. There are a million broken assumptions that people make about Unicode. Until
they understand these things, their 🐪 code will be
broken.

Code
that assumes it can open a text file without specifying the encoding is
broken.

Code that assumes the default
encoding is some sort of native platform encoding is
broken.

Code that assumes that web
pages in Japanese or Chinese take up less space in UTF‑16 than in UTF‑8 is
wrong.

Code that assumes Perl uses
UTF‑8 internally is wrong.

Code that
assumes that encoding errors will always raise an exception is
wrong.

Code that assumes Perl code
points are limited to 0x10_FFFF is
wrong.

Code that assumes you can set
$/ to something that will work with any valid line separator is
wrong.

Code that assumes roundtrip
equality on casefolding, like lc(uc($s)) eq $s or
uc(lc($s)) eq $s, is completely broken and wrong. Consider that
the uc("σ") and uc("ς") are both
"Σ", but lc("Σ") cannot possibly
return both of those.

Code that
assumes every lowercase code point has a distinct uppercase one, or vice versa, is
broken. For example, "ª" is a lowercase letter with no
uppercase; whereas both "ᵃ" and "ᴬ"
are letters, but they are not lowercase letters; however, they are both lowercase code
points without corresponding uppercase versions. Got that? They are
not \p{Lowercase_Letter}, despite
being both \p{Letter} and
\p{Lowercase}.

Code
that assumes changing the case doesn’t change the length of the string is
broken.

Code that assumes
there are only two cases is broken. There’s also
titlecase.

Code that assumes only
letters have case is broken. Beyond just letters, it turns out that numbers, symbols,
and even marks have case. In fact, changing the case can even make something change its
main general category, like a \p{Mark} turning into a
\p{Letter}. It can also make it switch from one script to
another.

Code that assumes that case
is never locale-dependent is
broken.

Code that assumes Unicode
gives a fig about POSIX locales is
broken.

Code that assumes you can
remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged,
wrong, and justification for capital
punishment.

Code that assumes that
diacritics \p{Diacritic} and marks
\p{Mark} are the same thing is
broken.

Code that assumes
\p{GC=Dash_Punctuation} covers as much as
\p{Dash} is
broken.

Code that assumes dash,
hyphens, and minuses are the same thing as each other, or that there is only one of
each, is broken and wrong.

Code that
assumes every code point takes up no more than one print column is
broken.

Code that assumes that all
\p{Mark} characters take up zero print columns is
broken.

Code that assumes
that characters which look alike are alike is
broken.

Code that assumes that
characters which do not look alike are not
alike is broken.

Code that assumes
there is a limit to the number of code points in a row that just one
\X can match is
wrong.

Code that assumes
\X can never start with a \p{Mark}
character is wrong.

Code that assumes
that \X can never hold two
non-\p{Mark} characters is
wrong.

Code that assumes that it
cannot use "\x{FFFF}" is
wrong.

Code that assumes a non-BMP
code point that requires two UTF-16 (surrogate) code units will encode to two separate
UTF-8 characters, one per code unit, is wrong. It doesn’t: it encodes to single code
point.

Code that transcodes from
UTF‐16 or UTF‐32 with leading BOMs into UTF‐8 is broken if it puts a BOM at the start of
the resulting UTF-8. This is so stupid the engineer should have their eyelids
removed.

Code that assumes the CESU-8
is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as
"\xC0\x80" is UTF-8 is broken and wrong. These guys also
deserve the eyelid treatment.

Code
that assumes characters like > always points to the right
and < always points to the left are wrong — because they in
fact do not.

Code that
assumes if you first output character X and then character
Y, that those will show up as XY is
wrong. Sometimes they
don’t.

Code that assumes
that ASCII is good enough for writing English properly is stupid, shortsighted,
illiterate, broken, evil, and wrong. Off with their heads! If that seems
too extreme, we can compromise: henceforth they may type only with their big toe from
one foot. (The rest will be duct
taped.)

Code that assumes that all
\p{Math} code points are visible characters is
wrong.

Code that assumes
\w contains only letters, digits, and underscores is
wrong.

Code that assumes that
^ and ~ are punctuation marks is
wrong.

Code that assumes that
ü has an umlaut is
wrong.

Code that believes things like
₨ contain any letters in them is
wrong.

Code that believes
\p{InLatin} is the same as \p{Latin}
is heinously broken.

Code that
believe that \p{InLatin} is almost ever useful is almost
certainly wrong.

Code that believes
that given $FIRST_LETTER as the first letter in some alphabet
and $LAST_LETTER as the last letter in that same alphabet, that
[${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is
almost always complete broken and wrong and
meaningless.

Code that
believes someone’s name can only contain certain characters is stupid, offensive, and
wrong.

Code that tries to reduce
Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in
programming again. Period. I’m not even positive they should even be allowed to see
again, since it obviously hasn’t done them much good so
far.

Code that believes there’s some
way to pretend textfile encodings don’t exist is broken and dangerous. Might as well
poke the other eye out, too.

Code that
converts unknown characters to ? is broken, stupid, braindead,
and runs contrary to the standard recommendation, which says NOT TO DO
THAT! RTFM for why
not.

Code that believes it can
reliably guess the encoding of an unmarked textfile is guilty of a fatal mélange of
hubris and naïveté that only a lightning bolt from Zeus will
fix.

Code that believes you can use 🐪
printf widths to pad and justify Unicode data is broken and
wrong.

Code that believes once you
successfully create a file by a given name, that when you run
ls or readdir on its enclosing
directory, you’ll actually find that file with the name you created it under is buggy,
broken, and wrong. Stop being surprised by
this!

Code that believes UTF-16 is a
fixed-width encoding is stupid, broken, and wrong. Revoke their programming
licence.

Code that treats code points
from one plane one whit differently than those from any other plane is ipso
facto broken and wrong. Go back to
school.

Code that believes that stuff
like /s/i can only match "S" or
"s" is broken and wrong. You’d be
surprised.

Code that uses
\PM\pM* to find grapheme clusters instead of using
\X is broken and
wrong.

People who want to go back to
the ASCII world should be whole-heartedly encouraged to do so, and in honor of their
glorious upgrade they should be provided gratis with a pre-electric
manual typewriter for all their data-entry needs. Messages sent to them should be sent
via an ᴀʟʟᴄᴀᴘs telegraph at 40 characters per line and hand-delivered by a courier.
STOP.

/>

/>

I don’t know how much more “default
Unicode in 🐪” you can get than what I’ve written. Well, yes I do: you should be using
Unicode::Collate and
Unicode::LineBreak, too. And probably
more.

As you see, there are far too many Unicode
things that you really do have to worry about for there to
ever exist any such thing as “default to
Unicode”.

What you’re going to discover, just as
we did back in 🐪 5.8, that it is simply impossible to impose all these things on code
that hasn’t been designed right from the beginning to account for them. Your
well-meaning selfishness just broke the entire
world.

And even once you do, there are still
critical issues that require a great deal of thought to get right. There is no switch
you can flip. Nothing but brain, and I mean real brain, will
suffice here. There’s a heck of a lot of stuff you have to learn. Modulo the retreat to
the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21ˢᵗ
century, and you cannot wish Unicode away by willful ignorance.

You have to learn it. Period. It will never be
so easy that “everything just works,” because that will guarantee that a lot of things
don’t work — which invalidates the assumption that there
can ever be a way to “make it all
work.”

You may be able to get a few
reasonable defaults for a very few and very limited operations, but not without thinking
about things a whole lot more than I think you
have.

As just one example, canonical ordering is
going to cause some real headaches. 😭"\x{F5}"
‘õ’, "o\x{303}"
‘õ’, "o\x{303}\x{304}"
‘ȭ’, and "o\x{304}\x{303}"
‘ō̃’ should all match ‘õ’, but
how in the world are you going to do that? This is harder than it looks, but it’s
something you need to account for. 💣

If
there’s one thing I know about Perl, it is what its Unicode bits do and do not do, and
this thing I promise you: “ ̲ᴛ̲ʜ̲ᴇ̲ʀ̲ᴇ̲ ̲ɪ̲s̲ ̲ɴ̲ᴏ̲ ̲U̲ɴ̲ɪ̲ᴄ̲ᴏ̲ᴅ̲ᴇ̲
̲ᴍ̲ᴀ̲ɢ̲ɪ̲ᴄ̲ ̲ʙ̲ᴜ̲ʟ̲ʟ̲ᴇ̲ᴛ̲ ̲ ” 😞

You cannot just change some defaults and get
smooth sailing. It’s true that I run 🐪 with PERL_UNICODE set
to "SA", but that’s all, and even that is mostly for
command-line stuff. For real work, I go through all the many steps outlined above, and I
do it very, ** very** carefully.

Blog

Tuesday 17 October 2017