Tuesday 17 October 2017

unicode - Why does modern Perl avoid UTF-8 by default?

itemprop="text">

I wonder why most modern solutions
built using Perl don't enable rel="noreferrer">UTF-8 by
default.



I understand there are many legacy
problems for core Perl scripts, where it may break things. But, from my point of view,
in the 21st century, big new projects (or projects with a big
perspective) should make their software UTF-8 proof from scratch. Still I don't see it
happening. For example, rel="noreferrer">Moose enables strict and warnings, but not href="http://en.wikipedia.org/wiki/Unicode" rel="noreferrer">Unicode. href="http://search.cpan.org/~chromatic/Modern-Perl-1.03/lib/Modern/Perl.pm"
rel="noreferrer">Modern::Perl reduces boilerplate too, but no UTF-8
handling.




Why? Are there some reasons
to avoid UTF-8 in modern Perl projects in the year
2011?



/>

Commenting @tchrist got too long, so I'm adding it
here.



It seems that I did not make myself clear.
Let me try to add some
things.



tchrist and
I see situation pretty similarly, but our conclusions are completely in opposite ends. I
agree, the situation with Unicode is complicated, but this is why we (Perl users and
coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be
nowadays.




tchrist
pointed to many aspects to cover, I will read and think about them for days or even
weeks. Still, this is not my point. tchrist tries to prove
that there is not one single way "to enable UTF-8". I have not so much knowledge to
argue with that. So, I stick to live examples.



I
played around with rel="noreferrer">Rakudo and UTF-8 was just there as I
needed
. I didn't have any problems, it just worked. Maybe there are some
limitation somewhere deeper, but at start, all I tested worked as I
expected.



Shouldn't that be a goal in modern
Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for
core Perl, I suggest the possibility to trigger it with a
snap
for those who develop new
projects.



Another example, but with a more
negative tone. Frameworks should make development easier. Some years ago, I tried web
frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not
find how and where to hook Unicode support. It was so time-consuming that I found it
easier to go the old way. Now I saw here there was a bounty to deal with the same
problem with rel="noreferrer">Mason 2: href="https://stackoverflow.com/questions/5858596/how-to-make-mason2-utf8-clean">How
to make Mason2 UTF-8 clean?
. So, it is pretty new framework, but
using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign:
STOP, don't use me!



I really like Perl. But
dealing with Unicode is painful. I still find myself running against walls. Some way
tchrist is right and answers my questions: new projects
don't attract UTF-8 because it is too complicated in Perl 5.



Answer








  1. Set
    your PERL_UNICODE envariable to AS.
    This makes all Perl scripts decode @ARGV as UTF‑8 strings, and
    sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are
    global effects, not lexical ones.


  2. At
    the top of your source file (program, module, library,
    dohickey), prominently assert that you are running perl version
    5.12 or better via:



    use v5.12; #
    minimal for unicode string feature
    use v5.14; # optimal for unicode string
    feature

  3. Enable
    warnings, since the previous declaration only enables strictures and features, not
    warnings. I also suggest promoting Unicode warnings into exceptions, so use both these
    lines, not just one of them. Note however that under v5.14, the
    utf8 warning class comprises three other subwarnings which can
    all be separately enabled: nonchar,
    surrogate, and non_unicode. These you
    may wish to exert greater control
    over.




    use
    warnings;
    use warnings qw( FATAL utf8
    );

  4. Declare
    that this source unit is encoded as UTF‑8. Although once upon a time this pragma did
    other things, it now serves this one singular purpose alone and no
    other:



    use
    utf8;

  5. Declare
    that anything that opens a filehandle within this lexical scope but not
    elsewhere
    is to assume that that stream is encoded in UTF‑8 unless you tell
    it otherwise. That way you do not affect other module’s or other program’s
    code.




    use open qw(
    :encoding(UTF-8) :std
    );

  6. Enable
    named characters via
    \N{CHARNAME}.



    use
    charnames qw( :full :short
    );

  7. If you
    have a DATA handle, you must explicitly set its encoding. If
    you want this to be UTF‑8, then
    say:



    binmode(DATA,
    ":encoding(UTF-8)");




There
is of course no end of other matters with which you may eventually find yourself
concerned, but these will suffice to approximate the state goal to “make everything just
work with UTF‑8”, albeit for a somewhat weakened sense of those terms.



One other pragma, although it is not Unicode
related, is:



 use
autodie;


It is
strongly recommended.




๐ŸŒด ๐Ÿช๐Ÿซ๐Ÿช ๐ŸŒž
๐•ฒ๐–” ๐•ฟ๐–๐–”๐–š ๐–†๐–“๐–‰ ๐•ฏ๐–” ๐•ท๐–Ž๐–๐–Š๐–œ๐–Ž๐–˜๐–Š ๐ŸŒž ๐Ÿช๐Ÿซ๐Ÿช
๐Ÿ






๐ŸŽ ๐Ÿช
๐•ญ๐–”๐–Ž๐–‘๐–Š๐–—⸗๐–•๐–‘๐–†๐–™๐–Š ๐–‹๐–”๐–— ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Š⸗๐•ฌ๐–œ๐–†๐–—๐–Š ๐•ฎ๐–”๐–‰๐–Š ๐Ÿช
๐ŸŽ






My own
boilerplate these days tends to look like
this:




use
5.014;

use utf8;
use strict;
use
autodie;
use warnings;
use warnings qw< FATAL utf8
>;
use open qw< :std :utf8 >;
use charnames qw< :full
>;

use feature qw< unicode_strings
>;

use File::Basename qw< basename >;
use Carp
qw< carp croak confess cluck >;
use Encode qw< encode decode
>;
use Unicode::Normalize qw< NFD NFC >;

END {
close STDOUT }

if (grep /\P{ASCII}/ => @ARGV) {


@ARGV = map { decode("UTF-8", $_) }
@ARGV;
}

$0 = basename($0); # shorter
messages
$| = 1;

binmode(DATA,
":utf8");

# give a full stack dump on any untrapped
exceptions
local $SIG{__DIE__} = sub {

confess "Uncaught
exception: @_" unless $^S;
};

# now promote run-time
warnings into stack-dumped
# exceptions *unless* we're in an try block,
in
# which case just cluck the stack dump instead
local
$SIG{__WARN__} = sub {
if ($^S) { cluck "Trapped warning: @_" }

else { confess "Deadly warning: @_" }
};


while
(<>) {
chomp;
$_ = NFD($_);
...
}
continue {
say
NFC($_);
}

__END__



/>



/>

Saying that “Perl should
[somehow!] enable Unicode by default” doesn’t even start to
begin to think about getting around to saying enough to be even marginally useful in
some sort of rare and isolated case. Unicode is much much more than just a larger
character repertoire; it’s also how those characters all interact in many, many
ways.




Even the simple-minded minimal
measures that (some) people seem to think they want are guaranteed to miserably break
millions of lines of code, code that has no chance to “upgrade” to your spiffy new
Brave New World modernity.



It is way way way more complicated than people
pretend. I’ve thought about this a huge, whole lot over the past few years. I would love
to be shown that I am wrong. But I don’t think I am. Unicode is fundamentally more
complex than the model that you would like to impose on it, and there is complexity here
that you can never sweep under the carpet. If you try, you’ll break either your own code
or somebody else’s. At some point, you simply have to break down and learn what Unicode
is about. You cannot pretend it is something it is not.



๐Ÿช goes out of its way to make Unicode easy,
far more than anything else I’ve ever used. If you think this is bad, try something else
for a while. Then come back to ๐Ÿช: either you will have returned to a better world, or
else you will bring knowledge of the same with you so that we can make use of your new
knowledge to make ๐Ÿช better at these things.



/>

๐Ÿ’ก ๐•ด๐–‰๐–Š๐–†๐–˜ ๐–‹๐–”๐–— ๐–† ๐–€๐–“๐–Ž๐–ˆ๐–”๐–‰๐–Š ⸗
๐•ฌ๐–œ๐–†๐–—๐–Š ๐Ÿช ๐•ท๐–†๐–š๐–“๐–‰๐–—๐–ž ๐•ท๐–Ž๐–˜๐–™
๐Ÿ’ก




/>

At a minimum, here are some things that would appear
to be required for ๐Ÿช to “enable Unicode by default”, as you put
it:




  1. All ๐Ÿช source
    code should be in UTF-8 by default. You can get that with use
    utf8
    or export
    PERL5OPTS=-Mutf8
    .


  2. The ๐Ÿช
    DATA handle should be UTF-8. You will have to do this on a
    per-package basis, as in binmode(DATA,
    ":encoding(UTF-8)")
    .


  3. Program
    arguments to ๐Ÿช scripts should be understood to be UTF-8 by default. export
    PERL_UNICODE=A
    , or perl -CA, or export
    PERL5OPTS=-CA
    .


  4. The
    standard input, output, and error streams should default to UTF-8. export
    PERL_UNICODE=S
    for all of them, or I,
    O, and/or E for just some of them.
    This is like perl
    -CS
    .


  5. Any other handles
    opened by ๐Ÿช should be considered UTF-8 unless declared otherwise; export
    PERL_UNICODE=D
    or with i and
    o for particular ones of these; export
    PERL5OPTS=-CD
    would work. That makes -CSAD for all
    of them.



  6. Cover both bases
    plus all the streams you open with export
    PERL5OPTS=-Mopen=:utf8,:std
    . See href="http://training.perl.com/scripts/uniquote"
    rel="noreferrer">uniquote.


  7. You
    don’t want to miss UTF-8 encoding errors. Try export
    PERL5OPTS=-Mwarnings=FATAL,utf8
    . And make sure your input streams are
    always binmoded to :encoding(UTF-8),
    not just to
    :utf8.


  8. Code
    points between 128–255 should be understood by ๐Ÿช to be the corresponding Unicode code
    points, not just unpropertied binary values. use feature
    "unicode_strings"
    or export
    PERL5OPTS=-Mfeature=unicode_strings
    . That will make uc("\xDF")
    eq "SS"
    and "\xE9" =~ /\w/. A simple
    export PERL5OPTS=-Mv5.12 or better will also get
    that.


  9. Named Unicode characters are not
    by default enabled, so add export
    PERL5OPTS=-Mcharnames=:full,:short,latin,greek
    or some such. See href="http://training.perl.com/scripts/uninames"
    rel="noreferrer">uninames and href="http://training.perl.com/scripts/tcgrep"
    rel="noreferrer">tcgrep.


  10. You
    almost always need access to the functions from href="http://search.cpan.org/perldoc?Unicode::Normalize" rel="noreferrer">the
    standard Unicode::Normalize module various types of
    decompositions. export
    PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD
    , and then always run
    incoming stuff through NFD and outbound stuff from NFC. There’s no I/O layer for these
    yet that I’m aware of, but see rel="noreferrer">nfc, href="http://training.perl.com/scripts/nfd"
    rel="noreferrer">nfd, href="http://training.perl.com/scripts/nfkd"
    rel="noreferrer">nfkd, and href="http://training.perl.com/scripts/nfkc"
    rel="noreferrer">nfkc.


  11. String
    comparisons in ๐Ÿช using eq, ne,
    lc, cmp,
    sort, &c&cc are always wrong. So instead of
    @a = sort @b, you need @a =
    Unicode::Collate->new->sort(@b)
    . Might as well add that to your
    export PERL5OPTS=-MUnicode::Collate. You can cache the key for
    binary comparisons.


  12. ๐Ÿช built-ins like
    printf and write do the wrong thing
    with Unicode data. You need to use href="http://search.cpan.org/perldoc?Unicode::GCString" rel="noreferrer">the
    Unicode::GCString module for the former, and both
    that and also rel="noreferrer">the Unicode::LineBreak module as
    well for the latter. See rel="noreferrer">uwc and href="http://training.perl.com/scripts/ucsort"
    rel="noreferrer">unifmt.


  13. If
    you want them to count as integers, then you are going to have to run your
    \d+ captures through href="http://search.cpan.org/~jesse/perl-5.14.0/lib/Unicode/UCD.pm#num"
    rel="noreferrer">the Unicode::UCD::num function
    because ๐Ÿช’s built-in atoi(3) isn’t currently clever
    enough.


  14. You are going to have
    filesystem issues on ๐Ÿ‘ฝ filesystems. Some filesystems silently enforce a conversion to
    NFC; others silently enforce a conversion to NFD. And others do something else still.
    Some even ignore the matter altogether, which leads to even greater problems. So you
    have to do your own NFC/NFD handling to keep
    sane.


  15. All your ๐Ÿช code involving
    a-z or A-Z and such MUST
    BE CHANGED
    , including m//,
    s///, and tr///. It’s should stand out
    as a screaming red flag that your code is broken. But it is not clear how it must
    change. Getting the right properties, and understanding their casefolds, is harder than
    you might think. I use rel="noreferrer">unichars and href="http://training.perl.com/scripts/uniprops"
    rel="noreferrer">uniprops every single
    day.



  16. Code that uses
    \p{Lu} is almost as wrong as code that uses
    [A-Za-z]. You need to use \p{Upper}
    instead, and know the reason why. Yes, \p{Lowercase} and
    \p{Lower} are different from \p{Ll}
    and
    \p{Lowercase_Letter}.


  17. Code
    that uses [a-zA-Z] is even worse. And it can’t use
    \pL or \p{Letter}; it needs to use
    \p{Alphabetic}. Not all alphabetics are letters, you
    know!


  18. If you are looking for ๐Ÿช
    variables with /[\$\@\%]\w+/, then you have a problem. You need
    to look for /[\$\@\%]\p{IDS}\p{IDC}*/, and even that isn’t
    thinking about the punctuation variables or package
    variables.


  19. If you are checking for
    whitespace, then you should choose between \h and
    \v, depending. And you should never use
    \s, since it DOES NOT MEAN
    [\h\v], contrary to popular
    belief.


  20. If you are using
    \n for a line boundary, or even \r\n,
    then you are doing it wrong. You have to use \R, which is not
    the same!


  21. If you don’t know when and
    whether to call rel="noreferrer">Unicode::Stringprep, then you had better
    learn.


  22. Case-insensitive comparisons
    need to check for whether two things are the same letters no matter their diacritics and
    such. The easiest way to do that is with the href="http://search.cpan.org/perldoc?Unicode::Collate" rel="noreferrer">standard
    Unicode::Collate module. Unicode::Collate->new(level =>
    1)->cmp($a, $b)
    . There are also eq methods and
    such, and you should probably learn about the match and
    substr methods, too. These are have distinct advantages over
    the ๐Ÿช built-ins.


  23. Sometimes that’s
    still not enough, and you need href="http://search.cpan.org/perldoc?Unicode::Collate::Locale" rel="noreferrer">the
    Unicode::Collate::Locale module instead, as in
    Unicode::Collate::Locale->new(locale => "de__phonebook", level =>
    1)->cmp($a, $b)
    instead. Consider that
    Unicode::Collate::->new(level => 1)->eq("d", "รฐ") is
    true, but Unicode::Collate::Locale->new(locale=>"is",level =>
    1)->eq("d", " รฐ")
    is false. Similarly, "ae" and "รฆ" are
    eq if you don’t use locales, or if you use the English one, but
    they are different in the Icelandic locale. Now what? It’s tough, I tell you. You can
    play with rel="noreferrer">ucsort to test some of these things
    out.


  24. Consider how to match the pattern
    CVCV (consonsant, vowel, consonant, vowel) in the string “niรฑo”.
    Its NFD form — which you had darned well better have remembered to put it in — becomes
    “nin\x{303}o”. Now what are you going to do? Even pretending that a vowel is
    [aeiou] (which is wrong, by the way), you won’t be able to do
    something like (?=[aeiou])\X) either, because even in NFD a
    code point like ‘รธ’ does not decompose! However, it will
    test equal to an ‘o’ using the UCA comparison I just showed you. You can’t rely on NFD,
    you have to rely on UCA.





/>

๐Ÿ’ฉ ๐”ธ ๐•ค ๐•ค ๐•ฆ ๐•ž ๐•– ๐”น ๐•ฃ ๐•  ๐•œ ๐•– ๐•Ÿ ๐•Ÿ ๐•– ๐•ค ๐•ค
๐Ÿ’ฉ






And that’s
not all. There are a million broken assumptions that people make about Unicode. Until
they understand these things, their ๐Ÿช code will be
broken.





  1. Code
    that assumes it can open a text file without specifying the encoding is
    broken.


  2. Code that assumes the default
    encoding is some sort of native platform encoding is
    broken.


  3. Code that assumes that web
    pages in Japanese or Chinese take up less space in UTF‑16 than in UTF‑8 is
    wrong.


  4. Code that assumes Perl uses
    UTF‑8 internally is wrong.


  5. Code that
    assumes that encoding errors will always raise an exception is
    wrong.


  6. Code that assumes Perl code
    points are limited to 0x10_FFFF is
    wrong.


  7. Code that assumes you can set
    $/ to something that will work with any valid line separator is
    wrong.


  8. Code that assumes roundtrip
    equality on casefolding, like lc(uc($s)) eq $s or
    uc(lc($s)) eq $s, is completely broken and wrong. Consider that
    the uc("ฯƒ") and uc("ฯ‚") are both
    "ฮฃ", but lc("ฮฃ") cannot possibly
    return both of those.


  9. Code that
    assumes every lowercase code point has a distinct uppercase one, or vice versa, is
    broken. For example, "ยช" is a lowercase letter with no
    uppercase; whereas both "แตƒ" and "แดฌ"
    are letters, but they are not lowercase letters; however, they are both lowercase code
    points without corresponding uppercase versions. Got that? They are
    not \p{Lowercase_Letter}, despite
    being both \p{Letter} and
    \p{Lowercase}.


  10. Code
    that assumes changing the case doesn’t change the length of the string is
    broken.



  11. Code that assumes
    there are only two cases is broken. There’s also
    titlecase.


  12. Code that assumes only
    letters have case is broken. Beyond just letters, it turns out that numbers, symbols,
    and even marks have case. In fact, changing the case can even make something change its
    main general category, like a \p{Mark} turning into a
    \p{Letter}. It can also make it switch from one script to
    another.


  13. Code that assumes that case
    is never locale-dependent is
    broken.


  14. Code that assumes Unicode
    gives a fig about POSIX locales is
    broken.


  15. Code that assumes you can
    remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged,
    wrong, and justification for capital
    punishment.


  16. Code that assumes that
    diacritics \p{Diacritic} and marks
    \p{Mark} are the same thing is
    broken.


  17. Code that assumes
    \p{GC=Dash_Punctuation} covers as much as
    \p{Dash} is
    broken.


  18. Code that assumes dash,
    hyphens, and minuses are the same thing as each other, or that there is only one of
    each, is broken and wrong.


  19. Code that
    assumes every code point takes up no more than one print column is
    broken.


  20. Code that assumes that all
    \p{Mark} characters take up zero print columns is
    broken.



  21. Code that assumes
    that characters which look alike are alike is
    broken.


  22. Code that assumes that
    characters which do not look alike are not
    alike is broken.


  23. Code that assumes
    there is a limit to the number of code points in a row that just one
    \X can match is
    wrong.


  24. Code that assumes
    \X can never start with a \p{Mark}
    character is wrong.


  25. Code that assumes
    that \X can never hold two
    non-\p{Mark} characters is
    wrong.


  26. Code that assumes that it
    cannot use "\x{FFFF}" is
    wrong.


  27. Code that assumes a non-BMP
    code point that requires two UTF-16 (surrogate) code units will encode to two separate
    UTF-8 characters, one per code unit, is wrong. It doesn’t: it encodes to single code
    point.


  28. Code that transcodes from
    UTF‐16 or UTF‐32 with leading BOMs into UTF‐8 is broken if it puts a BOM at the start of
    the resulting UTF-8. This is so stupid the engineer should have their eyelids
    removed.


  29. Code that assumes the CESU-8
    is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as
    "\xC0\x80" is UTF-8 is broken and wrong. These guys also
    deserve the eyelid treatment.


  30. Code
    that assumes characters like > always points to the right
    and < always points to the left are wrong — because they in
    fact do not.



  31. Code that
    assumes if you first output character X and then character
    Y, that those will show up as XY is
    wrong. Sometimes they
    don’t.


  32. Code that assumes
    that ASCII is good enough for writing English properly is stupid, shortsighted,
    illiterate, broken, evil, and wrong.
    Off with their heads! If that seems
    too extreme, we can compromise: henceforth they may type only with their big toe from
    one foot. (The rest will be duct
    taped.)


  33. Code that assumes that all
    \p{Math} code points are visible characters is
    wrong.


  34. Code that assumes
    \w contains only letters, digits, and underscores is
    wrong.


  35. Code that assumes that
    ^ and ~ are punctuation marks is
    wrong.


  36. Code that assumes that
    รผ has an umlaut is
    wrong.


  37. Code that believes things like
    contain any letters in them is
    wrong.


  38. Code that believes
    \p{InLatin} is the same as \p{Latin}
    is heinously broken.


  39. Code that
    believe that \p{InLatin} is almost ever useful is almost
    certainly wrong.


  40. Code that believes
    that given $FIRST_LETTER as the first letter in some alphabet
    and $LAST_LETTER as the last letter in that same alphabet, that
    [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is
    almost always complete broken and wrong and
    meaningless.



  41. Code that
    believes someone’s name can only contain certain characters is stupid, offensive, and
    wrong.


  42. Code that tries to reduce
    Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in
    programming again. Period. I’m not even positive they should even be allowed to see
    again, since it obviously hasn’t done them much good so
    far.


  43. Code that believes there’s some
    way to pretend textfile encodings don’t exist is broken and dangerous. Might as well
    poke the other eye out, too.


  44. Code that
    converts unknown characters to ? is broken, stupid, braindead,
    and runs contrary to the standard recommendation, which says NOT TO DO
    THAT!
    RTFM for why
    not.


  45. Code that believes it can
    reliably guess the encoding of an unmarked textfile is guilty of a fatal mรฉlange of
    hubris and naรฏvetรฉ that only a lightning bolt from Zeus will
    fix.


  46. Code that believes you can use ๐Ÿช
    printf widths to pad and justify Unicode data is broken and
    wrong.


  47. Code that believes once you
    successfully create a file by a given name, that when you run
    ls or readdir on its enclosing
    directory, you’ll actually find that file with the name you created it under is buggy,
    broken, and wrong. Stop being surprised by
    this!


  48. Code that believes UTF-16 is a
    fixed-width encoding is stupid, broken, and wrong. Revoke their programming
    licence.


  49. Code that treats code points
    from one plane one whit differently than those from any other plane is ipso
    facto
    broken and wrong. Go back to
    school.


  50. Code that believes that stuff
    like /s/i can only match "S" or
    "s" is broken and wrong. You’d be
    surprised.



  51. Code that uses
    \PM\pM* to find grapheme clusters instead of using
    \X is broken and
    wrong.


  52. People who want to go back to
    the ASCII world should be whole-heartedly encouraged to do so, and in honor of their
    glorious upgrade they should be provided gratis with a pre-electric
    manual typewriter for all their data-entry needs. Messages sent to them should be sent
    via an แด€สŸสŸแด„แด€แด˜s telegraph at 40 characters per line and hand-delivered by a courier.
    STOP.




/>



/>


I don’t know how much more “default
Unicode in ๐Ÿช” you can get than what I’ve written. Well, yes I do: you should be using
Unicode::Collate and
Unicode::LineBreak, too. And probably
more.



As you see, there are far too many Unicode
things that you really do have to worry about for there to
ever exist any such thing as “default to
Unicode”.



What you’re going to discover, just as
we did back in ๐Ÿช 5.8, that it is simply impossible to impose all these things on code
that hasn’t been designed right from the beginning to account for them. Your
well-meaning selfishness just broke the entire
world.



And even once you do, there are still
critical issues that require a great deal of thought to get right. There is no switch
you can flip. Nothing but brain, and I mean real brain, will
suffice here. There’s a heck of a lot of stuff you have to learn. Modulo the retreat to
the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21หขแต—
century, and you cannot wish Unicode away by willful ignorance.



You have to learn it. Period. It will never be
so easy that “everything just works,” because that will guarantee that a lot of things
don’t work — which invalidates the assumption that there
can ever be a way to “make it all
work.”




You may be able to get a few
reasonable defaults for a very few and very limited operations, but not without thinking
about things a whole lot more than I think you
have.



As just one example, canonical ordering is
going to cause some real headaches. ๐Ÿ˜ญ"\x{F5}"
‘รต’, "o\x{303}"
‘รต’, "o\x{303}\x{304}"
‘ศญ’, and "o\x{304}\x{303}"
‘ล̃’ should all match ‘รต’, but
how in the world are you going to do that? This is harder than it looks, but it’s
something you need to account for. ๐Ÿ’ฃ



If
there’s one thing I know about Perl, it is what its Unicode bits do and do not do, and
this thing I promise you: “ ̲แด›̲สœ̲แด‡̲ส€̲แด‡̲ ̲ษช̲s̲ ̲ษด̲แด̲ ̲U̲ษด̲ษช̲แด„̲แด̲แด…̲แด‡̲
̲แด̲แด€̲ษข̲ษช̲แด„̲ ̲ส™̲แดœ̲สŸ̲สŸ̲แด‡̲แด›̲ ̲ ”
๐Ÿ˜ž



You cannot just change some defaults and get
smooth sailing. It’s true that I run ๐Ÿช with PERL_UNICODE set
to "SA", but that’s all, and even that is mostly for
command-line stuff. For real work, I go through all the many steps outlined above, and I
do it very, ** very** carefully.



/>



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...