One of the things that made Python 2 to 3 upgrades complicated was the new design about how unicode strings should be handled. They caused untold breakage for better unicode - and after all of that dust had settled - you still have
len("👨👩👧👦") is 7 instead of 1.
Why did that happen?
Just using naive
len calls in different programming languages will inform you it has a “length” of:
25 (UTF-8 bytes)
len("👨👩👧👦") == 25
- Python 2:
len("👨👩👧👦") == 25
- PHP 7.4:
strlen("👨👩👧👦") == 25
11 (UTF-16 doublebytes)
"👨👩👧👦".length === 11
- Python 2 as Unicode:
len(u"👨👩👧👦") == 11
7 (Unicode codepoints)
- Python 3:
len("👨👩👧👦") == 7
- Ruby 3:
"👨👩👧👦".length == 7
1 (Grapheme clusters)
- Raku (née Perl 6):
"👨👩👧👦".chars == 1
At first glance it seems like only Raku does the correct thing here by declaring “👨👩👧👦” to have length 1 (when doing the simple thing).
But it’s unstable
So the Python 3 design is that
len gives unicode codepoints, not grapheme clusters (“visible characters” as an approximation). So
len("👨👩👧👦") is 7 instead of 1 which might seem unusual.
One interesting problem with having this return
len == 1 is that it is not a stable result. New versions of Unicode define new grapheme clusters of existing codepoints. Python 2 - 3 was a major transition, but every year’s new emoji definitions could actually cause ongoing subtle breakage. This affects all programming languages and libraries where you are not explicitly pinning a version of Unicode.
From a discussion on LWN.net:
I was thinking more of cases where you mix up strings as sequences of Unicode-8.0-graphemes and as sequences of Unicode-9.0-graphemes. Like you implement a login form where the password must be at least 8 characters (using Raku’s built-in definition of ‘character’), and a user registers with an 8-character password, but then you upgrade Raku and now that user’s password is only 7 characters and the form won’t let them log in.
Different versions of software will see it as being a different number of grapheme clusters. There are several time stages to consider.
Any version of these interpreters prior to Emoji 2.0 / Unicode 8 in 2015 does not have a grapheme-cluster definition for these 7 separate codepoints (and a renderer from this time would display it as 👨👩👧👦).
We can demonstrate this using Rakudo Perl 2013.12-1 from Ubuntu 14.04:
LC_ALL=C.UTF-8 perl6 -e $'print "\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9\xe2\x80\x8d\xf0\x9f\x91\xa7\xe2\80\x8d\xf0\x9f\x91\xa6".chars' 7
Then Emoji 2.0 explicitly defined this 👨 ZWJ 👩 ZWJ 👧 ZWJ 👦 sequence of unicode codepoints to render as the combined 👨👩👧👦 grapheme cluster.
Then the reality of mobile devices set in:
Originally Unicode documented all the ZWJ sequences, so tools dealing with grapheme clusters could correctly report them as a single character. But iOS/Android kept adding new ones, so Unicode gave up and made a generic rule that any EMOJI+ZWJ+EMOJI sequence is a grapheme cluster
In the current implicit timeframe, this means that “fake emojis” such as 👨 ZWJ 👨 ZWJ 👨 ZWJ 👨 ZWJ 👨 ZWJ 👨 (…) will appear to have a “length” of 1 even with arbitrary length. At least this should be a stable definition for the future.
If Python 3 had used grapheme clusters for
len when it was released in December 2008, it would have (A) got this wrong for most of its life anyway and (B) caused ongoing breakage as the output of
len would not be stable across versions and OS environments.
Of course you can get any of the 25, 11, 7, and 1 numbers output in most languages if you explicitly ask for what you want: