One of the things that made Python 2 to 3 upgrades complicated was the new design about how unicode strings should be handled. They caused untold breakage for better unicode - and after all of that dust had settled - you still have len("๐จโ๐ฉโ๐งโ๐ฆ")
is 7 instead of 1.
Why did that happen?
Just using naive len
calls in different programming languages will inform you it has a “length” of:
25 (UTF-8 bytes)
- Go:
len("๐จโ๐ฉโ๐งโ๐ฆ") == 25
- Python 2:
len("๐จโ๐ฉโ๐งโ๐ฆ") == 25
- PHP 7.4:
strlen("๐จโ๐ฉโ๐งโ๐ฆ") == 25
11 (UTF-16 doublebytes)
- Javascript:
"๐จโ๐ฉโ๐งโ๐ฆ".length === 11
- Python 2 as Unicode:
len(u"๐จโ๐ฉโ๐งโ๐ฆ") == 11
7 (Unicode codepoints)
- Python 3:
len("๐จโ๐ฉโ๐งโ๐ฆ") == 7
- Ruby 3:
"๐จโ๐ฉโ๐งโ๐ฆ".length == 7
1 (Grapheme clusters)
- Raku (nรฉe Perl 6):
"๐จโ๐ฉโ๐งโ๐ฆ".chars == 1
At first glance it seems like only Raku does the correct thing here by declaring “๐จโ๐ฉโ๐งโ๐ฆ” to have length 1 (when doing the simple thing).
But it’s unstable
So the Python 3 design is that len
gives unicode codepoints, not grapheme clusters (“visible characters” as an approximation). So len("๐จโ๐ฉโ๐งโ๐ฆ")
is 7 instead of 1 which might seem unusual.
One interesting problem with having this return len == 1
is that it is not a stable result. New versions of Unicode define new grapheme clusters of existing codepoints. Python 2 - 3 was a major transition, but every year’s new emoji definitions could actually cause ongoing subtle breakage. This affects all programming languages and libraries where you are not explicitly pinning a version of Unicode.
From a discussion on LWN.net:
I was thinking more of cases where you mix up strings as sequences of Unicode-8.0-graphemes and as sequences of Unicode-9.0-graphemes. Like you implement a login form where the password must be at least 8 characters (using Raku’s built-in definition of ‘character’), and a user registers with an 8-character password, but then you upgrade Raku and now that user’s password is only 7 characters and the form won’t let them log in.
Different versions of software will see it as being a different number of grapheme clusters. There are several time stages to consider.
Old versions
Any version of these interpreters prior to Emoji 2.0 / Unicode 8 in 2015 does not have a grapheme-cluster definition for these 7 separate codepoints (and a renderer from this time would display it as ๐จ๐ฉ๐ง๐ฆ).
We can demonstrate this using Rakudo Perl 2013.12-1 from Ubuntu 14.04:
LC_ALL=C.UTF-8 perl6 -e $'print "\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9\xe2\x80\x8d\xf0\x9f\x91\xa7\xe2\80\x8d\xf0\x9f\x91\xa6".chars'
7
Explicit timeframe
Then Emoji 2.0 explicitly defined this ๐จ ZWJ ๐ฉ ZWJ ๐ง ZWJ ๐ฆ sequence of unicode codepoints to render as the combined ๐จโ๐ฉโ๐งโ๐ฆ grapheme cluster.
Implicit timeframe
Then the reality of mobile devices set in:
Originally Unicode documented all the ZWJ sequences, so tools dealing with grapheme clusters could correctly report them as a single character. But iOS/Android kept adding new ones, so Unicode gave up and made a generic rule that any EMOJI+ZWJ+EMOJI sequence is a grapheme cluster
In the current implicit timeframe, this means that “fake emojis” such as ๐จ ZWJ ๐จ ZWJ ๐จ ZWJ ๐จ ZWJ ๐จ ZWJ ๐จ (…) will appear to have a “length” of 1 even with arbitrary length. At least this should be a stable definition for the future.
Conclusion
If Python 3 had used grapheme clusters for len
when it was released in December 2008, it would have (A) got this wrong for most of its life anyway and (B) caused ongoing breakage as the output of len
would not be stable across versions and OS environments.
Appendix
Of course you can get any of the 25, 11, 7, and 1 numbers output in most languages if you explicitly ask for what you want:
Language | 25 | 11 | 7 | 1 |
---|---|---|---|---|
PHP 7.4 | strlen("๐จโ๐ฉโ๐งโ๐ฆ") |
strlen(iconv("UTF-8", "UTF-16LE", "๐จโ๐ฉโ๐งโ๐ฆ")) / 2 |
mb_strlen("๐จโ๐ฉโ๐งโ๐ฆ") |
grapheme_strlen("๐จโ๐ฉโ๐งโ๐ฆ") (with php-intl) |
Ruby 3 | "๐จโ๐ฉโ๐งโ๐ฆ".bytes.length |
"๐จโ๐ฉโ๐งโ๐ฆ".length or"๐จโ๐ฉโ๐งโ๐ฆ".codepoints.length |
"๐จโ๐ฉโ๐งโ๐ฆ".grapheme_clusters.length |
|
Raku | "๐จโ๐ฉโ๐งโ๐ฆ".codes |
"๐จโ๐ฉโ๐งโ๐ฆ".chars |
||
Go | len("๐จโ๐ฉโ๐งโ๐ฆ") |
for _, _ = range "๐จโ๐ฉโ๐งโ๐ฆ" { counter += 1 } |
||
Shell | echo -n "๐จโ๐ฉโ๐งโ๐ฆ" | wc -c |
$(( $(echo -n '๐จโ๐ฉโ๐งโ๐ฆ' | iconv -f UTF-8 -t UTF-16LE | wc -c) / 2))$(( $(echo -n '๐จโ๐ฉโ๐งโ๐ฆ' | iconv -f UTF-8 -t UTF-16LE | wc -c) / 2)) |
$(( $(echo -n '๐จโ๐ฉโ๐งโ๐ฆ' | iconv -f UTF-8 -t UTF-32LE | wc -c) / 4))$(( $(echo -n '๐จโ๐ฉโ๐งโ๐ฆ' | iconv -f UTF-8 -t UTF-32LE | wc -c) / 4)) |