Grapheme Clusters

Posted on

One of the things that made Python 2 to 3 upgrades complicated was the new design about how unicode strings should be handled. They caused untold breakage for better unicode - and after all of that dust had settled - you still have len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") is 7 instead of 1.

Why did that happen?

Just using naive len calls in different programming languages will inform you it has a “length” of:

25 (UTF-8 bytes)

  • Go: len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") == 25
  • Python 2: len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") == 25
  • PHP 7.4: strlen("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") == 25

11 (UTF-16 doublebytes)

  • Javascript: "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".length === 11
  • Python 2 as Unicode: len(u"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") == 11

7 (Unicode codepoints)

  • Python 3: len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") == 7
  • Ruby 3: "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".length == 7

1 (Grapheme clusters)

  • Raku (nรฉe Perl 6): "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".chars == 1

At first glance it seems like only Raku does the correct thing here by declaring “๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ” to have length 1 (when doing the simple thing).

But it’s unstable

So the Python 3 design is that len gives unicode codepoints, not grapheme clusters (“visible characters” as an approximation). So len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") is 7 instead of 1 which might seem unusual.

One interesting problem with having this return len == 1 is that it is not a stable result. New versions of Unicode define new grapheme clusters of existing codepoints. Python 2 - 3 was a major transition, but every year’s new emoji definitions could actually cause ongoing subtle breakage. This affects all programming languages and libraries where you are not explicitly pinning a version of Unicode.

From a discussion on LWN.net:

I was thinking more of cases where you mix up strings as sequences of Unicode-8.0-graphemes and as sequences of Unicode-9.0-graphemes. Like you implement a login form where the password must be at least 8 characters (using Raku’s built-in definition of ‘character’), and a user registers with an 8-character password, but then you upgrade Raku and now that user’s password is only 7 characters and the form won’t let them log in.

Different versions of software will see it as being a different number of grapheme clusters. There are several time stages to consider.

Old versions

Any version of these interpreters prior to Emoji 2.0 / Unicode 8 in 2015 does not have a grapheme-cluster definition for these 7 separate codepoints (and a renderer from this time would display it as ๐Ÿ‘จ๐Ÿ‘ฉ๐Ÿ‘ง๐Ÿ‘ฆ).

We can demonstrate this using Rakudo Perl 2013.12-1 from Ubuntu 14.04:

LC_ALL=C.UTF-8 perl6 -e $'print "\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9\xe2\x80\x8d\xf0\x9f\x91\xa7\xe2\80\x8d\xf0\x9f\x91\xa6".chars'
7

Explicit timeframe

Then Emoji 2.0 explicitly defined this ๐Ÿ‘จ ZWJ ๐Ÿ‘ฉ ZWJ ๐Ÿ‘ง ZWJ ๐Ÿ‘ฆ sequence of unicode codepoints to render as the combined ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ grapheme cluster.

Implicit timeframe

Then the reality of mobile devices set in:

Originally Unicode documented all the ZWJ sequences, so tools dealing with grapheme clusters could correctly report them as a single character. But iOS/Android kept adding new ones, so Unicode gave up and made a generic rule that any EMOJI+ZWJ+EMOJI sequence is a grapheme cluster

In the current implicit timeframe, this means that “fake emojis” such as ๐Ÿ‘จ ZWJ ๐Ÿ‘จ ZWJ ๐Ÿ‘จ ZWJ ๐Ÿ‘จ ZWJ ๐Ÿ‘จ ZWJ ๐Ÿ‘จ (…) will appear to have a “length” of 1 even with arbitrary length. At least this should be a stable definition for the future.

Conclusion

If Python 3 had used grapheme clusters for len when it was released in December 2008, it would have (A) got this wrong for most of its life anyway and (B) caused ongoing breakage as the output of len would not be stable across versions and OS environments.

Appendix

Of course you can get any of the 25, 11, 7, and 1 numbers output in most languages if you explicitly ask for what you want:

Language 25 11 7 1
PHP 7.4 strlen("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") strlen(iconv("UTF-8", "UTF-16LE", "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ")) / 2 mb_strlen("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") grapheme_strlen("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ")(with php-intl)
Ruby 3 "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".bytes.length "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".length or"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".codepoints.length "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".grapheme_clusters.length
Raku "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".codes "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".chars
Go len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") for _, _ = range "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" { counter += 1 }
Shell echo -n "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" | wc -c $(( $(echo -n '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ' | iconv -f UTF-8 -t UTF-16LE | wc -c) / 2))$(( $(echo -n '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ' | iconv -f UTF-8 -t UTF-16LE | wc -c) / 2)) $(( $(echo -n '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ' | iconv -f UTF-8 -t UTF-32LE | wc -c) / 4))$(( $(echo -n '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ' | iconv -f UTF-8 -t UTF-32LE | wc -c) / 4))