What Is the “Length” of a String?

mjevans · on Oct 3, 2019

IIRC from some other article; most JavaScript engines use a UTF-16 Unicode encoding for characters, and it looks like the .length attribute is reporting the memory footprint in terms of 16 bit codepoints. A single grapheme (display 'character', pictograph, or fixed sized space) may be composed of one or more codepoints.

'length' is subjective. It makes sense to want to know how many storage units are required.

Knowing the number of printed 'graphemes' (do we count non-printing 'characters'?) might also be useful.

The display length (at a given scaling size, taking in to consideration font kerning/etc) can also be useful.

That's where the linked article really goes off the rails. Characters don't have printing widths, not outside of extremely specific circumstances. You have to ask the layout engine in use what dimensions are occupied after it solves the very complex problem.

udp · on Oct 3, 2019

It really doesn't matter. A string is a binary blob until you need to parse it, or display it. If you need to parse a string and it's UTF-8, you can pretend it's an ASCII string because the control characters you need for parsing are probably ASCII characters such as {, [, or ". UTF-8 continuation characters set a high bit to ensure that UTF-8 is a superset of ASCII, so you can use all of the standard C library functions. If you need to display a string, you're already using a font library which, in addition to providing the logic to iterate through the string character by character, can work out kerning etc.

I never understood the rationale for using encodings such as UTF-16. They seem to be the worst of both worlds: strings for which ASCII would be adequate take 2x the space, and the encoding is still multi-byte. I once worked with a Windows developer who swore blind that UTF-16 was not a multi-byte encoding. When I provided evidence to the contrary, they responded something along the lines of "ok, but who would ever need more than 16 bits worth of characters?". ¯\_(ツ)_/¯

dataflow · on Oct 3, 2019

The(/a) rationale for UTF-16 is that it's more space-efficient for languages that use high code points, like East Asian languages.

jandrese · on Oct 3, 2019

IIRC the real rationale is back in the early days people bet that 16 bits would be enough for a fixed length encoding, but the bet didn't pay off and now they're stuck with the worst of both worlds.

lokedhs · on Oct 4, 2019

Your explanation is correct. UTF-16 is a hack on top of a hack, and no one uses it in new software. However, Java and Javascript are stuck with it for legacy reasons.

It wouldn't be so bad if people just understood that there are (almost) no valid uses-cases for measuring the size of a string. As someone else mentioned, it's a binary blob for all intents and purposes, and if processing of its content needs to be done (such as displaying it on the screen, or performing, say, word-wrap) then it should be handed over to libraries that have been designed for this purpose, because these things are very complicated.

Simply by having a single non-ASCII character in my name, I'm seeing software fail on this on a regular basis.

dataflow · on Oct 3, 2019

Depends what you mean by "rationale" I guess. I was trying to explain why one might still want to use it, not why one might be stuck using it.

nwallin · on Oct 4, 2019

East Asian HTML and XML documents are typically smaller in utf8 than utf16 because the of the markup. (on mobile so u can't prove it right now. maybe when I get home tonight) utf16 is a bad default for almost everybody.

The actual reason utf16 is used is because it was easier to port to utf16 from ucs16. Much the way it's often acceptable to assume utf8 is ASCII it's often acceptable to assume utf16 is utf8. Even today, I imagine there are a lot of applications which assume utf16 is actually ucs16.

edflsafoiewq · on Oct 4, 2019

That depends on what kind documents of course. It isn't usually true for text-centric content.

Here's Dazai Osamu's Hashire Melos: https://www.aozora.gr.jp/cards/000035/files/1567_14913.html. Despite having an unusually large amount of markup (most Japanese text wouldn't add the ruby characters), it's smaller as UTF-16 than UTF-8.

udp · on Oct 3, 2019

Sure, but the majority of the text computers have to parse has a syntax consisting of low codepoint ASCII characters (think XML tags vs. data). The fragments of text using a high codepoint language are usually only interesting at render time.

dataflow · on Oct 3, 2019

I'm not supporting the position; I'm just explaining what the argument for it is, since you said you never understood the rationale behind it and felt it gave you the worst of both worlds.

But I will say that the concern you have does not take into account the concerns other people have who make this argument.

SAI_Peregrinus · on Oct 4, 2019

The other aspect of importance is that languages with Chinese characters have higher information density per character, so they tend to even out in terms of actual storage requirements.

furyofantares · on Oct 3, 2019

It took me a while to realize that by multi-byte you must mean variable-width. Of course UTF-16 is multi-byte, how could it not be? :)

fireattack · on Oct 3, 2019

I prefer this article: https://hsivonen.fi/string-length/ HN discussion: https://news.ycombinator.com/item?id=20914184)

maxdamantus · on Oct 4, 2019

> Instead of rendering all the strings in each column, we can split the strings into their corresponding graphemes and render them individually. This allows us to cache the pixel length of each grapheme we encounter.

I wonder if they're aware of the `measureText` method used in canvas: https://developer.mozilla.org/en-US/docs/Web/API/CanvasRende...

Seems a lot simpler than trying to split strings up by graphemes, and probably more reliable. Pairs of letters such as "fi" are rendered as ligatures (single glyphs) in some fonts, and Unicode no longer standardises them (since the combinations are completely arbitrary). Also, grapheme clustering is going to change according to Unicode version, and not everyone's browser is going to be using the same Unicode version.

If you want to know the "string length", you basically either want the number of code units to store it (in languages that handle strings sensibly, this is just the number of bytes, since UTF-8 should always be used, except for historical reasons), or you want to pass it to a font rendering library to tell you something about pixels.

kevincox · on Oct 4, 2019

Of course measuring each character also doesn't work because of kerning, ligatures and probably other rendering features.

wintorez · on Oct 4, 2019

There are two data types that the more you think about them, the weirder they become: String & Date