Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Java is still using UTF-16, it is the internal format used since its creation. I don't know exactly how much this is a problem or not, but it shows that UTF-16 is still an important thing.


I think its a huge problem for Java. Try doing proper string collation (standard library or ICU4J), or regular expression matching, in a context where your strings are all UTF-8 and your output should also be UTF-8. Operations that shouldn't require allocation do, because you have to transcode to UTF-16. Not to mention that in some cases, that transcoding is the most expensive part of the operation.

All the core Java APIs are built around String or CharSequence (more the latter in releases post-Java 8). CharSequence is a terrible interface for supporting UTF-8 or any encoding besides latin1 or UTF-16. If Java's interfaces had been designed around Unicode codepoint iteration rather than char random access, then the coupling to UTF-16 wouldn't have been so tight. But as things stand, you aren't doing anything interesting to text in Java without either (1) re-implementing everything from scratch, from integer parsing to regexp, or (2) paying the transcode cost on everything your program consumes and emits.


It's a huge problem. UTF-16 is a big big pain.

JavaScript (ECMAScript) too has this problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: