Arc Forumnew | comments | leaders | submit | kmag's commentslogin

I should also point out that dealing with non-normalized Unicode has caused many security bugs in the past. For instance, sometimes input validation routines have bugs where they only test for the normalized forms of unsafe inputs.

Imagine a website that checks that user-submitted JavaScript fragments only contain a "safe" subset of JavaScript, disallowing eval(). Now imagine that the user input contains eval with the "e" represented as a Latin wide character and imagine that some browser understands Latin wide versions of JavaScript keywords, or that the code that renders web pages to HTML transforms Latin wide characters to standard Latin characters.

-----


The simple answer is "Yes, UTF-8 is backward compatible with ASCII, and the designers of UTF-8 were very clever to make the common use cases efficient and robust in the face of minor data corruption" ... but ...

For a start, the substring and string length code would need to be rewritten. If a lot of substring operations were to be performed, you'd probably want to lazily construct skiplists or some other data structure to memoize the character/codepoint indexing operations.

Unicode has a notion of "codepoints" that correspond roughly with what you probably think of as characters. However, what you think of as a single character may sometimes be a single codepoint and may sometimes be multiple codepoints. The most simple way of treating Unicode will force all of the complexity of Unicode onto all of the users of the language, regardless of which encoding is used.

You'd probably want to declare a standard Unicode normalization that gets performed on all byte streams when converting them into strings. (You may want an optional parameter that allows programmers to override the default normalization, but that adds a lot of headaches.) What you or I probably think of as a single character can potentially be represented in more than one way in Unicode. Presumably, string comparison would work codepoint by codepoint for efficiency reasons. In order for string comparison to not be horribly confusing, strings should probably be normalized internally so that there's only one internal representation for each character. (For instance, there's a Unicode codepoint for the letter u, there's a codepoint for u with an umlaut, and there's a codepoint that adds an umlaut to an adjacent character. In this way, a u with an umlaut can be represented with one or two codepoints. Similarly, (as I remember) each Korean character can be represented as a single codepoint or as a sequence of 3 codepoints for the 3 phonemes in the character (including a codepoint for a silent consonant). Han unification characters (Chinese logograms and their equivalents in written Japanese and Korean) can be represented as single codepoints or as codepoints for simple graphical elements and codepoints for composing the elements into characters.) There are several standards for normalizing Unicode, but most of them boil down to either always representing each character with as few codepoints as possible, or else always decomposing each character into codepoints for its most basic elements.

Perhaps UTF-8 strings could be represented internally as a byte array and an enum indicating the normalization used (plus an array length and a codepoint count). This would allow a fast code path for indexing, substring operations, pattern matching, and comparison of pure ASCII strings, as well as a fast code path for pattern matching and comparison of two strings that used the default normalization. Comparison between two strings that used non-default normalization (even if the two strings use the same normalization) would need to involve a renormalization step in order to prevent the case where a > b, b > c and c > a. (Different normalizations could potentially result in different orderings, so for a globally consistent ordering, the same normalization must be used in all comparisons.)

It may be wise for a string to internally keep track of normalization inconsistencies in the original input byte array, so that arrays of bytes can be converted into strings that can then be converted back to the original byte arrays. One would hope that most strings being imported into the system would be internally consistent in their normalization so that this "inconsistencies annotations" structure would be nil in the vast majority of cases.

Pattern matching, such as regexes, should also involve normalization so as to not trap the unwary programmer.

There are also corner cases for things like the Latin wide characters. Basically, CJK (Chinese, Japanese, and Korean) characters are about as wide as they are tall, but Latin characters are about half as wide as they are tall. When mixed with CJK characters, Latin characters look much better when they're stretched horizontally. Unicode has a set of codepoints for normal Latin characters and a different set of codepoints for the wide versions of the Latin characters. Should the default unicode normalization turn the wide Latin characters into normal Latin characters?

In summary, UTF-8 is absolutely brilliant in the way it treats the subtleties of being backward-compatible with ASCII. However, there's no simple way to deal with Unicode in a way that both preserves certain identities and also isn't a hazard for unwary programmers.

-----

1 point by papersmith 6184 days ago | link

Wow, thanks for the thorough reply. That explains it a lot.

-----