wchar_t: Unsafe at any size
One of today’s fads in software engineering is supporting multiple languages. It used to be that each language or script had its own code point system (or encoding), with each code point representing a different character. For reasons of convenience, the various scripts were incompatible, they could not identified by simply looking at the code points, and an identifier describing which script the text was in was not allow in the same zip code. Sometimes this caused problems with engineers who had weak constitutions; was that ‘c’ or a ‘¥’? Experienced programmers knew the correct answer was to cycle through all the known scripts, interpreting the text with each in turn, and ask the user to tell them when they could read it or saw the hidden picture. These were the earliest known captchas.
The Unicode Consortium was unhappy with this because they were not the cause of the mass confusion, as a result of being late to the party. They devised a scheme in which each character had its own unique code point. They also allocated enough code points to represent all the characters of a lot of different languages. They even added a unique byte sequence at the start of any Unicode text to mark it as Unicode. And thus, all was well and good as long as you didn’t mind having text that took four times more space than usual, and wasted three out of four bytes. The Unicode Consortium at first wasn’t interested in fixing this problem until they realized they could use it to add more “features” (read: confusion). The Consortium begat UTF-8 and UTF-16 in order to fill this need. UTF-8 encoding allowed most characters to be encoded in 8 bits, with the rest as escape sequences, and UTF-16 allowed most characters to be encoded in 16 bits.
Originally people implemented these types in C by using unsigned char (UTF-8), unsigned short (UTF-16), or unsigned int (UTF-32). At the time of adoption of Unicode both Win32 and the Mac Toolbox used UTF-16. It was a nice tradeoff between size and efficiency. For most characters they were only wasting one byte (as opposed to three bytes in UTF-32), but could still assume most characters where just 16-bits (as opposed to UTF-8 which escaped anything not ASCII). Life was good.
As most standards committees, the C/C++ standards committee were bent on death and destruction. They saw people were using this newfangled Unicode, and that it was almost sufficiently confusing. The standards committee wanted to advocate this confusion while adding even more. To achieve their demented objective, they introduced wchar_t and std::wstring. But which encoding of Unicode did it use: UTF-16 or UTF-32? BUHAHAHAHA! In their greatest show of leadership to date, the standards committee refused to say. It would be a surprise, and they would hate to spoil a surprise. wchar_t was defined to be more than a byte but no larger than jet liner.
With this new edict in hand, compiler and library writers quickly got to work. Instead of following each other’s lead, they each implemented wchar_t and its supporting libraries as they saw fit. Some saw the benefit of making wchar_t UTF-16. Others wanted it to be UTF-32. And thus, the standards committee bode their time.
Since both Windows and Mac OS (Classic) had adopted UTF-16 already, the compiler makers implemented wchar_t as UTF-16. But this was just a trap, meant to ensnare hard working cross platform engineers. Engineers who worked on software that ran on Windows and MacOS started using wchar_t. It was easy and worked well. A little too well.
Meanwhile, Unix vendors had decided that wasting one byte was insufficient, and that wasting three bytes per character was definitely the way to go. Besides its not like anyone on Unix was using Unicode for anything other than Klingon.
The trap was sprung in 1996 when Apple purchased NeXT and its Unix based operating system. Like all good traps no one realized what had happened for several more years. It wouldn’t be until 2001 when Mac OS X was released and Steve Jobs started after developers with cattle prods to get them to port to Mac OS X. Unfortunately for the standards committee, some developers continued to use the old developer tools, like CodeWarrior, and old executable formats, like CFM/PEF, that implemented wchar_t as UTF-16. But the standards committee was patient. They knew they would prevail in the end.
Apple would turn out to be the instrument of the standards committee. They continued to improve Xcode until it was good enough to actually build most of their own sample code. At the same time, Metrowerks finally won its game of Russian Roulette, and stopped development of CodeWarrior. Apple delivered the final blow when they announced they were moving to the Intel architecture and that they had the only compiler that supported it. A compiler with a secret.
There were screams of anguish when it dawned on engineers the cruel trick Apple and the standards committee had played. Mac OS X, being a Unix variant, had implemented wchar_t as UTF-32! All the cross platform code, code that used to work on Windows and Mac, no longer worked. Apple felt their pain, and issued this technical note, which essentially says: “instead of using wchar_t, which used to be cross platform before we destroyed it, use CFStringRef, which is not cross platform, has never been, and never will be. P.S. This is really your own fault for ever using wchar_t. Suckers.”
At the time that this was happening, I happened to work for Macromedia (now Adobe). Being the most important company that implements Flash, some of the Apple execs came down and talked to the Mac engineers at Macromedia. When the appropriate time came, I sprang into action demanding to know what would be done about wchar_t. There was stunned silence. “What’s wchar_t?” was the first answer. After explaining it, the next answer was “We don’t implement that.” After pointing them to their own documentation, the next answer was “Oh. Huh. Well, why did you use it? We don’t use that crap. Use CFString instead!” After slamming my head against the table, I attempted to explain wchar_t was used everywhere in our codebase, and CFString wasn’t cross platform. “Sure it is! It works on both Mac OS 9 and Mac OS X!”
The solution in the end for those duped into using wchar_t, is to go back and use unsigned short instead. Unfortunately, that means doing a lot find and replace (find: wchar_t replace: char16_t, where char16_t is typedef’d to unsigned short) and then re-implementing the wchar_t library (including wstring) for the new type. Yep. Reimplement the wchar_t library. The lucky jumped into a pit of rabid ice weasels, where they were torn from limb to limb. The unlucky had to repurpose all the old CodeWarrior MSL code to re-implement the wchar_t library as char16_t library.
The moral of the story is: don’t trust the standards committee. Especially on standards that aren’t really defined or when they start snickering behind your back. Usually that means they stuck a note on your back that says “Standardize me.” I’m not sure why that’s funny, but they think its hilarious. If you need to use a Unicode encoding, use UTF-8. You can just use char and std::string for that.
Besides, who doesn’t speak English?