may contain source code

published: • tags:

This article is part of a series about implementing Unicode support for std::string:

  1. Unicode for std::string
  2. Fast Lookup of Unicode Properties

While researching and brainstorming for a completely unrelated project idea I decided to attempt to write a Unicode library for std::string, the canonical string type in C++. This post is about why I made the decision and it develops a first draft for how such a library could be designed.

The Sad Reality

Text processing in 2019 is easy, isn’t it? Almost everybody uses UTF-8 or UTF-16 anyway. The bad old days of 8-bit codepages are mostly behind us. So, no problem, right?

Unfortunately, as a C++ programmer the situation is a bit grimmer. What we have is std::string. It is a sequence of bytes without any notion of encoding and it is deeply entrenched in the obsolete assumption that byte and character are the same thing. None of the member functions is unicode-aware in any way. The same is true for the standard algorithms that can work with std::string. They too see it as a simple sequence of bytes.

The other string variants – std::wstring, std::u16string, std::u32string and in C++20 std::u8string – don’t improve the picture a lot. For the most part they do carry a notion of the encoding their content should use. But that’s as far as it goes. There’s no validation and neither member functions nor algorithms can handle properly encoded data properly.

As a result alternative unicode-capable string classes exist, for example Qt’s QString or ICU’s UnicodeString. Especially ICU provides everything and the kitchen sink when it comes to Unicode and internationalization. But it’s a huge library! And how great is a special string type when std::string is everywhere in APIs? Any 3rd party string type likely leads to a bunch of conversions back and forth.

An Idea for a Solution

Wouldn’t it be great if we could build Unicode on top of std::string? The standard library barely scratches the surface with its localization functionality and what I could find on the web either introduces a separate string type or stops with barebones UTF-8 and maybe UTF-16 support.

OK. … Unicode is an interesting topic anyway. Let’s write such a library. If nothing else it’ll have quite a bit of educational value.

From the situation outlined above I decided on these major design goals:

  • The library is small and only depends on the standard library. I’m not trying to build another ICU.
  • It is non-intrusive. Especially it does not require its own string type.
  • It is as string-type agnostic as possible while never compromising compatibility with std::string.
  • It provides a high-level abstraction for text processing. For the common use cases the intricacies of Unicode should be completely transparent.
  • Supported text encodings are UTF-8 and UTF-16. UTF-8 because it is and should be everywhere, and UTF-16 because it is used in the Windows API and some prominent 3rd party string types, for example Qt’s QString.
  • The library is about handling unicode text:
    • Conversion from/to other text encoding systems is out of scope (think especially the 8-bit codepages like ISO-8859-1).
    • The broader range of internationalization and localization issues is out of scope (think date/time formats, number formats, formatting of monetary values, etc.)

It’s becoming clear that this library will be a set of algorithms on top of the string types. While searching around for what’s available already I got inspired by utfcpp – a simple UTF-8 library. It builds on iterator pairs, much like the standard algorithm header. That’s a neat and idiomatic approach. I’ll do the same and add simple range-like wrappers so you can pass string objects directly instead of iterators.

Alright. Let’s get started. First hurdle: a library needs a name. I’m good at getting side tracked at this point, but no! Not this time! :-) I’m calling it unicode_algorithm. Not very creative, but it does the job.

For the first milestone I’m planning this functionality:

  • validation of UTF-8 data including BOM detection
  • counting characters in UTF-8
  • iteration over characters in UTF-8
  • decoding of a UTF-8 code unit sequence into a code point number

Most of this is easy. Basically it boils down to understanding the relatively simple UTF-8 encoding and writing a parser for it. That’s my way of getting some straight forward functionality implemented and creating the opportunity to figure out how to structure the library. Counting and iteration is slightly more complicated because recognizing characters involves detecting the boundaries between grapheme clusters. That’s my rabbit hole into the details of Unicode.

So, let’s see how this goes. My next post will probably be when the first milestone is done.

Comments