Matt Truty

September 20, 2023

Beyond Character Chaos: Converting Latin-1 to UTF-8

In 2018, I got back into PHP development and one of my first projects was navigating the maze of UTF-8 encoding intricacies, particularly the unique challenge of dealing with widespread Latin-1 encoded data mixed within predominantly UTF-8 encoded MySQL tables.

The Problem: Latin-1 encoded data displayed as UTF-8 (i.e., garbled looking text)

Example:
Bad data: "Life is not about waiting for the storm to pass, but learning to dance in the rain. La vie est bellé!"
Converted data:
"Life is not about waiting for the storm to pass, but learning to dance in the rain. La vie est bellé!"

The Objective: Convert billions of rows of Latin-1 encoded data to UTF-8 with minimal data loss.

The Solution: A WordPress Codex technique proposed a two-step conversion using a binary intermediary.

The Result: Mission accomplished!

Though this method was effective, it demanded precision to prevent data loss, as emphasized in the referenced article. Having tackled this problem, I've gained deeper insights into the workings of UTF-8 encoding. Here's what I learned.

The connection Between Single-byte (ASCII) and Multi-byte UTF-8 Characters

  • Multi-byte UTF-8 characters, like "â", can also be expressed using a combination of single-byte characters. For example, the UTF-8 "â" can be seen as "â" in single byte format.
  • UTF-8 is flexible in its length. Characters can be anywhere from 1 to 4 bytes long. To give some specifics: 1-byte for characters 0-127 (ASCII range), 2-bytes for 127-2048, 3-bytes for 2,048-65,535, and 4-bytes for 65,536-1,112,064. Currently, there are about 110,187 defined characters, so there's space for more in the future. For this article, I'm sticking with the 4-byte maximum assumption for UTF-8 characters.
  • While an 8-bit character can represent numbers up to 255, ASCII only uses up to 127 of these. Importantly, characters 0-127 look the same in both ASCII and UTF-8. This becomes crucial when dealing with multi-byte UTF-8 characters.
  • Some number ranges in UTF-8 have specific functions. For instance, numbers 0-127 are for ASCII, 192-247 are like "Shift" keys, and others indicate specific character sets or multiple shifts, as outlined in a detailed article by Smashing Magazine.


The Significance of the First Byte

In UTF-8, the initial byte of a character holds valuable info. It can tell if a character belongs to the ASCII set and also show how many bytes make up that character.

For ASCII characters, if the first of the eight bits is 0, it's ASCII. If it's 1, then it's not. For multi-byte characters, the number of consecutive 1s at the start can tell us how many bytes are involved.

To illustrate:

  • 110xxxxx indicates a character that uses two bytes.
  • 1110xxxx shows it's a three-byte character.

The table below demonstrates this for 1 to 4-byte characters:


From Multi-byte to Unicode Numbers

If we want to get the Unicode number of a character from its UTF-8 byte sequence, we can use the following formulas:

For 1-byte characters:
U = C1

For 2-byte characters:
U = (C1 – 192) * 64 + C2 – 128

For 3-byte characters:
U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128

For 4-byte characters:
U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128

Dissecting a Character: The Case of â

Using the formulas, let's decipher the character "â", which has a Unicode number of 226. In UTF-8, it's represented as "â". When broken down into bits, it looks like this:

First byte: 11000011 (which is 195 in decimal)

  • This isn't ASCII (since the first bit is 1).
  • It's a two-byte character (from the initial two 1s).

By plugging into our formula for 2-byte characters, we confirm the Unicode number is indeed 226 or "â".

Conclusion: This conversion can be done, be careful out there, folks. 

Go Deeper:


About Matt Truty

I'm a hands-on tech leader who really enjoys building with software. I have experience in growing engineering teams, developing leaders, creating scalable software, and designing workflows that engineers enjoy and delivers real value.

Let's connect! https://www.linkedin.com/in/mtruty/