Jaime Cepeda Villamayor

June 23, 2022

Migrate MySQL's UTF-8 to Real UTF-8

Where I work, we normally use MySQL as our main database. In the beginning we didn't have any trouble using it, but one day, we found out the following error in our production server:

(1366, u"Incorrect string value: '\\xF0\\x9F\\x98\\x80' for column 'name' at row 1")

How was that possible? We thought our database uses UTF-8. The truth is, when checking our encodings we saw they were utf-8.

> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+


Although the character sets showed that we were using UTF-8, MySQL's UTF-8 is not real UTF-8. Standard UTF-8 uses 4-byte characters whereas MySQL's UTF-8 uses 3-byte characters. To deal with this problem, MySQL added a new encoding, called utf8mb4. Nobody at my company thought MySQL's UTF-8 would be different.

What we needed to do is to find out how to transform all character sets from our database to UTF-8.

Character sets in MySQL


One of the first solutions that you may attempt is to make a script that converts your tables to utf8mb4. This will not work. MySQL have different character encoding settings for the server, client, database, table, columns, etc. If you're going to change one of those settings and convert it to utf8mb4 without converting the others, MySQL will try to make transformations between those types "transparently", and you will still have problems. Moreover, you can also set different charset for your database clients.

How to move to utf8mb4


The only way to solve this problem is to convert all charsets to utf8mb4. In order to do it properly those steps are needed:

- Add utf8mb4 support for all the clients that are accessing the database
- Add utf8mb4 encoding to MySQL's server
- Migrate tables to utf8mb4

The steps


1. Make a backup just in case something goes wrong.
2. Make sure all the clients that are accessing our database use utf8mb4 charset and use utf8mb4_unicode_ci collation.
3. Convert MySQL's default character
   1. If you're using GCP or similar, you need to add the following flag: character-set-server: utf8mb4
   2. If you're using Docker, you need to add the following flags: 
  • --collation-server=utf8mb4_unicode_ci 
  • --character-set-server=utf8mb4
   3. If you're using standard MySQL, you can add the following options at server startup:

[mysqld]        
character-set-server=utf8mb4        
collation-server=utf8mb4_unicode_ci

4. Check the character set:

    > show variables like 'char%'
    +--------------------------+----------------------------+
    | Variable_name            | Value                      |
    +--------------------------+----------------------------+
    | character_set_client     | utf8mb4                    |
    | character_set_connection | utf8mb4                    |
    | character_set_database   | utf8mb4                    |
    | character_set_filesystem | binary                     |
    | character_set_results    | utf8mb4                    |
    | character_set_server     | utf8mb4                    |
    | character_set_system     | utf8                       |
    | character_sets_dir       | /usr/share/mysql/charsets/ |
    +--------------------------+----------------------------+

    If you've a similar result, then you're ready for the next step

5. Convert tables to utf8mb4.
   1. Now what you need to do is to convert every table's collation and charset to utf8mb4_unicode_ci and utf8mb4 respectively.
   2. One of the easiest solutions is to generate the queries that transform your encodings to utf8mb4. I used the following one:
SELECT CONCAT("alter table `", t.table_name,
      "` convert to character set utf8mb4 collate utf8mb4_unicode_ci;")
AS _sql
FROM   information_schema.TABLES t
WHERE  t.table_schema = '<your_db_name>';

        After executing your query, you should have something similar to this:

+---------------------------------------------------------------------------------------------------+
| _sql                                                                                              |
+---------------------------------------------------------------------------------------------------+
| ALTER TABLE `your_table_1` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;           |
| ALTER TABLE `your_table_2` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;           |
+---------------------------------------------------------------------------------------------------+
        ```

        Copy the queries and put them in a migration file

   3. Keep in mind that if you have foreign keys that point to text columns such as VARCHAR, you need to remove your foreign keys, convert your tables to utf8mb4 and add the foreign keys again. If you do not perform this, you will have the following error:

Error 1832: Cannot change column 'your_column_name': used in a forign key constraint 'your_foreign_key_name'

   4. Once you've prepared your migration file, execute it.

Now the final step is to check that your tables use utf8mb4. Execute:

> SELECT t.table_name,
         t.table_collation
  FROM   information_schema.TABLES t
  WHERE  t.table_schema = 'your_database_schema';

+------------------------+--------------------+
| table_name             | table_collation    |
+------------------------+--------------------+
| your_table_1           | utf8mb4_unicode_ci |
| your_table_2           | utf8mb4_unicode_ci |
| your_table_3           | utf8mb4_unicode_ci |
+------------------------+--------------------+

You're finished! Now you won't have any encoding conversions nor character sets that are not UTF-8.

Conclusion


Character set encodings are messy. It is something important to handle, since your application can crash at any time if your user enters any unexpected character, and you need to be prepared for this kind of situations.

This post wouldn't be created without the help of this great article

Jaime Cepeda