Migrating MySQL to UTF-8 encoding

While developing a content management system (CMS) using default MySQL settings an issue came to light in regards to UTF-8 encoding. Clients were sending in documents in Microsoft Word format that were encoded with UTF-8. When the data was copied from the document and pasted into the CMS WYSIWYG editor strange characters would be displayed after saving the document.

A second issue was identified where exported MySQL tables that contained UTF-8 encoded characters were being not being read correctly by Debian. Note that I am using an old installation of Debian (fully updated of course) that was rolled out before widespread adoption of UTF-8. Newer installations may already have UTF-8 enabled by default.

The goals of this HOWTO are as follows:

  1. set newly created tables in MySQL to use UTF-8 encoding
  2. convert existing MySQL tables to UTF-8
  3. set default environment (command line) encoding to use UTF-8

Set Newly Created Tables in MySQL to use UTF-8 Encoding

I am using MySQL on Debian and the default MySQL settings use latin1_swedish_ci as the character set for newly created tables. To change the default encoding to UTF-8 open /etc/mysql/my.cnf and add 'default-character-set = utf8' to the '[client]' section. The configuration should read like this:

[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock
default-character-set = utf

Convert Existing MySQL tables to UTF-8

The basis of converting a table to UTF-8 is:

ALTER TABLE your_table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

NOTE: When I ran this script for a migration of a table with a compound index I received an error:

MySQL Error: Speficied key was too long; max key length is 1000 bytes

For more information on this issue please refer to the article here.

Helpful Script

I wrote a script to automate the migration of larger MySQL databases to use UTF-8. Note that I rely on ~/my.cnf to provide the username and password.

#!/bin/bash

DATABASE=$1

if [ ! "$DATABASE" ]; then
echo "Please specify a database"
exit
fi

BACKUPDIR="/root/tmp/mysql_backups/"

if [ ! -d "$BACKUPDIR" ]; then
mkdir -p "$BACKUPDIR"
fi

BACKUP="$BACKUPDIR""$DATABASE.sql"

mysqldump --add-drop-table --extended-insert "$DATABASE" > "$BACKUP"

TABLES=`mysql --batch --execute 'SHOW TABLES' "$DATABASE" | grep -v "^Tables_in"`

for TABLE in $TABLES; do
echo 'ALTER TABLE `'"$TABLE"'` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;';
# uncomment the following line to process the commands
#mysql --execute 'ALTER TABLE `'"$TABLE"'` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;' "$DATABASE"
done

Set Default Environment (command line) Encoding to use UTF-8

There are a number of ways to set the default locale. The easiest is to reconfigure the locales. Run the command below and select the UTF locales you want. During the reconfigure process you will be able to select your default locale.

sudo dpkg-reconfigure locales

Log out and log back in and you will have a default UTF-8 charset. To check run the command:

locale charmap

The output should read 'UTF-8'.

Sources:

MySQL, PHP, Apache, and UTF-8 Issues
Convert MySQL table to utf8
Step by step introduction to switching your debian installation to utf-8 encoding
How to change your default locale on Ubuntu Linux

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
By submitting this form, you accept the Mollom privacy policy.