collation

What is the difference utf8_unicode_ci and utf8_general_ci?

From the MySQL manual:

For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci.

They have a amusing “examples of the effect of collation” set on “sorting German umlauts,” but it unhelpfully uses latin1_* collations. And another table that helpfully explains:

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):

ß = ss

This forum post adds more info, but nowhere do they explain how a ☃ sorts against ☁ or ⛅.

How much faster is utf8_general_ci than utf8_unicode_ci, though? An August 2010 message in the MySQL forums seems to suggest the performance for specific operations could be 30% faster, but then dismisses the performance difference as unimportant compared to good indexing and writing efficient queries.

The Difference Between MySQL’s utf8_unicode_ci and. utf8_general_ci Collations

MySQL answer: utf8_unicode_ci vs. utf8_general_ci.

Collation controls sorting behavior. Unicode rationalizes the character set, but doesn’t, on it’s own, rationalize sorting behavior for all the various languages it supports. utf8_general_ci (ci = case insensitive) is apparently a bit faster, but sloppier, and only appropriate for English language data sets.