unzip
can fail on macOS when UTF-8 chars are in the archive. The solution is to use ditto
. Via a Github issue:
ditto -V -x -k --sequesterRsrc --rsrc FILENAME.ZIP DESTINATIONDIRECTORY
unzip
can fail on macOS when UTF-8 chars are in the archive. The solution is to use ditto
. Via a Github issue:
ditto -V -x -k --sequesterRsrc --rsrc FILENAME.ZIP DESTINATIONDIRECTORY
From the MySQL manual:
For any Unicode character set, operations performed using the
xxx_general_ci
collation are faster than those for thexxx_unicode_ci
collation. For example, comparisons for theutf8_general_ci
collation are faster, but slightly less correct, than comparisons forutf8_unicode_ci
.
They have a amusing “examples of the effect of collation” set on “sorting German umlauts,” but it unhelpfully uses latin1_*
collations. And another table that helpfully explains:
A difference between the collations is that this is true for utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):
ß = ss
This forum post adds more info, but nowhere do they explain how a ☃ sorts against ☁ or ⛅.
How much faster is utf8_general_ci
than utf8_unicode_ci
, though? An August 2010 message in the MySQL forums seems to suggest the performance for specific operations could be 30% faster, but then dismisses the performance difference as unimportant compared to good indexing and writing efficient queries.
MySQL answer: utf8_unicode_ci
vs. utf8_general_ci
.
Collation controls sorting behavior. Unicode rationalizes the character set, but doesn’t, on it’s own, rationalize sorting behavior for all the various languages it supports. utf8_general_ci
(ci = case insensitive) is apparently a bit faster, but sloppier, and only appropriate for English language data sets.
This Gentoo Wiki page suggests dumping the table and using iconv to convert the characters, then insert the dump into a new table with the new charset.
Alex King solved a different problem: his apps were talking UTF8, but his tables were Latin1. His solution was to dump the tables, change the charset info in the dump file, then re-insert the contents.