There is a huge profusion of terms around sets of letters, symbols, and characters and how they are encoded onto computers. I found myself embroiled in this mess working at a speech-technology company back in the day as we struggled to adapt a speech recognizer designed for English to cope with Korean — which turns out to have a surprisingly regular orthographic system. (Never mind that none of us on the team knew Korean.)
But I’m here today to talk about the key distinctions I learned in dealing with those letters, symbols, and characters on a computer. More after the jump, if you don’t mind me overexplaining a little.
A character is not a glyph
A glyph is a representation of a character on a screen or paper; it has font and point size and other properties, but a character is an abstract notion denoting the class of these properties. (For the anthropologists and linguists: glyph is an “etic” term; character is an “emic” term.) Graphic designers get very upset about glyph mismatches; computational linguists, on the other hand, get very upset when we have character mismatches; we count on the graphic designers to sort out the way our words appear on the page.
A coded character set is not a character encoding
A coded character set assigns a distinct integer (a code point) to each character in the set, but a character encoding maps code points to sequences of bytes (eight-bit chunks).
Unicode is a coded character set; UTF-8 is a character encoding
Unicode is a very large CCS — it has over a million characters assigned to code points. But there is more than one scheme for mapping this very large set of code points onto byte sequences. UTF-32, for example, assigns every single code point to a binary representation using four bytes (32 bits; hence the name). Most popular these days is UTF-8, which has several backwards-compatibility virtues and tends to be space-efficient for Europeans.
Microsoft tools frequently get the CCS/character-encoding distinction confusingly wrong: many of their “Save As” menus offer “Unicode”, but that is actually underspecified, since it doesn’t actually clarify which character encoding it’s using. (Anybody else know?)
If you’re working with non-English data, you will encounter this headache. You’ll hit it sooner (rather than later) if you work with so-called CJKV languages (Chinese, Japanese, Korean, Vietnamese: anything that uses hanzi, even a little).
Back in the dark ages (I think it was 2002), I was exploring graduate school as a non-matriculated student and took my first graduate-level computational linguistics class, from John Goldsmith, who was guest-teaching at the University of Washington. We were working our way through (what was then) the only standard CL textbook. John liked me (I hope he still does!) and noticed that I had some industry experience, and (correspondingly) I noticed that there wasn’t a chapter about character encodings.
So John asked me to give a guest lecture on this subject, and I was flattered to do so. I updated the slides a few years later to give (roughly) the same talk to the newborn UW Computational Linguistics Lab students. Those slides are available, and I’m posting them here, because Steven Moran asked me to: Character encodings as PDF.