All you ever wanted to know about character encoding

Mike O'Dell mo at ccr.org
Fri Mar 11 16:18:43 CST 2011


yes, but it may not be all you *need* to know.

the last time I looked, Unicode didn't have glyphs for Hawaiian.
even more contentious, Unicode has decided that some glyphs in different
languages are sufficiently "alike" that they have been replaced with a
single codepoint (common glyph) which is shared between the languages.

as you might imagine, similarity is in the eye of the beholder, and the
native speakers of those languages don't share the opinions of the
Unicode priests who made the decisions. that's a long, heated argument
that's raged for years and will likely continue.

not unrelated is the issue that the notion of a "typeface" (incorrectly
called a "font" in the article) and a "glyph" are not cleanly independent,
and hence a single Unicode codepoint is insufficient to represent the
English letter A displayed in both Helvetica and Bookman Old Style
typefaces (not to mention bold vs non-bold vs italic vs character size).
In a mathematics textbook, all those letters which we read as "A" in spite
of their different typeface *are* extremely distinct - that's why they
are displayed differently!

so "when is an A not the same as another A?" is the same question of is
a particular Chinese ideogram different from a particular Japanese ideogram
even if they "look *very* similar" but mean entirely different things
in the two languages? is the similarity real, or is it an illusion created
by the way the two ideograms are rendered (ie, "typefaces")?

this is all very complex and very subtle, and gets to the very heart of 
the way
humans represent information in written or stored form, and how much 
fidelity
is required to capture the author's intent sufficiently well to allow it to
be reproduced in a different form?

this is why "grep" for much beyond ASCII (including 8-bit ASCII+"other" 
codes)
gets to be very difficult to understand very quickly. Yes, "grep" can be 
defined
for Unicode, but exactly what that means gets softer and softer the further
you get from ASCII.

And all this is almost certainly *more* than you wanted to know about 
the collision
of coding, orthography, typefaces, linguistics, and the geopolitics of 
1000-year-old
grudges.

     cheers,
     -mo



On 3/11/11 10:07 AM, Andre Kesteloot wrote:
> http://www.howtogeek.com/howto/45765/htg-explains-what-are-character-encodings-and-how-do-they-differ/?utm_source=newsletter&utm_medium=email&utm_campaign=110311 
>
>
>
> _______________________________________________
> Tacos mailing list
> Tacos at amrad.org
> https://amrad.org/mailman/listinfo/tacos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://amrad.org/pipermail/tacos/attachments/20110311/5c18638e/attachment.html>


More information about the Tacos mailing list