All you ever wanted to know about character encoding

Mike O'Dell mo at ccr.org
Sun Mar 13 12:41:10 CDT 2011


Andre can't have the accent in his email address,
just like i can't type it in the body of a message
without resorting to MIME encoding.

that's because the characters allowed in an RFC822 email address
are restricted to ASCII-96 interpreted as "all upper-case" (ie,
the lower-case letters are "folded" to their upper-case counterparts for
the purposes of comparing character strings).

so there is no way to put a "wide" character into a standards-compliant 
email address

Note: same character set restriction applies to the text of an email 
message, so
that's where base64 coding comes from for message bodies without 
invoking full-blown MIME.
This is not strictly kosher, but it was a hack that predated MIME by 
years and nobody
wanted to break it. Now, you may just get asked if sending raw "8-bit 
email" is OK
(if it asks at all - there is probably a knob in the preferences 
somewhere) since
most email clients just let it slide as "extended ASCII" and you get 
what you get
from the "code page" in force at the time it's displayed.

this gets very tricky, though, because with the introduction of 
"international domain names",
there are "issues" with what characters can appear in domain names.

DNS itself doesn't care internally - it's a byte sequence, so in theory, 
there should be no
problem.  however, as with RFC822, there are significant restrictions on 
byte values which
can appear in domain names as handled by many protocols, so just spewing 
anything you want
into a domain name violates protocol specs coming and going.

however, with all the discontent over the geopolitics of Unicode and 
UTF-8, not to mention
Microsoft's brain-damaged adoption of UTF-16 instead of UTF-8, just 
extending the protocols
to allow UTF-8 coding of domain names, was deemed politically impossible.

Therefore, everyone decided to disagree and "just do what they want".  
This may be converging;
I haven't had the stomach to go look in that particular cess pool 
recently so am not completely
up to speed on the latest palace intrigue, but i'd vote against it sight 
unseen.

Welcome to the inside of the sausage factory.

     -mo




On 3/12/11 11:21 PM, Chip Fetrow wrote:
> And, I find it interesting that in the "From" field, your name can be 
> read, though there is no accent over the "e."
>
> However, when you type your name at the bottom of the message the 
> accented e is replaced with a question mark in my e-mail client.
>
> --chip
>
> On Mar 12, 2011, at 11:38 AM, tacos-request at amrad.org wrote:
>
>> Message: 2
>> Date: Fri, 11 Mar 2011 18:02:33 -0500
>> From: Andre Kesteloot <andre.kesteloot at verizon.net>
>> Subject: Re: All you ever wanted to know about character encoding
>> [...]
>> indeed.
>> Fascinating stuff, communication between human beings.
>> Andr?  (with an acute accent on the "e"  :-)
>
> _______________________________________________
> Tacos mailing list
> Tacos at amrad.org
> https://amrad.org/mailman/listinfo/tacos


More information about the Tacos mailing list