Beyond ASCII

Stephen Edmonds

Melbourne Perl Mongers - August 2008

What?

How?

What about text?

Character encoding

An example

Chinese character for 'dog'
Unicode character number 72D7 (hex)29399 (decimal)
HTML entity 狗 狗
Three bytes in UTF-8 E7 8B 97
Two bytes in UTF-1672 D7
Two bytes in GB2312 (chinese) B9 B7
Two bytes in Shift JIS (japanese) 8B E7

In perl

In perl

Some confusion

A key concept

Input

Output

Which encoding?

Email

Reading an email

my $entity = MIME::Parser->new()->parse_data($message);

my $enc = Encode::find_encoding( 
   $entity->head()->mime_attr('content-type.charset')
);

$text = $enc->decode(
   $entity->bodyhandle()->as_string()
);

Writing an email

my $entity = MIME::Entity->build(

   'Subject'  => Encode::encode('MIME-Header', $subject),

   'Data'     => Encode::encode('UTF-8',       $body),

   'Charset'  => 'UTF-8',

   'Encoding' => '-SUGGEST',
);

$entity->as_string();

Web forms

HTML charset

HTML charset

accept-charset

HTML output

HTML encode everything

Partial HTML encode

Overview

References