Beyond ASCII

Stephen Edmonds

Melbourne Perl Mongers - August 2008

What?

Communicate information.
Transmitted bytes.
Information -> bytes -> Information

How?

Many representations are well known:
Images - PNG, JPEG, GIF, etc...
Audio - WAV, MP3, etc...
Video - MPEG, AVI, etc...
Text - not so well known/understood

What about text?

In the 60s ASCII and EBCDIC were the first standards
Now lots more: ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10, ISO 8859-11, ISO 8859-13, ISO 8859-14, ISO 8859-15, ISO 8859-16, CP437, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP863, CP865, CP866, CP869, Windows-1250, Windows-1251, Windows-1252, Windows-1253, Windows-1254, Windows-1255, Windows-1256, Windows-1257, Windows-1258, Mac OS Roman, KOI8-R, KOI8-U, KOI7, MIK, Cork or T1, ISCII, VISCII, HKSCS, GB2312, GB18030, Shift JIS, EUC-KR, ISO-2022, UTF-8, UTF-16, etc...
Known as 'character encoding'

Character encoding

Character set - list of supported characters
Encoding is how to represent each one of them

An example

Chinese character for 'dog'	狗
Unicode character number	72D7 (hex)	29399 (decimal)
HTML entity	狗	狗
Three bytes in UTF-8	E7 8B 97
Two bytes in UTF-16	72 D7
Two bytes in GB2312 (chinese)	B9 B7
Two bytes in Shift JIS (japanese)	8B E7

In perl

For years I thought the answer was:
```
    use utf8;
```
but this is only the encoding of the source code

In perl

The real answer is:
```
    use Encode;
```
functions to convert to or from pretty much any character encoding

Some confusion

Different things use different terms:
encoding, character encoding, charset, character set, ...
May mean the same thing, but might not

A key concept

perl strings can be anything
- internally they are unicode if they need to be
but you need to think of them in two ways:
- binary strings
- text strings

Input

What comes in is a series of bytes, the binary string
Might not behave as expected:
- byte count may not equal the character count
- regexes might not work as you want
So first you decode the binary into text:
```
   $text = decode('UTF-8', $binary);
```
Now have a text string where each character is an actual character

Output

You shouldn't just output a text string
You should first encode it back to binary:
```
    $binary = encode('UTF-8', $text);
```
and then output it
possibly with a transfer encoding

Which encoding?

You need to know which encoding to use
Use the wrong one and the text will be broken
Unfortunately this means you need to make assumptions
- your assumption depends on what you are doing

Email

MIME - Multipurpose Internet Mail Extensions
Straightforward, except for terminology:
- encoding is the transfer encoding
(how to send binary data over systems that might only support 7-bit)
- charset within the content-type is the one we want
```
    text/plain; charset=US-ASCII
```
Also a special encoding (encoded-word) for header fields

Reading an email

my $entity = MIME::Parser->new()->parse_data($message);

my $enc = Encode::find_encoding( 
   $entity->head()->mime_attr('content-type.charset')
);

$text = $enc->decode(
   $entity->bodyhandle()->as_string()
);

(assume that the sender built the message correctly)

Writing an email

my $entity = MIME::Entity->build(

   'Subject'  => Encode::encode('MIME-Header', $subject),

   'Data'     => Encode::encode('UTF-8',       $body),

   'Charset'  => 'UTF-8',

   'Encoding' => '-SUGGEST',
);

$entity->as_string();

Web forms

Potentially more problems
Browsers might behave differently
Users can override page charset

HTML charset

Set as a HTTP header:

 Content-Type: text/html; charset=ISO-8859-1

as a meta element:

 <meta http-equiv="Content-Type"
    content="text/html; charset=US-ASCII">

or for XHTML:

<?xml version="1.0" encoding="ISO-8859-1"?>

HTML charset

Browser should display the page using the specified charset
But users can override this
Form submissions will be sent back to the server in the specified charset
But not if the user has overridden it
And different browsers degrade in different ways for characters not supported by the charset

accept-charset

You can specify the charset on a HTML form:
```
   <form ... accept-charset="UTF-8">
```
The browser must now submit in the specified charset, whatever the page/browser charset is
Safer to assume charset of the input

HTML output

Two approaches:
- encode everything
- only encode unsafe characters

HTML encode everything

charset of the page can be any one that includes ASCII

All text that may contain non-ASCII is run through

  $binary = HTML::Entities::encode_entities($text)

By default this entity encodes all non-ASCII
Page size will increase
but the characters will be correct
- even if the user changes it in browser

Partial HTML encode

Select a charset, eg UTF-8
Everything has to be in that charset

More work on output:

  my $binary = Encode::encode('UTF-8'
      HTML::Entities::encode_entities($text, '>%lt;&"')
   );

(This examples does not encode control characters)

Overview

At first appears to be really complex ..
.. but can be simpler than you think
subtle complexities
have to assume (but try to mitigate)
text strings != binary strings
decode binary to text as early as possible
encode text to binary as late as possible

References

Perl documentation:
From Juerd Waalboer:
- Perl Unicode Tutorial (YAPC::Europe 2007)
- Perl Unicode Advice
Wikipedia:
- Unicode
- Character encoding