Melbourne Perl Mongers - August 2008
- Communicate information.
- Transmitted bytes.
- Information -> bytes -> Information
- Many representations are well known:
- Images - PNG, JPEG, GIF, etc...
- Audio - WAV, MP3, etc...
- Video - MPEG, AVI, etc...
- Text - not so well known/understood
What about text?
- In the 60s ASCII and EBCDIC were the first standards
- Now lots more: ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO
8859-5, ISO 8859-6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10, ISO
8859-11, ISO 8859-13, ISO 8859-14, ISO 8859-15, ISO 8859-16, CP437,
CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP863, CP865,
CP866, CP869, Windows-1250, Windows-1251, Windows-1252, Windows-1253,
Windows-1254, Windows-1255, Windows-1256, Windows-1257, Windows-1258,
Mac OS Roman, KOI8-R, KOI8-U, KOI7, MIK, Cork or T1, ISCII, VISCII,
HKSCS, GB2312, GB18030, Shift JIS, EUC-KR, ISO-2022, UTF-8, UTF-16, etc...
- Known as 'character encoding'
- Character set - list of supported characters
- Encoding is how to represent each one of them
|Chinese character for 'dog'||狗|
|Unicode character number ||72D7 (hex)||29399 (decimal)|
|HTML entity|| 狗 || 狗|
|Three bytes in UTF-8|| E7 8B 97|
|Two bytes in UTF-16||72 D7|
|Two bytes in GB2312 (chinese)|| B9 B7|
|Two bytes in Shift JIS (japanese)|| 8B E7|
- For years I thought the answer was:
- but this is only the encoding of the source code
- The real answer is:
- functions to convert to or from pretty much any character encoding
- Different things use different terms:
- encoding, character encoding, charset, character set, ...
- May mean the same thing, but might not
A key concept
- perl strings can be anything
- - internally they are unicode if they need to be
- but you need to think of them in two ways:
- - binary strings
- - text strings
- You need to know which encoding to use
- Use the wrong one and the text will be broken
- Unfortunately this means you need to make assumptions
- - your assumption depends on what you are doing
Reading an email
my $entity = MIME::Parser->new()->parse_data($message);
my $enc = Encode::find_encoding(
$text = $enc->decode(
- (assume that the sender built the message correctly)
Writing an email
my $entity = MIME::Entity->build(
'Subject' => Encode::encode('MIME-Header', $subject),
'Data' => Encode::encode('UTF-8', $body),
'Charset' => 'UTF-8',
'Encoding' => '-SUGGEST',
- Potentially more problems
- Browsers might behave differently
- Users can override page charset
- Set as a HTTP header:
Content-Type: text/html; charset=ISO-8859-1
- as a meta element:
- or for XHTML:
<?xml version="1.0" encoding="ISO-8859-1"?>
- Browser should display the page using the specified charset
- But users can override this
- Form submissions will be sent back to the server in the specified
- But not if the user has overridden it
- And different browsers degrade in different ways for characters not
supported by the charset
- Two approaches:
- - encode everything
- - only encode unsafe characters
- At first appears to be really complex ..
- .. but can be simpler than you think
- subtle complexities
- have to assume (but try to mitigate)
- text strings != binary strings
- decode binary to text as early as possible
- encode text to binary as late as possible
- Perl documentation:
- From Juerd Waalboer: