Use iconv to change character string encoding

Introduction

The iconv API is used for converting strings between different character encodings, it exposes the libiconv functionality embedded within Iguana. If you are already familiar with this you will probably want to skip to the Examples using iconv below and take a look at our iconv API reference. For everyone else please read on…

Task [top]

Use iconv to change character string encoding.

The most common uses of iconv will be for converting incoming text from language specific encodings into the UTF-8 (Unicode) character set, and converting from UTF-8 to a language specific coding. Why would we want to convert text to UTF-8 encoding? The first thing is that UTF-8 is a standard Unicode implementation so it is compatible with (is a superset of) all single language encodings. The second thing is that internet browsers (like Chrome, Firefox etc.) will recognize UTF-8 and display the characters correctly, which makes strings easier to work with in the Translator.

For example if you receive some Spanish text encoded as CP1252 (Windows code page 1252), it might look like this “El hardware inal\225mbrico no autorizado se puede introducir f\255cilmente.“, once converted to UTF-8 the “\225” is displayed as “á” which is much easier to read “El hardware inalámbrico no autorizado se puede introducir fácilmente.“.

Tip: In this case we could also have converted directly into ISO-8859-1 (western character) encoding, which works for most modern languages, and is very similar to CP1252.

However UTF-8 is “safer” as it works with as a target for any encoding – so if you are unsure just convert to UTF-8.

You can also use iconv for converting directly between different language encodings. Care should be taken when doing this as most encodings are not entirely compatible, as some characters in the source encoding will be missing from the target encoding. If you are translating from Central European languages (such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian etc.) encodings (ISO-8859-2, CP1250) to standard western encoding (ISO-8859-1, CP1252) then a few characters are missing. Two workarounds are offered by iconv: transliterate and ignore (append “//TRANSLIT” or “//IGNORE” to the target encoding when using our iconv.convert() function). For example with Hungarian the following long vowels are missing Ő, ő, Ű, ű, if you use the transliterate option they are converted to O, o, U, u respectively. Transliteration is usually the best option, but you can choose to ignore characters if it better suits your requirements.

Note: You may find it more convenient to use UTF-8 as an intermediary, rather than translating directly to the final target encoding, because UTF-8 is easier to work with in Iguana Translator (as it displays correctly in the browser).

So basically three steps:

  1. Convert from your source encoding to UTF-8
  2. View and manipulate the UTF-8 encoded data in the Translator
  3. Convert the UTF-8 encoded data to the target encoding (or even multiple encodings) before transmitting/saving the data

Some historical background (skip if you want):

A long time ago in a Galaxy far away there were ASCII (a 7 bit encoding system) and EBCDIC (a technically superior 8 bit encoding system). A great battle ensued and ASCII became the common standard and eventually evolved from a 7 bit (max 128 characters) to an 8 bit (max 256 characters) standard called Extended ASCII. Unfortunately 256 characters is insufficient for all characters in all languages so different “code pages” for individual languages were developed (see Windows code page).

This made for much pain (and switching between code pages) when dealing with multiple languages. The solution was to develop Unicode which includes characters for all languages (it uses up to 32 bits – so it can handle a lot of characters…). There are various Unicode flavours UTF-8, UTF-16 and UTF-32. UTF-8 and UTF-16 are “variable width” implementations using a minimum of 8 and 16 bits respectively, UTF-32 is fixed width and always uses 32 bits. The most commonly used is UTF-8 (probably because it uses the least space), all three flavours are compatible see Comparison of Unicode encodings for more information.

So now we all use Unicode and everything is simple! Unfortunately that is not the case as many computer systems still use language or country dependent character encoding. So we need to convert between the different encoding – hence iconv was developed. The iconv API comes from Unix (originally HP-UX), it is now included with most Linux distributions and is also available for Windows.

iconv functions [top]

We have provided conversion functions to and from UTF-8 for the three most common encodings (ASCII, CP1252, and ISO-8559-1). For example iconv.ascii.dec() converts ASCII to UTF-8 and iconv.iso8859_1.enc() converts UTF-8 to ISO-8559-1.

For all other conversions you can use the iconv.convert() function, for example use iconv.convert("Hello", "ASCII", "cp1252") to convert ASCII to CP1252.

We also supply three utility functions:

  • iconv.list() – that returns a list of all character encodings understood by iconv
  • iconv.aliases(encoding) – that returns a list of all aliases for a specified encoding
  • iconv.supported(encoding) – that returns true if a specified encoding is supported, or false if it is not

Note: Those of you familiar with character encoding will probably spot the iconv.convert("Hello", "ASCII", "cp1252") example as a trivial conversion, because the source and result strings are identical. This is because both ASCII and CP1252 use the same byte-codes for alphabetic characters (as does UTF-8).

This screenshot demonstrates the point by representing “Hello” as byte codes in the second conversion:

However there are occasions with special characters where the byte codes are different for different encodings:

Examples using iconv [top]

I suggest that you copy the sample code from below (it contains all the examples used here), and work through it as you are reading this section.

First we will demonstrate converting the three common encoding (ASCII, CP1252, and ISO-8559-1) to/from UTF-8, then using iconv.convert() to convert between arbitrary encodings, and finally we will show you the three utility functions.

Converting between ASCII and UTF-8

Conversions between ASCII and UTF-8 are always trivial (the source and target strings are identical). This is because the first 128 characters of UTF-8 are the same as ASCII.

However UTF-8 is a variable length encoding that uses 1 to 4 bytes. So to prove that the UTF-8 results are single bytes you can also inspect the hex values in the Translator Editor:

However if you try to convert a non-ascii character like the euro symbol “€” then the conversion will fail:

If you need to deal with non-ascii characters like the euro symbol “€” you must use the iconv.convert() function. You can choose between two options when converting “transliteration” (//TRANSLIT) which converts to the nearest equivalent, or “ignore”  (//IGNORE) which simply ignores the character.

However you should realize that the transliteration option does not work in all cases. For example if you try to transliterate a Greek letter like “Ω” Omega you will get an error, which make sense as there is no similar letter in ASCII.

Converting between CP1252 and UTF-8

Conversions between the first 128 characters of CP1252 and UTF-8 are always trivial. This is because the first 128 characters of CP1252 are the same as ASCII and therefore they are same as the first 128 characters in UTF-8.

So our conversion of “Hello” is once again trivial, as we again show by inspecting the hex values:

Conversion of the second 128 characters are however non-trivial (the source and target strings are different), as UTF-8 uses multi-byte encodings for all characters after the first 128.

This means that our second second conversion of the euro symbol “€” is non-trivial (the internal representation in UTF-8 is different). The simplest way to show this is to inspect the hex codes as we did above. It is also necessary to use “\128” in our source string which is the decimal code for “€” in CP1252. As you can see single byte CP1252 representation (hex 80) is converted into a 3 byte representation (hex E2 82 AC) in UTF-8.

Tip: This is exactly what we would expect.

Consider the encoding of the Euro sign, € –  from the wikipedia UTF-8 page

  1. The Unicode code point for “€” is U+20AC.
  2. According to the scheme table , this will take three bytes to encode, since it is between U+0800 and U+FFFF.
    The scheme table shows UTF-8 as it is since 2003 (the x characters are replaced by the bits of the code point):

    Number
    of bytes
    Bits for
    code point
    First
    code point
    Last
    code point
    Byte 1 Byte 2 Byte 3 Byte 4
    1 7 U+0000 U+007F 0xxxxxxx
    2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
    3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
    4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  3. Hexadecimal 20AC is binary 0010 0000 1010 1100. The two leading zeros are added because, as the scheme table shows, a three-byte encoding needs exactly sixteen bits from the code point.
  4. Because the encoding will be three bytes long, its leading byte starts with three 1s, then a 0 (1110...)
  5. The first four bits of the code point are stored in the remaining low order four bits of this byte (1110 0010), leaving 12 bits of the code point yet to be encoded (...0000 1010 1100).
  6. All continuation bytes contain exactly six bits from the code point. So the next six bits of the code point are stored in the low order six bits of the next byte, and 10 is stored in the high order two bits to mark it as a continuation byte (so 1000 0010).
  7. Finally the last six bits of the code point are stored in the low order six bits of the final byte, and again 10 is stored in the high order two bits (1010 1100).

The three bytes 1110 0010 1000 0010 1010 1100 can be more concisely written in hexadecimal, as E2 82 AC.

The following table summarizes this conversion, as well as others with different lengths in UTF-8. The colors indicate how bits from the code point are distributed among the UTF-8 bytes. Additional bits added by the UTF-8 encoding process are shown in black.

Character Binary code point Binary UTF-8 Hexadecimal UTF-8
$ U+0024 010 0100 00100100 24
¢ U+00A2 000 1010 0010 11000010 10100010 C2 A2
U+20AC 0010 0000 1010 1100 11100010 10000010 10101100 E2 82 AC
𐍈 U+10348 0 0001 0000 0011 0100 1000 11110000 10010000 10001101 10001000 F0 90 8D 88

So why can’t we just use the euro symbol “€” instead of “\128” in the source string (yes I made this mistake). The reason is simple enough, the browser (and hence the Translator) recognizes “€” as a UTF-8 character containing 3 bytes (hex E2 82 AC), and converts each byte separately into a single UTF-8 character. The result is therefore three characters “€’, which is not what we intended.

Once again if you try to convert a non-CP1252 character like the “Ā” (A with macron) then the conversion will fail:

If you need to deal with non-CP1252 characters like the “Ā” (A with macron) you must use the iconv.convert() function. You can choose between two options when converting “transliteration” (//TRANSLIT) which converts to the nearest equivalent, or “ignore”  (//IGNORE) which simply ignores the character.

However you should realize that the transliteration option does not work in all cases. For example if you try to transliterate a Greek letter like “Ω” Omega you will get an error, which make sense as there is no similar letter in CP1252.

Converting between ISO-8559-1 and UTF-8

Printable character conversions between ISO-8559-1 and UTF-8 are always trivial (the source and target strings are identical). This is because the printable characters in the first 128 characters of ISO-8559-1 are the same as ASCII (the non-printable ASCII characters are not defined in ISO-8559-1).

So our conversion of “Hello” is once again trivial, as we can see by inspecting the hex values:

However if you try to convert a non-ISO8559-1 character like the euro symbol “€” then the conversion will fail:

If you need to deal with non-ascii characters like the euro symbol “€” you must use the iconv.convert() function. You can choose between two options when converting “transliteration” (//TRANSLIT) which converts to the nearest equivalent, or “ignore”  (//IGNORE) which simply ignores the character.

However you should realize that the transliteration option does not work in all cases. For example if you try to transliterate a Greek letter like “Ω” Omega you will get an error, which make sense as there is no similar letter in ISO-8559-1.

Once again conversion of the second 128 characters are non-trivial (the source and target strings are different), as UTF-8 uses multi-byte encodings for all characters after the first 128.

Lets use plus-minus symbol “±” as an example of a non-trivial conversion, and inspect the hex codes (as we did for “€” for CP1252 above). It is also necessary to use “\177” in our source string which is the decimal code for “±” in ISO-8559-1. As you can see single byte ISO-8559-1 representation (hex B1) is converted into a 2 byte representation (hex C2 B1) in UTF-8.

So why can’t we just use the euro symbol “±” instead of “\177” in the source string (yes I made this mistake). The reason is simple enough, the browser (and hence the Translator) recognizes “±” as a UTF-8 character containing 2 bytes (hex C2 B1), and converts each byte separately into a single UTF-8 character. The result is therefore two characters “±’, which is not what we intended.

Convert between other encodings

You can iconv.convert() to convert between different Unicode encoding like UTF-8, UTF-16 and UTF-32, and as you can see results encoded in UTF-16 and UTF-32 encoded look completely different.

As expected converting back from the UTF-16 and UTF-32 strings gives us the original UTF-8 string.

You can also convert directly between non Unicode encodings, if you use a character that source does not exist in the target encoding the conversion will fail.

You can handle this using the transliteration and ignore options.

However transliteration does not always work, as there may not be a similar character to map to. For example the currency sign “¤” from CP1252 will fail.

Example Code [top]

function main()   

   --------------------------------------------------
   -- ASCII conversions
   --------------------------------------------------
   
   local AsciiData = iconv.ascii.enc("Hello")
   local Utf8Data = iconv.ascii.dec("Hello")
   
   -- notice that both of these are "trivial" conversions as the 
   -- source and target srings are identical
   -- trivial: ascii > utf-8
   Utf8Data:sub(1,1):byte()
   Utf8Data:sub(2,2):byte()
   Utf8Data:sub(3,3):byte() 
   Utf8Data:sub(5,5):byte()
   -- trivial: utf-8 > ascii
   AsciiData:sub(1,1):byte()
   AsciiData:sub(2,2):byte()
   AsciiData:sub(3,3):byte() 
   AsciiData:sub(5,5):byte()

    -- trying to convert a non-ascii character "€" gives an error
   ------ uncomment next line to demonstrate error ------
   -- local AsciiData = iconv.ascii.enc("Price: €100.00") -- error 

   -- Use iconv.convert() to handle non-ascii characters like "€"
   -- the "transliterate" option converts "€" to "EUR"
   local AsciiData = iconv.convert("Price: €100.00", "UTF-8", "ASCII//TRANSLIT")
   -- the "ignore" option simply ignores the euro sign "€" 
   local AsciiData = iconv.convert("Price: €100.00", "UTF-8", "ASCII//IGNORE")
   
   -- trying to convert the Greek "Ω" Omega works with ignore but fails with transliteration 
   local AsciiData = iconv.convert("Ω Greek capital Omega", "UTF-8", "ASCII//IGNORE")
   ------ uncomment next line to demonstrate error ------
   -- local AsciiData = iconv.convert("Ω Greek capital Omega", "UTF-8", "ASCII//TRANSLIT") -- error
      
   --------------------------------------------------
   -- CP1252 conversions
   --------------------------------------------------
   
   -- notice that both of these are "trivial" conversions as the 
   -- source and target srings are identical (just as they are with ASCII)
   local cp1252Data = iconv.cp1252.enc("Hello")
   local Utf8Data = iconv.cp1252.dec("Hello")

   -- the "€" character is included in CP1252 as "\128" 
   -- so iconv.convert() is not needed
   local Utf8Data = iconv.cp1252.dec("\128100.00")
 
   -- this doesn't work as the browser recognises "€" as UTF-8
   -- so you are actually converting the 3 byte UTF-8 representation
   -- into 3 UTF-8 characters - not at all what was intended
   local Utf8Data = iconv.cp1252.dec("€")
   
   -- this is what actually happens internally when we represent each 
   -- of the 3 bytes in decimal format hex E2=226, hex 82=130, hex AC=172
   local Utf8Data = iconv.cp1252.dec("\226\130\172")
 
    -- trying to convert a non-CP1252 character "Ā" (A with macron) gives an error
   ------ uncomment next line to demonstrate error ------
   -- local cp1252Data = iconv.cp1252.enc("Ā latin A with macron") -- error
  
   -- Use iconv.convert() to handle non-ascii characters like "Ā" (A with macron)
   -- the "transliterate" option converts "Ā" to "A"
   local cp1252Data = iconv.convert("Ā latin A with macron", "UTF-8", "CP1252//TRANSLIT")
   -- the "ignore" option simply ignores the euro sign "€" 
   local AsciiData = iconv.convert("Ā latin A with macron", "UTF-8", "CP1252//IGNORE")

   -- trying to convert the Greek "Ω" Omega works with ignore but fails with transliteration 
   local cp1252Data = iconv.convert("Ω Greek capital Omega", "UTF-8", "CP1252//IGNORE")
   ------ uncomment next line to demonstrate error ------
   -- local cp1252Data = iconv.convert("Ω Greek capital Omega", "UTF-8", "CP1252//TRANSLIT") -- error
   
   --------------------------------------------------
   -- ISO 8859_1 conversions
   --------------------------------------------------
   
   -- notice that both of these are "trivial" conversions as the 
   -- source and target srings are identical (like ASCII and CP1252)   
   local iso8859_1Data = iconv.iso8859_1.enc("Hello")
   local Utf8Data = iconv.iso8859_1.dec("Hello")
  
    -- trying to convert a non-ISO 8859_1 character "€" gives an error
   ------ uncomment next line to demonstrate error ------
   -- local AsciiData = iconv.iso8859_1.enc("Price: €100.00") -- error

   -- Use iconv.convert() to handle non-ISO-8859-1 characters like "€"
   -- the "transliterate" option converts "€" to "EUR"
   local iso8859_1Data = iconv.convert("Price: €100.00", "UTF-8", "ISO8859-1//TRANSLIT")
   -- the "ignore" option simply ignores the euro sign "€" 
   local iso8859_1Data = iconv.convert("Price: €100.00", "UTF-8", "ISO8859-1//IGNORE")
   
   -- trying to convert the Greek "Ω" Omega works with ignore but fails with transliteration 
   local iso8859_1Data = iconv.convert("Ω Greek capital Omega", "UTF-8", "ISO8859-1//IGNORE")
   ------ uncomment next line to demonstrate error ------
   -- local iso8859_1Data = iconv.convert("Ω Greek capital Omega", "UTF-8", "ISO8859-1//TRANSLIT") -- error

   -- the "±" character is included in ISO-8859-1 as "\177" 
   -- so iconv.convert() is not needed   
   local Utf8Data = iconv.iso8859_1.dec("\177 plus-minus sign")
 
   -- this doesn't work as the browser recognises "±" as UTF-8
   -- so you are actually converting the 2 byte UTF-8 representation
   -- into 2 UTF-8 characters - not at all what was intended
   local Utf8Data = iconv.iso8859_1.dec("±")
   
   -- this is what actually happens internally when we represent each 
   -- of the 2 bytes in decimal format hex C2=194, hex B1=177
   local Utf8Data = iconv.iso8859_1.dec("\194\177")

   --------------------------------------------------
   -- convert between other encodings
   --------------------------------------------------
   
   -- you can convert between different Unicode encoding like UTF-8, UTF-16 and UTF-32
   local Utf16Data = iconv.convert("Unicode: ȸ ȹ Ѿ א Չ ن","UTF-8","UTF-16")
   local Utf32Data = iconv.convert("Unicode: ȸ ȹ Ѿ א Չ ن", "UTF-8", "UTF-32")
   
   -- converting them back produces the original string
   local Utf8Data =--  iconv.convert(Utf16Data,"UTF-16","UTF-8")
   trace(Utf8Data)
   local Utf8Data = iconv.convert(Utf32Data,"UTF-32","UTF-8")
   trace(Utf8Data)
   
   -- you can convert directly between non Unicode encoding also
   local AsciiData = iconv.convert("This string works","CP1252","ASCII")
   -- trying to convert the the euro sign "€" (="\128" code in CP1252)
   ------ uncomment next line to demonstrate error ------
   -- local AsciiData = iconv.convert("This string fails \128","CP1252","ASCII") -- error
   
   -- unless you transliterate or ignore
   local AsciiData = iconv.convert("Works using transliterate \128","CP1252","ASCII//TRANSLIT")
   local AsciiData = iconv.convert("Works using ignore \128","CP1252","ASCII//IGNORE")
   
   -- transliteration does not always work, for example with the currency sign "¤" (\164)
   ------ uncomment next line to demonstrate error ------
   --local AsciiData = iconv.convert("Transliterate \164","CP1252","ASCII//TRANSLIT") -- error
   
end

More Information [top]