This topic contains 0 replies, has 1 voice, and was last updated by  lev 10 years, 3 months ago.

Thai cp874 (Windows 874) to UTF-8

  • Iguana will support languages encoded in UTF-8, unfortunately some systems use cp874 (Windows 874) and Thai language users cannot pipe inbound data to Iguana because of encoding mismatch.

    Apparently it is easy to transcode from cp874 to UTF-8 using pretty trivial module offered below.

    Say your inbound data comes in files encoded in cp874.

    In Iguana channel Source component Translator script read this file content in binary mode (AKA ‘rb’)

    Pipe the content of file through below module calling function cp874toUTF8()

    utf8sOutData = transcode.cp874toUTF8(cp874InData)

    Push UTF-8 result, calling queue.push{}, to Iguana queue for further normal processing by Filter or Destination component of the channel.

    transcode={}
    
    function transcode.convert(Data, Map)
       local j = 1
       local p = {}
    
       for i=1, #Data do
       local C = Map[Data:byte(i)]
       if C then
             p[#p+1] = Data:sub(j,i-1)
             p[#p+1] = C
             j = i + 1
          end
       end
       p[#p+1] = Data:sub(j,#Data)
       return table.concat(p)
    end
    function transcode.cp874toUTF8(D)
       local D2=''
       local d1=string.char(0xe0)..string.char(0xb8)
       local d2=string.char(0xe0)..string.char(0xb9)
    
       local cp874CodeSet={
          [128]='172',[133]='166',[145]='152',
          [146]='153',[147]='156',[148]='157',
          [149]='162',[150]='147',[151]='148'
       }
       
       local cp874List={128,133,145,146,147,148,149,150,151}
       
       local function din(n)
          for k,v in ipairs(cp874List) do
             if n==v then return true end
          end
       end
       
       for i=1,#D do
          if string.byte(D,i) < 0x80 then
             D2=D2..D:sub(i,i)
          elseif string.byte(D,i) > 0x9f then
             if string.byte(D,i) < 0xe0 then
                D2=D2..d1..
                string.char(string.byte(D,i)-0x20)
             elseif string.byte(D,i) > 0xdf and string.byte(D,i) < 0xfc then
                D2=D2..d2..
                string.char(string.byte(D,i)-0x60)
             elseif din(string.byte(D,i)) then
                D2=D2..
                transcode.convert(
                   string.byte(D,i),cp874CodeSet)
             end
          end
          
       end
       
       return D2
    end

    Let me know if it works for you. Comments needed. Happy transcoding.

You must be logged in to reply to this topic.