Parsing CSV Files

This code is designed to handle well formed CSV files. However CSV implementations can vary, so you should test it using realistic samples.

Please contact support at support@interfaceware.com if you need further help.

Sample Code [top]

You can download the Parse_CSV_To_Translator.zip project file (code and test messages) or copy the code from below.

Code for main():

local csv = require 'csv'

function main(Data)
   local Csv = csv.parseCsv(Data)       -- comma separated (default)
   --local Csv = csv.parseCsv(Data, '\t') -- tab separated (sample message 11)
   --local Csv = csv.parseCsv(Data, '|')  -- bar separated (sample message 12)
   trace(Csv)
   
   -- Examples of what you can do:
   
   -- 1) Use in a To Translator and add code
   --    for saving patients to a database
   
   -- 2) Use in a Filter Component and map to
   --    XML/JSON then queue for further processing
end

Code for the module:

-- module is likely to be re-used so use a descriptive name like "csv"

local function parseCsvLine (line,sep) 
   local res = {}
   local pos = 1
   sep = sep or ','
   while true do 
      local c = string.sub(line,pos,pos)
      if (c == "") then break end
      local posn = pos 
      local ctest = string.sub(line,pos,pos)
      trace(ctest)
      while ctest == ' ' do
         -- handle space(s) at the start of the line (with quoted values)
         posn = posn + 1
         ctest = string.sub(line,posn,posn) 
         if ctest == '"' then
            pos = posn
            c = ctest
         end
      end
      if (c == '"') then
         -- quoted value (ignore separator within)
         local txt = ""
         repeat
            local startp,endp = string.find(line,'^%b""',pos)
            txt = txt..string.sub(line,startp+1,endp-1)
            pos = endp + 1
            c = string.sub(line,pos,pos) 
            if (c == '"') then 
               txt = txt..'"' 
               -- check first char AFTER quoted string, if it is another
               -- quoted string without separator, then append it
               -- this is the way to "escape" the quote char in a quote. example:
               --   value1,"blub""blip""boing",value3  will result in blub"blip"boing  for the middle
            elseif c == ' ' then
               -- handle space(s) before the delimiter (with quoted values)
               while c == ' ' do
                  pos = pos + 1
                  c = string.sub(line,pos,pos) 
               end
            end
         until (c ~= '"')
         table.insert(res,txt)
         trace(c,pos,i)
         if not (c == sep or c == "") then 
            error("ERROR: Invalid CSV field - near character "..pos.." in this line of the CSV file: \n"..line, 3)
         end
         pos = pos + 1
         posn = pos 
         ctest = string.sub(line,pos,pos)
         trace(ctest)
         while ctest == ' ' do
            -- handle space(s) after the delimiter (with quoted values)
            posn = posn + 1
            ctest = string.sub(line,posn,posn) 
            if ctest == '"' then
               pos = posn
               c = ctest
            end
         end
      else    
         -- no quotes used, just look for the first separator
         local startp,endp = string.find(line,sep,pos)
         if (startp) then 
            table.insert(res,string.sub(line,pos,startp-1))
            pos = endp + 1
         else
            -- no separator found -> use rest of string and terminate
            table.insert(res,string.sub(line,pos))
            break
         end 
      end
   end
   return res
end

------------------------------------
---- Module Interface functions ----
------------------------------------
local csv = {}

function csv.parseCsv(Data, Separator)
   -- handle '\r\n\' as line separator
   Data = Data:gsub('\r\n','\n')
   -- handle '\r' (bad form) as line separator  
   Data = Data:gsub('\r','\n')
   local Result={}
   
   for Line in Data:gmatch("([^\n]+)") do
      local ParsedLine = parseCsvLine(Line, Separator)
      table.insert(Result, ParsedLine)
   end
   
   return Result
end

return csv

Using the code [top]

  • This code would usually be used in a Filter or Destination component script, for example:
    • You could save patient details to a  database table in a Destination component
    • You could map patient details to XML/JSON in Filter and then queue for further processing (either in the Destination or by forwarding to another channel)
  • You will need to write From Translator code to read your CSV files and push them onto the Iguana queue for processing in your Filter or To Translator component
  • Because the code is a good candidate for re-use by other channels, give it a descriptive name, we used “csv”
    Note: If you already have a module called “csv” you can add this code (or use a different module name).

How it works [top]

  • The code reads CSV messages from the queue and then parses them using the csv.parseCsv() function
  • There are definite advantages loading and queuing the CSV data files in the From Component, and then processing the CSV messages in a Filter or Destination component:
    • The main advantage is that you can reprocess CSV messages using the Iguana log rather than having to reload the original files
    • Separating the file handling and parsing/processing also keeps the code cleaner and easier to maintain
  • The parsing code is based on this code from Lua user’s wiki, with the following additions:
    • We changed the assert to an if statement to give a more informative error
    • We changed the code to allow for spaces before/after separators when using quoted fields
    • We changed the code to allow for spaces at the start of a line when using quoted fields
    • We wrapped the parseCsvLine() function in the parseCsv() function that accepts multiple lines
  • You can use different field separators by passing them as the Separator (2nd) parameter to the parseCsv() function, for example:
    • To use a tab separator: parseCsv(Data, '\t')
    • To use a “|” separator: parseCsv(Data, '|')
  • The parse will handle the following:
    • Quoted and unquoted fields
    • Separators within quoted fields
    • Escaped quotes (“”) within quoted fields
      Note: See the “Speedie” nickname in the 2nd row of 9th sample message
    • The parse will raise an error for an un-escaped (“) quote in quoted field
      Note: The 10th sample message raises an error for this reason
  • The parse will handle spaces before/after separators when using quoted fields
  • The parse will handle spaces at the start of a line when using quoted fields
  • Limitation: The parse does not handle quotes in an unquoted field, it just reads them like any other character
    Note: Technically this violates CSV rules but I have seen it occasionally

    • A un-escaped  quote (“) will be shown as a single quote (“), when it could be considered an error
    • An escaped quote (“”) will be show as two quotes (“”) when it should probably be shown as one quote (“)

Best Practices [top]

  • Test the code to make sure it will handle the CSV files you receive correctly, and adjust it if necessary
    Note: There a CSV standard (RFC 4180) but it is not always followed exactly,  so parsing CSV is never completely cut and dried

What not to do [top]

  • Don’t expect the code to handle all possible CSV file content (you need to test)
    Note: As mentioned in best practices CSV files are not truly standard so parsing them is not an exact science