Using rxmatch() and rxsub() with PCRE regex

Using rxmatch()

Tip: It is very easy to adapt PHP examples for Iguana, rxmatch() corresponds to preg_grep (and to a lesser extent preg_match).

  1. This first example is very simple, it just loops through a string and prints each word. As you can see there is slight difference: the first is that PCRE uses the “\” escape character where Lua uses “%”, so PCRE uses “\w” and Lua use “%w” to match all letters (word characters). There is a difference in the “word characters” matched: Lua matches alphanumeric characters, but PCRE matches alphanumeric plus “_” underscore.
    Note: The escape “\” character must be escaped as “\\” in Lua strings, or you can use square brackets as string delimeters “[<string value>]” without escaping.

       -- Iterate over all the words from string s, printing one per line:
       
       -- using gmatch
       s = "hello world from Lua"
       for w in s:gmatch("%w+") do
          print(w)
       end
       
       -- using rxmatch
       s = "hello world from Lua and PCRE"
      for w in s:rxmatch("\\w+") do         -- the \ character must be escaped as \\ in Lua strings
      -- for w in s:rxmatch([\w+]) do       -- alternatively you can use the [] syntax without escaping
          print(w)
       end 
    
  2. This example uses captures “(<captured value>)” to collect all key value pairs and write them into a Lua table. As you can see the only difference is that PCRE uses the “\” escape character where Lua uses “%”.

       -- Collect all pairs key=value from the given string into a table:
       
       -- using gmatch
       t = {}
       s = "from=world, to=Lua"
       for k, v in s:gmatch("(%w+)=(%w+)") do
          t[k] = v
       end
        
       -- using rxmatch
       t = {}
       s = "from=world, to=Lua"
       for k, v in s:rxmatch("(\\w+)=(\\w+)") do    -- the \ character must be escaped as \\ in Lua strings
       --for k, v in s:rxmatch([(\w+)=(\w+)]) do    -- alternatively you can use the [] syntax without escaping
          t[k] = v
       end
    
  3. You can use captures to find duplicated words:
      -- find duplicate words
       
       -- using gmatch   
       local s = "hello hello world world"
       for k, v in s:gmatch("(%w+)(%s+%1)") do
          trace(k, v)
       end
       
       -- using rxmatch  
       local s = "hello hello world world"
       for k, v in s:rxmatch("(\\b\\w+\\b)(\\W+\\1)") do
          trace(k, v)
       end
  4. And now for a trick with regex that you can’t do with Lua patterns, finding a word (or pattern) in a string that is not followed by another word. To achieve this we will use a PCRE “lookaround” subpattern.
    To find “hello” only when it is not followed by “world”, use a negative lookahead:

      -- not possible with gmatch()
       t = {}
       local s = "hello everyone my name is ..."
       for k in s:rxmatch("(hello)(?!.*world)") do -- negative lookahead - PCRE only
          trace(k)
       end
  5. Capture quoted text in a string, using captures allows us to write one regex that works for single (‘) and double (“) quotes.
       -- capture text in quotes
       
       local s = 'She said "Hello everyone"'
       for k in s:rxmatch([[("[^"]+")]]) do        -- capture text including quotes
          trace(k)
       end
       --> "Hello everyone"
       
       local s = 'She said "Hello everyone"'
       for k in s:rxmatch([["([^"]+)"]]) do        -- capture text excluding the quotes
          trace(k)
       end
       --> Hello everyone
     
       local s = [[She said "Don't wait up"]]
       for k in s:rxmatch([[(['"][^'"]+['"])]]) do -- capture text including quotes = FAIL stops at (matches) single quote
          trace(k)
       end
       --> "Don' (stops incorrectly on single quote)
     
       local s = [[She said "Don't wait up"]]
       for k in s:rxmatch([[((["'])[^\2]+\2)]]) do -- capture text including quotes = SUCCESS using capture \2 to identify 
          trace(k)                                 -- which type of opening quote is used
       end
       --> "Don't wait up"
    
       local s = [[She said "Don't wait up"]]
       for k,v in s:rxmatch([[(["'])([^\1]+)\1]]) do -- capture text excluding the quotes = SUCCESS using capture \1 to identify 
          trace(k,v)                                 -- which type of opening quote is used
       end
       --> Don't wait up (NOTE: this is the second return "v")
  6. Capture text inside brackets or other delimiters:
       -- capture text inside delimiters like brackets, etc.
       
       local s = "Hello to the world (and everyone)"
       for k in s:rxmatch('(\\([^)]+\\))') do        -- capture text including brackets (or substitute other delimiters)
          trace(k)
       end
       -->(and everyone)
    
       local s = "Hello to the world (and everyone)"
       for k in s:rxmatch('\\(([^)]+)\\)') do        -- just capture text (NOT including brackets)
          trace(k)
       end
       -->and everyone
  7. Extract HTML or XML tags:
       -- extract HTML or XML tags
       
       local s = '<a href="#hello_world">Hello world link</a>'
       for k in s:rxmatch('(<[^>]+>)') do              -- extract HTML tags
          trace(k)
       end
    
       local s = [[<?xml version="1.0"?>
       <patients>
          <patient id="123">
             <first-name>John</first-name>
             <last-name>Smith</last-name>
          </patient>
       </patients>]]
       for k in s:rxmatch('(<[^>]+>)') do              -- extract XML tags
          trace(k)
       end
    
       local s = [[<?xml version="1.0"?>
       <patients>
          <patient id = "123">
             <first-name>John</first-name>
             <last-name>Smith</last-name>
          </patient>
       </patients>]]
       for k in s:rxmatch('(<[^>]+\\sid\\b[^>]+>)') do -- only extract XML tags containing an id attribute
          trace(k)
       end
  8. White-listing and black-listing are useful both useful techniques.
    1. White-listing is simple with rxmatch() but next to impossible with gmatch():
         -- white-list
      
         -- using rxmatch  
      
         -- match single word
         local s = "Hello hello world I was here world"
         for k in s:rxmatch('(\\bhello\\b)', 'i') do       -- \b (word boundaries) to only match whole words
            trace(k)                                       -- 'i' (3rd param) for case insensitive match
         end
      
         -- extend to matching a list
         local s = "Hello hello world I was here world"
         for k in s:rxmatch('\\b(hello|world)\\b', 'i') do -- move the \b (word boundaries) outside the the capture group
            trace(k)
         end
      
         -- simply extend the list as required
         local s = "Hello to the world, Mars, Venus and the Universe"
         for k in s:rxmatch('\\b(hello|world|mars|venus|universe)\\b', 'i') do 
            trace(k) 
         end
      
         -- using gmatch - cannot be done easily
      
         -- matching a single word/phrase is easy
         local s = "Hello hello world I was here"
         for k in s:lower():gmatch('(hello)') do           -- using lower() for  for case insensitive match
            trace(k)
         end
      
         -- unfortunately Lua will also match partial words as well
         local s = "Hello to the worldwide web"
         for k in s:lower():gmatch('(hello)') do           -- matches "world" in "worldwide"
            trace(k)
         end
         -- NOTE: This excludes the partial match but it misses the first hello
         for k in s:lower():gmatch('%W(hello)%W') do 
            trace(k)
         end
         
         -- also Lua patterns do not support "|" (OR) so you cannot match members in a list
         local s = "Hello hello|world world I was here"
         for k in s:lower():gmatch('(hello|world)') do     -- matches string "hello|world"
            trace(k)
         end
      
         -- though you could loop through a white-list stored in table
         local s = "Hello hello world I was here world"
         local wlist = {'hello', 'world'}
         for i=1,#wlist do  
            trace(wlist[i])
            for k in s:lower():gmatch(wlist[i]) do         -- using lower() for case insensitive match
               trace(k)
            end
         end
    2. A black-list is also simple with rxmatch() but we are not even going to try it gmatch():
         -- black-list
         
         -- first lets exclude a single word
         local s = "Hello hello world I was here world"
         for k in s:rxmatch([[\bhello\b(*SKIP)(*FAIL)|(\w+)]], 'i') do
            trace(k) 
         end
      
         -- then exclude words in a list
         for k in s:rxmatch([[\bhello\b(*SKIP)(*FAIL)|\bworld\b(*SKIP)(*FAIL)|\w+]], 'i') do
            trace(k) 
         end
         -- Compact Version: place \b and (*SKIP)(*FAIL) outside a non-capturing group
         for k in s:rxmatch([[\b(?:hello|world)\b(*SKIP)(*FAIL)|\w+]], 'i') do
            trace(k) 
         end
      
      
  9. Here are some examples of unicode matching.
    1. First a trivial example to help understand matching a specific unicode grapheme (compound character) like “à”:
      Note: The unicode “à” is composed of two code points (symbols): U+0061 (a) followed by U+0300 (grave accent).

         -- demonstrate that "à" is composed of two unicode code points
         string.byte('à',1,2)       --> 195,160 = decimal values of U+0061 and U+0300
      
         -- using gmatch  
         local s = "à"
         for k in s:gmatch(".") do  --> trace('\195') - matches the first code point (195)
            trace(k)                --> trace('\160') - then the 2nd
         end
         for k in s:gmatch("..") do --> trace('à') - matches the whole grapheme (2 code points)
            trace(k)
         end
         for k in s:gmatch("à") do  --> trivially matches "à"
            trace(k)
         end
            
         -- using rxmatch  
         local s = "à"
         for k in s:rxmatch(".") do  --> trace('\195') - matches the first code point (195)
            trace(k)                 --> trace('\160') - then the 2nd
         end
         for k in s:rxmatch("..") do --> trace('à') - matches the whole grapheme (2 code points)
            trace(k)
         end
         for k in s:rxmatch("à") do  --> trivially matches "à"
            trace(k)
         end
      
         -- using rxmatch Unicode specific features 
         -- NOTE: you must include 'u' (unicode) as the 3rd parameter
         local s = "à"
         for k in s:rxmatch([[\X]], 'u') do    -- trace('à') - match the grapheme using "\X" (the unicode equivalent of ".")
            trace(k)
         end
         for k in s:rxmatch([[\p{L}]], 'u') do -- trace('à') - match any unicode letter grapheme 
            trace(k)
         end
         for k in s:rxmatch([[\p{Lu}]], 'u') do -- trace('à') - match any unicode lower case letter grapheme 
            trace(k)
         end
    2. Now lets look at matching multiple unicode graphemes in a string, this can only be done with PCRE (rxmatch).
         -- matching unicode graphemes
         
         -- using rxmatch
         local s = "Ábcd éfgh ©copyright"
         local cnt = 0
         for k in s:rxmatch([[\X]], 'u') do     -- match all unicode graphemes "\X"
            trace(k)
            cnt = cnt + 1
         end
         trace(cnt)                             --> cnt = 20 matches each letter (the unicode graphemes "Áé©" each count as a single letter)
      
         cnt = 0
         for k in s:rxmatch([[\p{L}]], 'u') do  -- match all unicode letter graphemes "\p{L}"
            trace(k)
            cnt = cnt + 1
         end
         trace(cnt)                             --> cnt = 17 two spaces and the "©" are not matched
      
         cnt = 0
         for k in s:rxmatch([[\p{Ll}]], 'u') do -- match lowercase unicode letter graphemes "\p{Ll}"
            trace(k)
            cnt = cnt + 1
         end
         trace(cnt)                             --> cnt = 16 spaces "Á" and the "©" are not matched
      
         cnt = 0
         for k in s:rxmatch([[\p{Lu}]], 'u') do -- match uppercase unicode letter graphemes "\p{Lu}"
            trace(k)
            cnt = cnt + 1
         end
         trace(cnt)                             --> cnt = 1 only "Á" is matched
         
         cnt = 0
         for k in s:rxmatch([[\p{S}]], 'u') do  -- match unicode symbol graphemes "\p{S}"
            trace(k)
            cnt = cnt + 1
         end
         trace(cnt)                             --> cnt = 1 only "©" is matched
    3. Detect if a string contains graphemes for a specified language, using unicode scripts like \p{Greek} or \p{Cyrillic}, etc.
         -- matching unicode scripts like \p{Greek} or \p{Cyrillic}, etc.
         
         -- using rxmatch
         local s = "Hello world in Greek Γειά σου Κόσμε (from google translate)"
         local cnt = 0
         for k in s:rxmatch([[\p{Greek}]], 'u') do -- match all Greek graphemes
            trace(k)
            cnt = cnt + 1
         end
         trace(cnt)                                --> cnt = 12 Greek letters
      
         local s = "Hello world in Greek Γειά σου Κόσμε (from google translate)"
         local cnt = 0
         for k in s:rxmatch([[\P{Greek}]], 'u') do -- match all NON Greek graphemes
            trace(k)
            cnt = cnt + 1
         end
         trace(cnt)                                --> cnt = 47 NON Greek letters