Using rxmatch() and rxsub() with PCRE regex

Using rxmatch()

Tip: It is very easy to adapt PHP examples for Iguana, rxmatch() corresponds to preg_grep (and to a lesser extent preg_match).

This first example is very simple, it just loops through a string and prints each word. As you can see there is slight difference: the first is that PCRE uses the “\” escape character where Lua uses “%”, so PCRE uses “\w” and Lua use “%w” to match all letters (word characters). There is a difference in the “word characters” matched: Lua matches alphanumeric characters, but PCRE matches alphanumeric plus “_” underscore.
Note: The escape “\” character must be escaped as “\\” in Lua strings, or you can use square brackets as string delimeters “[<string value>]” without escaping.
```
   -- Iterate over all the words from string s, printing one per line:
   
   -- using gmatch
   s = "hello world from Lua"
   for w in s:gmatch("%w+") do
      print(w)
   end
   
   -- using rxmatch
   s = "hello world from Lua and PCRE"
  for w in s:rxmatch("\\w+") do         -- the \ character must be escaped as \\ in Lua strings
  -- for w in s:rxmatch([\w+]) do       -- alternatively you can use the [] syntax without escaping
      print(w)
   end 
```

This example uses captures “(<captured value>)” to collect all key value pairs and write them into a Lua table. As you can see the only difference is that PCRE uses the “\” escape character where Lua uses “%”.

   -- Collect all pairs key=value from the given string into a table:
   
   -- using gmatch
   t = {}
   s = "from=world, to=Lua"
   for k, v in s:gmatch("(%w+)=(%w+)") do
      t[k] = v
   end
    
   -- using rxmatch
   t = {}
   s = "from=world, to=Lua"
   for k, v in s:rxmatch("(\\w+)=(\\w+)") do    -- the \ character must be escaped as \\ in Lua strings
   --for k, v in s:rxmatch([(\w+)=(\w+)]) do    -- alternatively you can use the [] syntax without escaping
      t[k] = v
   end

You can use captures to find duplicated words:

  -- find duplicate words
   
   -- using gmatch   
   local s = "hello hello world world"
   for k, v in s:gmatch("(%w+)(%s+%1)") do
      trace(k, v)
   end
   
   -- using rxmatch  
   local s = "hello hello world world"
   for k, v in s:rxmatch("(\\b\\w+\\b)(\\W+\\1)") do
      trace(k, v)
   end

And now for a trick with regex that you can’t do with Lua patterns, finding a word (or pattern) in a string that is not followed by another word. To achieve this we will use a PCRE “lookaround” subpattern.
To find “hello” only when it is not followed by “world”, use a negative lookahead:
```
  -- not possible with gmatch()
   t = {}
   local s = "hello everyone my name is ..."
   for k in s:rxmatch("(hello)(?!.*world)") do -- negative lookahead - PCRE only
      trace(k)
   end
```

Capture quoted text in a string, using captures allows us to write one regex that works for single (‘) and double (“) quotes.

   -- capture text in quotes
   
   local s = 'She said "Hello everyone"'
   for k in s:rxmatch([[("[^"]+")]]) do        -- capture text including quotes
      trace(k)
   end
   --> "Hello everyone"
   
   local s = 'She said "Hello everyone"'
   for k in s:rxmatch([["([^"]+)"]]) do        -- capture text excluding the quotes
      trace(k)
   end
   --> Hello everyone
 
   local s = [[She said "Don't wait up"]]
   for k in s:rxmatch([[(['"][^'"]+['"])]]) do -- capture text including quotes = FAIL stops at (matches) single quote
      trace(k)
   end
   --> "Don' (stops incorrectly on single quote)
 
   local s = [[She said "Don't wait up"]]
   for k in s:rxmatch([[((["'])[^\2]+\2)]]) do -- capture text including quotes = SUCCESS using capture \2 to identify 
      trace(k)                                 -- which type of opening quote is used
   end
   --> "Don't wait up"

   local s = [[She said "Don't wait up"]]
   for k,v in s:rxmatch([[(["'])([^\1]+)\1]]) do -- capture text excluding the quotes = SUCCESS using capture \1 to identify 
      trace(k,v)                                 -- which type of opening quote is used
   end
   --> Don't wait up (NOTE: this is the second return "v")

Capture text inside brackets or other delimiters:

   -- capture text inside delimiters like brackets, etc.
   
   local s = "Hello to the world (and everyone)"
   for k in s:rxmatch('(\\([^)]+\\))') do        -- capture text including brackets (or substitute other delimiters)
      trace(k)
   end
   -->(and everyone)

   local s = "Hello to the world (and everyone)"
   for k in s:rxmatch('\\(([^)]+)\\)') do        -- just capture text (NOT including brackets)
      trace(k)
   end
   -->and everyone

Extract HTML or XML tags:

   -- extract HTML or XML tags
   
   local s = '<a href="#hello_world">Hello world link</a>'
   for k in s:rxmatch('(<[^>]+>)') do              -- extract HTML tags
      trace(k)
   end

   local s = [[<?xml version="1.0"?>
   <patients>
      <patient id="123">
         <first-name>John</first-name>
         <last-name>Smith</last-name>
      </patient>
   </patients>]]
   for k in s:rxmatch('(<[^>]+>)') do              -- extract XML tags
      trace(k)
   end

   local s = [[<?xml version="1.0"?>
   <patients>
      <patient id = "123">
         <first-name>John</first-name>
         <last-name>Smith</last-name>
      </patient>
   </patients>]]
   for k in s:rxmatch('(<[^>]+\\sid\\b[^>]+>)') do -- only extract XML tags containing an id attribute
      trace(k)
   end

White-listing and black-listing are useful both useful techniques.

White-listing is simple with rxmatch() but next to impossible with gmatch():

   -- white-list

   -- using rxmatch  

   -- match single word
   local s = "Hello hello world I was here world"
   for k in s:rxmatch('(\\bhello\\b)', 'i') do       -- \b (word boundaries) to only match whole words
      trace(k)                                       -- 'i' (3rd param) for case insensitive match
   end

   -- extend to matching a list
   local s = "Hello hello world I was here world"
   for k in s:rxmatch('\\b(hello|world)\\b', 'i') do -- move the \b (word boundaries) outside the the capture group
      trace(k)
   end

   -- simply extend the list as required
   local s = "Hello to the world, Mars, Venus and the Universe"
   for k in s:rxmatch('\\b(hello|world|mars|venus|universe)\\b', 'i') do 
      trace(k) 
   end

   -- using gmatch - cannot be done easily

   -- matching a single word/phrase is easy
   local s = "Hello hello world I was here"
   for k in s:lower():gmatch('(hello)') do           -- using lower() for  for case insensitive match
      trace(k)
   end

   -- unfortunately Lua will also match partial words as well
   local s = "Hello to the worldwide web"
   for k in s:lower():gmatch('(hello)') do           -- matches "world" in "worldwide"
      trace(k)
   end
   -- NOTE: This excludes the partial match but it misses the first hello
   for k in s:lower():gmatch('%W(hello)%W') do 
      trace(k)
   end
   
   -- also Lua patterns do not support "|" (OR) so you cannot match members in a list
   local s = "Hello hello|world world I was here"
   for k in s:lower():gmatch('(hello|world)') do     -- matches string "hello|world"
      trace(k)
   end

   -- though you could loop through a white-list stored in table
   local s = "Hello hello world I was here world"
   local wlist = {'hello', 'world'}
   for i=1,#wlist do  
      trace(wlist[i])
      for k in s:lower():gmatch(wlist[i]) do         -- using lower() for case insensitive match
         trace(k)
      end
   end

A black-list is also simple with rxmatch() but we are not even going to try it gmatch():

   -- black-list
   
   -- first lets exclude a single word
   local s = "Hello hello world I was here world"
   for k in s:rxmatch([[\bhello\b(*SKIP)(*FAIL)|(\w+)]], 'i') do
      trace(k) 
   end

   -- then exclude words in a list
   for k in s:rxmatch([[\bhello\b(*SKIP)(*FAIL)|\bworld\b(*SKIP)(*FAIL)|\w+]], 'i') do
      trace(k) 
   end
   -- Compact Version: place \b and (*SKIP)(*FAIL) outside a non-capturing group
   for k in s:rxmatch([[\b(?:hello|world)\b(*SKIP)(*FAIL)|\w+]], 'i') do
      trace(k) 
   end

Here are some examples of unicode matching.

First a trivial example to help understand matching a specific unicode grapheme (compound character) like “à”:
Note: The unicode “à” is composed of two code points (symbols): U+0061 (a) followed by U+0300 (grave accent).

   -- demonstrate that "à" is composed of two unicode code points
   string.byte('à',1,2)       --> 195,160 = decimal values of U+0061 and U+0300

   -- using gmatch  
   local s = "à"
   for k in s:gmatch(".") do  --> trace('\195') - matches the first code point (195)
      trace(k)                --> trace('\160') - then the 2nd
   end
   for k in s:gmatch("..") do --> trace('à') - matches the whole grapheme (2 code points)
      trace(k)
   end
   for k in s:gmatch("à") do  --> trivially matches "à"
      trace(k)
   end
      
   -- using rxmatch  
   local s = "à"
   for k in s:rxmatch(".") do  --> trace('\195') - matches the first code point (195)
      trace(k)                 --> trace('\160') - then the 2nd
   end
   for k in s:rxmatch("..") do --> trace('à') - matches the whole grapheme (2 code points)
      trace(k)
   end
   for k in s:rxmatch("à") do  --> trivially matches "à"
      trace(k)
   end

   -- using rxmatch Unicode specific features 
   -- NOTE: you must include 'u' (unicode) as the 3rd parameter
   local s = "à"
   for k in s:rxmatch([[\X]], 'u') do    -- trace('à') - match the grapheme using "\X" (the unicode equivalent of ".")
      trace(k)
   end
   for k in s:rxmatch([[\p{L}]], 'u') do -- trace('à') - match any unicode letter grapheme 
      trace(k)
   end
   for k in s:rxmatch([[\p{Lu}]], 'u') do -- trace('à') - match any unicode lower case letter grapheme 
      trace(k)
   end

Now lets look at matching multiple unicode graphemes in a string, this can only be done with PCRE (rxmatch).

   -- matching unicode graphemes
   
   -- using rxmatch
   local s = "Ábcd éfgh ©copyright"
   local cnt = 0
   for k in s:rxmatch([[\X]], 'u') do     -- match all unicode graphemes "\X"
      trace(k)
      cnt = cnt + 1
   end
   trace(cnt)                             --> cnt = 20 matches each letter (the unicode graphemes "Áé©" each count as a single letter)

   cnt = 0
   for k in s:rxmatch([[\p{L}]], 'u') do  -- match all unicode letter graphemes "\p{L}"
      trace(k)
      cnt = cnt + 1
   end
   trace(cnt)                             --> cnt = 17 two spaces and the "©" are not matched

   cnt = 0
   for k in s:rxmatch([[\p{Ll}]], 'u') do -- match lowercase unicode letter graphemes "\p{Ll}"
      trace(k)
      cnt = cnt + 1
   end
   trace(cnt)                             --> cnt = 16 spaces "Á" and the "©" are not matched

   cnt = 0
   for k in s:rxmatch([[\p{Lu}]], 'u') do -- match uppercase unicode letter graphemes "\p{Lu}"
      trace(k)
      cnt = cnt + 1
   end
   trace(cnt)                             --> cnt = 1 only "Á" is matched
   
   cnt = 0
   for k in s:rxmatch([[\p{S}]], 'u') do  -- match unicode symbol graphemes "\p{S}"
      trace(k)
      cnt = cnt + 1
   end
   trace(cnt)                             --> cnt = 1 only "©" is matched

Detect if a string contains graphemes for a specified language, using unicode scripts like \p{Greek} or \p{Cyrillic}, etc.

   -- matching unicode scripts like \p{Greek} or \p{Cyrillic}, etc.
   
   -- using rxmatch
   local s = "Hello world in Greek Γειά σου Κόσμε (from google translate)"
   local cnt = 0
   for k in s:rxmatch([[\p{Greek}]], 'u') do -- match all Greek graphemes
      trace(k)
      cnt = cnt + 1
   end
   trace(cnt)                                --> cnt = 12 Greek letters

   local s = "Hello world in Greek Γειά σου Κόσμε (from google translate)"
   local cnt = 0
   for k in s:rxmatch([[\P{Greek}]], 'u') do -- match all NON Greek graphemes
      trace(k)
      cnt = cnt + 1
   end
   trace(cnt)                                --> cnt = 47 NON Greek letters

Continue: Using rxsub()

Archive: Iguana 5 Documentation ( Switch to Iguana 6 )

Using rxmatch()

Contents