Using rxmatch() and rxsub() with PCRE regex

Using rxsub()

Tip: It is very easy to adapt PHP examples for Iguana, rxsub() corresponds to preg_filter and preg_replace.

The first two example is very simple, replace a word or a phrase. As you can see (in this case) there is no difference between the string.gsub() and string.rxsub() syntax.

Replace a word:

   -- match and replace a word

   -- using gsub   
   local s = "hello world my name is ..."
   x = s:gsub( "world", "everyone")           
   --> hello everyone my name is ... 

   -- using rxsub
   local s = "hello world my name is ..."
   x = s:rxsub( "world", "everyone")
   --> hello everyone my name is ...

Replace a phrase:

   -- match and replace a phrase

   -- using gsub
   local s = "hello everyone my name is ... "
   x = s:gsub( "hello everyone", "Hello World") 
   --> Hello World my name is ...

   -- using rxsub   
   local s = "hello everyone my name is ... "
   x = s:rxsub( "hello everyone", "Hello World") 
   --> Hello World my name is ...

This next example shows how to remove duplicate space or whitespace characters.
Note: Space is just a space ” “, whitespace includes all space characters (like space, tab, newline, carriage return, vertical tab).

   -- remove duplicate spaces

   -- using gsub   
   local s = "hello world  I     was  here"
   x = s:gsub(" +", " ")                         -- replace multiple spaces with a single space
   --> hello world I was here
   local s = "hello world    I \t was \r\n here"
   x = s:gsub("%s+", " ")                        -- replace *any* multiple whitespace characters with a single space 
   --> hello world I was here
   
   -- using rxsub   
   local s = "hello world  I    was     here"
   x = s:rxsub(" +", " ")                        -- replace multiple spaces with a single space
   --> hello world I was here
   local s = "hello world    I \t was \r\n here"
   x = s:rxsub("\\s+", " ")                      -- replace *any* multiple whitespace characters with a single space 
   --> hello world I was here

This example removes multiple spaces or whitespace characters before a fullstop (point) at the end of a sentence.

   -- remove multiple spaces before a fullstop/point

   -- using gsub   
   local s = "Hello world.      I was here."
   x = s:gsub("%. +", ". ")                    -- replace multiple spaces with a single space
   --> hello world I was here
   local s = "Hello world. \t\r\n I was here."
   x = s:gsub("%.%s+", ". ")                   -- replace *any* multiple whitespace characters with a single space
   --> hello world I was here
   
   -- using rxsub   
   local s = "Hello world.      I was here.  "
   x = s:rxsub("\\. +", ". ")                  -- replace multiple spaces with a single space
   --> hello world I was here
   local s = "Hello world. \t\r\n I was here.  "
   x = s:rxsub("\\.\\s+", ". ")                -- replace *any* multiple whitespace characters with a single space
   --> hello world I was here

This example demonstrates the use of a capture to duplicate words. To create a capture you enclose a phrase or pattern in brackets, then you can refer to it later as $1-9 (regex) or %1-9 (Lua pattern), you can also use $0 or %0 to refer to a complete string match. If there is no explicit capture then string.gsub() will capture a whole string match as %1 (which is equivalent to %0 in this case), we prefer %0 as it is more obvious as it is consistent with regex (see example below).

   -- using gsub   
   local s = "hello world"
   x = s:gsub("(%w+)", "%1 %1") -- %1 = first match
   x = s:gsub("(%w+)", "%0 %0") -- %0 = whole match
   x = s:gsub("(%w+)", "%1 %0") -- mixed = same result
   x = s:gsub("%w+", "%0 %0")   -- %0 = whole match
   x = s:gsub("%w+", "%1 %1")   -- %1 = first match (same result as %0) - equivalent regex (below) fails = not recommended
   --> x="hello hello world world"   
   
   -- using rxsub   
   local s = "hello world"
   x = s:rxsub("(\\w+)", "$1 $1") -- $1 = first match
   x = s:rxsub("(\\w+)", "$0 $0") -- $0 = whole match
   x = s:rxsub("\\w+", "$0 $0")   -- $0 = whole match
   x = s:rxsub("\\w+", "$1 $1")   -- $1 = first match fails - equivalent Lua pattern (above) works
   --> x="hello hello world world"
   
   -- notice how %0 or $0 is different from %1, %2 or $1, $2 when using multiple captures
   -- using gsub   
   local s = "hello world"
   x = s:gsub("(%w+) (%w+)", "%0 %0")   -- %0 = whole match   --> hello world hello world
   x = s:gsub("(%w+) (%w+)", "%1 %1")   -- %1 = first capture --> hello hello
   -- using rxsub   
   local s = "hello world"
   x = s:gsub("(\\w+) (\\w+)", "$0 $0")   -- %0 = whole match   --> hello world hello world
   x = s:gsub("(\\w+) (\\w+)", "$1 $1")   -- %1 = first capture --> hello hello

A more useful example with captures is to use them to remove duplicate words. Notice how we can remove multiple repeated words with string.rxsub() but not with string.gsub(), this is because PCRE allows for repetition of captures to be quantified with (with * or +), but Lua Patterns do not allow this.

   -- remove duplicate words

   -- using gsub   
   local s = "hello hello world world I was here"
   x = s:gsub("(%w+)%s+%1","%1")                         -- can only remove duplicate words (not multiples like PCRE below)
   x = s:gsub("(%w+)(%s+%1)+","%1")                      -- cannot repeat a capture in Lua so THIS DOES NOT WORK         
   --> hello world I was here

   -- using rxsub   
   local s = "hello hello  hello world world I was here"
   x = s:rxsub("(\\b\\w+\\b)(\\W+\\1)+","$1")            -- using a word boundary \b
   --> hello world I was here                            -- our PCRE regex can remove multiple repeats (unlike gsub)

Suppose we occasionally receive HL7 messages with an extra “|” character before the encoding characters (“MSH||^~\&“), we can remove this by using an “^” anchor to check and fix the start of the message.
Note: You might think the anchor is overkill, but it will prevent matching things like embedded HL7 messages.

   -- Remove extra bar "|" character

   -- Note: The use of the [[<string>]] syntax to reduce escaping 
   -- i.e., for gsub() [[MSH|^~\&|]] rather than 'MSH|^~\\&|'
 
   -- using gsub   
   local s = [[MSH||^~\&|iNTERFACEWARE|Lab|]] -- partial HL7 message for brevity
   x = s:gsub([[^MSH||^~\&|]], [[MSH|^~\&|]])
   --> MSH|^~\&|iNTERFACEWARE|Lab|
   
   -- using rxsub   
   local s = [[MSH||^~\&|iNTERFACEWARE|Lab|]] -- partial HL7 message for brevity
   x = s:rxsub([[^MSH\|\|\^~\\&\|]], [[MSH|^~\\&|]])
   --> MSH|^~\&|iNTERFACEWARE|Lab|

We can also use a “$” anchor to add a fullstop/point at the end of a string.

   -- put a fullstop/point at the end of the string
   
   -- using gsub   
   local s = "Hello world I was here"
   x = s:gsub('[^.]$', '%0.')
   --> Hello world I was here.
   
   -- using rxsub   
   local s = "Hello world I was here"
   x = s:rxsub('[^.]$', '$0.')
   --> Hello world I was here.

This example demonstrates the use of multiple captures to reverse the order of consecutive words. Notice the use of the POSIX class [:alpha:] with string.rxsub() to match alphabetic characters.

   -- reverse the order of two consecutive words
   
   -- using gsub   
   local s = "one two three  four 5 6"
   x = s:gsub("(%w+)%s*(%w+)", "%2 %1")    -- %w for alphanumeric, %s* matches multiple spaces
   --> x="two one four three 6 5"          -- but multiple spaces are not included in the result
   local s = "one two three  four 5 6"
   x = s:gsub("(%w+)(%s*)(%w+)", "%3%2%1") -- 3 captures will include spaces in the result
   --> x="two one four  three 6 5"          
   local s = "one two three  four 5 6"
   x = s:gsub("(%a+)%s*(%a+)", "%2 %1")    -- %a for alphabetic only
   --> x="two one four three 5 6"
   
   -- using rxsub   
   local s = "one two three four 5 6"
   x = s:rxsub("(\\w+)\\s*(\\w+)", "$2 $1")                 -- \w for alphanumeric, \s* matches multiple spaces
   --> x="two one four three 6 5"
   local s = "one two three  four 5 6"
   x = s:rxsub("(\\w+)(\\s*)(\\w+)", "$3$2$1")              -- 3 captures will include spaces in the result
   --> x="two one four  three 6 5"          
   local s = "one two three four 5 6"
   x = s:rxsub("([[:alpha:]]+)\\s*([[:alpha:]]+)", "$2 $1") -- [:alpha:] (POSIX class) for alphabetic only
   --> x="two one four three 5 6"

Convert URLs in text to hyperlinks:
Note: This will not find “shorthand” URLs like “www.google.com” they needs to start with https:// (or http, ftp or ftps)

   -- convert URLs to hyperlinks
   
   local r = [[(http|https|ftp|ftps)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/[^\s,;:(?:. )]*)?]] -- URL regex
   
   -- text to match and update
   local s = 'We should match https://css-tricks.com/snippets/php/find-urls-in-text-make-links/, '.. 
             'this https://gist.github.com/dperini/729294, this http://php.net/manual/en/book.pcre.php '..
             'and this https://www.google.com, but not this www.google.com (without the https://)'
   
   if(s:rxmatch(r)) then      
      x = s:rxsub(r, "<a href=$0>$0</a>") -- create hyperlinks
   end

Replace words in a foreign language like Greek or Cyrillic etc, by using unicode scripts.
Note: You could use a call to translation web service rather than “<Greek word>”.

   -- Replace words in a foreign language
   
   local s = "Hello world in Greek Γειά σου Κόσμε (from google translate)"
   x = s:rxsub([[\p{Greek}+]], '<Greek word>', 'u') -- use a call to translation web service instead of "<Greek word>"
   trace(x)

Continue: PCRE Samples

Archive: Iguana 5 Documentation ( Switch to Iguana 6 )

Using rxsub()

Contents