Using rxmatch()
Contents
Tip: It is very easy to adapt PHP examples for Iguana, rxmatch()
corresponds to preg_grep (and to a lesser extent preg_match).
- This first example is very simple, it just loops through a string and prints each word. As you can see there is slight difference: the first is that PCRE uses the “\” escape character where Lua uses “%”, so PCRE uses “\w” and Lua use “%w” to match all letters (word characters). There is a difference in the “word characters” matched: Lua matches alphanumeric characters, but PCRE matches alphanumeric plus “_” underscore.
Note: The escape “\” character must be escaped as “\\” in Lua strings, or you can use square brackets as string delimeters “[<string value>]” without escaping.-- Iterate over all the words from string s, printing one per line: -- using gmatch s = "hello world from Lua" for w in s:gmatch("%w+") do print(w) end -- using rxmatch s = "hello world from Lua and PCRE" for w in s:rxmatch("\\w+") do -- the \ character must be escaped as \\ in Lua strings -- for w in s:rxmatch([\w+]) do -- alternatively you can use the [] syntax without escaping print(w) end
- This example uses captures “(<captured value>)” to collect all key value pairs and write them into a Lua table. As you can see the only difference is that PCRE uses the “\” escape character where Lua uses “%”.
-- Collect all pairs key=value from the given string into a table: -- using gmatch t = {} s = "from=world, to=Lua" for k, v in s:gmatch("(%w+)=(%w+)") do t[k] = v end -- using rxmatch t = {} s = "from=world, to=Lua" for k, v in s:rxmatch("(\\w+)=(\\w+)") do -- the \ character must be escaped as \\ in Lua strings --for k, v in s:rxmatch([(\w+)=(\w+)]) do -- alternatively you can use the [] syntax without escaping t[k] = v end
- You can use captures to find duplicated words:
-- find duplicate words -- using gmatch local s = "hello hello world world" for k, v in s:gmatch("(%w+)(%s+%1)") do trace(k, v) end -- using rxmatch local s = "hello hello world world" for k, v in s:rxmatch("(\\b\\w+\\b)(\\W+\\1)") do trace(k, v) end
- And now for a trick with regex that you can’t do with Lua patterns, finding a word (or pattern) in a string that is not followed by another word. To achieve this we will use a PCRE “lookaround” subpattern.
To find “hello” only when it is not followed by “world”, use a negative lookahead:-- not possible with gmatch() t = {} local s = "hello everyone my name is ..." for k in s:rxmatch("(hello)(?!.*world)") do -- negative lookahead - PCRE only trace(k) end
- Capture quoted text in a string, using captures allows us to write one regex that works for single (‘) and double (“) quotes.
-- capture text in quotes local s = 'She said "Hello everyone"' for k in s:rxmatch([[("[^"]+")]]) do -- capture text including quotes trace(k) end --> "Hello everyone" local s = 'She said "Hello everyone"' for k in s:rxmatch([["([^"]+)"]]) do -- capture text excluding the quotes trace(k) end --> Hello everyone local s = [[She said "Don't wait up"]] for k in s:rxmatch([[(['"][^'"]+['"])]]) do -- capture text including quotes = FAIL stops at (matches) single quote trace(k) end --> "Don' (stops incorrectly on single quote) local s = [[She said "Don't wait up"]] for k in s:rxmatch([[((["'])[^\2]+\2)]]) do -- capture text including quotes = SUCCESS using capture \2 to identify trace(k) -- which type of opening quote is used end --> "Don't wait up" local s = [[She said "Don't wait up"]] for k,v in s:rxmatch([[(["'])([^\1]+)\1]]) do -- capture text excluding the quotes = SUCCESS using capture \1 to identify trace(k,v) -- which type of opening quote is used end --> Don't wait up (NOTE: this is the second return "v")
- Capture text inside brackets or other delimiters:
-- capture text inside delimiters like brackets, etc. local s = "Hello to the world (and everyone)" for k in s:rxmatch('(\\([^)]+\\))') do -- capture text including brackets (or substitute other delimiters) trace(k) end -->(and everyone) local s = "Hello to the world (and everyone)" for k in s:rxmatch('\\(([^)]+)\\)') do -- just capture text (NOT including brackets) trace(k) end -->and everyone
- Extract HTML or XML tags:
-- extract HTML or XML tags local s = '<a href="#hello_world">Hello world link</a>' for k in s:rxmatch('(<[^>]+>)') do -- extract HTML tags trace(k) end local s = [[<?xml version="1.0"?> <patients> <patient id="123"> <first-name>John</first-name> <last-name>Smith</last-name> </patient> </patients>]] for k in s:rxmatch('(<[^>]+>)') do -- extract XML tags trace(k) end local s = [[<?xml version="1.0"?> <patients> <patient id = "123"> <first-name>John</first-name> <last-name>Smith</last-name> </patient> </patients>]] for k in s:rxmatch('(<[^>]+\\sid\\b[^>]+>)') do -- only extract XML tags containing an id attribute trace(k) end
- White-listing and black-listing are useful both useful techniques.
- White-listing is simple with
rxmatch()
but next to impossible withgmatch()
:-- white-list -- using rxmatch -- match single word local s = "Hello hello world I was here world" for k in s:rxmatch('(\\bhello\\b)', 'i') do -- \b (word boundaries) to only match whole words trace(k) -- 'i' (3rd param) for case insensitive match end -- extend to matching a list local s = "Hello hello world I was here world" for k in s:rxmatch('\\b(hello|world)\\b', 'i') do -- move the \b (word boundaries) outside the the capture group trace(k) end -- simply extend the list as required local s = "Hello to the world, Mars, Venus and the Universe" for k in s:rxmatch('\\b(hello|world|mars|venus|universe)\\b', 'i') do trace(k) end -- using gmatch - cannot be done easily -- matching a single word/phrase is easy local s = "Hello hello world I was here" for k in s:lower():gmatch('(hello)') do -- using lower() for for case insensitive match trace(k) end -- unfortunately Lua will also match partial words as well local s = "Hello to the worldwide web" for k in s:lower():gmatch('(hello)') do -- matches "world" in "worldwide" trace(k) end -- NOTE: This excludes the partial match but it misses the first hello for k in s:lower():gmatch('%W(hello)%W') do trace(k) end -- also Lua patterns do not support "|" (OR) so you cannot match members in a list local s = "Hello hello|world world I was here" for k in s:lower():gmatch('(hello|world)') do -- matches string "hello|world" trace(k) end -- though you could loop through a white-list stored in table local s = "Hello hello world I was here world" local wlist = {'hello', 'world'} for i=1,#wlist do trace(wlist[i]) for k in s:lower():gmatch(wlist[i]) do -- using lower() for case insensitive match trace(k) end end
- A black-list is also simple with
rxmatch()
but we are not even going to try itgmatch()
:-- black-list -- first lets exclude a single word local s = "Hello hello world I was here world" for k in s:rxmatch([[\bhello\b(*SKIP)(*FAIL)|(\w+)]], 'i') do trace(k) end -- then exclude words in a list for k in s:rxmatch([[\bhello\b(*SKIP)(*FAIL)|\bworld\b(*SKIP)(*FAIL)|\w+]], 'i') do trace(k) end -- Compact Version: place \b and (*SKIP)(*FAIL) outside a non-capturing group for k in s:rxmatch([[\b(?:hello|world)\b(*SKIP)(*FAIL)|\w+]], 'i') do trace(k) end
- White-listing is simple with
- Here are some examples of unicode matching.
- First a trivial example to help understand matching a specific unicode grapheme (compound character) like “à”:
Note: The unicode “à” is composed of two code points (symbols): U+0061 (a) followed by U+0300 (grave accent).-- demonstrate that "à" is composed of two unicode code points string.byte('à',1,2) --> 195,160 = decimal values of U+0061 and U+0300 -- using gmatch local s = "à" for k in s:gmatch(".") do --> trace('\195') - matches the first code point (195) trace(k) --> trace('\160') - then the 2nd end for k in s:gmatch("..") do --> trace('à') - matches the whole grapheme (2 code points) trace(k) end for k in s:gmatch("à") do --> trivially matches "à" trace(k) end -- using rxmatch local s = "à" for k in s:rxmatch(".") do --> trace('\195') - matches the first code point (195) trace(k) --> trace('\160') - then the 2nd end for k in s:rxmatch("..") do --> trace('à') - matches the whole grapheme (2 code points) trace(k) end for k in s:rxmatch("à") do --> trivially matches "à" trace(k) end -- using rxmatch Unicode specific features -- NOTE: you must include 'u' (unicode) as the 3rd parameter local s = "à" for k in s:rxmatch([[\X]], 'u') do -- trace('à') - match the grapheme using "\X" (the unicode equivalent of ".") trace(k) end for k in s:rxmatch([[\p{L}]], 'u') do -- trace('à') - match any unicode letter grapheme trace(k) end for k in s:rxmatch([[\p{Lu}]], 'u') do -- trace('à') - match any unicode lower case letter grapheme trace(k) end
- Now lets look at matching multiple unicode graphemes in a string, this can only be done with PCRE (rxmatch).
-- matching unicode graphemes -- using rxmatch local s = "Ábcd éfgh ©copyright" local cnt = 0 for k in s:rxmatch([[\X]], 'u') do -- match all unicode graphemes "\X" trace(k) cnt = cnt + 1 end trace(cnt) --> cnt = 20 matches each letter (the unicode graphemes "Áé©" each count as a single letter) cnt = 0 for k in s:rxmatch([[\p{L}]], 'u') do -- match all unicode letter graphemes "\p{L}" trace(k) cnt = cnt + 1 end trace(cnt) --> cnt = 17 two spaces and the "©" are not matched cnt = 0 for k in s:rxmatch([[\p{Ll}]], 'u') do -- match lowercase unicode letter graphemes "\p{Ll}" trace(k) cnt = cnt + 1 end trace(cnt) --> cnt = 16 spaces "Á" and the "©" are not matched cnt = 0 for k in s:rxmatch([[\p{Lu}]], 'u') do -- match uppercase unicode letter graphemes "\p{Lu}" trace(k) cnt = cnt + 1 end trace(cnt) --> cnt = 1 only "Á" is matched cnt = 0 for k in s:rxmatch([[\p{S}]], 'u') do -- match unicode symbol graphemes "\p{S}" trace(k) cnt = cnt + 1 end trace(cnt) --> cnt = 1 only "©" is matched
- Detect if a string contains graphemes for a specified language, using unicode scripts like \p{Greek} or \p{Cyrillic}, etc.
-- matching unicode scripts like \p{Greek} or \p{Cyrillic}, etc. -- using rxmatch local s = "Hello world in Greek Γειά σου Κόσμε (from google translate)" local cnt = 0 for k in s:rxmatch([[\p{Greek}]], 'u') do -- match all Greek graphemes trace(k) cnt = cnt + 1 end trace(cnt) --> cnt = 12 Greek letters local s = "Hello world in Greek Γειά σου Κόσμε (from google translate)" local cnt = 0 for k in s:rxmatch([[\P{Greek}]], 'u') do -- match all NON Greek graphemes trace(k) cnt = cnt + 1 end trace(cnt) --> cnt = 47 NON Greek letters
- First a trivial example to help understand matching a specific unicode grapheme (compound character) like “à”:
Continue: Using rxsub()