Lua Patterns and Captures (Regular Expressions)

From OpenEUO
Revision as of 12:26, 12 December 2010 by Ximan (Talk | contribs) (Magic Characters)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Intro

Lua's string library contains a couple of functions that work with patterns, also called (a subset of) regular expressions. They aren't unique to Lua, in fact they're used across many languages and tools and have solid roots in automata theory. In this tutorial, I will explain how you can use these very useful expressions to your advantage.

The following Lua functions support RegEx:

string.find (s, pattern [, init [, plain]])
string.gmatch (s, pattern)
string.gsub (s, pattern, repl [, n])
string.match (s, pattern [, init])

You can find a technical documentation about Lua's RegEx patterns here: http://www.lua.org/manual/5.1/manual.html#5.4.1

Pattern Matching

In the following examples, we will use the string.match function. It allows us to specify a pattern and returns the specific part of the string that matches the pattern or nil if no match was found.

So if we have the string "hello" and are looking for pattern "el" then it'll return "el" (or nil in case hello was spelled incorrectly):

s = "hello"
print(s:match("el")) --returns "el"
s = "hallo"
print(s:match("el")) --returns nil

Not very useful so far.

Magic Characters

It gets more interesting when we start using magic characters. The most basic of those is . (dot) which represents any character.

s = "hello"
print(s:match("h.l")) --returns "hel"
s = "hallo"
print(s:match("h.l")) --returns "hal"

Another magic character is ? which declares that the previous character may appear 0 or 1 times.

So if we wanted to allow both "hello" and "helo" we could define the pattern as "hell?o" which would make the second "l" optional:

s = "hello"
print(s:match("hell?o")) --returns "hello"
s = "helo"
print(s:match("hell?o")) --returns "helo"
s = "heo"
print(s:match("hell?o")) --returns nil
?: previous character may appear 0 to 1 times.
+: previous character may appear 1 to infinite times.
*: previous character may appear 0 to infinite times.

So, pattern "hel+o" would match "helo" with as many "l" characters as you like but there must be at least one. "hel*o" also matches string "heo".

Of course, you can combine those things as well:

s = "hello!"
print(s:match("h.+o!?")) --returns "hello!"

The above matches any string that contains "h" and "o" with at least one character in between, followed by an optional exclamation mark. (Note that "!" is no magic character).

Character Classes

Apart from "." which can be any character, there are also more restrictive subclasses:

Lua Docs wrote: %a: represents all letters. %c: represents all control characters. %d: represents all digits. %l: represents all lowercase letters. %p: represents all punctuation characters. %s: represents all space characters. %u: represents all uppercase letters. %w: represents all alphanumeric characters. %x: represents all hexadecimal digits. %z: represents the character with representation 0.

The pattern "he%l%lo" matches any part of the string that contains "he" and "o" with two lowercase characters in between.

You can also combine several character classes using brackets. E.g. in most languages, a valid identifier has to start with a letter or underscore followed by any number of letters, underscores or digits. So let's match that:

s = "_myvar91"
if s:match("[%a_]+[%a%d_]*") then
  print("valid identifier!")
end


% is also used as escape character in Lua RegEx patterns, so "%?+" matches 1 or more question marks in the string ("%?" represents the literal character "?").

Anchors

There are two additional magic characters we haven't discussed yet: ^: marks beginning of string $: marks end of string

In all of the above examples we always looked inside a longer string to find shorter matches. So pattern "hello" may be contained in a longer string like "he said hello and went away". The pattern "^hello$" matches exactly "hello" and nothing else (no preceeding or trailing characters allowed!). You could also say that we anchored the pattern at the string's beginning and end.

s = "hello world!"
print(s:match("^he..")) --returns "hell"
s = "123 hello world!"
print(s:match("^he..")) --returns nil

Captures

Now we come to the most useful feature of RegEx patterns: Captures.

A capture (defined through parentheses) lets us mark which part or parts of the pattern we want string.match to return to us:

s = "abc"
print(s:match("a.c")) --returns "abc"
print(s:match("a(.)c")) --returns "b"


While both of the above patterns match "abc", the second one uses a capture to say that we are only interested in the single character between "a" and "c" and want neither "a" nor "c" returned in the result. You can also specify multiple captures and will get them returned in that order.

Real World Example

Here is a real world example of proper use of RegEx:

a,b = UO.Property(UO.LTargetID)
print("***"..a.."***")
b = "\r\n"..b
for s,n in b:gmatch("%c+([^%c]-)%s([%d%.]+)") do
  print('Str: "'..s..'", Num: '..n)
end


Output wrote:

***Cleaver***
Str: "Weight:", Num: 2
Str: "Physical Damage", Num: 100
Str: "Weapon Damage", Num: 11
Str: "Weapon Speed", Num: 2.5
Str: "Strength Requirement", Num: 10
Str: "Durability", Num: 45


Explanation of used pattern:

1) %c+
2)    ([^%c]-)
3)            %s
4)              ([%d%.]+)


  • 1) at least one control character (line break)
  • 2) any number of characters except control characters, shortest possible combination, returned as capture
  • 3) a single space
  • 4) any number of (but at least one) digits or dots, returned as capture

Conclusion

Because of filesize/memory requirements, Lua only supports a subset of what other RegEx packages provide. Nonetheless, you can do A LOT with the available expressions. And if your parsing requirements turn out to be too complex, you can always break a string down into several parts and parse them individually.

Only disadvantage is that your code gets truly unreadable for anyone who doesn't know RegEx so you should always add plenty of comments!

Further Information