A library for parsing, compiling, and executing regular expressions. The match time is linear in the length of the text and the regular expression. So, it can handle input from untrusted users. The syntax is similar to PCRE but lacks a few features that can not be implemented while keeping the space/time complexity guarantees, ex: backreferences.
Syntax
Matching one character
. any character except new line (includes new line with s flag) \d digit (\p{Nd}) \D not digit \pN One-letter name Unicode character class \p{Greek} Unicode character class (general category or script) \PN Negated one-letter name Unicode character class \P{Greek} negated Unicode character class (general category or script)
Character classes
[xyz] A character class matching either x, y or z (union). [^xyz] A character class matching any character except x, y and z. [a-z] A character class matching any character in range a-z. [[:alpha:]] ASCII character class ([A-Za-z]) [[:^alpha:]] Negated ASCII character class ([^A-Za-z]) [\[\]] Escaping in character classes (matching [ or ])
Composites
xy concatenation (x followed by y) x|y alternation (x or y, prefer x)
Repetitions
x* zero or more of x (greedy) x+ one or more of x (greedy) x? zero or one of x (greedy) x*? zero or more of x (ungreedy/lazy) x+? one or more of x (ungreedy/lazy) x?? zero or one of x (ungreedy/lazy) x{n,m} at least n x and at most m x (greedy) x{n,} at least n x (greedy) x{n} exactly n x x{n,m}? at least n x and at most m x (ungreedy/lazy) x{n,}? at least n x (ungreedy/lazy) x{n}? exactly n x
Empty matches
^ the beginning of text (or start-of-line with multi-line mode) $ the end of text (or end-of-line with multi-line mode) \A only the beginning of text (even with multi-line mode enabled) \z only the end of text (even with multi-line mode enabled) \b a Unicode word boundary (\w on one side and \W, \A, or \z on other) \B not a Unicode word boundary
Grouping and flags
(exp) numbered capture group (indexed by opening parenthesis) (?P<name>exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z]) (?:exp) non-capturing group (?flags) set flags within current group (?flags:exp) set flags for exp (non-capturing)
Flags are each a single character. For example, (?x) sets the flag x and (?-x) clears the flag x. Multiple flags can be set or cleared at the same time: (?xy) sets both the x and y flags, (?x-y) sets the x flag and clears the y flag, and (?-xy) clears both the x and y flags.
i case-insensitive: letters match both upper and lower case m multi-line mode: ^ and $ match begin/end of line s allow . to match \L (new line) U swap the meaning of x* and x*? (un-greedy mode) u Unicode support (enabled by default) x ignore whitespace and allow line comments (starting with #)
All flags are disabled by default unless stated otherwise
The regex accepts passing a set of flags:
regexCaseless same as (?i) regexMultiline same as (?m) regexDotAll same as (?s) regexUngreedy same as (?U) regexAscii same as (?-u) regexExtended same as (?x) regexArbitraryBytes treat both the regex and the input text as arbitrary byte sequences
Escape sequences
\* literal *, works for any punctuation character: \.+*?()|[]{}^$ \a bell (\x07) \f form feed (\x0C) \t horizontal tab \n new line (\L) \r carriage return \v vertical tab (\x0B) \123 octal character code (up to three digits) \x7F hex character code (exactly two digits) \x{10FFFF} any hex character code corresponding to a Unicode code point \u007F hex character code (exactly four digits) \U0010FFFF hex character code (exactly eight digits)
Perl character classes (Unicode friendly)
These classes are based on the definitions provided in UTS#18
\d digit (\p{Nd}) \D not digit \s whitespace (\p{White_Space}) \S not whitespace \w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control}) \W not word character
ASCII character classes
[[:alnum:]] alphanumeric ([0-9A-Za-z]) [[:alpha:]] alphabetic ([A-Za-z]) [[:ascii:]] ASCII ([\x00-\x7F]) [[:blank:]] blank ([\t ]) [[:cntrl:]] control ([\x00-\x1F\x7F]) [[:digit:]] digits ([0-9]) [[:graph:]] graphical ([!-~]) [[:lower:]] lower case ([a-z]) [[:print:]] printable ([ -~]) [[:punct:]] punctuation ([!-/:-@\[-`{-~]) [[:space:]] whitespace ([\t\n\v\f\r ]) [[:upper:]] upper case ([A-Z]) [[:word:]] word characters ([0-9A-Za-z_]) [[:xdigit:]] hex digit ([0-9A-Fa-f])
Lookaround Assertions
(?=regex) A positive lookahead assertion (?!regex) A negative lookahead assertion (?<=regex) A positive lookbehind assertion (?<!regex) A negative lookbehind assertion
Any regex expression is a valid lookaround; groups are captured as well. Beware, lookarounds containing repetitions (*, +, and {n,}) may run in polynomial time.
Examples
Match
The match function match a text from start to end, similar to ^regex$. This means the whole text needs to match the regex for this function to return true.
let text = "nim c --styleCheck:hint --colors:off regex.nim" var m = RegexMatch2() if match(text, re2"nim c (?:--(\w+:\w+) *)+ (\w+).nim", m): doAssert text[m.group(0)] == "colors:off" doAssert text[m.group(1)] == "regex" else: doAssert false, "no match"
Captures
Like most other regex engines, this library only captures the last repetition in a repeated group (*, +, {n}). Note how in the previous example both styleCheck:hint and colors:off are matched in the same group but only the last captured match (colors:off) is returned.
To check if a capture group did match you can use reNonCapture. For example doAssert m.group(0) != reNonCapture. This is useful to disambiguate empty captures and non-matched captures. Since both return an empty string when slicing the text.
The space complexity for captures is O(regex_len * groups_count), and so it can be used to match untrusted text.
Find
The find function will find the first piece of text that match a given regex.
let text = """ The Continental's email list: john_wick@continental.com winston@continental.com ms_perkins@continental.com """ var match = "" var capture = "" var m = RegexMatch2() if find(text, re2"(\w+)@\w+\.\w+", m): match = text[m.boundaries] capture = text[m.group(0)] doAssert match == "john_wick@continental.com" doAssert capture == "john_wick"
Find All
The findAll function will find all pieces of text that match a given regex, returning their boundaries and captures/submatches.
let text = """ The Continental's email list: john_wick@continental.com winston@continental.com ms_perkins@continental.com """ var matches = newSeq[string]() var captures = newSeq[string]() for m in findAll(text, re2"(\w+)@\w+\.\w+"): matches.add text[m.boundaries] captures.add text[m.group(0)] doAssert matches == @[ "john_wick@continental.com", "winston@continental.com", "ms_perkins@continental.com" ] doAssert captures == @["john_wick", "winston", "ms_perkins"]
Verbose Mode
Verbose mode (?x) makes regexes more readable by allowing comments and multi-lines within the regular expression itself. The caveat is spaces and pound signs must be scaped to be matched.
const exp = re2"""(?x) \# # the hashtag \w+ # hashtag words """ let text = "#NimLang" doAssert match(text, exp)
Match Macro
The match macro is sometimes more convenient, and faster than the function version. It will run a full match on the whole string, similar to ^regex$.
A matches: seq[string] variable is injected into the scope, and it contains the submatches for every capture group.
var matched = false let text = "[my link](https://example.com)" match text, rex"\[([^\]]+)\]\((https?://[^)]+)\)": doAssert matches == @["my link", "https://example.com"] matched = true doAssert matched
Invalid UTF-8 input text
UTF-8 validation on the input text is only done in debug mode for perf reasons. The behaviour on invalid UTF-8 input (i.e: malformed, corrupted, truncated, etc) when compiling in release/danger mode is currently undefined, and it will likely result in an internal AssertionDefect or some other error.
What can be done about this is validating the input text to avoid passing invalid input to the match function.
import unicode # good input text doAssert validateUtf8("abc") == -1 # bad input text doAssert validateUtf8("\xf8\xa1\xa1\xa1\xa1") != -1
Note at the time of writting this, Nim's validateUtf8 is not strict enough and so you are better off using nim-unicodeplus's verifyUtf8 function.
Match arbitrary bytes
Setting the regexArbitraryBytes flag will treat both the regex and the input text as byte sequences. This flag makes ascii mode the default.
const flags = {regexArbitraryBytes} doAssert match("\xff", re2(r"\xff", flags)) doAssert match("\xf8\xa1\xa1\xa1\xa1", re2(r".+", flags))
Beware of (un)expected behaviour when mixin UTF-8 characters.
const flags = {regexArbitraryBytes} doAssert match("Ⓐ", re2(r"Ⓐ", flags)) doAssert match("ⒶⒶ", re2(r"(Ⓐ)+", flags)) doAssert not match("ⒶⒶ", re2(r"Ⓐ+", flags)) # ???
The last line in the above example won't match because the regex is parsed as a byte sequence. The Ⓐ character is composed of multiple bytes (\xe2\x92\xb6), and only the last byte is affected by the + operator.
Compile the regex at compile time
Passing a regex literal or assigning it to a const will compile the regex at compile time. Errors in the expression will be catched at compile time this way.
Do not confuse the regex compilation with the matching operation. The following examples do the matching at runtime. But matching at compile-time is supported as well.
let text = "abc" block: const rexp = re2".+" doAssert match(text, rexp) block: doAssert match(text, re2".+") block: func myFn(s: string, exp: static string) = const rexp = re2(exp) doAssert match(s, rexp) myFn(text, r".+")
Using a const can avoid confusion when passing flags:
let text = "abc" block: const rexp = re2(r".+", {regexDotAll}) doAssert match(text, rexp) block: doAssert match(text, re2(r".+", {regexDotAll})) block: # this will compile the expression at runtime # because flags is a var, avoid it! let flags = {regexDotAll} doAssert match(text, re2(r".+", flags))
Compile the regex at runtime
Most of the time compiling the regex at runtime can be avoided, and it should be avoided. Nim has really good compile-time capabilities like reading files, constructing strings, and so on. However, it cannot be helped in cases where the regex is passed to the program at runtime (from terminal input, network, or text files).
To compile the regex at runtime pass the regex expression as a var/let.
let text = "abc" block: var rexp = r".+" doAssert match(text, re2(rexp)) block: let rexp = r".+" doAssert match(text, re2(rexp)) block: func myFn(s: string, exp: string) = doAssert match(s, re2(exp)) myFn(text, r".+")
Consts
reNonCapture = (a: -1, b: -2)
Procs
func contains(s: string; pattern: Regex): bool {....raises: [], deprecated: "use contains(string, Regex2) instead", tags: [RootEffect].}
func contains(s: string; pattern: Regex2): bool {....raises: [], tags: [RootEffect].}
-
Example:
doAssert re2"bc" in "abcd" doAssert re2"(23)+" in "23232" doAssert re2"^(23)+$" notin "23232"
func endsWith(s: string; pattern: Regex): bool {....raises: [], deprecated: "use endsWith(string, Regex2) instead", tags: [RootEffect].}
func endsWith(s: string; pattern: Regex2): bool {....raises: [], tags: [RootEffect].}
-
return whether the string ends with the pattern or not
Example:
doAssert "abc".endsWith(re2"\w") doAssert not "abc".endsWith(re2"\d")
func escapeRe(s: string): string {....raises: [], tags: [].}
- Escape special regex characters in s so that it can be matched verbatim
func find(s: string; pattern: Regex2; m: var RegexMatch2; start = 0): bool {. ...raises: [], tags: [RootEffect].}
-
search through the string looking for the first location where there is a match
Example:
var m = RegexMatch2() doAssert "abcd".find(re2"bc", m) and m.boundaries == 1 .. 2 doAssert not "abcd".find(re2"de", m) doAssert "2222".find(re2"(22)*", m) and m.group(0) == 2 .. 3
func find(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {. ...raises: [], deprecated: "use find(string, Regex2, var RegexMatch2) instead", tags: [RootEffect].}
func findAll(s: string; pattern: Regex2; start = 0): seq[RegexMatch2] {. ...raises: [], tags: [RootEffect].}
func findAll(s: string; pattern: Regex; start = 0): seq[RegexMatch] {. ...raises: [], deprecated: "use findAll(string, Regex2) instead", tags: [RootEffect].}
func findAllBounds(s: string; pattern: Regex2; start = 0): seq[Slice[int]] {. ...raises: [], tags: [RootEffect].}
func findAllBounds(s: string; pattern: Regex; start = 0): seq[Slice[int]] {. ...raises: [], deprecated: "use findAllBounds(string, Regex2) instead", tags: [RootEffect].}
func findAndCaptureAll(s: string; pattern: Regex): seq[string] {....raises: [], deprecated: "use findAll(string, Regex2) instead", tags: [RootEffect].}
func group(m: RegexMatch2; i: int): Slice[int] {.inline, ...raises: [], tags: [].}
-
return slice for a given group. Slice of start > end are empty matches (i.e.: re2"(\d?)") and they are included same as in PCRE.
Example:
let text = "abc" var m = RegexMatch2() doAssert text.match(re2"(\w)+", m) doAssert text[m.group(0)] == "c"
func group(m: RegexMatch2; s: string): Slice[int] {.inline, ...raises: [KeyError], tags: [].}
-
return slices for a given named group
Example:
let text = "abc" var m = RegexMatch2() doAssert text.match(re2"(?P<foo>\w)+", m) doAssert text[m.group("foo")] == "c"
func group(m: RegexMatch; groupName: string; text: string): seq[string] {. inline, ...raises: [KeyError], deprecated, tags: [].}
func group(m: RegexMatch; i: int): seq[Slice[int]] {.inline, ...raises: [], deprecated: "use group(RegexMatch2, int)", tags: [].}
func group(m: RegexMatch; i: int; text: string): seq[string] {.inline, ...raises: [], deprecated, tags: [].}
func group(m: RegexMatch; s: string): seq[Slice[int]] {.inline, ...raises: [KeyError], deprecated: "use group(RegexMatch2, string)", tags: [].}
func groupFirstCapture(m: RegexMatch; groupName: string; text: string): string {. inline, ...raises: [KeyError], deprecated, tags: [].}
func groupFirstCapture(m: RegexMatch; i: int; text: string): string {.inline, ...raises: [], deprecated, tags: [].}
func groupLastCapture(m: RegexMatch; groupName: string; text: string): string {. inline, ...raises: [KeyError], deprecated: "use group(RegexMatch2, string) instead", tags: [].}
func groupLastCapture(m: RegexMatch; i: int; text: string): string {.inline, ...raises: [], deprecated: "use group(RegexMatch2, int) instead", tags: [].}
func groupNames(m: RegexMatch): seq[string] {.inline, ...raises: [], deprecated: "use groupNames(RegexMatch2)", tags: [].}
func groupNames(m: RegexMatch2): seq[string] {.inline, ...raises: [], tags: [].}
-
return the names of capturing groups.
Example:
let text = "hello world" var m = RegexMatch2() doAssert text.match(re2"(?P<greet>hello) (?P<who>world)", m) doAssert m.groupNames == @["greet", "who"]
func groupsCount(m: RegexMatch): int {.inline, ...raises: [], deprecated: "use groupsCount(RegexMatch2)", tags: [].}
func groupsCount(m: RegexMatch2): int {.inline, ...raises: [], tags: [].}
-
return the number of capturing groups
Example:
var m = RegexMatch2() doAssert "ab".match(re2"(a)(b)", m) doAssert m.groupsCount == 2
func isInitialized(re: Regex): bool {.inline, ...raises: [], deprecated: "use isInitialized(Regex2) instead", tags: [].}
func isInitialized(re: Regex2): bool {.inline, ...raises: [], tags: [].}
-
Check whether the regex has been initialized
Example:
var re: Regex2 doAssert not re.isInitialized re = re2"foo" doAssert re.isInitialized
func match(s: string; pattern: Regex): bool {....raises: [], deprecated: "use match(string, Regex2) instead", tags: [RootEffect].}
func match(s: string; pattern: Regex2; m: var RegexMatch2; start = 0): bool {. ...raises: [], tags: [RootEffect].}
-
return a match if the whole string matches the regular expression. This is similar to find(text, re"^regex$", m) but has better performance
Example:
var m = RegexMatch2() doAssert "abcd".match(re2"abcd", m) doAssert not "abcd".match(re2"abc", m)
func match(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {. ...raises: [], deprecated: "use match(string, Regex2, var RegexMatch2) instead", tags: [RootEffect].}
func re(s: string): Regex {....raises: [RegexError], deprecated: "use re2(string) instead", tags: [].}
func re2(s: static string; flags: static RegexFlags = {}): static[Regex2]
- Parse and compile a regular expression at compile-time
func re2(s: string; flags: RegexFlags = {}): Regex2 {....raises: [RegexError], tags: [].}
-
Parse and compile a regular expression at run-time
Example:
let abcx = re2"abc\w" let abcx2 = re2(r"abc\w") let pat = r"abc\w" let abcx3 = re2(pat)
func replace(s: string; pattern: Regex2; by: proc (m: RegexMatch2; s: string): string; limit = 0): string {. ...raises: [], effectsOf: by, ...tags: [RootEffect].}
-
Replace matched substrings.
If limit is given, at most limit replacements are done. limit of 0 means there is no limit
Example:
proc removeStars(m: RegexMatch2, s: string): string = result = s[m.group(0)] if result == "*": result = "" let text = "**this is a test**" doAssert text.replace(re2"(\*)", removeStars) == "this is a test"
func replace(s: string; pattern: Regex2; by: string; limit = 0): string {. ...raises: [ValueError], tags: [RootEffect].}
-
Replace matched substrings.
Matched groups can be accessed with $N notation, where N is the group's index, starting at 1 (1-indexed). $$ means literal $.
If limit is given, at most limit replacements are done. limit of 0 means there is no limit
Example:
doAssert "aaa".replace(re2"a", "b", 1) == "baa" doAssert "abc".replace(re2"(a(b)c)", "m($1) m($2)") == "m(abc) m(b)" doAssert "Nim is awesome!".replace(re2"(\w\B)", "$1_") == "N_i_m i_s a_w_e_s_o_m_e!"
func replace(s: string; pattern: Regex; by: proc (m: RegexMatch; s: string): string; limit = 0): string {. ...raises: [], effectsOf: by, ...deprecated: "use replace(string, Regex2, proc(RegexMatch2, string): string) instead", tags: [RootEffect].}
func replace(s: string; pattern: Regex; by: string; limit = 0): string {. ...raises: [ValueError], deprecated: "use replace(string, Regex2, string) instead", tags: [RootEffect].}
func split(s: string; sep: Regex): seq[string] {....raises: [], deprecated: "use split(string, Regex2) instead", tags: [RootEffect].}
func split(s: string; sep: Regex2): seq[string] {....raises: [], tags: [RootEffect].}
-
return not matched substrings
Example:
doAssert split("11a22Ϊ33Ⓐ44弢55", re2"\d+") == @["", "a", "Ϊ", "Ⓐ", "弢", ""]
func splitIncl(s: string; sep: Regex): seq[string] {....raises: [], deprecated: "use splitIncl(string, Regex2) instead", tags: [RootEffect].}
func splitIncl(s: string; sep: Regex2): seq[string] {....raises: [], tags: [RootEffect].}
-
return not matched substrings, including captured groups
Example:
let parts = splitIncl("a,b", re2"(,)") expected = @["a", ",", "b"] doAssert parts == expected
func startsWith(s: string; pattern: Regex2; start = 0): bool {....raises: [], tags: [RootEffect].}
-
return whether the string starts with the pattern or not
Example:
doAssert "abc".startsWith(re2"\w") doAssert not "abc".startsWith(re2"\d")
func startsWith(s: string; pattern: Regex; start = 0): bool {....raises: [], deprecated: "use startsWith(string, Regex2) instead", tags: [RootEffect].}
func toPattern(s: string): Regex {....raises: [RegexError], deprecated: "Use `re2(string)` instead", tags: [].}
Iterators
iterator findAll(s: string; pattern: Regex2; start = 0): RegexMatch2 {.inline, ...raises: [], tags: [RootEffect].}
-
search through the string and return each match. Empty matches (start > end) are included
Example:
let text = "abcabc" var bounds = newSeq[Slice[int]]() var found = newSeq[string]() for m in findAll(text, re2"bc"): bounds.add m.boundaries found.add text[m.boundaries] doAssert bounds == @[1 .. 2, 4 .. 5] doAssert found == @["bc", "bc"]
iterator findAll(s: string; pattern: Regex; start = 0): RegexMatch {.inline, ...raises: [], deprecated: "use findAll(string, Regex2) instead", tags: [RootEffect].}
iterator findAllBounds(s: string; pattern: Regex2; start = 0): Slice[int] {. inline, ...raises: [], tags: [RootEffect].}
-
search through the string and return each match. Empty matches (start > end) are included
Example:
let text = "abcabc" var bounds = newSeq[Slice[int]]() for bd in findAllBounds(text, re2"bc"): bounds.add bd doAssert bounds == @[1 .. 2, 4 .. 5]
iterator findAllBounds(s: string; pattern: Regex; start = 0): Slice[int] {. inline, ...raises: [], deprecated: "use findAllBounds(string, Regex2) instead", tags: [RootEffect].}
iterator group(m: RegexMatch; i: int): Slice[int] {.inline, ...raises: [], deprecated, tags: [].}
iterator group(m: RegexMatch; s: string): Slice[int] {.inline, ...raises: [KeyError], deprecated, tags: [].}
Macros
macro match(text: string; regex: RegexLit; body: untyped): untyped
-
return a match if the whole string matches the regular expression. This is similar to the match function, but faster. Notice it requires a raw regex literal string as second parameter; the regex must be known at compile time, and cannot be a var/let/const
A matches: seq[string] variable is injected into the scope, and it contains the submatches for every capture group. If a group is repeated (ex: (\\w)+), it will contain the last capture for that group.
Note: Only available in Nim +1.1
Example:
match "abc", rex"(a(b)c)": doAssert matches == @["abc", "b"]