nregex

A library for parsing, compiling, and executing regular expressions. The match time is linear in the length of the input. So, it can handle input from untrusted users. The syntax is similar to PCRE but lacks a few features that can not be implemented while keeping the space/time complexity guarantees, i.e.: backreferences and look-around assertions.

Syntax

Matching one character

.          any character except new line (includes new line with s flag)
\d         digit (\p{Nd})
\D         not digit
\pN        One-letter name Unicode character class
\p{Greek}  Unicode character class (general category or script)
\PN        Negated one-letter name Unicode character class
\P{Greek}  negated Unicode character class (general category or script)

Character classes

[xyz]         A character class matching either x, y or z (union).
[^xyz]        A character class matching any character except x, y and z.
[a-z]         A character class matching any character in range a-z.
[[:alpha:]]   ASCII character class ([A-Za-z])
[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
[\[\]]        Escaping in character classes (matching [ or ])

Composites

xy   concatenation (x followed by y)
x|y  alternation (x or y, prefer x)

Repetitions

x*       zero or more of x (greedy)
x+       one or more of x (greedy)
x?       zero or one of x (greedy)
x*?      zero or more of x (ungreedy/lazy)
x+?      one or more of x (ungreedy/lazy)
x??      zero or one of x (ungreedy/lazy)
x{n,m}   at least n x and at most m x (greedy)
x{n,}    at least n x (greedy)
x{n}     exactly n x
x{n,m}?  at least n x and at most m x (ungreedy/lazy)
x{n,}?   at least n x (ungreedy/lazy)
x{n}?    exactly n x

Empty matches

^   the beginning of text (or start-of-line with multi-line mode)
$   the end of text (or end-of-line with multi-line mode)
\A  only the beginning of text (even with multi-line mode enabled)
\z  only the end of text (even with multi-line mode enabled)
\b  a Unicode word boundary (\w on one side and \W, \A, or \z on other)
\B  not a Unicode word boundary

Grouping and flags

(exp)          numbered capture group (indexed by opening parenthesis)
(?P<name>exp)  named (also numbered) capture group (allowed chars: [_0-9a-zA-Z])
(?:exp)        non-capturing group
(?flags)       set flags within current group
(?flags:exp)   set flags for exp (non-capturing)

Flags are each a single character. For example, (?x) sets the flag x and (?-x) clears the flag x. Multiple flags can be set or cleared at the same time: (?xy) sets both the x and y flags, (?x-y) sets the x flag and clears the y flag, and (?-xy) clears both the x and y flags.

i  case-insensitive: letters match both upper and lower case
m  multi-line mode: ^ and $ match begin/end of line
s  allow . to match \L (new line)
U  swap the meaning of x* and x*? (un-greedy mode)
u  Unicode support (enabled by default)
x  ignore whitespace and allow line comments (starting with `#`)

All flags are disabled by default unless stated otherwise

Escape sequences

\*         literal *, works for any punctuation character: \.+*?()|[]{}^$
\a         bell (\x07)
\f         form feed (\x0C)
\t         horizontal tab
\n         new line (\L)
\r         carriage return
\v         vertical tab (\x0B)
\123       octal character code (up to three digits)
\x7F       hex character code (exactly two digits)
\x{10FFFF} any hex character code corresponding to a Unicode code point
\u007F     hex character code (exactly four digits)
\U0010FFFF hex character code (exactly eight digits)

Perl character classes (Unicode friendly)

These classes are based on the definitions provided in UTS#18

\d  digit (\p{Nd})
\D  not digit
\s  whitespace (\p{White_Space})
\S  not whitespace
\w  word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
\W  not word character

ASCII character classes

[[:alnum:]]   alphanumeric ([0-9A-Za-z])
[[:alpha:]]   alphabetic ([A-Za-z])
[[:ascii:]]   ASCII ([\x00-\x7F])
[[:blank:]]   blank ([\t ])
[[:cntrl:]]   control ([\x00-\x1F\x7F])
[[:digit:]]   digits ([0-9])
[[:graph:]]   graphical ([!-~])
[[:lower:]]   lower case ([a-z])
[[:print:]]   printable ([ -~])
[[:punct:]]   punctuation ([!-/:-@\[-`{-~])
[[:space:]]   whitespace ([\t\n\v\f\r ])
[[:upper:]]   upper case ([A-Z])
[[:word:]]    word characters ([0-9A-Za-z_])
[[:xdigit:]]  hex digit ([0-9A-Fa-f])

Funcs

func re(s: string; flags: set[RegexFlag] = {}): Regex {...}{.inline,
    raises: [KeyError, IndexError, RegexError], tags: [].}
func re(s: static string; flags: static set[RegexFlag] = {}): static Regex {...}{.inline.}
func group(m: RegexMatch; i: int): seq[Slice[int]] {...}{.raises: [], tags: [].}
return slices for a given group. Use the iterator version if you care about performance
func group(m: RegexMatch; s: string): seq[Slice[int]] {...}{.raises: [KeyError], tags: [].}
return slices for a given named group. Use the iterator version if you care about performance
func groupsCount(m: RegexMatch): int {...}{.raises: [], tags: [].}
return the number of capturing groups

Examples:

var m: RegexMatch
doAssert "ab".match(re"(a)(b)", m)
doAssert m.groupsCount == 2
func groupNames(m: RegexMatch): seq[string] {...}{.raises: [], tags: [].}
return the names of capturing groups.

Examples:

let text = "hello world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?P<who>world)", m)
doAssert m.groupNames() == @["greet", "who"]
func group(m: RegexMatch; groupName: string; text: string): seq[string] {...}{.
    raises: [KeyError], tags: [].}
return seq of captured text by group groupName

Examples:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m)
doAssert m.group("greet", text) == @["hello"]
doAssert m.group("who", text) == @["beautiful", "world"]
func groupFirstCapture(m: RegexMatch; groupName: string; text: string): string {...}{.
    raises: [KeyError], tags: [].}
Return fist capture for a given capturing group

Examples:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m)
doAssert m.groupFirstCapture("greet", text) == "hello"
doAssert m.groupFirstCapture("who", text) == "beautiful"
func groupLastCapture(m: RegexMatch; groupName: string; text: string): string {...}{.
    raises: [KeyError], tags: [].}
Return last capture for a given capturing group

Examples:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m)
doAssert m.groupLastCapture("greet", text) == "hello"
doAssert m.groupLastCapture("who", text) == "world"
func isInitialized(re: Regex): bool {...}{.inline, raises: [], tags: [].}
Check whether the regex has been initialized

Examples:

var re: Regex
doAssert not re.isInitialized
re = re"foo"
doAssert re.isInitialized
func match(s: string; pattern: static Regex; m: var RegexMatch; start = 0): bool {...}{.inline.}
return a match if the whole string matches the regular expression. This is similar to find(text, re"^regex$", m), but has better performance

Examples:

var m: RegexMatch
doAssert "abcd".match(re"abcd", m)
doAssert not "abcd".match(re"abc", m)
func match(s: string; pattern: static Regex): bool {...}{.inline.}

Examples:

doAssert "abcd".match(re"abcd")
doAssert not "abcd".match(re"abc")
func match(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {...}{.inline,
    raises: [KeyError], tags: [].}
func match(s: string; pattern: Regex): bool {...}{.inline, raises: [KeyError], tags: [].}
func contains(s: string; pattern: Regex): bool {...}{.inline, raises: [KeyError], tags: [].}
search for the pattern anywhere in the string. It returns as soon as there is a match, even when the expression has repetitions

Examples:

doAssert re"bc" in "abcd"
doAssert re"(23)+" in "23232"
doAssert re"^(23)+$" notin "23232"
func find(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {...}{.inline,
    raises: [KeyError], tags: [].}
func findAll(s: string; pattern: Regex; start = 0): seq[RegexMatch] {...}{.inline,
    raises: [KeyError], tags: [].}

Iterators

iterator group(m: RegexMatch; i: int): Slice[int] {...}{.raises: [], tags: [].}
return slices for a given group. Slices of start > end are empty matches (i.e.: re"(\d?)") and they are included same as in PCRE.

Examples:

let text = "abc"
var m: RegexMatch
doAssert text.match(re"(\w)+", m)
var captures = newSeq[string]()
for bounds in m.group(0):
  captures.add(text[bounds])
doAssert captures == @["a", "b", "c"]
iterator group(m: RegexMatch; s: string): Slice[int] {...}{.raises: [KeyError], tags: [].}
return slices for a given named group

Examples:

let text = "abc"
var m: RegexMatch
doAssert text.match(re"(?P<foo>\w)+", m)
var captures = newSeq[string]()
for bounds in m.group("foo"):
  captures.add(text[bounds])
doAssert captures == @["a", "b", "c"]
iterator findAll(s: string; pattern: Regex; start = 0): RegexMatch {...}{.inline,
    raises: [KeyError], tags: [].}
Find all non-overlapping matches