Using Regular Expressions

G. Wade Johnson

What are Regular Expressions?

A family of languages for matching strings
A system for describing strings abstractly
A tool for extracting data from strings

There are a lot of variations of regular expression syntax and functionality. Be sure you know which set you are using to avoid over/under matching.

Ways to Use Regxes

Matching strings
Parsing strings
Validating strings

Regular expressions are really good at fuzzy descriptions of strings. I'm going to focus on three, broad uses of regular expressions.

Learning/Testing Tools

rxrx from Regexp::Debugger
RegexPal
Search on-line, there are a bunch

Hopefully, I'll get some time later to demonstrate rxrx from Perl. It's regular expressions are a little different than you will be using, but it's ability to show how a match happens is pretty amazing.

Terms

Literals: a, %, 3, etc.
\0, \n, \r, \t, \f, etc.
.
\., \*, etc.
Anchors/assertions: ^, $, \A, \z

You can think of these as the fundamental terms in specifying a regular expression. We use these by themselves or with various operators to construct our regular expressions. Although it seems obvious, all but the last set match one character.

The anchors don't actually match a character, instead they match a location in a string. The ^ matches at the beginning of the string or immediately after a newline character. The $ matches at the end of a string or immediately before a newline character.

Basic Operations

Concatenation: abc
Alternation: abc|def
Quantifiers: a+, a*, a?
Generalized Quantifiers: a{2,5}, a{2,}

The concatenation of a series of terms matches the first term, followed immediately by the second, etc.

Alternation matches either the expression to its left or the expression to its right. It will attempt a complete match on the left one before attempting the right one. If the first matches, the second won't be tried. This means it finds the first match, not the best match.

More Operations

Bracketed Character Class: [abc], [^abc]
Character Class: \d, \D, \w
Grouping: (abc), (?:abc)

notes

Matching

Mostly assuming positive intent
Is there a double quoted string?
Does the anchor tag have a target attribute?
Any duplicated words in a sentence?

The regex we use in these cases can be sloppier. In most cases, a false positive is not really a big problem. Most of the time, this kind of match is followed by more code that gets more specific about the data.

Sloppier Match: Quoted String

Regex: /".*"/
String: abcdef "ghi" jkl

The regex is pretty straight-forward and simple. Will match any string with at least two double quotes. It would also match strings with three double quotes, etc.

Sloppier Match: Mismatches

abc "def
abc "def" ghi "jkl" mno
abc "def\" ghi
abc "def" ghi "mno
abc 'def" ghi "jkl' mno
abc \"def" ghi

Parsing

Stricter than matching
Find and extract particular parts of the string
Not necessarily matching the whole string
Target string may have errors, but mostly as expected

notes

Parsing: Quoted Strings

/("(?:\\.|[^"\\]+)+")/

Extract the double-quoted string. More precise than the matching case. Does not fail for escaped quotes. Only matches the first quoted string and does not extend past the first quoted string.

Parsing: Floating Point

Match: /[\d.]+/
Parse: /([0-9]+\.[0-9]*)/

So, what's different between these two?

The first case matches a string of digits and decimal points.

The second case matches something that can be converted into a floating point number.

Validation

Most strict
Match the whole string
Assume any level of error
Assume malicious input

Unlike the previous two patterns, for validation you need to be paranoid. Assume that someone is retrying to slip invalid input by you. Or, possibly trying to get you to ignore valid input.

Validation: Floating Point Number

Are integers allowed as well?
Are negative numbers allowed?
How many numbers after the decimal? 0, 1, 2, 3, n
Are digits required before the decimal?
Are unnecessary leading zeros allowed?
Is exponential notation allowed?
Are special floating point values allowed? (NaN, Inf, -Inf, etc)

A large portion of validation is figuring out what is valid and what is invalid. Depending on the circumstance, any of the above questions could be answered in any variation. (There are a few combinations that I haven't personally needed to deal with, but I have dealt with most combinations.

Validation: Floating Point

/\A-?(?:0|[1-9][0-9]*)\.[0-9]{0,3}\z/

Can be negative, no leading zeros, can end in '.', up to 3 decimals
No integers allowed
No exponential notation, or special float values

notes

Validation: Floating Point 2

/\A(?:0|[1-9][0-9]*)(?:\.[0-9]{1,2})?\z/

Non-negative, no leading zeros, up to 2 decimals
Integers allowed
No exponential notation, or special float values

notes

Matching Tips

Anchor if it helps, otherwise not
Be liberal in what you match, but not too liberal
Loose character classes can be useful
Series of simple matches may be better than one complicated match

notes

Parsing Tips

Anchor if it helps
Match around what you want to extract.
If you know what you want to keep, regex works well
If you know what you want to remove, split can be better

notes

Validation Tips

Positive match: Always anchor both ends
Use the right anchors
Be careful not to match too much
- Greedy matches
- Loose character classes
- .
Try not to get too complicated/clever
Negative matches can be very effective

notes

General Regex Tips

Character classes rather than alternates of characters:
- /(a|b|c)/
- /[abc]/
Watch anchors and alternates: /\Acat|dog\z/
Matching not this string is harder than you think
Lookahead (positive and negative) can be your friend
Be careful with greedy matches

notes

Tricks

Greedy vs Non-greedy vs Possessive
- /".+"/
- /".+?"/
- /"[^"]++"/
Possessive doesn't backtrack
Watch for matchinng empty strings: /ab*c/
Watch for nested quantifiers, with backtracking: /a(b*c*)*d/



    notes

Houston.pm

November 14, 2019

Using Regular Expressions

G. Wade Johnson

What are Regular Expressions?

Ways to Use Regxes

Learning/Testing Tools

Terms

Basic Operations

More Operations

Matching

Sloppier Match: Quoted String

Sloppier Match: Mismatches

Parsing

Parsing: Quoted Strings

Parsing: Floating Point

Validation

Validation: Floating Point Number

Validation: Floating Point

Validation: Floating Point 2

Matching Tips

Parsing Tips

Validation Tips

General Regex Tips

Tricks