Using Regular Expressions

G. Wade Johnson

What are Regular Expressions?

There are a lot of variations of regular expression syntax and functionality. Be sure you know which set you are using to avoid over/under matching.

Ways to Use Regxes

Regular expressions are really good at fuzzy descriptions of strings. I'm going to focus on three, broad uses of regular expressions.

Learning/Testing Tools

Hopefully, I'll get some time later to demonstrate rxrx from Perl. It's regular expressions are a little different than you will be using, but it's ability to show how a match happens is pretty amazing.

Terms

You can think of these as the fundamental terms in specifying a regular expression. We use these by themselves or with various operators to construct our regular expressions. Although it seems obvious, all but the last set match one character.

The anchors don't actually match a character, instead they match a location in a string. The ^ matches at the beginning of the string or immediately after a newline character. The $ matches at the end of a string or immediately before a newline character.

Basic Operations

The concatenation of a series of terms matches the first term, followed immediately by the second, etc.

Alternation matches either the expression to its left or the expression to its right. It will attempt a complete match on the left one before attempting the right one. If the first matches, the second won't be tried. This means it finds the first match, not the best match.

More Operations

notes

Matching

The regex we use in these cases can be sloppier. In most cases, a false positive is not really a big problem. Most of the time, this kind of match is followed by more code that gets more specific about the data.

Sloppier Match: Quoted String

The regex is pretty straight-forward and simple. Will match any string with at least two double quotes. It would also match strings with three double quotes, etc.

Sloppier Match: Mismatches

Parsing

notes

Parsing: Quoted Strings

Extract the double-quoted string. More precise than the matching case. Does not fail for escaped quotes. Only matches the first quoted string and does not extend past the first quoted string.

Parsing: Floating Point

So, what's different between these two?

The first case matches a string of digits and decimal points.

The second case matches something that can be converted into a floating point number.

Validation

Unlike the previous two patterns, for validation you need to be paranoid. Assume that someone is retrying to slip invalid input by you. Or, possibly trying to get you to ignore valid input.

Validation: Floating Point Number

A large portion of validation is figuring out what is valid and what is invalid. Depending on the circumstance, any of the above questions could be answered in any variation. (There are a few combinations that I haven't personally needed to deal with, but I have dealt with most combinations.

Validation: Floating Point

/\A-?(?:0|[1-9][0-9]*)\.[0-9]{0,3}\z/

notes

Validation: Floating Point 2

/\A(?:0|[1-9][0-9]*)(?:\.[0-9]{1,2})?\z/

notes

Matching Tips

notes

Parsing Tips

notes

Validation Tips

notes

General Regex Tips

notes

Tricks

notes