There are a lot of variations of regular expression syntax and functionality. Be sure you know which set you are using to avoid over/under matching.
Regular expressions are really good at fuzzy descriptions of strings. I'm going to focus on three, broad uses of regular expressions.
rxrx
from Regexp::Debugger
Hopefully, I'll get some time later to demonstrate rxrx
from
Perl. It's regular expressions are a little different than you will be using,
but it's ability to show how a match happens is pretty amazing.
a
, %
, 3
, etc.\0
, \n
, \r
, \t
, \f
, etc..
\.
, \*
, etc.^
, $
, \A
, \z
You can think of these as the fundamental terms in specifying a regular expression. We use these by themselves or with various operators to construct our regular expressions. Although it seems obvious, all but the last set match one character.
The anchors don't actually match a character, instead they match
a location in a string. The ^
matches at the beginning of the
string or immediately after a newline character. The $
matches
at the end of a string or immediately before a newline character.
abc
abc|def
a+
, a*
, a?
a{2,5}
, a{2,}
The concatenation of a series of terms matches the first term, followed immediately by the second, etc.
Alternation matches either the expression to its left or the expression to its right. It will attempt a complete match on the left one before attempting the right one. If the first matches, the second won't be tried. This means it finds the first match, not the best match.
[abc]
, [^abc]
\d
, \D
, \w
(abc)
, (?:abc)
The regex we use in these cases can be sloppier. In most cases, a false positive is not really a big problem. Most of the time, this kind of match is followed by more code that gets more specific about the data.
/".*"/
abcdef "ghi" jkl
The regex is pretty straight-forward and simple. Will match any string with at least two double quotes. It would also match strings with three double quotes, etc.
abc "def
abc "def" ghi "jkl" mno
abc "def\" ghi
abc "def" ghi "mno
abc 'def" ghi "jkl' mno
abc \"def" ghi
/("(?:\\.|[^"\\]+)+")/
Extract the double-quoted string. More precise than the matching case. Does not fail for escaped quotes. Only matches the first quoted string and does not extend past the first quoted string.
/[\d.]+/
/([0-9]+\.[0-9]*)/
So, what's different between these two?
The first case matches a string of digits and decimal points.
The second case matches something that can be converted into a floating point number.
Unlike the previous two patterns, for validation you need to be paranoid. Assume that someone is retrying to slip invalid input by you. Or, possibly trying to get you to ignore valid input.
n
A large portion of validation is figuring out what is valid and what is invalid. Depending on the circumstance, any of the above questions could be answered in any variation. (There are a few combinations that I haven't personally needed to deal with, but I have dealt with most combinations.
/\A-?(?:0|[1-9][0-9]*)\.[0-9]{0,3}\z/
/\A(?:0|[1-9][0-9]*)(?:\.[0-9]{1,2})?\z/
split
can be better.
/(a|b|c)/
/[abc]/
/\Acat|dog\z/
/".+"/
/".+?"/
/"[^"]++"/
/ab*c/
/a(b*c*)*d/