Here’s another great example of a subject I’ve learned, but don’t frequently use as a tool because I don’t understand it well enough. If you’re a loyal reader, you may remember that I’ve started to work with a new codebase at work (the one that’s written in TypeScript). This code was written to improve some of our existing processes by using templating to make our scripts smaller, more efficient, and easier to predict. So now, instead of writing custom JavaScript, I plug some strings into the framework — awesome!
A big part of my job is finding and parsing strings, so I would frequently use functions like filter and replace in my JavaScript. This week, I was working with a phone number element whose innerText looked like this:
T: (808) 808–8088
We’ve got a phone number here, but it’s in a pretty unique format, so it’s not suitable to be passed into a DB where it may need to be matched with other phone numbers. Fortunately, the new codebase includes a sanitizer which automatically removes dashes and spaces from collected phone numbers. The system works! Unfortunately, that sanitizer won’t remove the T:, so I have to do that before I pass along the value. Normally I’d use replace, but because I’m no longer able to write custom JS to parse my string, I instead have to pass in a string containing regex.
A Regular Expression
Regex is short for regular expression, meaning a series of characters that represent patterns in text. The term originated with Stephen Cole Kleene, a mathematician who discussed regular events in his Representation of Events in Nerve Nets and Finite Automata in 1951. While Kleene was seeking to prove that such events could be mapped by the aforementioned nerve nets and automata, his representation of the events using regular expressions was reminiscent of the regex we might see today:
By a regular expression, we shall mean a particular way of expressing a regular set of tables starting with single-table sets and applying zero or more times the three operations (passing from E and F to E ∨ F, EF or E∗F).
In 1968, programmer Ken Thompson built Kleene’s notation into a text editor called QED to help match patterns in text files. When he later added it to ed, a Unix editor, it helped birth the term grep — globally search for a regular expression and print the results (g/re/p). Regular expressions really started to increase in popularity after Henry Spencer wrote a more sophisticated regex library for Perl and now of course they’re completely ubiquitous. So what should we expect from our regular expressions?
Character Types
As mentioned earlier, regular expressions are made up of different characters. Some of these are literal or regular characters, for example:
regex
This is technically a regular expression made completely of regular characters. We can use it to find any instance of the pattern where a character r is followed by a character e followed by g and so on. It’s really easy to understand, but it’s not terribly unique or useful unless we know exactly what we’re looking for. That’s why regular expressions can also be made up of metacharacters which can more openly describe a pattern. A few popular metacharacters include:
- | to represent a boolean “or”: we could match realize or realise by using reali(z|s)e
- Quantifiers such as ?, *, and +: these characters indicate how frequently something may occur in our pattern.
- . as a wildcard: cat, bat, sat, and rat can all be matched with the regex .at
^[ \t]+|[ \t]+$
The Wikipedia article on regular expressions presents two examples in their section on Patterns. Since I didn’t immediately understand them despite having just read the section, I figure it would be good practice to try and decode them together. Here’s the first:
^[ \t]+|[ \t]+$
Our first character is a carrot: ^. This is a metacharacter to represent the start of a string. Next we see an open bracket, so we can expect the next series of characters to be grouped together in reference to the carrot. Parentheses can also be used for this purpose, but square brackets carry an extra qualification that will prove relevant in this example. Rather than searching for the combined pattern of the contained group of characters, the pattern will be considered a match if any of the characters are found. In other words, square brackets use an implicit or:
(a|b|c)
is the same as
[abc]
Inside of the brackets is a space, a backslash, and the letter t. We might think that we’ll be looking for each of these three characters, but there’s something special about the backslash.
In regex, a backslash can also be referred to as an “escape character,” which acts as a switch to flip the definition of its subsequent character. If the next character is a literal, the backslash will switch that to a metacharacter and likewise from meta to literal. This means that instead of matching the letter t in our string, we’re matching the meta translation of \t, which is a tab (a larger space or series of spaces). We’re starting to get the idea that maybe we’re looking to match whitespace at the beginning of a string.
After we close the brackets, we see a plus sign, which as we learned earlier is a quantifier. The sign, +, means that we’ll match if these characters appear one or more times. So we’ll match whether there’s a single space or whether there are 18 spaces and tabs in any order. Knowing that we’re looking for these characters at the start of a string, we now understand the first half of our regular expression.
Next we see a pipe: |, which we remember means or. After the pipe is the same pattern we just saw, [ \t]+, but this time followed by $. The dollar sign represents the end of a string, so we’re looking for whitespace at the beginning of a string or at the end of a string. This regular expression would probably be used to parse or clean up a document that has inconsistent formatting!
[+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?
Here we see the second example in the Wikipedia article. I’ll reprint it for clarity:
[+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?
We’ll start with our first set of square brackets. Here we see a plus and a minus sign, which might remind us of the quantifier from the previous example. In this case, however, since the characters are alone inside of brackets, they are actually being used as literals. Our string will match if it starts with a + or -. After the brackets, we see a question mark that is being used as a quantifier. The ? means that the previous part of the expression will occur either 0 or 1 times, essentially making it optional. So a matching string might start with + or -, or something completely different.
Next we open parentheses, which means we’re going to be looking for all of the values to exist next to each other, rather than one or another. Let’s reprint again to simplify, this time just inside the parentheses:
\d+(\.\d+)?|\.\d+
That’s a lot of characters, but note the repetition. We won’t have to evaluate too many different patterns to decode this. First we see a backslash, which again means we’re escaping the default value of the next character. When we see \d together, we’re switching from a literal letter d to the meta function of d: a numeric digit (0, 1, 2, etc). Next we see the quantifier +, meaning we’re expecting any number of digits. Both 1 and 12355356 would match in that case.
We open another parentheses to find more backslashes, so we can break our expression into \., \d, and +. We’re familiar with the last two, but what about \. ? We’ve previously seen \ proceed the use of a metacharacter, but in this case the period is actually a literal. Because . is a wildcard by default, the “escaped” version is a literal period. So a matching pattern for the regex inside these parentheses might be .1 or .62643432. Next we have a pipe |, so we know that we’re going to start a new expression. That means that we can put together everything we’ve seen previously and know that matches for the first half of this or statement include 1, 1.5, or 2.84531. It seems like we’re looking for a number.
After the pipe we see the same series of characters that we just evaluated. This time, there’s no digit in front, so this would match a number between 0 and 1 in decimal notation, like .2 or .4532.
With the outside parentheses finally closed, we’re getting a much better idea what we’re looking for. If we want to account for all sorts of different numeric notation, we might expect our numbers to have a + or — before them to indicate positive or negative. But our regex will also match if that’s missing. So what’s the rest of the expression for?
([eE][+-]?\d+)?
After our number, we can have the letter e or E, followed optionally by a + or -, then any number of digits. This whole addition is made optional by the question mark outside of the parentheses. This clever addition means that we’ll also capture numbers written in scientific notation. Not only can we find 1000000 (one million), but we can also find 10e5 (the scientific notation for one million).
And now that we’re regex experts, we can say that this expression will match for a lot of different number types, but won’t help us find numbers with commas in them like 1,000 or fractions if they’re written as 1/2 etc. When we’re writing our regex, we have to be thoughtful about what we’re looking to pull in and what we’re willing to miss, or what we explicitly want to exclude.
^\s*(?:\+?(\d{1,3}))?[-. (]*(\d{3})[-. )]*(\d{3})[-. ]*(\d{4})(?: *x(\d+))?\s*$
One last example here because I search for this often: regex to match phone numbers. Like with all regex, there’s no single or “best” way to do this — this particular expression comes from a Stack Overflow post and includes a few extra wrinkles that make it more exciting. Let’s get started!
Our first chunk is ^\s*, which means the string can start with any amount of whitespace. In an earlier example, we saw whitespace indicated as [ \t], but \s serves pretty much the exact same purpose since it includes both spaces and tabs. The asterisk indicates that the previous part of the expression can occur 0 or more times in the string — maybe our string starts with no space, but maybe it starts with 100 spaces.
Next we see an open parenthesis followed by a question mark. That’s strange because ? is a quantifier, so we might assume that we’re accounting for a potential ( at the start of a phone number. That was my guess until I went onto the next character, a colon, :. I learned that (?: together indicate a non-capturing group, which means a pattern we want to match but not necessarily to return as part of our matched string. This Stack Overflow post helped explain the principle, but didn’t actually help me figure out why the author decided to use it in this case. Within the non-capturing group, we see this:
\+?(\d{1,3})
This is an optional plus sign and a series of digits. Curly brackets indicate a minimum and/or maximum, so somewhere between 1 and 3 digits. So we could match +123, +1, 24, etc. Since this is right up front in our number, it might be an area code or country code. That would explain the plus, but I don’t know why the author would want this value to be ignored when we returned the number. But we do know that if we wanted to potentially match but ignore on this portion, we could do it. There’s a question mark after the group, making it essentially optional.
Next up we have a portion that appears with slight variations three times in our expression: [-. (]*. By now, we probably know that this means we’re looking for any number of dashes, periods, spaces, or open parentheses. That makes sense if we’re trying to parse a phone number that could come through in different forms.
Then we see another small group that comes up more than once in the expression: (\d{3}). This one’s easy: \d means digit and {3} means it happens three times. This makes sense since we’re dealing with a phone number. Next we get the option for another space, dash, or a closed parenthesis. Then three more digits, another space/period/dash, then four digits.
I would have figured we’d be done with a phone number after 10 digits, but there’s something else here:
(?: *x(\d+))?
Now we recognize the non-capturing group, so what we’re really looking at is a space, then *x(d\+) with a question mark after it all to make this portion optional for our match. Space and asterisk means that our last 4 digits can be followed by any number of spaces, then the literal character x. After that, there can be any number of digits. So our phone number might look something like this:
808 808 8088 x1234
It makes sense after writing it out — it’s accounting for an extension! Then at the very end we get another \s* with a $ — our matching string can end with any amount of whitespace.
Regex in JavaScript
Regular expressions are used in slightly different ways depending on the language. As a JavaScript native, I most frequently see two functions related to regex: test and match.
Test is a function associated with the RegExp prototype, meaning we can call it on an object of the regex type. Before we get into what test does, let’s look at two ways to create a regex object in JavaScript:
// we can put our regex between two slashes
/\d{3}/// we can use the new keyword, passing in a string without the surrounding slashes
const regex = new RegExp('\\d{3}')
Note that the object creation requires an extra backslash, which is actually being used as an escape character for the following backslash. This is because \ is also used as an escape character in strings, so if we want to actually denote a literal \ in a string, we have to preface it with the escape character \.
Now that we’ve got a regex object, we can use test to pass it a string parameter. We’ll receive a boolean back indicating whether our string includes a pattern that matches the regex:
/\d{3}/.test('123') => true
/\d{3}/.test('1234') => true
/\d{3}/.test('12') => false
The other popular JS regex function, match, gives us a bit more information in its return statement. Match is run on a string and receives regex as an argument. If the string matches the passed pattern, we’ll receive an array that includes the match or matches as a string. We can indicate a global search by appending a g flag onto the end of our regex — this will give us multiple results:
'who, what where when'.match(/w/g) => ["w", "w", "w", "w"]
If our match finds only one result, our return will be an array with extra properties of index and input:
'who, what where when'.match(/w/) => ["w", index: 0, input: "who what where when", groups: undefined]
Index represents where the match first occurs in the string and input returns the original string itself. Groups is an object that will be populated if we choose to add a group to our expression with angle brackets: <group_name>. Feel free to try it out in the console:
'who what where when'.match(/(?<ws>w)/) => ["w", "w", index: 0, input: "who what where when", groups: {ws: "w"}]
Regular expressions can be very hard to read and their efficiency can be unpredictable, so they’re not always the best option for string parsing. Still, it’s important to be able to understand them if and when our code demands the knowledge. Express yourself regularly, and wisely.
Sources:
- Regular expression on Wikipedia
- Kleene, Stephen C. (1951). Shannon, Claude E.; McCarthy, John (eds.). Representation of Events in Nerve Nets and Finite Automata (PDF)
- Regular expression to match standard 10 digit phone number on Stack Overflow
- What is a non-capturing group in regular expressions? on Stack Overflow
- RegExp.prototype.test(), String.prototype.match(), and Groups and ranges on MDN