Introducing Regular Expressions

Regular expressions are a powerful tool for searching and manipulating text. They are used in many programming languages, text editors, and other tools. They are also a source of confusion and frustration for many people. My aim with this article is to demystify regular expressions and help you understand how to use them by introducing the concepts and terminology used, demonstrate some examples, and then point you to some resources for further learning.

Through this article we’ll look at a few examples. If you want to explore along as we go you can add the example text discussed in each example to a simple text file and then use a tool like PowerShell Select-String on Windows, or grep on Linux and MacOS to practice searching, later on in Getting Some Practice I have some links to practice materials I’ve put together for you too. Alternatively, a tool like regex101.com is a great way to experiment with regular expressions and see the results in real time.

Introduction

If you’ve spent any time around large volumes of log files, or programming where you need to validate incoming text data then you’ve likely stumbled across references to “just use a regex” or words to that effect. You then get presented with impenetrable strings of characters that look like they’ve been generated by a cat walking across a keyboard such as these (w)o\1 or \b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b and you reach for the paracetamol, a stiff drink and look for another way. This is the way most people, in my experience, start off with regular expressions and it’s no wonder that they strike fear and terror into the hearts of many. Ten points for anyone who can already explain what these examples mean…

So, in this article I want to relieve those headaches, remove the fears, and take you through what all the chaos means and how you can use regular expressions to make your life easier!

Why use regular expressions?

Before we get into anything technical, let’s start by understanding the problems regular expressions can help us solve. I’ll start by listing some common problems anyone working in tech for long enough will likely encounter:

Searching through a large volume of text, typically log files, looking for a specific pattern of text. For example, you need to find every IP address that has connected to a web server in the last 24 hours, or the time and date that a specific user logged in.
- Perhaps you also need to reformat that log data to produce a report of just the information you’re interested in from the extensive log data
You need to validate that a user has entered a valid email address, or phone number, or credit card number, or any other type of data that has a specific format.
You need to update or refactor some code to change every reference to a specific name or variable to a new name or variable. Maybe your company was acquired, and you need to replace all instances of “My Company”, or “MyCompany” with “New Company”.

Remember also that regular expressions are just a tool, and they have a specific purpose and problem to solve. They’re not the solution to all problems, and they’re not always the best tool for the job. In some cases, other options, specific parsers for example, exist and may well be a better choice. Particularly when programmatically parsing data from sources such as JSON or XML many languages have parsing libraries built in, or available, which will be a much more robust and easier to maintain solution than a regular expression.

All that said, for situations such as the ones above, regular expressions are a great tool to have in your toolbox, so let’s explore them in more detail.

What is a regular expression?

As with any good story we should start at the beginning. Before we even describe regular expressions let’s talk about names and abbreviations, as there are many in circulation all referring to the same thing. Regular expressions are also known as regex, RegEx, Reg Ex, regexp, or RE to name the common variations I see. I’ll use the terms regular expression or regex throughout this article, but you may see any of the other terms used elsewhere.

So, what is a regular expression? A regular expression is a sequence of characters to represent a known pattern of data without needing to know the exact content of any single specific instance of that pattern. That’s a bit of a mouthful, so let’s break it down.

Telephone numbers in the USA are 10 digits long and typically grouped as 3 digits, followed by a dash or hyphen, another 3 digits, another dash/hyphen, and then a final 4 digit grouping. E.g. 800-555-1234 or 777-555-2345. We know the pattern, but to search for every possible combination in a dataset would be extremely time consuming and error prone if we had to do it manually. How would you approach it?

Similarly entities like emails address or IPv4 addresses have a very clear, known pattern, but the sheer number of permutations of valid addresses makes it impractical to search for them manually.

Regular expressions allow us to describe the pattern of data we’re looking for, and then search for that pattern in a dataset without needing to know the exact content of any specific instance of that pattern. They can include, or exclude, latin alphabet characters, numbers, punctuation, and other special characters. They can also include or exclude whitespace, and can be case sensitive or case insensitive. They can be as simple as a single character, or as complex as you can imagine. As we get more understanding and more skilled they can also be used to replace, or substitute, text in a dataset with other text.

Where can I use regular expressions?

The answer is… lot’s of places. Regular expressions are supported in many programming languages, text editors, and other tools. I’ll list some of the more common ones below, but this is by no means an exhaustive list.

Visual Text Editors and IDEs
- Visual Studio Code
- Notepad++
- Sublime Text
- Atom
Command Line Tools and Text Editors
- grep
- sed
- awk
- vi
- vim
- PowerShell Select-String
- Windows Command Line findstr
Programming Lanuages
- JavaScript
- Python
- C#
- Java
- etc.
Cloud Tools
- AWS CloudWatch Logs
- Azure Monitor Logs
- etc.

The regular expression character set

Regular expressions are made up of a set of characters, each with a specific meaning.

Character	Meaning	Character	Meaning
`abc...`	Lowercase alphabet	`\s`	Any whitespace character
`ABC...`	Uppercase alphabet	`\S`	Any non-whitespace character
`123...`	Numerical digits	`\A`	Start of string
`\d`	Any numerical digit	`\Z`	End of string
`\D`	Any non-numerical digit	`\b`	Word boundary
`\w`	Any alphanumeric character	`\B`	Not a word boundary
`\W`	Any non-alphanumeric character	`*`	Zero or more of the preceding character
`.`	Any character	`+`	One or more of the preceding character
`\`	Escape character	`?`	Preceding character is optional
`.`	A literal period character	`{n}`	Exactly n of the preceding character
`[abc]`	Any character in the set a, b, or c	`{n,}`	n or more of the preceding character
`[^abc]`	Any character not in the set a, b, or c	`{n,m}`	Between n and m of the preceding character
`[a-z]`	Any character in the range a to z	`{,m}`	Up to m of the preceding character
`[0-9]`	Any digit in the range 0 to 9	`(...)`	Capture group
`^`	Start of line	`(?:...)`	Non-capturing group
`$`	End of line	`(a(bc))`	Nested capture group
`\1`	Contents of first capture group	`(abc\\|def)`	Match abc or def

Gotchas

There are a few caveats and gotchas to be aware of when working with regular expressions. Some of the more common ones are:

Regular expressions are case sensitive by default. This means that if you search for “abc” it will match “abc” but not “ABC”. This can be changed in some tools and languages, but not all.
- Check you specific tool or language documentation for details
- Or use [a-zA-Z] to match both upper and lower case characters for example
Not all languages, particularly older implementations, support all of the characters in the table above. For example, some languages don’t support the \A and \Z characters.
- A common one I see is that grep -E does not support the \d character, but grep -P does
Some tools require the regex to be bounded by marker or delimiting characters such as / or %
- I see this in my daily work when working with AWS CloudWatch Logs
- This varies by tool so check the documentation for the tool you are using

Greedy vs. Lazy

A common challenge when starting out with regex is trying to get the match you want when the text has similar patterns in it. Take, for example, the below demonstrations html excerpt:

<p>Hello</p><span>Awesome</span><p>World</p>

Say that we want to match the text between from the first  and to the end of the first , i.e. we want to match Hello. We could try to use the following regex to do that:

<p>(.*)</p>

However this will actually match the entire string up until the last  as shown below:

<p>Hello</p><span>Awesome</span><p>World</p>

This is because the .* is “greedy” and will match as much as possible. However, we can use the optional character ? to make the match “lazy” and only match as little as possible. This is shown below:

<p>(.*?)</p>

Which would match the following:

<p>Hello</p>

I won’t try and get into the technical details of why this happens, but if you’re interested you can read a much better write up than I could do here: https://blog.kiprosh.com/regular-expressions-greedy-vs-non-greedy/

Special characters and delimiters

Some characters have special meaning in regular expressions, and so need to be escaped with a \ character to be matched literally. For example, if you want to match a literal . character you would need to escape it with a \ character like this \.. This is because the . character has a special meaning in regular expressions, and so needs to be escaped to be matched literally. This is also true of other characters such as *, +, ?, (, ), [, ], {, }, ^, $, \, |, and /.

As an example, imagine that you wanted to match the text “Hello World?”. You could try matching the regex Hello World? as a literal representation of the text you wished to match; however you might get some unexpected results as “Hello Worl” and “Hello World” would match, but not explicitly “Hello World?”. This is because the ? character has a special meaning in regular expressions, as we’ve seen above it makes the preceding character optional. So by including d? we’ve expressed that the d character is optional, hence matching “Hello Worl” as well as “Hello World” and not explicitly matching with the ? character.

To ensure that we match the ? character literally we need to escape it with a \ character like this Hello World\?. This will match the text “Hello World?” explicitly.

Anchors and boundaries

Anchors and boundaries are special characters that allow us to match specific locations in a string. The most common ones are ^ and $ which match the start and end of a string respectively. For example, if we wanted to match the text “Hello World” at the start of a string we could use the regex ^Hello World. This would match the following:

Hello World is awesome

But would not match the following:

This is awesome, Hello World

Similarly, if we wanted to match the text “Hello World” at the end of a string we could use the regex Hello World$. This would match the following:

This is awesome, Hello World

But would not match the following:

Hello World is awesome

Other common anchors and boundaries are \A and \Z which match the start and end of a string respectively, but do not match the start and end of a line. This is useful when working with multi-line strings, such as log files, where you want to match the start and end of the entire string, but not the start and end of each line.

You can also use \b and \B to match word boundaries. For example, if you wanted to match the text “Hello World” but not “Hello Worldly” you could use the regex \bHello World\b which would require a word boundary before and after the text “Hello World”.

You can also use the \s to match any whitespace character as a boundary, and \S to match any non-whitespace character as a boundary.

Simple examples

A regex of a{3} would match 3 consecutive a characters
- E.g. aaa would match, but aa would not
- Note: if you have a string of aaaa this would match both the first 3 a characters, and the last 3 a characters so you may need to consider using boundaries or anchors to ensure you match the correct text
  - For example a{3}}\b would match the last 3 a characters in aaaa, but not the first 3 as it requires a word boundary after the 3rd a
A regex of a{3,} would match 3 or more consecutive a characters
- E.g. aaa and aaaa would match, but aa would not
A regex of a{3,5} would match between 3 and 5 consecutive a characters
- E.g. aaa, aaaa, and aaaaa would match
- Again, similar to the first example, while aaaaaa would not match intentionally you may need to consider using boundaries or anchors to ensure you match the correct text
A regex of [a-c]{2} would match 2 consecutive characters that are either a, b, or c
- E.g. aa, ab, ac, ba, bb, bc, ca, cb, and cc would all match
A regex of [0-9]{4} would match 4 consecutive numerical digits
- E.g. 1234 would match, but 123 would not
- This could also be written \d{4} as \d is a shorthand for [0-9]
  - Be cautious, as mentioned before, some tools may not support the \d character
  - If you need to be more specific with the digits match, for example 1-5, you could use [1-5]{4} but \d{4} would match any digit

A simple real world example

When I worked for Vocera one of my roles was integrating 3rd party systems, such as Nurse Call systems, with our platform. In part, this involved processing incoming text based data and sample sections of the received data to then be stored in the database. The data was received as a string of text, and the data we needed to extract was in a specific format. For example, we might receive a string of text like this:

ICU Room 101 Nurse

We might need to extract the room details, such as Room 101 but we wouldn’t know the room number ahead of time. So we might use a regex such as Room\s\d{3,}. This would match as follows:

Room - matches the literal text “Room”
\s - matches the space character between Room and the room number
\d{3,} - matches 3 or more consecutive numerical digits which would be the room number
- \d is a shorthand for [0-9] so this would match any digit between 0 and 9
- {3,} means 3 or more of the preceding character, so this would match 3 or more consecutive digits such as “101” or “1234”

With this regex we could match incoming data such as:

ICU Room 101 Nurse

or

ED Room 1234 Toilet

Capture groups and back references

Capture groups are a way of capturing a specific part of a match which can then be referenced either later in the same regex, a back reference, or as part of reformatting the string.

Capture groups

Capture groups are defined by wrapping the part of the regex you want to capture in ( and ). For example, if we wanted to capture the room number from the example above we could use the regex Room\s(\d{3,}). This would match the following:

Room - matches the literal text Room
\s - matches the space character between Room and the room number
( - start of the capture group
- \d{3,} - matches 3 or more consecutive numerical digits which would be the room number
) - end of the capture group

The capture group would then contain specifically just the room number, such as 101 from “ICU Room 101 Nurse”

Back references

Back references allow us to reference the content of a capture group later in the same regex. Imagine a simple example where we want to match a word where the first letter and the last letter are the same, such as wow or dad. We could use the regex (\w)\w\1 to match this. This would match as follows:

( - start of the capture group
- \w - matches any alphanumeric character
) - end of the capture group
\w - matches any alphanumeric character - not being captured
\1 - matches the contents of the first capture group

For the example of “wow” this would match as follows:

( - start of the capture group
- \w - matches the first “w” character
) - end of the capture group
\w - matches the “o” character - not being captured
\1 - matches the contents of the first capture group, which is “w” from the first capture group

Capture groups and reformatting

Another example from my days at Vocera. We would often receive incoming data such as Patient Critical - D405 - 1. To familiarise you “Patient Critical” would be a patient’s status, so it could be “Patient Critical” or “Needs Toilet” or any other range of statuses as defined by the 3rd party systems. The “D405” would be a room number, and the “1” would be a bed number within the room. We would need to reformat the message starting with with room number, then the bed number before finally including the status as in this format care staff such as nurses would already be able to start heading towards a given room before they’re finished listening to the message.

We would use a regex such as (.*?)\s-\s(\w\d+)\s-\s(\d) to match the incoming data. This would match as follows:

( - start of the first capture group
- .*? - matches any character, zero or more times, but as few as possible
  - This would match the status, such as “Patient Critical” or “Needs Toilet”
  - The ? makes the match lazy, so it will match as few characters as possible before matching the next part of the regex
) - end of the first capture group
\s-\s - matches the literal text “ - “ with a space either side
( - start of the second capture group
- \w - matches any alphanumeric character, one or more times
  - This would match the room number, such as “D”
  - The \w is a shorthand for [a-zA-Z0-9_] which means any alphanumeric character or underscore
- \d+ - matches any numerical digit, one or more times
  - This would match the bed number, such as “405”
  - The + makes the match greedy, so it will match as many characters as possible before matching the next part of the regex
  - The + is a shorthand for {1,} which means one or more of the preceding character
  - The \d is a shorthand for [0-9] which means any numerical digit
The ) - end of the second capture group
\s-\s - matches the literal text “ - “ with a space either side
( - start of the third capture group
- \d - matches any numerical digit
  - This would match the bed number, such as “1”
  - The \d is a shorthand for [0-9] which means any numerical digit
) - end of the third capture group

With this regex we would have 3 capture groups that our software could then continue to process. We might then have a reformatting string such as this: Room $2 Bed $3 Status: $1. This would reformat the incoming data to be “Room D405 Bed 1 Status: Patient Critical”.

Context

Particularly when working with log files being able to use a regex to search for specific text but then gather log events that happened immediately before or after the match can be extremely useful. This is where context comes in.

In grep we have the switches -A, -B, and -C for example where -A means “after”, -B means “before”, and -C means “context”. Similarly PowerShell provides the -Context switch which takes a comma separated list of numbers to specify the number of lines before and after the match to include in the output, such as -Context 1,2 to include 1 line before and 2 lines after the match or -Context 2 to include 2 lines before and 2 lines after the match.

Getting Some Practice

I’ve prepared some practice materials, available from my GitHub account here: https://github.com/GingerGraham/DemosAndPracticeFiles/tree/main/RegularExpressions

The materials include some simple sample text and log files and a tutorial to walk you through some examples.

Useful resources

If this article helped inspire you please consider sharing this article with your friends and colleagues, or let me know via LinkedIn or X / Twitter. If you have any ideas for further content you might like to see please let me know too.

Twitter Facebook LinkedIn