Regex Basics: A Regular Expression Tutorial

by Canonical SEO on October 29, 2009

Learn regex basics

One might wonder, “Why on earth is Canonical writing about regex basics?” So many new webmasters don’t have a clue what it is, and if they do then they have no clue how it relates to SEO.  Believe it or not, there is a major connection between regex and search engine optimization.

Having at least a basic knowledge of the regex vocabulary and how to create simple to intermediate regular expressions is crucial to the success of any webmaster.  Mod Rewrite (like many Un*x-based tools) is built on top of a regular expression parser.  Understanding the basics of regex is a prerequisite for learning Mod Rewrite.  And knowing how to use Mod Rewrite is a mandatory requirement for calling yourself a webmaster IMO.

So let’s start learning some regex basics.

What is regex?

Regex is a robust language or syntax used to “express” or specify patterns in data, typically in text data or strings.  It is short for regular expression.  It is also frequently referred to as regexp.   Regex can be thought of as “wildcarding on steroids”.

A regular expression is made up of a combination of literal characters and/or metacharacters.  The metacharacters have special meaning to a regex parser.  The parser takes two inputs – a regular expression and an input string – and evaluates them to determine if the input strings contains the sequence of characters defined by the regex pattern.  The regex parser then returns a Boolean value to indicate the result of the matching.

The first step to understanding regular expressions is to understand the metacharacters used for pattern matching.

Regular Expression Cheat Sheet for Metacharacters

The following cheat sheet should prove to be an invaluable reference to anyone just learning regex. It contains the most commonly used metacharacters recognized by the regex parser – those used in most simple regular expressions.

Regex Metacharacter Meaning
^ An anchor representing the start of a string.
$ An anchor representing the end of a string.
. Matches any single character.
\ Escapes a regex metacharacter so that it will be treated as a literal by the regular expression parser. It can also be used to add special meaning to characters that would otherwise be treated as a literal.
* Matches zero or more occurrences of the previous construct.
? Matches zero or one occurrences of the previous construct.
+ Matches one or more occurrences of the previous construct.
[ ] Called a character class. Matches only one of the characters contained inside the brackets.
( ) Provides a grouping functionality so that you can treat a group of characters as a single unit. Also provides the ability to capture a group of characters which you can later use called a back reference.

 

Become familiar with the above regular expression metacharacters.  It pays to understand regular expressions. especially if you work in a Un*x environment and/or host a web site on Apache. 

Basic regex pattern matching

The best way to learn is to go through some simple regex examples.  It can be confusing at first due to the cryptic syntax of regular expressions. But once you have walked through several basic examples, you will likely start to catch on quickly. 

As I mentioned previously, a regular expression parser returns a true or false Boolean value to indicate whether the input string matched the regex pattern.  In some cases where the regex pattern uses the () characters to group literal and metacharacters in the pattern, the regex pattern can be used to capture substrings from the input string as a back reference for later use.

You will find that there are many ways to write a regular expression that matches a particular pattern.  Some regex expressions are more efficient than others.  Writing efficient expressions will come with practice.  But initially you should simply focus on learning the basics regardless of whether the patterns you write are efficient or not.

Anchoring matches to the start and end of string (^ and $)

The ^ character is used to anchor a regex pattern to the beginning of the input string.  The $ character is used as an end of string anchor.  Below are some examples of basic regular expression patterns utilizing the start and end of string anchors:

Regex Pattern Matches
^$ Any input string where there is nothing between the start of string and end of string (in other words, it matches only the empty string)
^abc Any input string that begins with abc.
abc$ Any input string that ends with abc
^abc$ Only the input string abc

 

Matching any character (.)

The . character is used in a regex pattern to match any character.  Below are some examples of using . in regular expressions:

Regex Pattern Matches
a.c Any input string that contains the letter a followed immediatately by any character followed immediately by the letter c.Examples:
abc
1a2c3
accept
match
^.$ Any input string that is exactly one character in length.

 

Escaping characters (\)

The \ (backslash) or escape character is used in a regular expression to either remove special meaning from a metacharacter causing it to be treated as a literal.  The following regex examples demonstrate the use of the \ to escape metacharacters so that they are treated as literal characters:

Regex Pattern Matches
\.jpg$ Any input string that ends in .jpg
\$1\.00$ Any input string that ends in $1.00

 

The backslash character can also be used to add special meaning to characters which would otherwise be treated as literals.  For example, \d can be used to match a decimal character (i.e. 0, 1, …, 9) or \s to match a space.

Matching zero or more characters (*)

The * character is used in a regex expression to match the previous character in the pattern zero or more times.  Some simple examples of using the * character in a regular expression are as follows:

Regex Pattern Matches
.* Any input string (actually, the entire input string) even if it is the empty string.
^\s*# Any input string that begins with zero or more spaces followed immediately by a # character.
^a*bc$ Any input string that begins with zero or more occurrences of the a characters immediately followed by bc at the end of the string.Examples:
bc
abc
aabc
aaabc

 

Matching an optional character (?)

The ? character is used in a regular expression to make the previous character in the pattern optional.  In other words, it matches zero or one occurrences of the previous character in the pattern.

Regex Pattern Matches
e-?mail Any input strings that contains the string email or e-mail.

 

Matching one or more characters (+)

The + character is used in regex to match the previous character in the pattern one or more times.  Some simple examples of using + in a regular expression are as follows:

Regex Pattern Matches
^\d+\.jpg$ Any input string that begins with one or more decimal digits immediately followed by .jpg at the end of the string
a+bc Any input string that contains a series of on or more consecutive occurrences of the letter a immediately followed by bc.Example:
abc
aabc
aaabc
aabcc

 

Matching a character class ( [ and ] )

You can use [ and ] to match a character class or character list in regex.  The construct matches one and only one of the characters between the [ and ]. Below are some examples of using character classes in a regular expression:

Regex Pattern Matches
[bcf]at Any input string that contains either the letter b or c or f immediately followed by at (in other words, any input string containing bat, cat, or fat)

 

You can also use a hyphen inside the start and end brackets of a character class to indicate a range of characters. 

Regex Pattern Matches
^[b-f]at$ Any input string that is exactly 3 characters in length where the first character is b, c, d, e, or f followed immediately by at (in other words, any input string that is exactly bat, cat, dat, eat, or fat)
^image[0-9]\.jpg$ Any input string that starts with image followed immediately by a single decimal digit followed immediately by .jpg at the end of the string (in other words, image0.jpg, image1.jpg, …, image9.jpg only)

 

Grouping patterns and capturing results ( ( and ) )

Parentheses can be used to group characters in a regular expression so that they can be treated as a single unit.  This is very useful in pattern matching as it allows you to apply metacharacters to sub-patterns within a bigger pattern. An example of using parentheses in a regex pattern for grouping is as follows:

Regex Pattern Matches
(xyz)+ Any input string that contains one or more consecutive occurrences of xyz (in otherwords, xyz, xyzxyz, xyzxyzxyz, etc.)

 

Applications like Mod Rewrite which are built on top of regex parsers also utilize parentheses to capture the results of a match for later use.  These captured values are stored in variables called back references and can be used to determine what string of characters matched the pattern.  I will discuss back references when I get around to writing a post on the basics of Mod Rewrite.

Learning more about regular expressions

Familiarized yourself with the basics of regular expressions by experimenting with an online regex parser.   Once you have the regex basics down, you can move on to more advanced techniques.  This post barely scratches the surface of what you can do using regular expressions.   The sky is the limit!

Leave a Comment

Previous post:

Next post: