Exploring Perl

Regular Expression Tutorial

Introduction

Regular Expressions are patterns of characters that are used in searching databases. They are implemented in the Unix program, egrep(1), and a limited version is used in shells like sh(1) and MS-DOS.

This document is not a complete description of what is and is not available with regular expressions, but a simple introduction to a language that looks foreign, but is very powerful.

Regular Expressions

Any single character matches itself, therefore, the expression:

bob

Will match the word bob.

While this is all fine and dandy, what if you wanted to match the word, bob or bobby ... or anything containing an "o"? This is where regular expressions are used.

Some characters have special significance, for instance, the "." (period) character means, "Match any other character." Therefore, the expression: b.b, will match any of the following:

        bob
        bib
        bbb

The "*" character means, "Match 0 or more of the preceeding character." Therefore, the expression bob*, will match any of the following:

        bob
        bobbbb
        bo

Together, these characters will allow you to search for any sequence with certain starting characters. For instance, the expression: bob.*, will match any of the following:

        bob
        bobby
        bob barker

What if we wanted to use the pattern, b.b, but only wanted to match characters that were vowels. In this case, you would use, b[aeiou]b, where the brackets, [...] will match a single character with anything in the middle of the brackets. This is often used to ignore capitalization (i.e. [Bb]ob will match either Bob or bob.

The bracket sequence allows ranges, so a pattern like, [b-d]op will match:

        bop
        cop
        dop

Of course the bracket sequence can be used with the "*", so the pattern, b[aeiou]*b, will match:

        bob
        bab
        baaab
        boab
        beieb
        bb

The "?" is similar to the "*" character, but means, "Match 0 or 1 occurance of the previous character." So, the pattern, bo?b will only match:

        bob
        bb

The "+" character is also similar, but means, "Match 1 or more occurances of the previous character." So, the pattern, be+, will match:

        be
        bee
        beeeeeeeeee

To match either "bob" or "bobby," we would make use of the grouping characters, (...). to make our search pattern as: bob(by)?. Remember, the "?" character tells to match 0 or 1 occurance of the previous character, but in this case, the character has been grouped together, so this will match bob or bobby.

Often, we would like to match either "bob" or "dog." In order to do this, we would use the pattern: (bob)|(dog), where we have created two groups and use the "or" symbol (the "|" character).

Metacharacter Behavior

We have just given a list of all of the "special" characters (called metacharacters, but what if the data we are searching contains those characters? In this case, we "quote" them using the "\" (backslash) character. So to match a phone number with an area code like: (801)555-1212, we would use the following:

        \([0-9]*\)[0-9]*-[0-9]*

(Yes, the "-" is a special character, but only when it is in the form: [x-x], where x is any character. If you want to use a "-" as a character in brackets, but not used as a range, place it at the beginning of the [...] sequence, like: [-abc]).

Some characters that are not easily typed from the keyboard (or are used so much that we get tired of typing the entire thing in) have a metacharacter sequence that represents it. The following are some of the more popular ones:

\t - The Tab character
\n - The carriage return
\s - Match any whitespace (space, tab, etc)
\d - Match any digit (same as [0-9]).

Advanced Stuff

The previous example, won't work exactly as we would like, because any of the following would also work:

        ()-
        (3453)34-2355

But others the following won't:

        (801) 555-1212          --- Notice the space

In this case, we use the braces, {#} and specify a number. So now, our telephone pattern should look like:

        \([0-9]{3}\)[0-9]{3}-[0-9]{4}

Where we have replaced all "*" character with "{3}" or "{4}". This form has two other modifiers: {n,}, means "Match at least n times the previous character," and {n,m} means "Match at least n times, but not more than m times the previous character."

Limited vs. Full Expressions

While Full Regular Expressions have been discussed in detail above, most people are familiar with more Limited versions. For instance, many shell and DOS users expect that the pattern: bob*, should match anything that starts with the characters, bob. However, if a process supports the full regular expression codes, then this should really be: bob.* (notice the period).

Another assumption that many people make when using the Catalog program is that if they type the pattern, "bob" that is should match "Bob," "Bob Barker," or "Joe Bob Briggs."

In order to make everyone happy, the "Query Catalog" program implements some of these more limited versions by placing a ".*" sequence around every pattern that is entered. So, the pattern, bob, really becomes, .*bob.* and will now act as expected.

Of course, no the regular expression purists (REPs) now begin to complain because their pattern of john.*, will not only match "John Doe," but with also match "Barry Johnny." This is why the "Catalog Query" program has an option to select either version, with Limit expressions as the default.

I should also note that the Limited regular expressions are "case insensitive," meaning that pattern john will match both "john" and "John" ... and "jOhN" for that matter. Selecting the full regular expression option turns this off and makes you enter, [Jj]ohn.