Exploring Perl |
This document is not a complete description of what is and is not available with regular expressions, but a simple introduction to a language that looks foreign, but is very powerful.
bobWill match the word
bob
. While this is all fine and dandy, what if you wanted to match the word, bob or bobby ... or anything containing an "o"? This is where regular expressions are used.
Some characters have special significance, for instance, the
"." (period) character means, "Match any other character."
Therefore, the expression: b.b
, will match any of
the following:
bob bib bbbThe "*" character means, "Match 0 or more of the preceeding character." Therefore, the expression
bob*
, will
match any of the following: bob bobbbb boTogether, these characters will allow you to search for any sequence with certain starting characters. For instance, the expression:
bob.*
, will match any of the following: bob bobby bob barkerWhat if we wanted to use the pattern,
b.b
, but
only wanted to match characters that were vowels. In this case, you
would use, b[aeiou]b
, where the brackets, [...]
will match a single character with anything in the middle of the
brackets. This is often used to ignore capitalization (i.e.
[Bb]ob
will match either Bob or bob.
The bracket sequence allows ranges, so a pattern like,
[b-d]op
will match:
bop cop dopOf course the bracket sequence can be used with the "*", so the pattern,
b[aeiou]*b
, will match: bob bab baaab boab beieb bbThe "?" is similar to the "*" character, but means, "Match 0 or 1 occurance of the previous character." So, the pattern,
bo?b
will only match: bob bbThe "+" character is also similar, but means, "Match 1 or more occurances of the previous character." So, the pattern,
be+
, will match: be bee beeeeeeeeeeTo match either "bob" or "bobby," we would make use of the grouping characters, (...). to make our search pattern as:
bob(by)?
. Remember, the "?" character tells to
match 0 or 1 occurance of the previous character, but in this case,
the character has been grouped together, so this will match bob
or bobby.
Often, we would like to match either "bob" or "dog." In order to do
this, we would use the pattern: (bob)|(dog)
, where
we have created two groups and use the "or" symbol (the "|"
character).
\([0-9]*\)[0-9]*-[0-9]*(Yes, the "-" is a special character, but only when it is in the form:
[x-x]
, where x is any character. If you
want to use a "-" as a character in brackets, but not used as a range,
place it at the beginning of the [...] sequence, like:
[-abc]
). Some characters that are not easily typed from the keyboard (or are used so much that we get tired of typing the entire thing in) have a metacharacter sequence that represents it. The following are some of the more popular ones:
()- (3453)34-2355But others the following won't:
(801) 555-1212 --- Notice the spaceIn this case, we use the braces, {#} and specify a number. So now, our telephone pattern should look like:
\([0-9]{3}\)[0-9]{3}-[0-9]{4}Where we have replaced all "*" character with "{3}" or "{4}". This form has two other modifiers:
{n,}
, means "Match
at least n times the previous character," and
{n,m}
means "Match at least n times, but
not more than m times the previous character."
bob*
, should match anything that starts with the
characters, bob
. However, if a process supports
the full regular expression codes, then this should really be:
bob.*
(notice the period). Another assumption that many people make when using the Catalog program is that if they type the pattern, "bob" that is should match "Bob," "Bob Barker," or "Joe Bob Briggs."
In order to make everyone happy, the "Query Catalog" program
implements some of these more limited versions by placing a ".*"
sequence around every pattern that is entered. So, the pattern,
bob
, really becomes, .*bob.*
and will now act as expected.
Of course, no the regular expression purists (REPs) now begin
to complain because their pattern of john.*
, will
not only match "John Doe," but with also match "Barry Johnny." This is
why the "Catalog Query" program has an option to select either
version, with Limit expressions as the default.
I should also note that the Limited regular expressions are
"case insensitive," meaning that pattern john
will
match both "john" and "John" ... and "jOhN" for that matter. Selecting
the full regular expression option turns this off and makes you
enter, [Jj]ohn
.