Perl Tutorial

Hi!

Yes, it is time for the obligatory ``hello world'' program:

print "Hello world!\n";

The point of a ``hello world'' program is not what the program does, but how you get it to execute. This really depends on the system you are using, but for this class, I will assume it is a Unix system. In fact, I will assume you are using one of the RCI or ICI machines.

  1. Log into RCI (or ICI) and start up your favorite editor.
  2. Name the file hello.
  3. Write the following: (Make sure the first line is really the first line of the file.) #!/usr/local/bin/perl5 print "Hello world!\n";
  4. Save the file.
  5. At your account prompt, execute the command: chmod u+x hello

Your program should be ready to go. Just type hello at your account prompt and it should run. I won't insult you by including the output of the program.

What was That!?

While there are two interesting points to make about the program itself, I first want to explain what the above sequence of actions did. (This is the part where people's eyes start glazing over. This is also the part where they shouldn't. Pay attention here and you will be able to figure out how to write and run a perl program on any machine.)

The first four steps took you through the creation of a perl program. This is nothing fancy. If you have edited files on a Unix system in the past, you didn't do anything new. However, notice that I added something to the beginning of the program. That line that starts with the tic-tac-toe thing (#) and an exclamation point (!). (It is pronounced hash-bang in Unixese.)

This new line of text is not part of the program itself. It is a command to the Unix operating system telling it to take the rest of the file and hand it to the program listed after the hash-bang. If you don't follow that, just keep in mind that it tells Unix that the rest of the file is written in perl.

The ``chmod'' thing in step five might be a little cryptic if you only use Unix to read your email. It changes the protection of the file you just created so that it can be executed by you, the user. (chmod u+x hello means that the user that created the file gets execution access to the hello file.

So, if you ever are stuck with the task of running a perl program on a non-Unix machine, you know that you have to find a way to make the file executable and make it so the operating system knows what language the file is in.

When you executed the hello program that you wrote, the perl interpreter started (because of the first line in the file) and it was handed the rest of the file as input.

Notice I said ``perl interpreter.''

Like awk and most BASICs, and unlike a compiled language like C, perl is an interpreted language. Programs written in a compiled language are first compiled into ``machine code'' and the resulting file can be run, independantly of any other program, directly by the machine's CPU. The compiler is not required to run the compiled program.

Every time an interpreted language is run, it requires the presence of another program that can read the language and make sense of it for the CPU. (Some will argue that perl compiles your program into an internal representation, but the bottom line is that your perl programs require the constant presence of another program, the interpreter, to run.)

Oh, yeah. I said there were two interesting things about the program that were worth mentioning. Here they are:

  1. The statement ends in a semi-colon. Perl, like C, uses semi-colons to end statements. You should always put them there.
  2. The string that is printed ends in "\n". The \n tells perl to print a line-feed at the end of the output, putting the cursor on the next line. Try it without the \n and see what happens. This is the same behavior C's printf function. (It is also the same as putting a semi-colon at the end of a BASIC print statement.)

Jumping In

That is the last time I plan to get into that much detail. For the rest of this, I am going to assume that you have had some experience writting programs. If you have no such experience, I would recommend you start your programming experience with a different language, as perl is rather a mess as languages go.

Let's see, where to start... It is often customary to give the breifest version of a simple program to show how easy it is to program in perl. The truth is that such an example is often a lie. It is filled with shorthand that only works in limited (but useful!) situations. So, I am going to write the full blown version of a simple program and go through the rest of the class explaining it using other examples along the way. (I expect the class to go on for the rest of an hour. If time permits, I am going to have a small project for you to do on your own. It will be a question and answer session that will make things clearer to you.)

$_ = <STDIN>; while ($_ ne '' ) { chop($_); @words = split(/\s+/, $_); @reversed = reverse(@words); print join(' ', @reversed), "\n"; $_ = <STDIN>; }

This program will take any sentence you give it (one line at a time) and produce it with the words reversed. Try it.

Here is the same program with all of perl's shorthand:

while (<STDIN>) { chop; print join(' ', reverse(split)), "\n"; }

For now, don't pay too much attention to the shorthand version. Concentrate on the long version. There is quite a lot going on in it and you will have to understand that first.

Let's start with the variables and literals...

Variables and Literals

Scalars

Variables in perl are much like variables in other languages. They are simply symbolic places to store information to be used throughout the program. C requires you to declare your variables to be a certain type. Integer, long, ... C needs to know it all. This is true to some extent with perl, but perl is not nearly as strict.

Perl needs to know what ``kind'' of variable you need and it figures out (for the most part) what type the variable is. It will also convert the value in the variable to whatever form it needs to be in.

The first kind of variable is a scalar. It always begins with a $. It holds single values, be they integers, floating point numbers, or strings.

Here is an example that demonstrates how to do variable assignments in perl and how type conversion happens automatically.

$a = 'Life ='; $b = 35; # Could also be $b = 043 or $b = 0x23 # or $b = 35.0 $c = '7'; $d = $b + $c; $e = "$a $d\n"; print $e;

The first line assigns the string 'Life =' to the scalar variable $a. Next the numberic value 35 is assigned to $b. Then, $c gets the string '7' and $d gets the numeric value 42. Notice that $c had to be converted from a string to a number before the addition could be done.

Next, $e gets a string with $a and $d replaced with their current values. Notice that the number in $d needs to be converted back into a string. Finally, the result is printed to the screen.

Life = 42

Arrays

The next kind of variable is an array. It holds several values (scalars). It begins with an @ when you are referring to the whole (or part of the whole) array and begins with a $ when you are referring to one of the scalars it contains.

Here is the same program but using arrays:

$a[0] = 'Life ='; $a[1] = 35; $a[2] = '7'; $a[3] = $a[1] + $a[2]; $a[4] = "$a[0] $a[3]\n"; print $a[4];

Here is another version of the same program that illustrates another way to assign scalars to an array:

@a = ('Life =', 35, '7'); $a[3] = $a[1] + $a[2]; $a[4] = "$a[0] $a[3]\n"; print $a[4];

You can think of an array as a column in a spreadsheet. The variable name is the column header, and the index in the square brackets is the row number. One difference is that the row numbers start at zero instead of one, usually. An array can hold as many values as there is memory on the computer.

You can also assign whole or partial arrays to other arrays directly:

@array1 = ('a', 'b', 'c', 'd'); @array2 = @array1; @array3 = @array2[1..2]; print "$array3[0] $array3[1]\n"; b c

And if you have a mind just as twisted as the implementors of the language, you would have probably guessed that this works as you might hope:

# Swap two values ($one, $two) = (2, 1); ($one, $two) = ($two, $one); print "$one, $two.\n"; 1, 2.

Hash Tables

The last kind of variable I plan to cover is a hash table. It associates two scalars with each other. It is a lot like an array except you can use strings (or any other scalar) as index values. It is often used to associate strings with numbers. (Like associating a month with the numeric form of the month.)

When you are referring to the whole hash table, you put a % in front of it. When you are referring to a single scalar in the has table, you put a $ in front of it. (Do you see the trend yet? When you need a scalar value, you better have a $ in front of the variable, no matter what kind of variable it is.)

$a{'life'} = 'Life ='; $a{'start'} = '35'; $a{'next'} = 7; $a{'sum'} = $a{'start'} + $a{'next'}; $a{'string'} = "$a{'life'} $a{'sum'}\n"; print $a{'string'};

The part that is in the curly baces ({}) is called the ``key'' and the part being assigned (or retieved) is the value.

Here is another program that illustrates another way to assign to a hash table:

%month = ( 'January' => 1, 'February' => 2, 'March' => 3, 'April' => 4, 'May' => 5, 'June' => 6, 'July' => 7, 'August' => 8, 'September' => 9, 'October' => 10, 'November' => 11, 'December' => 12 ); print "$month{'March'}\n";

Like arrays, whole hash tables can be assigned to other hash tables. In fact, the `=>' symbol is actually an alternative comma. So, when you assign to a hash table, you are actually doing this: %hash_table = (key1, value1, key2, value2, ...);

You can do a couple of other things that are often useful. You can get all the keys or all the values out of the hash table.

@keys = keys(%hash_table); @vals = values(%hash_table);

This is dangerous in situations where the contents of the hash table can outstrip the memory of the machine. (There are situations where a hash table can hold more information than memory can hold... but we will not be covering that in this tutorial.)

A safer thing to do is to iterate through all the keys and values in the hash, one at a time. Like this:

while (($key, $val) = each(%hash_table)) { ... }

However, I have not talked about while loops yet. That is the next topic.

Flow Control

If all our perl programs could do was execute from the top of the file to the end, we would be able to do very little with perl. Fortunately, we can control the flow of execution through a program.

Usually, different parts of code are executed based on the truth or falsity of some condition. So... we need to talk about condtions, what it means to be true or false. That is actually the hard part. The control structures are easy. Here is a simple condition:

... if ($value == 3) { print "The value is three.\n"; } else { print "The value is not three.\n"; }

The behavior of this program is rather clear. If the value of the variable $value is three, then the first print statement is executed. If not, the second gets done.

So, now we know that ``=='' means numeric equality. What are the other comparisions? Here they all are:
ComparisonNumericStringReturn Value
Equal==eqTrue if $a is equal to $b
Not Equal!=neTrue if $a is not equal to $b
Less than<ltTrue if $a is less than $b
Greater than>gtTrue if $a greater than $b
Less or Equal<=leTrue if $a is less than or equal to $b
Greater or Equal>=geTrue if $a is greater than or equal to $b
Comparison<=>cmp0 if equal, 1 if $a greater, -1 if $b greater

You can also string a bunch of comparisons together using the logical operators. They are the same as they are in C.
$a && $bTrue if $a and $b are true.
$a || $bTrue if $a or $b are true.
! $aTrue if $a is false.

You can actually use the words and, or and not instead of the symbols. That is less common.

Truth?

It is easy to say that something is ``true'' or ``false'' but what is true or false to perl? It turns out that this is a really important question. Some things return strings... others numeric values... others are just not defined. What does perl do when each of these things is the condition on which a decision is being made?

The easiest way to think about it is that anything that can be interpreted as a zero, an empty string, or is undefined is considered false. Everything else is true.

Just to make sure you follow me, consider the following code. If we replace condition with things in the left column, we get the output in the right.
if (condition) { print "True"; } else { print "False"; }
conditionOutput
''False
0False
'0'False
undefFalse
'okay'True
'wrong'True
'12'True
12True

(Undef is a subroutine that always returns an undefined value. And, yes, it is very useful.)

More Control Flow

if (condition) { statements... } If condition is true, execute the statements.
if (condition) { statements... } else { statements... } If condition is true, execute the first set of statements. If condition is false, execute the second.
if (condition) { statements... } elsif (condition) { statements... } else { statements... } If the first condition is true, execute the first set of statements. If the second condition is true, execute the second set of statements. Otherwise execute the last set of statements. You can have as many elsifs as you want. (No, I did not misspell elsif.)
statement if (condition); If condition is true, execute the statement. (Not statements.)
while (condition) { statements... } As long as condition is true, execute the statements. (Only check condition before starting the statements each time.) There is also an until loop which is logically opposite to while. However, showing that loop at the same time tends to confuse people.
for ($i = 0; $i <= 10; $i++) { statements... } This is really shorthand for this: $i = 0; while ($i <= 10) { statements... $i++; }

As it turns out, there is no reason that you must make each part of a for loop as the example shows. They can actually be quite strange. You can even leave parts out.

foreach $i (@list) { statements... } Execute statements once for each item in the array @list. The array can also be a literal array, like this: foreach $i (1, 2, 3, 4) { print "$i\n"; }

File Handles

There are only two more things in that original program that I will need to cover before we can (almost) completely understand it. The first is something called file handles. File handles are the ``variables'' used to reference files, the terminal, sockets, or pipes.

There are three default file handles that are available without having to do any extra work. People who are familiar with Unix programming should recognize them: STDIN, STDOUT, and STDERR. To what each of these file handles references depends on how you start the program. We will concern ourselves with just five situations. All the others require us to know more about perl than we do right now. (Remember, I am assuming that you are using perl on a Unix machine.)

Situation What's going on Example Unix Command
Simple Command
STDINUser's Keyboard
STDOUTUser's Monitor
STDERRUser's Monitor
% perlprogram
Input Redirect
STDINFile
STDOUTUser's Monitor
STDERRUser's Monitor
% perlprog < file
Output Redirect
STDINUser's Keyboard
STDOUTFile
STDERRUser's Monitor
% perlprog > file
Input from Pipe
STDINSTDOUT of Other Program
STDOUTUser's Monitor
STDERRUser's Monitor
% otherprog | perlprog
Ouput to Pipe
STDINUser's Keyboard
STDOUTSTDIN of Other Program
STDERRUser's Monitor
% perlprog | otherprog

Any resonable combination of these things is possible, as they would be for any program on a Unix system.

Using File Handles

If you think about it, the two basic things you would like to be able to do to a file handle is read from it and write to it. This would enable you to interact with the user or process the contents of a file.

There are several ways to accomplish this. Here are just two:

Reading from a File Handle

To read from a file handle one line at a time, do something like this:

$line = <STDIN>; This will take the next line of information from STDIN, which could be the user's keyboard, a file, or the output of another running program.

As it turns out, STDIN is read only. (Well, you should certainly assume that it is.) So, it doesn't make sense to write to it.

Writing to a File Handle

To write something out of a file handle, do something like this:

print STDOUT $something; This will take whatever is in the $something variable and print it to STDOUT. STDOUT can be the user's monitor, a file, or the input to another running program.

As it turns out, by default, print will print to STDOUT. So, this will do the same as the above:

print $something; Just as we have been doing it all along. You can change the default file handle print will use by doing this: select STDERR; Any subsequent print (without an explicit file handle) will print to STDERR.

By the way, STDERR is where you would put the error messages your program should generate if there is a problem.

Reverse Revisited

Now, go back and see if you can understand the original program. You should know that there are functions used that I did not cover, but their function should be obvious.

Here is the program again:

$_ = <STDIN>; while ($_ ne '' ) { chop($_); @words = split(/\s+/, $_); @reversed = reverse(@words); print join(' ', @reversed), "\n"; $_ = <STDIN>; }

Things to notice:

It turns out that there are a lot of things you can do to shorten this program. First of all, $_ is a special variable. It is the default variable for many operations. Some examples of places where it can be left out are in split, chop, within a while's condition. Also, there is no reason to take up a new array when a funtion already returns an array: while (<STDIN>) { chop; print join(' ', reverse(split)), "\n"; }

Regular Expressions

Regular expressions are hard to explain to people. To encourage experimentation, I have written a regular expression simulator. It is a Java application that will take a regular expression that you write and some input text and tell you how the regular expression worked. Don't try it just yet, though. Let me try to motivate them first.

Regular expressions are used in all kinds of places, but lets look at them as they would function in conditions first. Look at the following code:

if ($text eq 'stop') { print "Stopping\n"; }

This code fragment will print ``Stopping'' if the variable $text is equal to ``stop''. But what if $text was a complete sentence and we wanted to stop if the string ``stop'' occurred anywhere in that sentence.

A regular expression will let you specify a pattern to match against. It is much more powerful than just a simple comparison. Here is the program that does what we want:

if ($text =~ /stop/) { print "Stopping\n"; }

The =~ is the perl pattern match operator. The left side is the text you are checking and the right is the regular expression that represents the pattern you are checking against.

In this case, if the text has the string ``stop'' anywhere in it (even in the middle of a word) the pattern match operator will return true and ``Stopping'' will be printed.

Patterns can get more complicated. In fact, they can get so complicated that it is nearly impossible for anyone but the programmer to understand them.

Here are some things you can stick in your patterns that match special things.

^
At the beginning of a pattern, the caret means match starting at the beginning of the string. For example: "stop that" =~ /^stop/; # Matches. "don't stop" =~ /^stop/; # Does not match.
$
At the end of a pattern, the dollar sign means to match ending at the end of the string. For example: "stop that" =~ /stop$/; # Does not match. "don't stop" =~ /stop$/; # Matches.
[]
One of the characters inside of square brackes will match. For example: "top check" =~ /[tp]op check/; # Matches. "pop check" =~ /[tp]op check/; # Matches. "cop check" =~ /[tp]op check/; # Does not match.

You can use `-' between letters to denote a set of letters. For example, [a-d] is the same as [abcd]. You can do the same with numbers, upper case letters and any sequence of characters in the ACSII character set.

If you put a caret (^) just after the open square bracket, it negates the meaning. For example, [^a-d] would match everything except a, b, c, or d.

\d
Digit. Same as [0-9].
\D
Digit. Same as [^0-9].
\w
Word character. Same as [a-zA-Z0-9_].
\W
Word character. Same as [^a-zA-Z0-9_].
\s
Whitespace character. Same as [ \t\n\r\f]. (Space, tab, newline, return, formfeed.)
\S
Non-space character. Same as [^ \t\n\r\f]. (Space, tab, newline, return, formfeed.)
|
The regular expression will match anything to the left or to the right of a |. For example: "left" =~ /left|right/; # Matches. "right" =~ /left|right/; # Matches.
.
A period will match any single character except a newline.
+
A plus sign after something will match one or more of that something. "boooo" =~ /^bo+/; # Matches. "b" =~ /^bo+/; # Does not match. "bobobobo" =~ /^[bo]+/; # Matches. "booobo" =~ /^[bo]+/; # Matches.
*
An asterisk after something will match zero or more of that something. "boooo" =~ /^bo*/; # Matches. "b" =~ /^bo*/; # Matches. "bobobobo" =~ /^[bo]*/; # Matches. "booobo" =~ /^[bo]*/; # Matches. "" =~ /^[bo]*/; # Matches.
?
A question mark after something will match zero or one of that something.
()
If you put parentheses around something and that something gets matched, that something will be put into the variable $n for the nth set of parentheses. (The parentheses themselves do not match anything.)

For example:

"Hey there!" =~ /(Hey) (there)/; print "One: $1, Two: $2\n"; One: Hey, Two: there

This is very useful if you are processing lines of text from a file or user input and you want to pull parts of the text out... How about pulling the subject out of a mail message:

while ($line = <STDIN>) { if ($line =~ /^Subject: (.*)$/) { print "The subject is: $1\n"; } }

Or shorter:

while (<STDIN>) { print "The subject is: $1\n" if (/^Subject: (.*)$/); }

There are several more, but this is confusing enough.

Besides $n, there are three other variables that have useful values after a match. $` will have the part of the string that was before the match, $' has the part that was after, and $& has the part that actually matched. For example:

"Hello there" =~ /o t/; print "$`\n$'\n$&\n"; Hell here o t

Try the perl regular expression simulator now and see how regular expression operate.

Looking Back at Split

If we look back at the ``split'' that was in the original program, we can now get an understanding of what it does. Here is the line:

... @words = split(/\s+/, $_); ...

Split will take a regular expression and divide a string into pieces. It breaks the string at the places where the regular expression matches. In the above case, the regular expression matches on one or more occurances of a whitespace character. It removes the whitespace characters and returns all the pieces in between as the elements of an array. So, we have something that breaks a sentence into words. (Of course, it acts funny with punctuation.)

Assignment

Write a perl program that will take a file with each line in the following format:

Last Name First Name Phone Number

For each line, it should print out the information with the proper labels. For example, if the line of the file is this:

Hagan Sue (908)445-2425 it should print this: Name: Sue Hagan Phone Number: (908)445-2425

It is a simple enough program, but you should try to make it as concise as possible. (short, but understandable)