Regular expressions

Like all lecture modules in this series, this in not meant to be a substitute for reading the man page, the course text, or using a reference book. I recommend you also obtain :

sed & awk
Dale Dougherty
O'Reilly & Associates, Inc
ISBN 0-937175-59-5

Many Unix utilities are designed to search and manipulate text files, whether they be documents, program code, or data sets. To help with this search activity, the regular expressions have been created.

Regular expressions provide a way of describing the general property of a string with a set of rules. They are used by grep, vi, sed, awk, perl, and other programs.

Regular expressions are composed of standard keyboard characters that have been given a secondary or meta meaning by the programs that recognize regular expressions.

Although most programs using the regular expressions recognize a basic set of meta-characters, some programs recognize a second "extended" set of rules that provide additional or alternative meta-characters that simplify or add greater flexibility to the expression.

We will start by looking at the basic regular expression set.

For our examples, we will use grep on one or more data files. grep does not modify the file or even the data retrieved from the file. It displays any lines for which a provided regular expression evaluates as true. In most cases, the regular expression should be single quoted to avoid conflicts with the command shell meta-characters.

You may copy the data ~berezin/lec/data into your work directory under the name data. As each example is discussed, run it to see the output. If you prefer, redirect the output to a file and diff data and your output to spot differences.

A regular expression can be composed of both literals and meta-characters. A literal is just that, the characters stated are exactly what will be matched. A meta-character can be one of three types, a position indicator or anchor, a character descriptor, or an operator.

Literal example :

grep 'dog' data

will search the file "data" for the 3 characters d, o, and g in sequence with no spacing between the three characters. Each line for which this condition is true will be displayed. If dog is part of a larger word or string of characters, it is still a match to the specific regular expression.

Characters such as space or tab can be specified simply by entering them in the regular expression.

Non-printable literal characters can be expressed by using the escape-quote or backslash (\) followed by a representative characters.

\a  (bell)
\b  (backspace)
\t  (horizontal tab)
\n  (new line)
\v  (vertical tab)
\f  (form feed)
\r  (carriage return)

Positional meta-characters or anchors.

Positional meta-characters allow you to describe the string to be matched in relation to some anchor point.

You would use anchors when looking something like a word. If you are looking for the word dog, you know it is the word because the d in dog is at the beginning of the word and the g is at the end. And you know it is a word because anything before the d is not part of a word and anything after the g is not part of a word. And even if the word appears at the beginning or end of a line, you still recognize it.

The positional or anchor meta-characters describe this same property. There are 4 meta-characters that describe anchor points :

^ - the caret, if it appears at the beginning of the regular expression, represents the beginning of the line. If it appears elsewhere in the expression, it either looses its meaning or serves a different function.

grep '^He ' data

This statement looks for the word He at the beginning of a line. It must be in caps and followed by 1 space.

$ - the dollar, if it appears at the end of the regular expression, represents the end of the line. If it appears elsewhere in the expression it is treated as a literal. However, the $ is also the reference symbol for accessing a variable, so you should use single quotes around the regular expression if entering it on the command line.

grep -n '^$' data

This statement will find all lines that are completely empty and the -n option will print their line number.

When matching a word, it is tempting to include spaces before and after the word.

grep ' dog ' data

This literal regular expression describes the characters that spell the word dog with a space on either side. However, if the word is followed by punctuation or occurs at the beginning or end of the line, there will be no match. The regular expression syntax includes meta-characters that recognize the beginning and end of word descriptors.

\< - beginning of a word. A word is recognized as any combination of alpha (a-zA-Z), numeric (0-9), and underscore (_) characters. Punctuation, the hyphen, or any other character is considered not part of the word and acts as a delimiter to the sequence of characters composing the word.

\> - end of word. This is the complement of \<.

The following regular expression finds lines that contain the word dog.

grep '<dog\>' data

The beginning and end of word delimiters function independent of each other. You do not need to use them in pairs, although, often you will. The following will match the words dog, doghouse, dog-tired, dogma, etc.

grep '\<dog' data

Without the backslash, the < and > are treated as literal characters.

Using the beginning and end anchors to enclose non-word characters may produce unpredictable matches.

Character descriptors

A character descriptor or character variable is a meta-character that represents one or more possible alternative values for a particular character in the string being matched.

The two meta-characters used to signify an individual character are the . (period) and [] (brackets).

The . represents 1 character and matches any character of any value except the newline. Think of the . as the character wild-card.

Example :

grep '..' data

will search the file "dfile" for any lines containing at two characters. If a match is found, the line is displayed. Keep in mind that a line containing 8 characters can be said to contain at least 2 characters. The period is most often used in combination with literals or one of the positional meta-characters.

To match any lines containing exactly 2 characters, use the period meta-character with the beginning and end of line anchors :

grep '^..$' data

To search for a period, you would escape quote it. The following matches all lines that end with a period.

grep '\.$' data

When we wish to provide a more limited list of matching characters, we use the [] or character class expression. This expression allows us to specify a list or range of specific characters to look for.

Example :

grep '\<[dD]og\>' data

will search the file "data" for any line containing the word dog with the d in either upper or lower case. Note we included the word delimiters.

The character class expression, [], describes a list of possibilities for a single character being matched. Because it represents a single character, don't use a comma in the list of enclosed characters unless you want to match a comma. The same is true of spaces.

You may specify a range of possible values if the list is large.

grep '\<[0-9][0-9][02468]\>' data

The preceding looks for a 3 digit number ending with an even digit. We use the word delimiters to indicate we want only 3 digit numbers (words).

Most meta-characters that have special meaning outside of the [] loose that meaning and simply represent themselves, so grep '[.]' data will actually look for a period on each line.

There are a few exceptions to this action.

The caret (^) immediately following the opening bracket causes the list to be treated as list of disqualified characters.

Example :

grep '[^4689]' data

indicates we are looking for all lines that contain a character that is not 4, 6, 8, or 9. You need to be careful with this logic. The line may contain one of these digits, but as long as it contains a character that is not one of these, the match is still true.

This is not the same as identifying all lines that do not contain a single one of the listed digits. To perform that test, use the inverse option, -v.

grep -v '[4689]' data

Another is the hyphen, -, if it occurs between two values in ascending order. To specify a hyphen as a literal inside the [], place 1st or last in the list of possible characters.

grep '[.;:,-]' data

The character class specifiers, such as [:digit:] are valid inside the brackets, but remember to include the [] inside the [].

grep '[[:digit:]]' data

We looked at these in the Filename wild-card module. As a reminder :

[:alnum:] - a-zA-Z0-9
[:alpha:] - a-zA-Z
[:blank:] - the [space] and [tab]
[:cntrl:] - such as [ctrl]c, ASCII characters 1-31 and 127
[:digit:] - 0-9
[:graph:] - all printable characters except [space], ASCII 33-126
[:lower:] - a-z
[:print:] - all printable characters including [space], ASCII 33-126
[:punct:] - all printable characters not alpha-numeric.
[:space:] - [tab],[space],[verttab],[formfeed],[carriagereturn]
[:upper:] - A-Z
[:xdigit:] - 0-9a-fA-F

These are defined as a POSIX standard and are being implemented more widely every year. They offer the advantage of being language independent. If you are working on a machine whose locale is french, [[:alpha:]] will recognize the appropriate characters.

It is possible that some application that uses regular expressions may not recognize character classes and you will have to resort to range specification, such as [a-zA-Z].

Operators

Regular expressions include several operators that allow you to expand the symbolic or literal character description.

* - asterisk, the multiplier. * multiplies the occurrence of the previous regular expression zero or more times. Keep in mind it is zero or more.

This first example matches any line with any number of any type of character on the line, including zero occurrences of any character, so basically, every line.

grep '^.*$' data

If you redirect the output to a file and diff the original and grep output, you can confirm this.

The following matches all lines that have at least one character of any value and zero or more additional characters. The multiplier multiples the regular expression so the characters identified do not need to match each other.

grep '^..*' data

More often the * multiplier is used with a literal value or the []. The following matches lines containing an even number. The number of digits before the final even digit is unlimited but may also not exist.

grep '\<[0-9]*[24680]\>' data

If applied after a literal, then the regular expression will attempt to match multiples of that character.

The bound set

The regular expression set provides a more limited repeater. This is the {} pair or bound set. The bound set allows you to repeat a preceding character or character descriptor a specific number of repetitions, a minimum number of repetitions, or a minimum and maximum range of repetitions.

If applied to a literal, then that character is repeated. If applied to a character descriptor, the descriptor is repeated not a matched character.

In most cases, you will use a bound along with the anchor meta-characters or with other characters in the regular expression to distinguish between the repeating characters and everything else.

The bound set may take a single unsigned integer number of one or more digits representing an exact number of occurrences. The following matches a 7 digit number. Note the word anchors.

grep '\<[0-9]\{7\}\>' data

The bound set may take a single number followed by a comma representing a minimum number repetitions. Do not use spaces between the bounds. The following matches words with at least 9 letters

grep '\<[[:alpha:]]\{9,\}\>' data

Finally, the bound set may take a minimum and maximum comma separated numeric pair. The following matches all lines of at least 60 and not more than 80 characters.

grep '^.\{40,80\}$' data |tee /dev/tty | wc

You are not required to use anchors when working with the bounds but in most cases the anchors or specified literals will add proper context to the bounds.

There some differences in software that use regular expressions. The POSIX standard indicate that bounds should escaped using the \ to be in effect. However, egrep recognize bounds only if they are stated without backslash. egrep treats a back-slashed bound as a literal character.

Grouping and back-reference

The escaped parenthesis provide a mechanism to group and recall previously identified characters on a line. When applied around a sequence of literal and/or regular expressions, a match found is stored in memory. It can then be recalled into a later position in the regular expression statement to be matched.

To find all lines with a particular character repeated twice in a row use :

grep '$.$\1' data

The period wild card is matched with a character. That character temporarily replaces the \1 (back-reference) in the regular expression and a match is attempted, the same character repeated. If no match, the next character is matched to the period and inserted into the regular expression and a match attempted. This will continue to end of line or two matching characters identified. For grep, as soon as the regular expression scan indicates a true, the line is displayed and the next line fetched.

The following looks to match a number at the beginning of a line with its equivalent at the end of the line. The size of the number is unbounded but it must be a complete number at both beginning and end of line. The ^ and \> bound the 1st number if it exists and the \< and $ bounds the number at the end of the line, if it exists. The .* allows any number of any characters in between.

grep '^$[0-9]*[0-9]$\>.*\<\1$' data

The following looks for a word repeated twice in a row but only if it is separated by one or more spaces.

grep '\<$[[:alpha:]]\{1,\}$ *\1\>' data

Note the word beginning delimiter before the 1st occurrence of the word and that it is placed before the \(.

The word itself is described as a string of one or more alpha characters. This regular expression is placed inside the  grouping and is expanded to an actual string when possible. The grouping remembers the actual string not the regular expression. So if you tell it to remember a word, it will remember the characters that made up the word and only those, but the fact that is was a word is lost once the regular expression is evaluated and the string is stored in memory.

The * is 1 literal space and zero or more additional occurrences of spaces. These act as a natural delimiter between the two instances of the word.

The \1 recalls the string matched by the regular expression in the group and looks for a match immediately following the spaces. The word end delimiter is required after the back reference to ensure we found a complete word..

Grouping changes the results of brace and asterisk operators. When uses without the braces, the regular expression is expanded. When used with the braces, the matched grouped characters are expanded.

The following looks for a 3 to 5 digit number, each position can be any numeric value :

grep '\<[0-9]\{3,5\}\>' data

The following looks for a 3 to 5 digit number. However, each successive digit must match the 1st digit identified. Also, because the 1st character has already been found, the counts inside the braces are decremented by one to compensate.

grep '\<$[0-9]$\1\{2,4\}\>' data

If the grouped expression evaluates to a multi-character match, the recall and repetition the full matched string.

If you are working on a system that does not support the POSIX standards for regular expressions, you may find that the grouping and braces are not compatible with each other.

The grouping and back-reference support up to 9 groups. Groups are assigned in ascending order, but they may be recalled in any order and repeated or even not used. It is possible to nest them, but this should be rare.

Working with multiple groups is more common in an editor such as sed or vi. The following swaps the 1st two words on the line.

sed 's/^$[a-z]\{1,\}$ $[a-z]\{1,\}$\>/\2 \1/' data

It begins by grouping characters on the line into two groups. From the beginning of the line, group 1 or more alpha characters up to a space. Skip the space and group the next set of alpha characters to the end of a word. The sed command then substitutes all of the characters found with the two group in reverse order. Because we did not include the space inside a grouping, we have to manually insert it between the two back references.

We will look at sed in more detail after we discuss the extended regular expressions.