Regular expressions

Because of its genesis in the research community, the Unix system has a large collection of text scanning and editing utilities. Users employ these to write manuals, manipulate data files, format text and documents for various output systems (line printer, commercial offset printers, etc), batch edit large text files or collection of files, and other chores.

To make the use of different utilities simpler to learn, a system of describing the nature of text in a file has been developed called the regular expression pattern matching. Regular expressions are a set of meta-characters used to build a description of a string of text without having to state the exact composition of that string. An example would be to describe a specific word that could be all lower case or capitalized. To describe the word dog, we would use the regular expression \<[dD]og\>. The backslash < and > represent the beginning and end of the word. The [dD] indicates upper or lower case d and the og are literals.

Another example would be to describe any word repeated twice in a row, but only if separated by a single space. What this word might be is unknown until you encounter it. You can intuitively identify such an occurrence with minimum effort. However, a computer must use an algorithm and concise set of rules that can consistently the pattern to the correct target.

The regular expression for this problem is :

\<\([a-zA-Z][a-zA-Z]*\) \1>

We will analyze this syntax in detail later.

The regular expression pattern matching rule set is used in a variety of text manipulation utilities on the Linux/Unix system.

Among these are :

  • grep/egrep - global regular expression parser. grep is used to search a file for the string specified by a regular expression. If found, the line containing the string is displayed. The original text is not modified. egrep is a variation of grep that introduces some additional regular expressions and sacrifices some of the more esoteric original regular expressions. Both grep and egrep are present on most systems. And there are versions of grep available for most other operating systems.

  • sed - stream editor. sed is a programmable batch editor that is often used to process extremely large text files or apply a particular set of edits on a large number of similar files. Although not interactive, the use of regular expressions and a flow control programming language makes sed a very powerful administrative tool.

  • awk - report generator. awk is a programmable editor/report generator that uses regular expressions and a flow control programming language modeled after the c language. Like sed, awk batch processes a specified text file or files.

  • perl - an interpreted programming language with a strong emphasis in manipulating text files. When visiting a web page that responds to your input, it is often a perl program generating the resulting web page.

  • vi - visual editor. vi is the visual interface to the ex(tended) editor. It also relies on regular expressions to perform search and search and replace actions on the file being edited.

  • Several other utilities also take advantage of the regular expression rule set, along with some of the newer version of the various shells.

    Basic regular expressions and grep