Regular Expressions

A way of describing strings using a variety of wild-card symbols.

Possible identifiers :
  beginning of line.
  end of line.
  beginning of word.
  end of word.
  character lists
  repeating characters
  repeating phrases.
  character wild-cards.

Used by :

grep - global regular expression parser.
Searches files for lines containing string described.

egrep - extended global regular expression parser.
Uses a slightly different set of regular expressions.
Now, is just a script that invokes grep -E

less - file browser. Supports regular expression searches.
/ - forward through file.
? - backwards through file.

vi, vim - visual editor.

sed - stream editor. sed edits a single line at a
time,
so it can edit a file of any size.

awk - report generator/editor.

A number of languages include a regular expression library.
 perl, ruby, java, javascript, etc.

fgrep - fast global regular expression parser.
Doesn't actually accept regular expression. Meant to be faster.
Now, is just a script that invokes grep -F

Regular expression pattern

attempts to match sequence of characters anywhere in line being processed.
matches as many characters as possible. (means be precise in your pattern)
Once pattern is matched, result is true.

Simplest match - literal string.
  Search for lines with string bin in them.

ps -ef | grep 'bin' | less

  Any 3 character sequence of 'bin' results in a match.

Because regex uses same meta-characters as file wildcards, quote pattern
to avoid interpretation by command parser.

If using the dollar sign, consider using single quotes.

Line anchors

^ - Beginning of line.
  Must be stated at the beginning of the regular expression.

grep "^b" /etc/passwd | less
  Search for line beginning with a lower case b

$ - End of line.
  Must be stated at the end of the regular expression.

ps -ef | grep "conf$" | less
  Search for lines that end with conf

These can be combined. Search for a blank line.

grep "^$" file
  Only lines that have no characters on them match. Lines with spaces or
  tabs are not blank and won't be matched.

Caution :

1. When working with $, use backslash or use single quotes around the regular 
expression. 

$ inside double quotes can/will be viewed as variable reference.

2. When searching for an actual $ at end of a pattern, use the backslash
to escape its meaning.

grep '[0-9]\$' datafile

  looks for a digit followed by a dollar sign.


Word anchors.

A word consists of a sequence of alpha-numeric and/or underscore _ characters.
A character different than these indicates the beginning or end of a word.

\< - Beginning of word.

ps -ef | grep "\<bin" | less
  Search for word string that begin with bin

\> - End of word

grep "ing\>" data
  Search for word string ending in ing

  Can be used to describe a complete word provided the regular expression
  between the two anchors also match only word type characters.

Note that \< and \> are independent of each other. But are often
paired.

grep "\<that's\>" data
  Matches the word    that's


Character matches.

literal - actual sequence of characters to match.


Single character wild-card 
. (period)
  Any one character, note : spaces and tabs are actual characters.

grep "...." data
  Match on any string that has 4 characters in it.
  A six character string has 4 characters in it.


Single character list match 
[] (open and closed brackets)
  Range or list of alternative characters for a single position.

grep "\<th[iu]s\>" data
  Matches the words this or thus

You may use hyphen to indicate a range as long as the left hand
character precedes the right hand character in the ASCII list.

You may state multiple ranges in a single set of brackets.

You may use multiple brackets, indicating one character for each
set of brackets.

grep "\<th[a-zA-Z][a-zA-Z]s\>" data
  Matches any word with 5 characters that's starts with th, ends with s,
  and has any combination to alpha characters in either upper or lower case.

grep "\<[0-9][02468]\>" data
  Matches a 2 digit even number with one or more digits.
  Number cannot be embedded in a longer word.
  But can be embedded between other punctuation.

Single character list of characters to NOT match.
[^] Use not to precede the list of characters not to match.
  Very useful but trickier to work with.

  To look for ^, place it else where in list
  [0-9a-fA-F^]

grep "[^a]" data
  Find a match in which there is not a lower case a character.
  Any line that has anything other than all a's would match.
  Usually used with additional anchor.

grep "^[^a]*$" data
  Find a match in which none of the characters on the line is an a.

grep "^[^z]" /etc/passwd
  Matches lines that start with any character other than lower case z

Predefined ranges within brackets.

List :

[:alnum:] - a-zA-Z0-9
[:alpha:] - a-zA-Z
[:blank:] - the [space] and [tab]
[:cntrl:] - such as [ctrl]c, ASCII characters 1-31 and 127
[:digit:] - 0-9
[:graph:] - all printable characters except [space], ASCII 33-126
[:lower:] - a-z
[:print:] - all printable characters including [space], ASCII 33-126
[:punct:] - all printable characters not alpha-numeric.
[:space:] - [tab],[space],[verttab],[formfeed],[carriage-return]
[:upper:] - A-Z
[:xdigit:] - 0-9a-fA-F

grep "^[[:xdigit:]][[:xdigit:]] " data
  Match on any line that starts with 2 hexadecimal digits 
  followed by a space.

Remember to use a separate pair of brackets around the keyword.


Operators

The regular expression library contains an additional set of symbols
that act as operators to further modify the expression.

asterisk multiplier (0 or more).
* - 
  Multiples the preceding literal character or regular expression 
  identifier zero or more times.

  Normally applies only to the single character preceding it.

grep '^..*$' data
  Find all lines that have 1 or more characters of any value.

grep '\<[0-9]*[24680]\>' data
  Find any word string consisting of one or more digits with the last digit 
  being even.  Remember * means 0 or more.


plus multiplier (1 or more).
+ 
  One or more of preceding character or regular expression.
  Only some commands such as egrep recognize the +
  
egrep '^.+$' data
grep '^..*$' data
  are functionally equivalent.
Commands that recognize + meta-character require \+ to identify a real plus.

Current version of grep reverses this and  support this.
grep '^.\+$' data  # matches on one or more occurrence of  e 


question multiplier (0 or 1).
?
  Matches zero or 1 occurrences of preceding character or regular expression.
  Only some commands such as egrep recognize the ?
  Requires additional information to do accurate match.

egrep '[0-9]?' data
# will match any any line including empty lines.

egrep '^[0-9]+\.?[0-9]*$' data
  Searches lines containing a single number of 1 or more integer digits 
  followed by zero or one decimal point followed by zero or more digits.

Current version of grep supports this if used with backslash. 
grep '^[0-9]+\.\?[0-9]*$' data 


range specifier.
\{ \} 
  Defines min and/or max number of recurrences of preceding pattern.
  Note, in egrep braces are used without the backslash.

grep "^.\{40\}$" data
  Match lines that have exactly 40 characters of any value.

grep "^.\{40,80\}$" data
  Match lines that have 40 to 80 characters of any value.

grep "^[a-z]\{40,\}$" data
  Match lines that have 40 or more characters that are only lower case alpha.

grep "^[a-z]\{40\}" data
  By removing the end of line anchor, 
    This also matches lines that have 40 or more lower case alpha at the
    beginning of the line and anything else after that.

Normally applied to single preceding character pattern match.

However, can be used with parentheses to multiply pattern.

grep '\<\(ha\)\{2\}\>' data
  Will match a line containing the word haha

Note : egrep does not escape quote the braces. 


parentheses grouping.
\(\)
  The parentheses allow you to 'remember' and repeat a previously matched
  string on the line.

  Each pair of \( \) pair is assigned a positional index.

grep "^\([^:]*\):\([^:\]*\):.*\2:\1$" data
           1          2

If all of that is true, there is a match.

It is possible to reference a parentheses set inside another parentheses set. But results may not be predictable.

parentheses grouping - alternative.
()
Some programs, such as egrep, use an alternative definition of the parentheses. Note a particular program can only support one version of the parentheses.
List two or more alternate patterns for a particular position in string.
will match any string containing either doghouse, henhouse, or outhouse.
The other regular expression meta-characters may be used inside of the parentheses.

See : man -s 7 regex

Also : Sed & Awk
       by Dale Dougherty
       O'Reilly & Associates, Inc.