Regular Expressions
A way of describing strings using a variety of wild-card symbols.
Possible identifiers :
beginning of line.
end of line.
beginning of word.
end of word.
character lists
repeating characters
repeating phrases.
character wild-cards.
Used by :
- grep - global regular expression parser.
Searches files for lines containing string described.
- egrep - extended global regular expression parser.
Uses a slightly different set of regular expressions.
Now, is just a script that invokes grep -E
- less - file browser. Supports regular expression searches.
/ - forward through file.
? - backwards through file.
- vi, vim - visual editor.
- sed - stream editor. sed edits a single line at a
time,
so it can edit a file of any size.
- awk - report generator/editor.
- A number of languages include a regular expression library.
perl, ruby, java, javascript, etc.
- fgrep - fast global regular expression parser.
Doesn't actually accept regular expression. Meant to be faster.
Now, is just a script that invokes grep -F
Regular expression pattern
- attempts to match sequence of characters anywhere in line being processed.
- matches as many characters as possible. (means be precise in your pattern)
- Once pattern is matched, result is true.
Simplest match - literal string.
Search for lines with string bin in them.
ps -ef | grep 'bin' | less
Any 3 character sequence of 'bin' results in a match.
Because regex uses same meta-characters as file wildcards, quote pattern
to avoid interpretation by command parser.
If using the dollar sign, consider using single quotes.
Line anchors
- ^ - Beginning of line.
Must be stated at the beginning of the regular expression.
grep "^b" /etc/passwd | less
Search for line beginning with a lower case b
- $ - End of line.
Must be stated at the end of the regular expression.
ps -ef | grep "conf$" | less
Search for lines that end with conf
These can be combined. Search for a blank line.
grep "^$" file
Only lines that have no characters on them match. Lines with spaces or
tabs are not blank and won't be matched.
Caution :
1. When working with $, use backslash or use single quotes around the regular
expression.
$ inside double quotes can/will be viewed as variable reference.
2. When searching for an actual $ at end of a pattern, use the backslash
to escape its meaning.
grep '[0-9]\$' datafile
looks for a digit followed by a dollar sign.
Word anchors.
A word consists of a sequence of alpha-numeric and/or underscore _ characters.
A character different than these indicates the beginning or end of a word.
- \< - Beginning of word.
ps -ef | grep "\<bin" | less
Search for word string that begin with bin
- \> - End of word
grep "ing\>" data
Search for word string ending in ing
Can be used to describe a complete word provided the regular expression
between the two anchors also match only word type characters.
Note that \< and \> are independent of each other. But are often
paired.
grep "\<that's\>" data
Matches the word that's
Character matches.
- literal - actual sequence of characters to match.
- Single character wild-card
. (period)
Any one character, note : spaces and tabs are actual characters.
grep "...." data
Match on any string that has 4 characters in it.
A six character string has 4 characters in it.
- Single character list match
[] (open and closed brackets)
Range or list of alternative characters for a single position.
grep "\<th[iu]s\>" data
Matches the words this or thus
You may use hyphen to indicate a range as long as the left hand
character precedes the right hand character in the ASCII list.
You may state multiple ranges in a single set of brackets.
You may use multiple brackets, indicating one character for each
set of brackets.
grep "\<th[a-zA-Z][a-zA-Z]s\>" data
Matches any word with 5 characters that's starts with th, ends with s,
and has any combination to alpha characters in either upper or lower case.
grep "\<[0-9][02468]\>" data
Matches a 2 digit even number with one or more digits.
Number cannot be embedded in a longer word.
But can be embedded between other punctuation.
- Single character list of characters to NOT match.
[^] Use not to precede the list of characters not to match.
Very useful but trickier to work with.
To look for ^, place it else where in list
[0-9a-fA-F^]
grep "[^a]" data
Find a match in which there is not a lower case a character.
Any line that has anything other than all a's would match.
Usually used with additional anchor.
grep "^[^a]*$" data
Find a match in which none of the characters on the line is an a.
grep "^[^z]" /etc/passwd
Matches lines that start with any character other than lower case z
Predefined ranges within brackets.
List :
[:alnum:] - a-zA-Z0-9
[:alpha:] - a-zA-Z
[:blank:] - the [space] and [tab]
[:cntrl:] - such as [ctrl]c, ASCII characters 1-31 and 127
[:digit:] - 0-9
[:graph:] - all printable characters except [space], ASCII 33-126
[:lower:] - a-z
[:print:] - all printable characters including [space], ASCII 33-126
[:punct:] - all printable characters not alpha-numeric.
[:space:] - [tab],[space],[verttab],[formfeed],[carriage-return]
[:upper:] - A-Z
[:xdigit:] - 0-9a-fA-F
grep "^[[:xdigit:]][[:xdigit:]] " data
Match on any line that starts with 2 hexadecimal digits
followed by a space.
Remember to use a separate pair of brackets around the keyword.
Operators
The regular expression library contains an additional set of symbols
that act as operators to further modify the expression.