In the previous modules, we examined basic regular expressions and used grep as the application for putting them through their paces.
However, grep only searches for text matches. With sed, we can modify the matches we have found.
sed - the stream editor is a non-interactive programmable text editor. sed takes a list of edit commands and a text file to edit and applies each command in the edit list to each line in the text file. It processes the text file one line at a time applying all edit commands to each line as it is retrieved from the text file.
Because the text file is processed one line at a time, the overall size of the file to be edited is unlimited. The only restriction is that it is an ASCII text file and that line breaks occur at reasonable intervals.
The output of sed is sent to standard output or the terminal. So in most cases, you will use redirection to capture any output. The original file is unmodified. Make sure you don't redirect your output to the same filename as the original.
A word on the examples. Where possible, I have used actual files and redirected
output to less. You may copy the example to the command line and run it. Use less to become familiar the original text, so you can note the differences. You may choose to redirect output to a temporary file instead and use diff to compare the original against the newly edited text. Actual examples will usually be preceded by a comment line explaining their actions. Examples which do not start with sed are generic and are often demonstrating a simple concept.
|
Invoking sed has the general form :
sed 'editcmd' file_to_edit > edited_file
When the edit command is provided as an argument, only one is allowed on the command line unless the -e option is specified in the general form :
sed -e 'editcmd1' -e 'editcmd2' file_to_edit > edited_file
Because many of the meta-characters used in regular expressions have meaning to the command interpreter, you will need to single quote the edit command in most cases.
It is also possible to specify multiple text files as input. However, output is sent as single stream of data to standard out and redirection allows a single target file.
sed -e 'editcmd1' file_to_edit1 file_to_edit2 > edited_composite_file
Because you will often apply several edit commands to a file and not just one or two, sed can read edit commands from a specified file. To access a file of commands, use the -f option and the file's name.
sed -f 'edit_cmd_file' file_to_edit > edited_file
Using a command file has several advantages :
Another useful option is the -n. The -n suppresses output from sed. But when combined with sed edit commands which over-ride the suppression, you have the equivalent of grep with the ability to edit what it found.
# Change student ids (zid) on line from lower to upper case
sed 's/\<z\([0-9][0-9]*\)\>/Z\1/g' /etc/passwd | less
The preceding searches the line for any student z-ids, an id that starts with lower case z followed by one or more digits and nothing else to the end of the "word", remembering the digit sequence, and substitutes it with an upper case Z and the found digits. The lower case g at the end indicates that all occurrences (global) on the line should be changed.
Any lines not containing any z-ids will be output unchanged.
The GNU version of sed has some additional options you may want to check out. But the three above are the Posix standard options.
The outer loop reads each line of the file to be edited until the end of file is encountered. Each line is read into a work area or pattern space for processing.
The inner loop applies each edit command provided either on the command line or in a file the pattern space to the current line in the pattern spaces. Each edit is applied in the order they are read until all edits have been processed. A particular edit may have no effect on the text in the pattern space, but it is still read and analysed.
Once the last edit has been applied to the pattern space, the contents are sent to standard output.
The outer loop fetches the next line of text to edit and the process repeats.
Later, we will look at several commands that modify this double loop behavior.
Its general form is :
s/match_pattern/replacement_string/ s = The s is the substitute command.
/ = The delimiter. Note there are three in the command. It separates the command, the match_pattern, and replacement_string. While it is most often the /, it does not have to be.
Let's say you want to replace a path name :
Using / delimiter :
# Substitute the 1st occurrence on the line of /home/lx with /export/home
sed 's/\/home\/lx/\/export\/home/' /etc/passwd | less
Because the / is the delimiter, you must escape it to specify it as a literal.
Using # delimiter :
# Substitute the 1st occurrence on the line of /home/lx with /export/home
sed 's#/home/lx#/export/home#' /etc/passwd | less
Using # as a delimiter in this situation greatly simplifies the command. Because the delimiter must follow the initial s command, the command will recognize the next character as the delimiter. It may be any character even a \. This only works with the substitute.
match_pattern = The match_pattern is composed of regular expression characters and literals. It can use any and all of the expressions covered in the basic expression module. Once the pattern is matched on the line being edited, the matched string is removed from the line in the pattern space and the replacement is substituted. You may use grouping and back reference in the match_pattern side of the substitute.
replacement_string = This is the replacement. Do not use any regular expressions in the replacement side of the substitute.
There are two exceptions to this. You may use the back reference if you used grouping in the match_pattern side of the substitute.
You may also use the & which refers to the whole string matched to the match_pattern.
# Place the 1st string of non-space characters the line inside brackets
sed 's/^[^:][^:]*:/[ & ]/' /etc/passwd | less
The preceding substitutes removes the string of all non-space characters up to and including the 1st space and replaces them with the same string (including the space ) in a pair of brackets. Note that the grouping \(\) were not needed.
The substitute allows certain flags appended to it. s/match_pattern/replacement_string/[flag]
where flag may be :
g - global. The substitute should apply to all strings on the line matching the match_pattern. Without this, only the 1st match is substituted.
#Substitute all occurrences of z912730 with johnb on the line.
sed 's/\<z912730\>/johnb/g' /etc/passwd | less
'n' - a numeric value. The substitute should apply to the specific 'nth' occurrence of the string matching the match_pattern.
# Break the line before the second occurrence of the <td>, table data
tag. Note the < and > are html tag syntax not the regex word delimiters,
there are no \s.
sed 's/<td>/\
&/2' /home/lx/berezin/ph/330.html | less
Substitute the second occurrence of the word house with home on each line. GNU's sed allows you to combine the digit with the g to give a range from the nth occurrence to the end of the line. Many seds won't do this and we will learn a different way to this later.
p - print. Prints or sends the edited line to standard output. Under normal
circumstances, this would probably result in two copies of a line being output.
This is used most often in combination with the -n command line option. If used
as a flag in the substitute command, print only occurs if substitute
executed.
# Change the string /home/lx to /export/home and only if substitute made,
# display changed lines
sed -n 's#/home/lx#/export/home#p' /etc/passwd | less
The -n supresses normal output to standard out. The p flag overrides this but only if the substitute is performed.
Because all edit commands are applied sequentially to each line of the text file, the order in which you specify the edit commands can make a difference.
For instance, you are working on a web page and you want to convert all paragraph markups, <p> to a horizontal rule, <hr> And you want to convert all line breaks, <br> to paragraph.
The following will work incorrectly :
sed -e 's/<br>/<p>/g' -e 's/<p>/<hr>/g' /home/lx/berezin/ph/doc.html |less
If you perform the <br> to <p> first, there is no way to distinguish between the original and new <p> markers. And the two commands together convert all <br> and <p> to <hr>.
The correct way :
sed -e 's/<p>/<hr>/g' -e 's/<br>/<p>/g' /home/lx/berezin/ph/doc.html | less
Another possible way to solve this type of situation. Suppose we want to swap the use of house and home in the document. Obviously, changing either one create the problem of loosing the other.
Instead, use a third temporary string.
#Swap swap the login ids for berezin and z912730.
sed -e 's/^berezin\>/BEREZIN/' -e 's/^z912730\>/berezin/' -e
's/^BEREZIN\>/z912730/' /etc/passwd | less
So far we have looked at applying the substitution to all lines in a text file. However, there may be cases where we wish to limit the lines to even consider for editing.
sed provides two types of addressing for line recognition.
The first is the line's literal address. You simple specify the line or line range you wish to apply the edit command to.
'1 s/\<U\.S\.\>United States/'
Change the 1st occurrence of U.S. to United States but only on the 1st line of the text file.
Because sed is a line editor and does not know what the last line is until it is read, the $ may be used represent the last line.
'$ s/\<The End\>/&, But not really/'
On encountering the last line of the file, if it contains the string "The End", replace it with itself and ", But not really"
Addresses may be combined to form a range.
'10,20 s/\<up\>/down/g'
Ranges are inclusive, so this would change will include lines 10 and 20 if the substitute is valid.
The second address form is a specified regular expression. This is basically a grep preceding the command.
To use a regular expression, place the regular expression at the beginning of the command between two forward slashes.
'/^Totals: / s/\<periodically\>/weekly/g'
For each any line starting with the word Totals: and a space, substitute any occurrences of the word periodically with weekly.
Like line number addresses, you may specify a range using regular expressions.
#Insert 2 spaces at the beginning of each line of a block of preformated
# text.
sed '/<pre>/,/<\/pre>/ s/^/ /' /home/lx/berezin/ph/330.html |
less
<pre> and </pre> indicate the start and end of a pre-formatted text block in a web page. This sed command will start applying the edit to the lines in the webpage text on finding the start html marker <pre> and will continue applying the edit to all lines read from the file until it encounters the end marker <\pre>. sed will not apply the edit to any lines encountered after the end marker unless it encounters another <pre>.
It is possible to mix lines and patterns in range specification. The following uses the p (print) command and -n option to make sed act like a cross between head and grep.
#This shows all lines in a file from the 1st to the line containing
<body>.
sed -n '1,/<body>/ p' /home/lx/berezin/ph/index.html | less
You may nest addresses with the braces.
For the following example, we look for the first part of the body, <body, markup because it may have additional formatting information before the closing >s. We are doing something similar with <table \* l1
sed '/<body/,/<\/body>/ { # indent all lines in body by 2 spaces s/^/ / /<table \/\* l1/,/<\/table \/\* l1/ { # substitute capitalized paragraph markers with lower case s/<P>/<p>/g # make sure all table data markups are in lower case. s/<[Tt][Dd]/<td/g # Indent all lines in the table s/^/ / } }' /home/lx/berezin/ph/index.html | less
Important : the braces after the address filter must be on the same line as the filter and should have no characters after, not even spaces.
Other issues raised in the example above. If the ranges you are searching for are themselves embedded, such as a table inside a table, the range toggling conditions may be mis-interpreted. In this case, the author (me) was kind enough to add documentation, the /* l1, which can be used to clarify which table markups are the target.
Also, note that several backslashes were needed to suppress the meta-meaning of / and *
The ! (not or invert) negates the address condition. So :
'1 ! s/^/ /'
indents with two spaces all lines except the 1st line of the file.
In the above range examples, the edits were applied to all lines including the line that starts the range and the line that ends the range. Here is how to skip the start and end line while acting on everything in between.
sed '/<table \/\* l1/,/<\/table \/\* l1/ { # Indent all lines except the lines with the <table> and </table> markups. # if not beginning table marker # if not ending table marker # indent lines # endif # endif /<table \/\* l1/! { /<\/table \/\* l1/! { s/^/ / } } }' /home/lx/berezin/ph/index.html | less
When used with a range, the not excludes the lines within the range. The following skips from the beginning of the file to the body markup.
sed '/<table \/\* l1/,/<\/table \/\* l1/ ! { s/^/ / }' /home/lx/berezin/ph/index.html | less
Remember the not applies to the address not the command.