In the previous modules, we examined basic regular expressions and used grep as the application for putting them through their paces.
However, grep only searches for text matches. With sed, we can modify the matches we have found.
sed - the stream editor is a non-interactive programmable text editor.
sed takes a list of edit commands and a text file to edit and applies each command in the edit list to each line in the text file. It processes the text file one line at a time applying all edit commands to each line as it is retrieved from the text file.
Because the text file is processed one line at a time, the overall size of the file to be edited is unlimited. The only restriction is that it is an ASCII text file and that line breaks occur at reasonable intervals.
The output of sed is sent to standard output or the terminal. So in most cases, you will use redirection to capture any output. The original file is unmodified. Make sure you don't redirect your output to the same file-name as the original.
A word on the examples. Where possible, I have used actual files and redirected
output to less. You may copy the example to the command line and run it. Use less to become familiar the original text, so you can note the differences. You may choose to redirect output to a temporary file instead and use diff to compare the original against the newly edited text. Actual examples will usually be preceded by a comment line explaining their actions. Examples which do not start with sed are generic and are often demonstrating a simple concept.
|
Invoking sed has the general form :
sed 'editcmd' file_to_edit > edited_file
When the edit command is provided as an argument, only one is allowed on the command line unless the -e option is specified in the general form :
sed -e 'editcmd1' -e 'editcmd2' file_to_edit > edited_file
Because many of the meta-characters used in regular expressions have meaning to the command interpreter, you may need to single quote the edit command in most cases.
It is also possible to specify multiple text files as input. However, output is sent as single stream of data to standard out and redirection allows a single target file.
sed -e 'editcmd1' file_to_edit1 file_to_edit2 > edited_composite_file
Because you will often apply several edit commands to a file and not just one or two, sed can read edit commands from a specified file. To access a file of commands, use the -f option and the file's name.
sed -f 'edit_cmd_file' file_to_edit > edited_file
Using a command file has several advantages :
When -n is combined with edit commands which over-ride the suppression, you have the equivalent of grep with the ability to display or save to files on lines it has found and possibly changed.
The preceding searches the line for any student z-ids, an id that starts with lower case z followed by one or more digits and nothing else to the end of the "word", remembering the digit sequence, and substitutes it with an upper case Z and the found digits. The lower case g at the end indicates that all occurrences (global) on the line should be changed.
Any lines not containing a z-id will be output unchanged.
The GNU version of sed has some additional options you may want to check out. But the three above are the Posix standard options.
The outer loop reads each line of the file to be edited until the end of file is encountered. Each line is read into a work area or pattern space for processing.
The inner loop applies each edit command provided either on the command line or in a file the pattern space to the current line in the pattern spaces. Each edit is applied in the order they are read until all edits have been processed. A particular edit may have no effect on the text in the pattern space, but it is still read and analyzed.
Once the last edit has been applied to the pattern space, the contents are sent to standard output.
The outer loop fetches the next line of text to edit and the process repeats.
Later, we will look at several commands that modify this double loop behavior.
Its general form is :
s/match_pattern/replacement_string/ s = The s is the substitute command.
/ = The delimiter. Note there are three in the command. It separates the command, the match_pattern, and replacement_string. While it is most often the /, it does not have to be.
Let's say you want to replace a path name :
Using / delimiter :
# Substitute the 1st occurrence on the line of /home/hopper with /export/home
sed 's/\/home\/hopper/\/export\/home/' /etc/passwd | less
Because the / is the delimiter, you must escape it to specify it as a literal.
Using # delimiter :
# Substitute the 1st occurrence on the line of /home/hopper with /export/home
sed 's#/home/hopper#/export/home#' /etc/passwd | less
Using # as a delimiter in this situation greatly simplifies the command. Because the delimiter must follow the initial s command, the command will recognize the next character as the delimiter. It may be any character even a \. This only works with the substitute.
match_pattern = The match_pattern is composed of regular expression characters and literals. It can use any and all of the expressions covered in the basic expression module. Once the pattern is matched on the line being edited, the matched string is removed from the line in the pattern space and the replacement is substituted. You may use grouping and back reference in the match_pattern side of the substitute.
replacement_string = This is the replacement. Do not use any regular expressions in the replacement side of the substitute.
There are two exceptions to this. You may use the back reference if you used grouping in the match_pattern side of the substitute.
You may also use the & which refers to the whole string matched to the match_pattern.
# Place the 1st string of non-space characters the line inside brackets
sed 's/^[^:][^:]*:/[ & ]/' /etc/passwd | less
The preceding substitutes removes the string of all non-space characters up to and including the 1st space and replaces them with the same string (including the space ) in a pair of brackets. Note that the grouping \(\) were not needed.
The substitute allows certain flags appended to it. s/match_pattern/replacement_string/[flag]
where flag may be :
g - global. The substitute should apply to all strings on the line matching the match_pattern. Without this, only the 1st match is substituted.
#Substitute all occurrences of z912730 with johnb on the line.
sed 's/\<z912730\>/johnb/g' /etc/passwd | less
'n' - a numeric value. The substitute should apply to the specific 'nth' occurrence of the string matching the match_pattern.
# Break the line before the second occurrence of the <td>, table data
tag. Note the < and > are HTML tag syntax not the reg-ex word delimiters,
there are no \s.
sed 's/<td>/\
&/2' /home/hopper/berezin/ph/330.html | less
Substitute the second occurrence of the word house with home on each line. GNU's sed allows you to combine the digit with the g to give a range from the nth occurrence to the end of the line. Many seds won't do this and we will learn a different way to this later.
p - print. Prints or sends the edited line to standard output. Under normal
circumstances, this would probably result in two copies of a line being output.
This is used most often in combination with the -n command line option. If used
as a flag in the substitute command, print only occurs if substitute
executed.
# Change the string /home/hopper to /export/home and only if substitute made,
# display changed lines
sed -n 's#/home/hopper#/export/home#p' /etc/passwd | less
The -n suppresses normal output to standard out. The p flag overrides this but only if the substitute is performed.
Because all edit commands are applied sequentially to each line of the text file, the order in which you specify the edit commands can make a difference.
For instance, you are working on a web page and you want to convert all paragraph markups, <p> to a horizontal rule, <hr> And you want to convert all line breaks, <br> to paragraph.
The following will work incorrectly :
sed -e 's/<br>/<p>/g' -e 's/<p>/<hr>/g' /home/hopper/berezin/ph/doc.html |less
If you perform the <br> to <p> first, there is no way to distinguish between the original and new <p> markers. And the two commands together convert all <br> and <p> to <hr>.
The correct way :
sed -e 's/<p>/<hr>/g' -e 's/<br>/<p>/g' /home/hopper/berezin/ph/doc.html | less
Another possible way to solve this type of situation. Suppose we want to swap the use of house and home in the document. Obviously, changing either one create the problem of loosing the other.
Instead, use a third temporary string.
#Swap swap the login ids for berezin and z912730.
sed -e 's/^berezin\>/BEREZIN/' -e 's/^z912730\>/berezin/' -e
's/^BEREZIN\>/z912730/' /etc/passwd | less