LINUX Redirection and the Command-line Pipe

Output Redirection

Suppose we use this command at the LINUX command line:

          prompt:  Program1 > Program1.out

What does it do?

The effect of this is to redirect the standard output of Program1 into a file called Program1.out.

The file may be new or it may already exist. If it already exists, it will be over-written.

Suppose we used this command instead:

          prompt:  Program1 >> Program1.out

In this case, the output is appended to the end of the file, rather than replacing the contents of the file.

Suppose we used this command instead:

          prompt:  Program1 2> Program1.out

In this case, what is redirected is output written to stderr, not stdout. Likewise

          prompt:  Program1 2>> Program1.out

will redirect output written to stderr, appending it to the file.

Note: Redirection of this sort is dependent on which command shell we are using. If the above method does not work, try using

          prompt:  Program1 >& Program1.out

to redirect output written to either stdout or stderr into the file instead, or

          prompt:  Program >>& Program1.out

to redirect output written to either stdout or stderr, appending it to the nfile instead.

How does it work?

The operating system will create the file if necessary and then modify a particular entry in the file descriptor table. One of the pieces of information in each entry of the file descriptor table is a pointer to something in the LINUX file system (which includes files, pipes, sockets, etc.). The entry for standard output (number 1) or for standard error (number 2) will be changed so its pointer leads to "Program1.out".

Related command

It may be worthwhile to look into the "tee" command as well.

Input Redirection

Suppose we use this command at the LINUX command line:

          prompt:  Program1 < Datafile.txt

What does it do?

The effect of this is to redirect the standard input of Program1 to come from a file called Datafile.txt.

How does it work?

The operating system will create the file if necessary and then modify a particular entry in the file descriptor table. One of the pieces of information in each entry of the file descriptor table is a pointer to something in the LINUX file system (which includes files, pipes, sockets, etc.). The entry for standard output (number 0) will be changed so its pointer leads to "Datafile.txt".

Command-line pipe

Suppose we use the following command at the LINUX command line:

          prompt:  Program1 | Program2

This is a "command-line pipe". Let's explore what is going on here.

What does it do?

The standard output of Program1 is redirected into the standard input of Program2.
Nothing happens to Program1's standard input or standard error.
Nothing happens to Program2's standard output or standard error.
Program1's output must make sense as input for Program2.

It is worth noticing that Program1 may be doing all of its processing (such as sorting the contents of a file) before it writes out all of its output at one time.

If these are programs that normally are interacting with a user, there can be disconcerting effects. Program2 may be printing messages which prompt its user to provide input, but the user cannot do so because all of Program2's standard input comes from Program1. If Program1 prints messages which prompt its user to provide input, the user will never see them, as those messages will be fed to Program2. Will Program2 know what to do when it finds these messages in its input?

It is therefore more appropriate to use the | pipe with programs that do not print such messages; they simply read from standard input, process it and write to standard output. Such programs are sometimes called "filters".

How does it work?

The system creates two processes and a buffer. The buffer is of some specific default size such as 64 KB. Data is added to the buffer at one end and removed from the other end, essentially a queue. The system is using the same mechanism as in the pipe() system call. The use of pipe() always involves a buffer.

The operating system is doing something like this:

Use pipe(). This is the step that actually creates the buffer.
Use fork() twice to create processes P1 and P2. There will be a leftover grandchild process which should be terminated. P1 and P2 now share the pipe.
In P1, close the read end of the pipe.
In P2, close the write end of the pipe.
In P1, redirect standard output to the write end of the pipe.
In P2, redirect standard input to the read end of the pipe.
In P1, use execlp() to execute Program1; this replaces the previous executable code with the code for Program1.
In P2, use execlp() to execute Program2.

(Instead of execlp(), this may be one of the members of the exec() family of functions.)

At this point, the output from Program1 (in P1) is going into the pipe and the input for Program2 (in P2) is coming out of the pipe.

Fake pipe

An alternative to a command-line pipe is to use a file: We interpret

          prompt:  Program1 | Program2

as three commands:

          prompt:  Program1 > tempfile.txt
          prompt:  Program2 < tempfile.txt
          prompt:  rm tempfile.txt

This is called a "fake pipe". It is actually how the MS-DOS operating system implemented the command-line pipe. It's easy enough to understand, but it has the disadvantage that as only one program is running at a time, the overall execution may be slower.

On the other hand, the operating system does not have to do anything very complex to make it work.

The temporary file ("tempfile.txt" above) is almost certainly a fixed-size buffer in memory, in which case we have to worry about overfilling the buffer. If instead it is a disk file, then we do not have that worry, but the name of the file needs to be unique (based on the time or date, etc.) so we do not accidentally overwrite an existing (valuable) file.

Comment and Speculation

This is an example of a Producer-Consumer situation. There is a danger of Program1 trying to write into the buffer even though it is full at present, and there is a danger of Program2 trying to read from the buffer even though it is empty at present.

If we were writing the code to do this for ourselves, one way to do it would be to have an integer N counting bytes in the buffer and a semaphore S controlling access to the counter. Suppose P1 wants to write a byte to the pipe. It will want to increment the counter. So:

     wait(S);
     if (N < BUFFERSIZE)
      {
       write one byte
       increment N
      }
     post(S);

Likewise, if P2 wants to read a byte from the buffer, it will want to decrement the counter. So:

     wait(S);
     if (N > 0)
      {
       read one byte
       decrement N
      }
      post(S);

We are using the semaphore S to ensure that only one process has access to the counter N (and thus to the buffer) at a time.

However, do Program1 and Program2 include such code? No, so who is managing this?

The system manages the buffer and there is presumably some additional structure of this sort.