There are several ways to do perform the editting and this will affect the resulting document.

The data document itself is a rather simple html document. It contains no tables or other advanced html features.

Remember it is not always possible to automate a procedure to produce perfect results. What we want is to move the document far enough along so that the final polishing of the converted document requires relatively few actions. Ie. while we can break all lines longer than 79 characters, this may result in a few very short lines in the middle of a paragraph. Or it may not be easy to correctly represent indented lists inside of indented lists.

I will post a link to the actual web page being converted, so that you can see what it looks like. Keep in mind that certain aspects of its appearance are affected by the browser and its font choices. I will provide a description of the general sequence of actions I have taken to get my final document. If you wish to attempt a better solution, you may.

When searhing for html markers, remember they are case insensitive, so you must check for both upper and lower case characters in the marker. eg. <br> <BR> <Br> and <bR> are all valid line break markers.

Each of the following paragraphs represent one invokation of sed and I have redirected the output of that step to a temporary file. This allows for the examining of the intermediate data state between action. First remove any spaces before the < tag start and all spaces after the > tag end.

Replace <p> marker with a blank line. Also remove any </p> end of paragraph markers. Keep in mind that </p> is not required and may not be matched to all <p>, so just remove. Also, break line at <br> provided it is not already at end of line.

Break the line <li> marker and add a * to the beginning of the line.

Replace the <li> marker with a asterick and space and double space the line being tagged.

Replace the start <b> and end </b> bold markers with double quotes.

Replace <b> and </b> with quotes, place space before open and a space after close quote.

Break any lines longer than 79 characters into a set of lines around 60 charaters long. Note the line may have to be reprocessed several times. See below on hints.

Indent all lines between the <ul> and </ul> markers and remove them. Also place a blank line at the <ul> and </ul> markers.

Replace the <hr> horizontal rule marker with a line of stars on their own line. And juste remove any additional html markers. Note: substitute all string starting with < and ending with > with null.

Replace all occurances of &lt; with < and &gt; with >


For the multicharacter line:

Test for a line of at least 79 characters
While line at least 79 characters long

Look at the N, P, and D to print and loop back to test. Not all three may be needed.

If done right, even a line of several 100 characters can be broken into lines of 60 to 80 characters long.

Remember:

N - joins the next line in the data file to the current line, leaving a new line between them.

P - prints (outputs) the contents of the current edit buffer up to and including the first new line character.

D - deletes the contents of the current edit buffer up to and including the first new line character. Then branches to the top of the list of edit commands without reading a new line of data.