HTML to LaTeX (version 2.7)

This page describes version 2.7 of html2tex, a program which can be used to converts a single HTML file or a collection of related HTML files into a single LaTeX file. Such a LaTeX file can be processed into a PostScript file. To generate a single LaTeX file from a collection of HTML files, the user needs to give a skeleton LaTeX file, and indicate where translated versions of the HTML files should be included. The user also has to specify for each HTML file at which level (chapter, section, subsection, ..) it should be included. Links between the different HTML files are mapped to references in the LaTeX file. External links can be included as footnotes or as a bibliography.

The generation of LaTeX is configurable. The mapping of each HTML tag to LaTeX commands can be specified. (This mapping can even be changed dynamically during the processing of the HTML file.) It is also possible to exclude certain parts from the HTML files from the generated LaTeX file, or to include LaTeX parts in HTML comment lines, which are ignored by HTML viewers. This makes it possible to maintain sources for both HTML and LaTeX in the same HTML files.

The program performs certain checking of the HTML files, in order to be able to generate correct LaTeX output, but this checking is not guaranteed to conform any HTML standard. At some places the checking might be more relaxed, while at other places more restrictive then HTML 2.0. So far, there is not much support for extensions beyond HTML 2.0.

The program does extensive checking of links between the different files. Because of this reason it can also be used as a link checking program, by giving it a single HTML file, and the option -c, or to change its name into chkhtml. In order to also check all referenced pages in the local directory (and its sub-directories), the option -s should be used as well.

Links to excluded HTML files (and other URL's) can either be reported as footnotes, or as a sorted bibliography in the LaTeX file.

Error messages are reported on the standard output file. The program can also generate an extensive cross-references file mentioning all the anchor tags.

Functionality

The HTML to LaTeX conversion program is implemented by the C program html2tex.c, which needs to be compiled first. The program is developed with the popular gcc compiler, which is freely available under the GNU public license. Under UNIX the program can be compiled with the command: 'make html2tex'.

The program can be either used to convert a single HTML file into a LaTeX file, or a collection of related HTML files into a single LaTeX file. These two modes of operation will be described below.

Converting a single HTML file

If the program is executed with a single HTML file, a LaTeX file will be generated. For example, the command 'html2tex test.html will generate the file test.tex However, files generated in this manner, are not a complete LaTeX files. To make them complete some LaTeX commands have to prefixed and appended to the file. A LaTex file starts with commands to specify the document style, the title page, and such.

Instead of adding the required LaTeX commands manually, it is also possible to place them inside comments in the HTML file. See below for a description of the commands which are recognized by html2tex inside HTML files. This page can be taken as an example of this. Execute the following command to get a LaTeX file of this page: 'html2tex html2tex.html'. After this the file html2tex.tex can be processed and made, for example, into a PDF file: html2tex.pdf.

Converting a collection of HTML files

To produce a single LaTeX file from a collection of linked HTML files, a skeleton LaTeX file has to be provided. In this skeleton there are commands embedded in comments which specify which HTML files should be included at which place.

When html2tex is executed with a skeleton file on the command line, a LaTeX file with the same name as the skeleton file, but with the extension .tex added to it, will be created.

A real life example of a skeleton file is transcoop, which includes pages from the original TransCoop pages, which are gone now. The LaTeX file transcoop.tex was generated when the following command was executed in the TransCoop home directory: 'html2tex transcoop'. From this, the PostScript file transcoop.ps can be produced with the help of latex and dvips.

The skeleton file

The skeleton input file should contain valid LaTeX commands. In the file all lines starting with %html will be interpreted as special lines by the conversion program. These are used to indicate which HTML files should be included, and to set the various options. The following special commands are recognized by the html2tex:

Special command in the HTML files

The following special commands (inside HTML comments) are recognized in the HTML files:
The program recognizes comments inside a pair of double dashes (--), in any of the HTML tags including <! >. It also recognizes any text in a <! > tag not surrounded by double dashes as comment, but not without generating a warning message for it.

Defining mappings

As we wrote above the various mappings of HTML tags to LaTeX can be changed in both the
input file (as a line of the form %html -d tag-name options "LaTeX-open" "LaTeX-close"), and inside comments in the HTML files (in the form of latex-def tag-name options "LaTeX-open" "LaTeX-close").

They change the mapping of the tag-name HTML tag to the given LaTeX formating commands. The strings LaTeX-open and LaTeX-close are put around the text that is marked by the HTML tag. (The string in LaTeX-close is generated at the proper place, in case the closing tag is not obligatory in the HTML syntax.) If the LaTeX command has to include a double quote one should use two double quotes in the string. If a real newline (the '\n' character) has to be included, use '\nl' instead. (There is no LaTeX command starting with this sequence, but there are many starting with '\n'.)

The options are used for some special kind of translating. The following options are possible:

The pseudo HTML tags (which cannot occur in the HTML files) L1 to L9 specify what LaTeX commands should be generated for which section level. The definition of these pseudo-tags is changed by the command %html -s style for setting the document style.

The default settings are the ones given below, using the format to be used in the input file:

%html -d html    ""  ""
%html -d head    ""  ""
%html -d title   ""  ""
%html -d body    -on ""  ""
%html -d address ""  ""
%html -d h1      -l1 "{\\LARGE \\textbf{" "}}"
%html -d h2      -l2 "{\\Large \\textbf{" "}}"
%html -d h3      -l3 "{\\large \\textbf{" "}}"
%html -d h4      -l4 "\\textbf{" "}"
%html -d h5      -l5 "{\\small \\textbf{" "}}"
%html -d h6      -l6 "{\\footnotesize \\textbf{" "}}"
%html -d p       "\nl\nl"  ""
%html -d ul      -igh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d menu    -igh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d dir     -gnh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d ol      -igh "\nl\begin{enumerate}"  "\nl\end{enumerate}\nl"
%html -d li      "\nl\item "  ""
%html -d lh      "\nl\item "  ""
%html -d dl      -igh "\nl\begin{description}"  "\nl\end{description}\nl"
%html -d dt      "\nl\item["  "]"
%html -d dd      ""  ""
%html -d a       ""  ""
%html -d q       "``"  "''"
%html -d i       -iim "\textit{"  "}"
%html -d em      "\emph{"  "}"
%html -d b       "\textbf{"  "}"
%html -d strong  "\textbf{"  "}"
%html -d tt      "\texttt{"  "}"
%html -d samp    "\texttt{"  "}"
%html -d kbd     "\texttt{"  "}"
%html -d var     "\textsl{"  "}"
%html -d dfn     "\textsc{"  "}"
%html -d code    "\texttt{"  "}"
%html -d blink   ""  ""
%html -d cite    "\emph{"  "}"
%html -d blockquote  -igh "\begin{quotation} "  "\end{quotation}\nl"
%html -d bq      -igh "\begin{quotation} "  "\end{quotation}\nl"
%html -d u       "\underbar{"  "}"

%html -d pre     -verb "\begin{verbatim} "  "\end{verbatim}\nl"
%html -d xmp     -verb "\begin{verbatim} "  "\end{verbatim}\nl"
%html -d listing -verb "\begin{verbatim} "  "\end{verbatim}\nl"

%html -d br      -br "\newline\nl"  ""
%html -d hr      "\vspace{1mm}\hrule "  ""
%html -d img     ""  ""
%html -d isindex ""  ""
%html -d select  ""  ""
%html -d link    ""  ""
%html -d center  "{\centering "  "}"
%html -d meta    ""  ""
%html -d table   ""  ""
%html -d tr      ""  ""
%html -d td      ""  ""
%html -d sup     "$^{" "}$"
%html -d sub     "$_{" "}$"
%html -d caption ""  ""
%html -d script  -off ""  ""
%html -d noscript ""  ""
%html -d style   -off ""  ""
%html -d font    ""  ""
Suggested alternative settings for the various tags are:
%html -d title -on "\newpage\thispagestyle{myheadings}\markright{\sc{}" "}\pagenumbering{arabic}\nl\nl"
%html -d h1 -l1 "{\nl\nl\smallskip\LARGE\bf\noindent " "}\nl\nl\noindent{}"
%html -d h2 -l2 "{\nl\nl\smallskip\Large\bf\noindent " "}\nl\nl\noindent{}"
%html -d h3 -l3 "{\nl\nl\smallskip\large\bf\noindent " "}\nl\nl\noindent{}"
%html -d h4 -l4 "{\nl\nl\smallskip\bf\noindent " "}\nl\nl\noindent{}"
%html -d h5 -l5 "{\nl\nl\smallskip\small\bf\noindent " "}\nl\nl\noindent{}"
%html -d h6 -l6 "{\nl\nl\smallskip\footnotesize\bf\noindent " "}\nl\nl\noindent{}"
%html -d code -math
%html -d blockquote "\nl{\parindent=2em\narrower\nl" "\nl}\nl"
The default setting for the pseudo tags for the book and report styles are:
%html -d l1      "\nl\nl\chapter{"  "}\nl\nl"
%html -d l2      "\nl\nl\section{"  "}\nl\nl"
%html -d l3      "\nl\nl\subsection{"  "}\nl\nl"
%html -d l4      "\nl\nl\subsubsection{"  "}\nl\nl"
%html -d l5      "\nl\nl\paragraph{"  "}\nl"
%html -d l6      "\nl\nl\subparagraph{"  "}\nl"
%html -d l7      ""  ""
%html -d l8      ""  ""
%html -d l9      ""  ""
The default setting for the pseudo tags for the article styles is:
%html -d l1      "\nl\nl\section{"  "}\nl\nl"
%html -d l2      "\nl\nl\subsection{"  "}\nl\nl"
%html -d l3      "\nl\nl\subsubsection{"  "}\nl\nl"
%html -d l4      "\nl\nl\paragraph{"  "}\nl"
%html -d l5      "\nl\nl\subparagraph{"  "}\nl"
%html -d l6      ""  ""
%html -d l7      ""  ""
%html -d l8      ""  ""
%html -d l9      ""  ""
The default setting for the pseudo tags for the plain style is:
%html -d l1      "\nl\nl\section*{"  "}\nl\nl"
%html -d l2      "\nl\nl\subsection*{"  "}\nl\nl"
%html -d l3      "\nl\nl\subsubsection*{"  "}\nl\nl"
%html -d l4      "\nl\nl\paragraph*{"  "}\nl"
%html -d l5      "\nl\nl\subparagraph*{"  "}\nl"
%html -d l6      ""  ""
%html -d l7      ""  ""
%html -d l8      ""  ""
%html -d l9      ""  ""

Options

The options can be used to configure the LaTeX fragments which are generated by the program for the various kinds of references. The options can be given in the input file (as a line of the form %html -o option-name option-value), and inside comments in the HTML files (in the form of latex-opt option-name option-value).

There are options that determine the cases in which references should be generated and when not. For example, it will often be the case that an HTML file contains a HREF tag, whenever an email address is given, which can be used to send an email. As the essential information is already provided it is not necessary to include it in a footnote or a bibliographic entry. The following options can be used for this purpose:

By default all these options are on.

The references can be divided into internal and external. The internal references are HREF tags that point to a file that is included in the LaTeX output, and external are those that are not. Internal references can be mapped to phrases, that state to look at the corresponding section. External references have to be given completely, either as a footnote at the bottom of the page or as a bibliographic entry. They are generated as bibliographic entries if the input file contains a line with '%html -b' (or if the program option -b is given), otherwise they are generated as footnotes. There are four generation modes:

These four modes can be set for three different environments, namely: the headers, LaTeX alltt environments, and all the remaining parts. The options for this are: There are also options that determine the format in which the various kinds of references are to be generated (including the format of the bibliographic entries). All these options make use of format strings (like those used in C), where the percentage symbol followed by letter indicates a place holder for a string or number that has to be outputted. A double percentage symbol causes a single percentage symbol to be printed. All these options should contain LaTeX formating commands. Because references can be generated in fragile environments '%p' has to be used at places where a '\protect' is required in a fragile environment. Also because a '\footnote' is not allowed everywhere, a '%F' has to be used instead.

These are the options for internal references:

The options for external references as footnotes are:

The options for citations are:

The options for the bibliographic entries are:

The following options deal with the formating of all kinds of references. They make it possible to add additional formating around the anchor text or the image tag. The "%R" indicates the place where the reference should be placed. This can either be an internal or an external reference, in the running text or as a footnote. In case the "%R" appears in an fragile environment, it should be changed into "%fR". In case it appears in a place where a \footnote would not be proper, a combination of an "%mR" and an "%tR" can be used to indicate the place of the footnote marker and the footnote text, respectively. (An "f" can be added if they occur in a fragile environment.)

Suport for tables is still minimal, but the following two options are related to converting tables:

Below an example HTML fragment to convert a table to the tabular LaTeX environment:
<!--latex-def table " \begin{tabular}{|p{3.5cm}|p{8cm}|}\hline " " \end{tabular} "-->
<!--latex-opt tab_row_sep " \\ "-->
<!--latex-opt tab_cell_sep " & "-->
<!--latex-def th " \textbf{" " } "-->

<TABLE>
<TR><TH>A</TH><TH>B</TH></TR>
<TR><TD>1</TD><TD>2</TD></TR>
</TABLE>

Program options

If the program is given an input file with the extension .html, it does not generate a LaTeX output file, but only analyse the file, and the files it references (if the -s option is given).

The program recognizes the following command line options:

Bugs

There is still a long road to go with respect to bugs. I still cannot process the
web testing pages correctly.

Known bugs are:

The source

The source of html2tex falls under the GNU General Public License, and thus no warrants what so ever are implied! Earlier versions are available on request. I cannot give much support, because I am busy with my kids Annabel and Andy.

Support for non-Western alphabets (Japanese, Cyrillic)

For support of non-Western alphabets (character encodings) it can be desirable not to translate the ASCII characters in the range 127 to 255. To enforce this, compile the program with the -DASCII8 switch, or add line the following line at the start of the source
#define ASCII8

Version history

For all versions: No warrants what so ever are implied!. Each version has a version number and a date at the top of the source file. Please use these for bug reports. Please check the revision history in the source for more information. (What happened to version 2.3? I guess, I skipped that number by accident.)

Future plans

There are a number of things, which if I did have the time, would like to work on. These are:

Acknowledgements

I would like to thank the following people for their contributions:

Other convertors

Another interesting, and probably more powerfull, HTML to LaTeX converter using Perl can be found here. See also Converting from HTML for more information.

As a spin-off of this program, I developed the program chkhtml.c, which I use as part of my tools for maintaining this web site.


home