Reverse engineering the Quark Xpress file format

by Frans

In the periode from February 2001 till May 2002, I have spend many hours reverse engineering the Quark Xpress Binary File Format as used by Quark Xpress a widely used DTP program. I have decided to bring my results in the public domain in the form of a source distribution under the GNU General Public License. I would be very happy if any additional discoveries about the Quark Xpress file formats that are made with the use of this program, are also made public under the GNU General Public License.

Although the program can read all the files I needed to read (for the Study Bible Project), it by no means is complete, and could possibly crash on any other file. The biggest limitation is that the program can only read files produced by some earlier MAC versions of Quark Xpress. Files saved by the windows version use a different byte order for the integers. (On February 22, 2001, I already released a very first version of the program, which was able to read some Windows files.)

At the moment I have only very limited time available for supporting anyone continueing the reverse engineering of the Quark Xpress formats. Please do not ask me any questions about the code, because if you are not able to read the code as it has been provided, you very likely will not be able to reverse engineer the binary file format any further. (Read: Requirements.) If you want to continue working on the Windows file formats, please read the last section on the page.

For professional conversions of Quark Xpress to XML, I point to the following resources:

The story in eightteen parts

Below the eightteen entries in my online diary in which I tell about my progress.

Description of the format

To describe a binary format is difficult, because it needs to be exact, consize, and easy to read. Grammers (such as BBF) could be of some help. A good way for describing a binary format is to provide a program that can parse the format. So far, I haven't had time to write down a well documented description of the format. The only documentation is thus the program that is provided here. Please read the above story to get some ideas about the general structure that Quark Xpress uses. For a detailed description study the file scanQXDoc.cpp starting with the function scan_file. I have tried to write the scan_* functions in such a way that they represent the "grammar". All these function operate on CReadBuf objects, which represent a buffer with a given length containing a part of the data from the file. I have made use of a set of macro defines (in capitals and starting with an underscore character) for the various elements in the grammer. Some argument of these macros are only useful for producing readable output. Below a short description of some of these macros, which read process some data:

_SUB_BUFFER and _SAFE_SUB_BUFFER: create sub-buffer (second argument) from a given buffer (first argument) and a given length (third argument) known by a certain name (fourth argument).
_SKIPBYTE: skips a byte.
_SKIPWORD: skips a word (two bytes).
_SKIPLWORD: skips a long word (four bytes).
_SKIPBYTES: skips a number of bytes.
_SKIPBYTES_S: same, but with printing.
_BYTE: reads a byte in an already defined variable.
_WORD: likewise for a word.
_LWORD: likewise for a long word.
_VARBYTE: reads a byte in a newly defined variable.
_VARWORD: likewise for a word.
_VARLWORD: likewise for a long word.
_VARPASCALSTRING: reads a PASCAL like string, where the length of the string is specified by the first byte.
_VARPASCAL2STRING: likewise, but next data starts at an even number of bytes from the first character.
_VARPASCALFIXSTRING: likewise, but extended to a fixed length.
_EXPECTBYTE: expect a byte with the given value (first argument).
_EXPECTWORD: likewise for a word.
_EXPECTLWORD: likewise for a long word.
_CALL and _CALL_IC: for calling another scanning function.
_DONE: checks if the given buffer has been read till the end.

The purpose of the rest of the macros is just for formatting the output in case an error was detected.

The sources

You can download the sources in a single zip file from here. The sources compile with the Cygnus gcc compiler (version 2.95.2) in the Cygnus unix under Windows environment. (Compilation problems can occur with newer versions of gcc.) To build the program, simply compile the file scan.cpp as it includes all the other sources.

Please note that the files CQXDoc.cpp and CDatabase.cpp are made with the cls2cpp program from the file CQXDoc.cls and CDatabase.cls files. Please do not edit these .cpp files, but generate them from the .cls files. You could use the following shell script for building the program:

#!/bin/sh
make cls2cpp
cls2cpp CDatabase
cls2cpp CQXDoc
gcc -g -Wall scan.cpp -o scan.exe

Of course, you could also write a small make file for doing the job. I didn't take the effort to save the half second to run the program each time.

Below, a short description of the files found in the source distribution is given.

The file `scan.cpp`

The main file in the source distribution is the file scan.cpp. This file includes all the other files. No header files have been used. With current day computers, it is often much faster to simply include all the sources into a single file, then to compile all the C++ files into separate object files, and having to link them together. Also for larger projects, where most of the time is spend on reading large number of include files, this could be a much faster approach, than the traditional way of compiling and linking.

The file `stddef.c`

Just a collection of handy functions and macros that I often use in my C/C++ programs.

The files `CBuf.cpp` and `CReadBuf.cpp`

These files implement a number of classes to read data from a buffer. The class CBuf implements the buffer, and the classes CReadBuf and CReadButWithBlocks implements procudures to read various kinds of values from a CBuf buffer.

The files `MMFile.cpp` and `MMFileDummy.cpp`

The file MMFile.cpp implements a persistent store (database) making use of a Memory Mapped File. The file MMFileDummy.cpp implements a replacement for MMFile.cpp which is not persistent. The scan.cpp provided in the distribution uses non-persistent implementation. If you want to use the persistent implementation, you might want to change the filename used in the open method, and increase the size of the store. The program may crashs in case of an overflow.

The files `CQXDoc.cls` (and `CQXDoc.cpp`)

This defines the classes for storing the logical structure of a Quark Xpress documents including many of it style definitions.

It also contains the class CTextAccessor which is an accessor to formatted text from a text fragment with all its formatting instructions. For an example how to use it, see the file DumpQXDoc.cpp.

It also contains the class CTextOnFramesAccessor which could be used to walk over the whole text of a book. There are no examples of it use given in the code distribution, but you should be able to figure out how to use it by yourself. It also contains some elementary parsing methods.

The files `CDatabase.cls` (and `CDatabase.cpp`)

This defines a few classes for organizing some Quark Xpress files into books and maintaining a collection of books.

The file `scanQXDoc.cpp`

This contains the actual scanner. It makes some heavy use of some tricky defines. The idea is that the code describes the grammar, but in case of an error, the parsing jumps back to a certain point and repeats the parsing, but now with dumping information. This makes it easier to figure out what went wrong. The system does not always work perfect.

The file `FrameGeom.cpp`

This file contains some code for determining the natural reading order of the frames. It also deals with nested frames. The algoritm used is probably not perfect, but it served my purpose well. After the main routine has been called, all frames found in the documents of a "book" are linked through first_frame_reading_order and next_reading_order.

The file `DumpQXDoc.cpp`

This file contains some routines to dump the information to file either plain text or HTML, but it could be modified to dump it to any format you want. This is more an example, than a working piece of code. A lot of intelligence is in the class CTextAccessor from the file CQXDoc.cls.

Latest version for Window file formats

For those who want to continue the work on reverse engineering the Windows file formats, I hereby also give access to the latest version of the program which can read some file produced by Quark Xpress 4.1 for Windows. I have not been able to date the version. It is definitely later than May 22, 2001. I think it is from earlier this year, as it makes use of an early implementation of the class CBuf. Actually, this version was produced on October 12, 2002, when I made some last modification to make it generate an XML file that can be viewed with IE!

I am not very proud of this program, because the code contains a lot of rubbish. Please do not look at it, if you are not an expert programmer. At some points it might even cause for more confusion than be of some help. (I am afraid it does contain some amouth of dead code.) When run, it produces a lot of debugging output on stdout. I usually redirect this to a file. A file with the extenstion .xml will be generated, if the program does not crash, which I am afraid is very likely, if you feed it an arbitrary Quark Xpress 4.1 document.

If you want to contribute to the reverse engineering of the Quark Xpress file formats, do not develop this program further, but rather make modifications to the latest source base. I am not willing to publish any modifications to the qq.cpp program. You may do yourself, of course.

Only extracting text

Based on the above sources, sed developed a program which simply extracts the raw texts from a Quark Xpress file for Mac versions 3.3 and 4.0. The text are extracted in the order in which they occur in the file, which is not very likely to match the order in which the occur in the document.

My life as a hacker | How to crack a Binary File Format | Software engineering

Reverse engineering the Quark Xpress file format

The story in eightteen parts

Description of the format

The sources

The file scan.cpp

The file stddef.c

The files CBuf.cpp and CReadBuf.cpp

The files MMFile.cpp and MMFileDummy.cpp

The files CQXDoc.cls (and CQXDoc.cpp)

The files CDatabase.cls (and CDatabase.cpp)

The file scanQXDoc.cpp

The file FrameGeom.cpp

The file DumpQXDoc.cpp