How to crack a Binary File Format
By Frans Faase
Important notice
At the moment I have a fixed contract with an employer
who is not interested in allowing me to perform work
(paid or unpaid) for a third party. Also my family
circumstances (having a son
with a mental handicap and a wife diagnosed with Alzheimer's Disease) are such that my spare time
is very limited.
Please do not send me binary files
and expecting me to reverse engineer them for you.
Request for cracking passwords and other forms of
encryption are deleted immediately by me.
|
Introduction
I often receive request from people to help them
with cracking a particular binary file format.
It seems that many software vendors are not willing
to document the format of the binary files that they
are using. Maybe there are good reasons for this.
Personally, I feel that users have the right to
access the information that is stored in the document
that they manipulate with a program that they are
legally allowed to use. That is why I do not consider
reverse engineering of a binary file format as an
illegal act (opposite to what the use of the verb
"cracking" is suggesting).
The most likely reason why people ask me for help is
because I already have reversed engineerd some
binary file formats. So far, I have worked on:
Although, cracking a binary file format can be very
rewarding, like a form of mental mountain climbing,
there are also some basic skills that are absolutely
neccessary, if you ever want to succeed. These are
(in order of importance):
- A good reason why you want to crack the format at all.
- A lot of time.
- A lot of preseverence.
- A lot of creativity.
- Willingness to do repeative tasks for hours and hours.
- Good understanding of how numbers and strings
are represented in binary manner.
- Excellent command of the C programming language.
Cracking a file format is not easy. It can drive you crazy.
Try to avoid it if you can. And, of course, you first
search the Internet for at least one day seeing what is
already known about the format, or to find out if someone
already offers some conversion service or tool to a format
that you can read. One place to look at, is
Wotsit, a site dedicated
to file formats.
Also, I think you should have read the stories of attempts
to crack a binary file format, and studied some programs
that read binary files. (Follow the links mentioned in the
introduction to start with.)
Besides this, you also need:
- a computer (quite obvious),
- a simple text editor capable of handling long lines,
- (optional) a hex viewer or editor,
- a calculator with hexadecimal-to-decimal conversion and
vice-versa functionality,
- a C or C++ compiler,
- enough sample files about which you know what data
they contain, and
- (optional) the application which uses the files.
Recently, a professional hex editor was brought under my
attention with some nice features for cracking binary
files. 010 Editor
allows you to write templates for describing the
structure of the binary file, and having it parse
on the fly showing the results. The languages used
has some powerful features. It some what resembles
my ideas for BFF: a grammar
for Binary File Formats.
What is the fun?
Indeed, cracking a format is like climbing a
mountain, when you reach the top it gives you
a great reward. But it is also like entering
the mind of someone else. With every step you
come further, you know something more about the
program being used to read and write the file.
And by this, you also know more about the mind
of the person (or persons) who designed that
program. However, sometimes this person becomes
your enemy, when attempts have been made to prevent
total reverse engineering of the format. Yes,
those people designing a format sometimes are
given the assignment to make reverse engineering
difficult. There are many ways to do it. Custom
made compression and encrytion are two techniques.
Luckily, both these can reduce performance of
loading and saving. But there are also small
tricks you can use, which make it a lot harder.
Programming style
Although, cracking a binary file format may tempt
you to write instant code, you should realize that
it is a long and complicated task, which often
requires the code to undergo major revisions. Code
revisions are much easier if your code is clean and
well structured. (The principles of
eXtreme Programming do also
apply to this kind of programming work.)
Some practical remarks:
- Document your code (either through comments
or, better, by giving meaningful names to
variables and procedures).
- Make regular back-ups. Although this should
be a standerd procedure, with cracking a file
format you will see yourself more often going
back to an earlier version of the program
whenever you get stuck in a certain area.
- Build into your program switches (simple
Boolean variables) which allow you to dump
any intermediate data during any stage of
the parsing. You will find that you often
need to go back one step to make a step
forward.
How to work
Where to start
The first thing you need to do is to understand the
global structure of the binary files. Look at the
sample files with a hex-viewer, or write a small
hex-dump program, which dumps each next 32 bytes
on a separate line, generating an empty line
after each 256 bytes. (Alternatively, you can also
dump 256 bytes on each line.)
You can use one of the following statements to
print a single byte:
printf("%02X", byte);
printf("%02X ", byte);
printf("%c%02X", (byte > ' ' && byte < 127)?byte:' ', byte);
printf((byte > ' ' && byte < 127)?" %c":"%02X", byte);
Take care that byte is defined as a unsigned
char, otherwise you may get confusing effects. (If
you don't understand the above code fragments, you clearly
do not meet the requirements, so,
don't ask me to explain it to you.)
Once the global structure has been discovered, you
can start writing a C program, which simply dumps
the file along this structure.
A lot the work, just consists of searching for
structure in the hex-dumps and editing them in long
lines representing the various records on a single line,
in order to discover the structure and determine the
meaning of the various parts.
For a primary investigation of a format, the
strings program might be useful. Sometimes
binary format contain whole chunks of text.
Block structure
Some binary file formats use a block structure
where each chunk of data is stored in a number
of equal sized blocks. A block size of 256 is
often used. Examples of these are MS Word 5.0 and
Quark Xpress. Whenever a block structure is
used, data structures are often referenced by the
block number in which the data start.
Number representation
The most common used manner to represent numbers
is to interpret a sequence of bytes as a binary
number. Numbers can also be stored as strings,
where each character is representes one digit.
(These are relatively easy to recognize in a
binary file.) Because only a limit set of characters
is needed to represent all kinds of numbers (including
fractions and exponents), two characters can be
put in a single byte. In the past there were even
processors that had instructions to perform
arithematic operations on such numbers. (The two
parts of the byte were often refered to as nibbles.)
Nowadays this format is not in use any more (but
it might still occur in some binary file format,
as these sometimes survive their original platform).
Negative integers
For simple positive numbers interpreting a sequence of
bits as a binary number is the rather obvious choice.
But for numbers that can be both positive and negative
this is not. One interpretation is to consider one of the
bits (the first usually) as the sign bit, and the rest of
the bits as a binary number. A problem with this
interpretation is that there are two representations for
zero. Another solution is to interpret a sequence of
bits as a binary number, and to subtract a large
value from it, in case the number is larger than some
number. There are two options here, with respect to
which numbers is subtracted with a certain amouth of
available bits. These are called 1's compliment and 2's compliment
number representation. Nowadays, the 2's compliment method
is most often used. This means that for a byte (8 bits)
the number 256 is subtracted, when the number is larger
than 127. That means that the unsigned number 128
represent the signed number -128, and the unsigned number
255 represent the signed number -1. (With the 1's compliment
number representation, the number 255 is subtracted.
Again, this introduces two representations for zero, as the
unsigned number 255 becomes the signed number 0.)
Note that with the 2' compliment the first bit can be
seen as the sign bit. The only disadvantage the 2' compliment
has is that there is always a negative number that does not
have a positive counterpart, e.g., for the byte there is
a representation for -128 but not for 128.
An important advantage of the 2-compliments notation is
that the arithmetic operations for the signed and unsigned
representation are exactly the same. The only time the
difference become obvious is when a number is cast to a
larger representation form. For example the hexadecimal
byte value 0xFF needs to be changed to 0xFFFF, when the
signed number -1 is cast from a byte to a word, and has
to be changed to 0x00FF, when the unsigned number 255 is
cast from a byte to a word.
In a binary file format, it can occur that a unsigned byte
value is stored as a word with the signed expansion applied.
As a result of this it can happen that 127 is stored
hexadecimal as 0x007F, where 128 is stored hexadecimal
as 0xFF80.
little and big endian
Processors can be divide into little and big endian.
This determines the order of the bytes in for short
and long integer values. little endian processors
start with the least significant byte. Big endian
processors start with the most significant byte.
The reason why little endian processors start with
the least significant by first, is because it gives
an advantage for 8 bits processors when having to
perform adding and subtraction operations. Although
this plays no role in current 32 bit and 64 bit
processors, some processors are still little endian
because of their origin. The Pentium processor is
a little endian. The PowerPC (used in Macs) is
big endian. This difference is especially important
for exchanging binary data between different processors.
One should be aware that some applications running on
a big endian processor, may use the little endian
byte order, because the application was originally
developed to run on a little endian processor.
Sometimes, long values (32 bits) are stored as two
big endian words, with the least significant word
first.
floating point formats
The possibilities for representing floating number
far exceed those of integers. Luckily, nowadays it
has been pretty much standardized. Most of the time
it matches the way common processors are using. So,
a simple cast to float or double
will work, sometimes in combination with a byte
reordering with respect to little and big endian
differences.
Strings
Usually, it is rather easy to recognize strings
in a binary format. One should realize that each
character in a string can either use one or two
bytes (unicode), and that little and big endian
could play a role. There is a difference between
fixed and variable size strings. Fixed strings
take a fixed amouth of space in the file. They
are often filled up with spaces or the
null-character. Variable sized strings, either
can have the length at the start, or use a
termination character (most often the null-character)
to indicate the end of the string. Sometimes, the
next piece of data does not directly start at the
next byte following the string, but could be at
a word or long word boundary.
Packed format
To save space, some binary format may make use
of bit packing, which means that the values stored
in the binary file do not necessary start at byte/word
boundaries. (In C this can be achieved by using the ":"
followed by a number in the struct/class member
definitions.) One such format is the
AutoCAD DWG13 file format.
See the program for some suggestions.
More often it happens that a number of Boolean values
are stored in a single byte or word. Sometimes it
can be helpful to print numbers in their binary
representation.
Rubbish
Be aware that many binary files just contain
chunks of rubbish, especially those that allocate
fixed blocks for storing data in the binary file.
Also memory mapped files could contain large
sections of rubbish. Rubbish can be very misleading
and a great trouble in reverse engineering binary
files, because sometimes it just seems to contain
valid data. On the other hand, something what looks
like rubbish could actually contain some information.
Reading files
When reading a binary file, one often has to access
the file in a random manner, or at least often look
ahead from the current reading position.
Direct random access of a binary file is slow. To
avoid this, it is best (if size permits this) to
store the whole file into a single byte array.
(For an example see my CBuf class, which
I used for reading Quark Xpress
files.)
If size does not permit it, you could try the
Memory Mapped file option. For an example of
this under Windows see the class Mmfile
in the Suneido
sources that uses the functions CreateFileMapping
and MapViewOfFile. Or study my own implementation of an Object-Oriented persistent
store based on Memory Mapped files.
The difference strategy
One way of cracking a binary file format is to make
many, preferable small, files with the application
that contain small changes compared to each other.
For example, you could start to write a document with
the application and each time save a document under
a different name.
You can use a Binary File Compare
utility to compare the files and discover how certain
changes are reflected in the binary file, which gives
you information about the where the information is
stored.

Links
Tools
(Background image was taken from
here)
My life as a hacker |
Software engineering |
home and email address