In this page such a grammar, named BFF, is described. It has several construct that are not traditionally found in context free grammars for programming language. Due to the nature of binary file formats, it is important to be able to reference information that has been read before. For example, a string of characters might be preceded by a number that indicates the lengt of the string.
The terminal symbols of the grammar consist of a number of bytes, representing one of the basic data types, such as: char, short int, long int, float and double. Differences in byte ordering for integers, and the different formats for floating point numbers should be taken into account.
Due to the nature of binary formats, it is not too restrictive to use only recursive descent grammars, e.i., grammars that can be parsed top-down, and belong to the LL(1) class.
The BFF is tested for the DWG file format. As we start with this file format, BFF will naturally first focus on the requirements based on this format, and because of this it will be slightly biased.
To specify the grammar of BFF we will use a form of extended BNF.
This tool should also support the reverse engineering of binary file formats.
In a later state, a tool could be made that generates a parser and the needed data structures for reading a binary file into memory according to the grammar. As it is not always required (nor possible) to read the whole file into memory, it should be possible to generate procedures to read the file interactively.
I never found the time (nor the need) for developing this tool. For a good professional alternative have a look at 010 Editor - Hex Editor, which makes use of a template language which has similar features as proposed here.
type word := byte : first, byte : second return ((word)first | ((word)second << 8)).A word representation where the lower order byte comes before the higher order is usually used by small Endian processors. The expression used on will be based on C. We assume that the following types have been defined on top of the default types of C:
typedef unsigned char byte; typedef unsigned short int word; typedef unsigned long int longword;This leeds us to the definition of the basic types that will be supported:
C_data_types := "char" | "byte" | "short" | "word" | "long" | "longword" | "float" | "double" .(We assume for the moment that float and double represent floating numbers of 4, respectively 10 bytes.) The grammar of the rule used for defining types is:
type_def_rule := "typedef" C_data_type basic_type_name ":=" ("byte" ":" byte_name) LIST "return" expr "." .Here expr stands for C-like expression using the byte_names as they are used in the rule. The basic_type_names should not be confused with the C_type_names. It is possible that the same name is both used as a basic_type_name and a C_type_name.
BFF_grammar := type_def_rule SEQ "root" rule rule SEQ.Each rule has a non-terminal symbol on the left-hand side, and a list of elements on the right-hand side. Each element is either a elementary element, a non-terminal symbol, or a grouping of elements. Because BFF assumes a top-down parsing method, it is possible to give each non-terminal symbol a number of parameters. This leads to the following grammar for the rules:
rule := non_term_name ( "(" param LIST ")" )OPT ":=" elem LIST ".".Each element consist of the following parts:
elem := range OPT data_type ( "[" expr "]" | "*" ) ( ":" elem_name )OPT ( "=" C_expr )OPT. range := "[" file_pos (":" file_pos)OPT "]". file_pos := "begin" | "end" | "cur" | expr. data_type := "(" elem LIST ")" | basic_type_name | non_term_name ( "(" expr LIST ")" )OPT.