WHY2025: Reviewing live-bootstrap

Linux is an open-source operating system build from available sources. For closed source operating systems, such as Windows and MacOS, there is no way to verify that it is reliable.

Although the sources of Linux are all known, a compiler is needed for building the Linux kernel from the C-source files into executables for the target platform.

A compiler is a program which translates a program from an input language to an output language. The input language is often a high-level, human readable language, while the output language is a low-level language, including machine code, which can be executed by the CPU. A compiler, like any other program, is programmed in a language, usually in machine code, or otherwise, depends on some other program based on machine code.

A malicious compiler could insert a back-door in some executables or libraries of the Linux kernel. The solution to use this compiler to compile itself from original sources, does not work, because it could recognise when it is compiled by itself and duplicate the malicious code in the resulting compiler, making it malicious as well.

In 1984 Ken Thompson already presented the idea of a malicious compiler during the Turing Award Lecture Reflections on Trusting Trust.

Machine code is usually in a binary form that is not easily readable by humans. This poses a problem for a non-trivial program, such as an optimising compiler. Although reverse engineering executables is possible, it is a lot of work.

The live-bootstrap approach

The live-bootstrap approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small piece of machine code. Next the Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4, 4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last compiler can be used to compiler the Linux kernel sources.

hex0-seed

The small piece of machine code is a program called hex0, which converts a file with hexadecimal numbers into a binary file. This program is specific for the target CPU for which Linux needs to be build. These programs can be found in the bootstrap-seeds repository. In the README.md file it states: 'NEVER TRUST ANYTHING IN HERE'. For x86 (32 bits) the hexadecimal representation of this program is given in the file hex0_x86.hex0, meaning that if this file is compiled with hex0 it will return hex0, assuming that it is not a malicious variant of hex0. The hex0 program requires two arguments, the names of the input and output files. The function of the program can also be achieved with following command line:

sed 's/[;#].*$//g' $input_file | xxd -r -p > $output_file

This makes use of the stream editing program sed and the xxd (hex dump) program that can be used to convert, in both directions, between binary files and hexadecimal files.

The important question is: Can we verify hex0? First we should verify that it indeed does what it does through an independent verify cations.

Independent verification

Brainfuck is an esoteric program language with only eight commands, represented by the letters '+-<>[].,'. I wanted to see if it was possible to write a Brainfuck program to replace hex0. I wrote some JavaScript to generate a program from a simple programming language: BF generator. I verified that with some Brainfuck interpreter running it on hex0_x86.hex0 did produce a file equal to hex0.

Blog posts:

Verify hex0_x86.hex0

For executables, Linux makes use of Executable and Linkable Format (ELF) files. These files start with a header, followed by the program header table, an number of machine code and/or data segments, and (optionally) a number of section header tables, which include symbol tables with information that can be used by a debugger.

To interpret the machine code in the body of the ELF file, the following references can be used:

Because the program requires the help of the operating system to read and write to files, it is also important to verify the system calls that are used. For information about the system calls, see for example:

Linux System Call Table

Review of the other sources

I decided to look some deeper into live-bootstrap to figure out which programs are executed and which files are being read. There is a global description of all the steps that is found in parts.rst. live-bootstrap comes with a script to execute all steps and a minimal shell, called the kaem shell, to execute the script. I wrote a program, kaem_parser.cpp, to parse the kaem files, which generates the HTML page live-bootstrap with all the information. To determine which header files are used when compiling C programs, the parser also implements a C preprocessor to parse the C files to find all the header files. At some point the bash shell is compiled and used to execute bash scripts. One would have to write a bash interpreter to parse the files involved in the remaining steps.

Blog posts:

Execute with emulator

To even dive deeper, I decided to implement an x86 emulator that implements only the x86 instructions that are actually used. The emulator also implements systems calls and supports multiple processes. I decided to simply map file operations on file operations on the file system, instead of simulating those in memory.

This allowed me also to verify the code of hex0 and to see if no other binary seeds were used and/or executed.

The development of the emulator required some very complicated debugging, but it was also insightful.

The emulator turned out to be too slow to compile the GNU Mes compiler, which is the C compiler used to compile a stripped version of the Tiny C Compiler.

This also means, I even got less far with this approach than the previous.

Blog posts:

Using strace

With strace command it is possible to trace system calls that are performed during the execution of a command. I wrote a shell script to execute all the steps with an alternative root (using the chroot command) and trace relevant system calls. I also wrote a program, called scan_trace.cpp to process the produced log file and generate an HTML page listing the processes and source files. It also generates a JSON file with similar information, which is used to generate a T-diagram.

T-diagram

A T-diagram (or tombstone diagram) uses T-shaped outlines that explain the languages that are involved in a program. It shows the source language (on the left), the target language (on the right) and the implementation language (on the bottom). In the above diagram it show how a program in C for reading hexadecimal data to a binary file is compiled to an executable that does the same. Notice how the execution of the compiler changes the implementation language of the source program into the implementation language of the target program.

GNU Mes Compiler

GNU Mes consist of Scheme (a dialect of the LISP family) interpreter programmed in a subset of C and a C compiler written in Scheme. It is a rather complete compiler including a separate linker. I get the idea that it is on par with the Tiny C compiler, which does not need a separate linker. It can compile it own sources without a separate linker. The MES compiler is rather slow, which is understandable because it uses an interpreting parser running an interpreting compiler. It is also slow because it compiles many individual C files into object files, which are linked at the end. Every time all the Scheme files are loaded before the compiler can start its actual work.

The GNU Mes compiler requires a compiler for a minimal subset of C. Because of this the bootstrap does contain include a subset of C compiler and a partial implementation of the C preprocessor. These are written in another subset of C for which there is compiler written in machine code, one for each supported CPU. I have the impression that the GNU Mes compiler is a bit of overkill for the gap it needs to bridge.

On several forums, I have read people raising the question why Forth was not used in the implementation instead of LISP.

I made some investigation and attempts to bridge the gap. To see if it would be possible to implement a compiler with M2-Mesoplanet. I encountered some bugs in M2-Mesoplanet.

Blog posts:

Observations

Although hex0 is short, it is also a bit cryptic. It looks like it was hand coded and that it uses different patterns for similar operations. It also uses registers to store certain variables. Besides having to verify hex0, one also has to analyse the following assembly programs:

hex1_x86.hex0: 329 lines
hex2_x86.hex1: 626 lines
M0_x86.hex2: 869 lines
cc_x86.M1: 4983 lines

Intermediate language

The above example shows how a single assignment statement in C can be translated to the intermediate language, and how this mapped on M1 assembly, which can than be converted into hexedecimal numbers accomplished with comments showing the relationship with the intermediate langeuage. This makes it possible to see the relationship between C statement and the hexadecimal numbers.

Blog posts:

Properties

The intermediate language is a stack base language, like Forth, that would be close to C and simple to compile. It has a simple method for defining global and local variables. It has a similar block structure as C with respect to local variables. It has functions and some support for structs.

The compilation to M1 makes use of two stacks. One stack contains the intermediate values, including the parameter values passed between functions, which makes use of the normal stack. The other stack contains the return address for function calls and the local variables, which uses the ebp register as a base offset.

Because it is intended to be used as a target language for a C compiler, it uses only C keywords to avoid possible clashes with variable names in C programs. This has lead to some keyword abuse, such as 'void' being used to define a function, 'int' being used to define a variable, and 'const' being used to define a numeric constant.

Blog posts:

C compiler for TCC

The goal for this C compiler is that it is able to compile the TCC sources and also itself, the stack_c compiler, the assmbler and the hexadecimal to binary converter.

A large part of the compiler involves the C preprocessor that is implemented with several character and token iterators. This was also an experiment to see if it would be possible to implement the preprocessor with iterators.

Although you need an LR-parser if you want to parse C purely based on the syntax, it appears that you can write an LL(1), recursive decent parser, if you know the kind of the identifier, whether it is a typedef or not.

It appears that it is possible to generate code during the parser, only using abstract syntax trees.

Blog posts:

Reviewing live-bootstrap

Contents

Can we trust Linux?

The live-bootstrap approach

hex0-seed

Independent verification

Verify hex0_x86.hex0

Review of the other sources

Execute with emulator

Using strace

T-diagram

GNU Mes Compiler

Observations

Intermediate language

Properties

C compiler for TCC