Linux is an open-source
operating system build from available sources. For closed source operating
systems, such as Windows and MacOS, there is no way to verify that it is reliable.
Although the sources of Linux are all known, a compiler is needed for building
the Linux kernel from the C-source files into executables for the
target platform.
A compiler is a program which translates a program from an input language to
an output language. The input language is often a high-level, human readable
language, while the output language is a low-level language, including
machine code, which can be executed by the CPU. A compiler, like any
other program, is programmed in a language, usually in machine code, or
otherwise, depends on some other program based on machine code.
A malicious compiler could insert a back-door in some executables or libraries
of the Linux kernel. The solution to use this compiler to compile itself from
original sources, does not work, because it could recognise when it is compiled by
itself and duplicate the malicious code in the resulting compiler, making it
malicious as well.
Machine code is usually in a binary form that is not easily readable by
humans. This poses a problem for a non-trivial program, such as an optimising
compiler. Although reverse engineering executables is possible, it is a lot
of work.
The live-bootstrap approach
Start with a tiny piece of machine code
Build necessary tools from this
To compile the Tiny C Compiler
gcc 4.0.4 → gcc 4.7.4 → gcc 10.4.0 → gcc 13.1.0
The live-bootstrap
approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small piece of machine code. Next the
Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4,
4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last
compiler can be used to compiler the Linux kernel sources.
hex0-seed
The bootstrap-seeds repository
NEVER TRUST ANYTHING IN HERE
hex0_x86.hex0
> hex0-seed hex0_x86.hex0 hex0
Can we verify hex0?
The small piece of machine code is a program called hex0, which converts a file with
hexadecimal numbers into a binary file. This program is specific for the
target CPU for which Linux needs to be build. These programs can be found in
the bootstrap-seeds
repository. In the README.md file it states: 'NEVER TRUST ANYTHING IN
HERE'. For x86 (32 bits)
the hexadecimal representation of this program is given in the file hex0_x86.hex0, meaning that if this file is compiled with hex0 it will
return hex0, assuming that it is not a malicious variant of hex0. The hex0
program requires two arguments, the names of the input and output files. The
function of the program can also be achieved with following command line:
sed 's/[;#].*$//g' $input_file | xxd -r -p > $output_file
This makes use of the stream editing program sed and the xxd
(hex dump) program that can be used to convert, in both directions, between
binary files and hexadecimal files.
The important question is: Can we verify hex0? First we should verify that it
indeed does what it does through an independent verify cations.
Independent verification
hex0 in Brainfuck
Esoteric programming language
Only eight commands: + - < > [ ] . ,
Verified hex0 is binary form of hex0_x86.hex0
Is this enough?
Brainfuck is an esoteric
program language with only eight commands, represented by the letters
'+-<>[].,'. I wanted to see if it was possible to write a Brainfuck
program to replace hex0. I wrote some JavaScript to generate a program from a
simple programming language: BF generator.
I verified that with some Brainfuck interpreter running it on hex0_x86.hex0 did
produce a file equal to hex0.
For executables, Linux makes use of Executable and Linkable Format (ELF) files. These files start with a
header, followed by the program header table, an number of machine code and/or data
segments, and (optionally) a number of section header tables, which include
symbol tables with information that can be used by a debugger.
To interpret the machine code in the body of the ELF file, the following
references can be used:
Because the program requires the help of the operating system to read and write
to files, it is also important to verify the system calls that are used. For
information about the system calls, see for example:
# SPDX-FileCopyrightText: 2019 Jeremiah Orians
# SPDX-FileCopyrightText: 2022 Andrius Štikonas
#
# SPDX-License-Identifier: GPL-3.0-or-later
## ELF Header
#:ELF_base
7F 45 4C 46 # e_ident[EI_MAG0-3] ELF's magic number
01 # e_ident[EI_CLASS] Indicating 32 bit
01 # e_ident[EI_DATA] Indicating little endianness
01 # e_ident[EI_VERSION] Indicating original elf
03 # e_ident[EI_OSABI] Set at 3 because FreeBSD is strict
00 # e_ident[EI_ABIVERSION] Set at 0 because no one cares
00 00 00 00 00 00 00 # e_ident[EI_PAD]
02 00 # e_type Indicating Executable
03 00 # e_machine Indicating x86
01 00 00 00 # e_version Indicating original elf
54 80 04 08 # e_entry Address of the entry point
34 00 00 00 # e_phoff Address of program header table
00 00 00 00 # e_shoff Address of section header table
00 00 00 00 # e_flags
34 00 # e_ehsize Indicating our 52 Byte header
20 00 # e_phentsize size of a program header table
01 00 # e_phnum number of entries in program table
00 00 # e_shentsize size of a section header table
00 00 # e_shnum number of entries in section table
00 00 # e_shstrndx index of the section names
## Program Header
#:ELF_program_headers
#:ELF_program_header__text
01 00 00 00 # ph_type: PT-LOAD = 1
00 00 00 00 # ph_offset
00 80 04 08 # ph_vaddr
00 80 04 08 # ph_physaddr
00 01 00 00 # ph_filesz
00 01 00 00 # ph_memsz
07 00 00 00 # ph_flags: PF-X|PF-W|PF-R = 7
01 00 00 00 # ph_align
#:ELF_text
# Where the ELF Header is going to hit
# Simply jump to _start
# Our main function
# :_start ; (0x8048054)
58 ; pop_eax # Get the number of arguments
5B ; pop_ebx # Get the program name
5B ; pop_ebx # Get the actual input name
31C9 ; xor_ecx,ecx # prepare read_only, ecx = 0
31D2 ; xor_edx,edx # Extra sure, edx = 0
6A 05 ; push !5 # prepare to set eax to 5
58 ; pop_eax # the syscall number for open()
CD 80 ; int !0x80 # Now open that damn file
89C6 ; mov_esi,eax # Preserve the file pointer we were given
5B ; pop_ebx # Get the actual output name
66B9 4102 ; mov_cx, @577 # Prepare file as O_WRONLY|O_CREAT|O_TRUNC
66BA C001 ; mov_dx, @448 # Prepare file as RWX for owner only (700 in octal)
6A 05 ; push !5 # prepare to set eax to 5
58 ; pop_eax # the syscall number for open()
CD 80 ; int !0x80 # Now open that damn file
89C2 ; mov_edx,eax # Preserve the file pointer we were given
# Our flag for byte processing
6A FF ; push !-1
5D ; pop_ebp # ebp = -1
# temp storage for the sum
31FF ; xor_edi,edi # edi = 0
#:loop ; (0x8048077)
# Read a byte
E8 68000000 ; call %Read_byte
# process byte
E8 1B000000 ; call %hex
# Deal with -1 values
85C0 ; test_eax,eax
7C F2 ; jl !loop
# deal with toggle
85ED ; test_ebp,ebp # jump if ebp >= 0
7D 06 ; jge !print
# process first byte of pair
89C7 ; mov_edi,eax
31ED ; xor_ebp,ebp # ebp = 0
EB E8 ; jmp !loop
# process second byte of pair
#:print ; (0x804808F)
# update the sum and store in output
C1E7 04 ; shl_edi, !4
01F8 ; add_eax,edi
# flip the toggle
4D ; dec_ebp # ebp = -1
E8 39000000 ; call %write_byte
EB DB ; jmp !loop
#:hex ; (0x804809C)
# Purge Comment Lines (#)
3C 23 ; cmp_al, !35
74 1E ; je !purge_comment
# Purge Comment Lines (;)
3C 3B ; cmp_al, !59
74 1A ; je !purge_comment
# deal all ascii less than 0
3C 30 ; cmp_al, !48
7C 1F ; jl !ascii_other
# deal with 0-9
3C 3A ; cmp_al, !58
7C 1F ; jl !ascii_num
# deal with all ascii less than A
3C 41 ; cmp_al, !65
7C 17 ; jl !ascii_other
# deal with A-F
3C 47 ; cmp_al, !71
7C 1C ; jl !ascii_high
# deal with all ascii less than a
3C 61 ; cmp_al, !97
7C 0F ; jl !ascii_other
# deal with a-f
3C 67 ; cmp_al, !103
7C 12 ; jl !ascii_low
# The rest that remains needs to be ignored
EB 09 ; jmp !ascii_other
#:purge_comment ; (0x80480BE)
# Read a byte
E8 21000000 ; call %Read_byte
# Loop if not LF
3C 0A ; cmp_al, !10
75 F7 ; jne !purge_comment
# Otherwise return -1
#:ascii_other ; (0x80480C7)
6A FF ; push !-1
58 ; pop_eax # return -1
C3 ; ret
#:ascii_num ; (0x80480CB)
2C 30 ; sub_al, !48
C3 ; ret
#:ascii_low ; (0x80480CE)
2C 20 ; sub_al, !32 # convert to uppercase
#:ascii_high ; (0x80480D0)
2C 37 ; sub_al, !55
C3 ; ret
# Writes byte stored in al
#:write_byte ; (0x80480D3)
# Print our Hex
89D3 ; mov_ebx,edx # Where are we writing to
52 ; push_edx # protect fout
6A 01 ; push !1 # prepare to set edx to 1
5A ; pop_edx # set the size of chars we want
50 ; push_eax # Move output to stack
89E1 ; mov_ecx,esp # What we are writing
6A 04 ; push !4 # prepare to set eax to 4
58 ; pop_eax # the syscall number for write
CD 80 ; int !0x80 # call the Kernel
5B ; pop_ebx # deallocate stack
5A ; pop_edx # restore fout
C3 ; ret
#:Read_byte ; (0x80480E4)
# Attempt to read 1 byte from Input file
52 ; push_edx # protect fout
6A 01 ; push !1 # prepare to set edx to 1
5A ; pop_edx # set the size of chars we want
57 ; push_edi # allocate stack
89E1 ; mov_ecx,esp # Where to put it
89F3 ; mov_ebx,esi # Where are we reading from
6A 03 ; push !3 # prepare to set eax to 3
58 ; pop_eax # the syscall number for read
CD 80 ; int !0x80 # call the Kernel
85C0 ; test_eax,eax # check what we got
74 03 ; je !Done # Got EOF call it done
# load byte
58 ; pop_eax # load char
5A ; pop_edx # restore fout
C3 ; ret
#:Done ; (0x80480F9)
# program completed Successfully
31DB ; xor_ebx,ebx # All is well, ebx = 0
6A 01 ; push !1
58 ; pop_eax # put the exit syscall number in eax
CD 80 ; int !0x80 # Call it a good day
#:ELF_end
I decided to look some deeper into live-bootstrap to figure out which programs
are executed and which files are being read. There is a global description of
all the steps that is found in parts.rst. live-bootstrap comes with a script to execute all steps and a
minimal shell, called the kaem shell, to execute the script. I wrote a
program, kaem_parser.cpp, to parse
the kaem files, which generates the HTML page live-bootstrap with all the information. To determine which header files
are used when compiling C programs, the parser also implements a C preprocessor to parse the C files to find all the header files. At some
point the bash shell is compiled and used to execute bash scripts. One would
have to write a bash interpreter to parse the files involved in the remaining
steps.
Too slow for compiling GNU Mes Compiler,
which is used to compile the Tiny C Compiler
To even dive deeper, I decided to implement an x86 emulator that implements
only the x86 instructions that are actually used. The emulator also implements
systems calls and supports multiple processes. I decided to simply map file
operations on file operations on the file system, instead of simulating those
in memory.
This allowed me also to verify the code of hex0 and to see if no other binary
seeds were used and/or executed.
The development of the emulator required some very complicated debugging, but
it was also insightful.
The emulator turned out to be too slow to compile the GNU Mes compiler, which is the C compiler used to compile a stripped
version of the Tiny C Compiler.
This also means, I even got less far with this approach than the previous.
With strace
command it is possible to trace system calls that are performed during the
execution of a command. I wrote a shell script to execute all the steps with an alternative root (using the
chroot
command) and trace relevant system calls. I also wrote a program, called
scan_trace.cpp to process the produced log file and generate
an HTML page
listing the processes and source files. It also generates
a JSON
file with similar information, which is used to generate
a T-diagram.
A T-diagram (or tombstone diagram) uses T-shaped outlines that explain the languages that
are involved in a program. It shows the source language (on the left), the
target language (on the right) and the implementation language (on the bottom).
In the above diagram it show how a program in C for reading hexadecimal data
to a binary file is compiled to an executable that does the same. Notice how
the execution of the compiler changes the implementation language of the source
program into the implementation language of the target program.
Legend:
Red: the binary seeds
Green box: Programs run by
C-1, C-2: subset of C languages
GNU Mes Compiler
C compiler in Scheme (LISP)
Scheme interpreter in a subset of C language
Preceded by two subset of C compilers
Probably on the level of Tiny C Compiler
Compiles rather slow
The odd one out
Why not Forth?
GNU Mes replacement
GNU Mes consist of Scheme (a dialect of the LISP family) interpreter programmed in a subset
of C and a C compiler written in Scheme. It is a rather complete compiler
including a separate linker. I get the idea that it is on par with the Tiny C
compiler, which does not need a separate linker. It can compile it own sources
without a separate linker. The MES compiler is rather slow, which is
understandable because it uses an interpreting parser running an interpreting
compiler. It is also slow because it compiles many individual C files into
object files, which are linked at the end. Every time all the Scheme files are
loaded before the compiler can start its actual work.
The GNU Mes compiler requires a compiler for a minimal subset of C. Because of
this the bootstrap does contain include a subset of C compiler and a partial
implementation of the C preprocessor. These are written in another subset of C for which there
is compiler written in machine code, one for each supported CPU. I have the
impression that the GNU Mes compiler is a bit of overkill for the gap it needs
to bridge.
On several forums, I have read people raising the question why Forth was not
used in the implementation instead of LISP.
I made some investigation and attempts to bridge the gap. To see if it would
be possible to implement a compiler with M2-Mesoplanet. I encountered some bugs
in M2-Mesoplanet.
Although hex0 is short, it is also a bit cryptic. It looks like it was hand
coded and that it uses different patterns for similar operations. It also
uses registers to store certain variables. Besides having to verify hex0, one
also has to analyse the following assembly programs:
The above example shows how a single assignment statement in C can be
translated to the intermediate language, and how this mapped on M1 assembly,
which can than be converted into hexedecimal numbers accomplished with
comments showing the relationship with the intermediate langeuage. This makes
it possible to see the relationship between C statement and the hexadecimal
numbers.
The intermediate language is a stack base language, like Forth, that would be close to C and simple to compile. It has a simple
method for defining global and local variables. It has a similar block
structure as C with respect to local variables. It has functions and some
support for structs.
The compilation to M1 makes use of two stacks. One stack contains the
intermediate values, including the parameter values passed between functions,
which makes use of the normal stack. The other stack contains the return
address for function calls and the local variables, which uses the ebp
register as a base offset.
Because it is intended to be used as a target language for a C compiler, it
uses only C keywords to avoid possible clashes with variable names in
C programs. This has lead to some keyword abuse, such as 'void' being used to
define a function, 'int' being used to define a variable, and 'const' being
used to define a numeric constant.
Goal: Being able to compile the Tiny C Compiler sources
Build-in preprocessor based on iterators
Single pass recursive decent parser
On the fly code generation to stack language
(Only AST for expressions)
Self-hosting: Can compile itself
Now working on compiling TCC sources
The goal for this C compiler is that it is able to compile the TCC sources
and also itself, the stack_c compiler, the assmbler and the hexadecimal to
binary converter.
A large part of the compiler involves the C preprocessor that is implemented
with several character and token iterators. This was also an experiment to see
if it would be possible to implement the preprocessor with iterators.
Although you need an LR-parser if you want to parse C purely based on the
syntax, it appears that you can write an LL(1), recursive decent parser, if
you know the kind of the identifier, whether it is a typedef or not.
It appears that it is possible to generate code during the parser, only using
abstract syntax trees.