Difference between revisions of "Linux"
Byteobserver (talk | contribs) (Clarify setting up section) |
Byteobserver (talk | contribs) (→Case study: 45-byte generative music intro) |
||
Line 374: | Line 374: | ||
this has absolutely no effect, and will probably always return -1 (EINTR). | this has absolutely no effect, and will probably always return -1 (EINTR). | ||
So, depending on eax&4, this code will either invoke the write syscall, or do nothing at all. | So, depending on eax&4, this code will either invoke the write syscall, or do nothing at all. | ||
+ | Which one is it? Well, the rdtsc instruction loads the low 32 bits of the CPU's | ||
+ | cycle counter into eax, and the high 32 bits into edx. So, whether we do a write | ||
+ | or not is effectively random depending on the current cycle count. Note that on | ||
+ | oldskool platforms, this may be quite deterministic, but on Linux the code is interrupted | ||
+ | many times a second, which causes effectively random fluctuations in the cycle count | ||
+ | that the program reads. | ||
The write syscall takes three arguments: the file to write to (in ebx), a pointer to the data to write (in ecx), and the amount of data to write (in edx). | The write syscall takes three arguments: the file to write to (in ebx), a pointer to the data to write (in ecx), and the amount of data to write (in edx). | ||
+ | |||
The code doesn't even mention ebx at all, and it is zeroed at program start. So this program writes to file descriptor 0, which is standard input. | The code doesn't even mention ebx at all, and it is zeroed at program start. So this program writes to file descriptor 0, which is standard input. | ||
− | Now, that sounds weird. It | + | Now, that sounds weird. It ''writes'' to standard ''input''? It turns out this is a perfectly fine thing to do, and we can redirect standard input to |
standard output by using <tt>./daemon45 0>&1</tt> on the command line. | standard output by using <tt>./daemon45 0>&1</tt> on the command line. | ||
+ | |||
+ | ecx is clearly set to equal esp, so the write syscall will be getting its data from the stack. | ||
+ | |||
+ | edx, the amount of data to write, is a bit more tricky. It is set to the bitwise and of the low 32 and high 32 bits of the CPU's cycle counter (rdtsc). | ||
+ | This may seem problematic, because this number might be larger than the size of the stack. However, the write syscall will stop either after writing edx bytes, | ||
+ | or when it encounters a memory access violation. So, it doesn't matter if edx is huge, because then it will just write the entire contents of the stack. | ||
+ | |||
+ | You may have noticed that the code has a <tt>push</tt> instruction but no corresponding <tt>pop</tt>. Won't it overflow the stack? Yes, eventually. But this takes quite a while. | ||
+ | |||
+ | So, overall what the program is does is it pushes the low 32 bits of the CPU's cycle counter to the stack, then | ||
+ | randomly plays a variable length chunk of the stack as audio, or does nothing. This repeats until the stack overflows, which on my machine takes at least half an hour (note: you can change the stack size by running <tt>ulimit -s unlimited</tt> beforehand, so that it will run until your RAM fills up completely). | ||
=== Method 2: pipe,fork,dup2,execve === | === Method 2: pipe,fork,dup2,execve === |
Revision as of 15:31, 29 November 2021
Contents
Introduction
This section of the sizecoding.org wiki is about creating very small (<=256byte) 32-bit X86 based Linux binaries (ELF format). For X86 related information, please check the main pages on this website, as a lot of the same tricks will also work with X86 Linux sizecoding.
A huge thanks goes out to byteobserver (Xorchitecture (2021) - https://www.pouet.net/prod.php?which=88982) as well as some early work by frag/fsqrt (Lintro (2012) - https://www.pouet.net/prod.php?which=58560) for all their research and hard work in producing tiny ELF binaries for linux.
Alternative methods and expectations
As the development of actual tiny ELF assembler executables on linux is still in its early days, with about a handful of actual <256 byte tiny ELF binary productions, lets look at some of the other methods of getting tiny intros onto linux.
1) Self-compilation tricks (using gcc or python): The executable executes a gcc (or python) compilation of the embedded code and executes it. This requires GCC and/or specific version of Python and potentially dynamically linked libraries to be installed.
2) Linking a piece of compiled C code to a stripped ELF header + dev/fb0 setup: This method has been used by The Orz to create several sizecoded procedural graphics entries. For more information about this check out https://github.com/grz0zrg/tinycelfgraphics
So what can we realistically expect from a 256 intro on Linux?
Expect about ~100 byte cost for the ELF header, setting up fb0 , some form of update loop, framecounter and using either mmap setup or copying via pwrite64 to get you started. If you want audio as well, the avaialble byte-budget will shrink even more.
Additionally, since we're dealing with 32-bit code, expect some instructions (especially when dealing with direct values) to take up bit more space.
Lets hope this wiki page will inspire and help people to get started and create newer, better Linux tiny intros ;-)
Setting up
Setting up your development platform for Linux development:
- Suggested Distributions : Any X86-based Linux distribution that allows for execution of 32-bit executables.
- Assembler: NASM (or any other linux compatible 32-bit X86 assembler)
Furthermore, it is important that the user has access to the /dev/fb0 framebuffer. This can be achieved by launching a virtual (fullscreen) console using CTRL-F3/F4 in most distributions, logging in and making sure the user is in the video group. You can test if you are in the video group by running:
$ cp /dev/urandom /dev/fb0
which should cause the screen to fill with white noise before printing "no space left on device". If this is not the case, you can add your user to the videogroup like so (substituting username for your username):
$ sudo usermod -a -G video username
After doing this you will need to log out and log back in for the changes to take effect.
Note: Make sure your binary is executable for everyone using the chmod 777 command after compilation :D
System Calls
Interaction with the Linux OS is mostly done via int 0x80 system calls. This usually includes dealing with opening files/framebuffer/audio and handling timers.
A full list of system calls and their expected register arguments is available at: https://syscalls32.paolostivanin.com/
ELF Header Information
Like a 32-bit windows executable, a 32-bit binary for linux comes with a pretty hefty ELF header.
org 0x00010000
ehdr: ; Elf32_Ehdr
db 0x7F, "ELF", 1, 1, 1, 0 ; e_ident
times 8 db 0
dw 2 ; e_type
dw 3 ; e_machine
dd 1 ; e_version
dd _start ; e_entry
dd phdr - $$ ; e_phoff
dd 0 ; e_shoff
dd 0 ; e_flags
dw ehdrsize ; e_ehsize
dw phdrsize ; e_phentsize
dw 1 ; e_phnum
dw 0 ; e_shentsize
dw 0 ; e_shnum
dw 0 ; e_shstrndx
phdr: ; Elf32_Phdr
dd 1 ; p_type
dd 0 ; p_offset
dd $$ ; p_vaddr
dd $$ ; p_paddr
dd filesize ; p_filesz
dd filesize ; p_memsz
dd 5 ; p_flags
dd 0x1000 ; p_align
_start:
; your program here
Luckily some parts of the ELF header can be repurposed and used to store some data and code. There is quite an extensive journey about some header optimisations available at http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html for those that are interested.
After merging the ehdr and phdr parts and changing your entry point, we can get the header down to about the 48 bytes range with a nifty /dev/fb0 string inserted which we'll be able to use later for setting up the framebuffer.
org $00010000
db $7F,"ELF" ; e_ident
dd 1 ; p_type
dd 0 ; p_offset
dd $$ ; p_vaddr
dw 2 ; e_type, p_paddr
dw 3 ; e_machine
dd entry ; e_version, p_filesz
dd entry ; e_entry, p_memsz
dd 4 ; e_phoff, p_flags
fname:
db "/dev/fb0",0 ; e_shoff, p_align, e_flags, e_ehsize
entry:
; this next instruction overlaps with a critical part of the elf header
; it needs to look like XX YY YY YY YY where YYYYYYYY=fname
; so you can change the register to something else or use push
; but the four byte pointer to fname cannot be changed.
mov ebx,fname ; e_phentsize, e_phnum
; e_shentsize, e_shnum, e_shstrndx are below but we can put whatever code/bytes we want there
mov cl,1 ; set read/write mode (1 or inc ecx is sufficient for pcopy method, read/write (2) is needed for mmap)
mov al,5 ; 5 = open syscall
int 0x80 ; open /dev/fb0 = 3
Displaying Graphics
Graphics can be produced by using and accessing the linux /dev/fb0 framebuffer. First the framebuffer has to be opened at the intro initialisation, and can then be used to either copy a piece of memory over using the pwrite64 syscall (0xb5) or using map a piece of memory directly to the framebuffer using syscall mmap.
Setting up the framebuffer
The dev/fb0 framebuffer can best be accessed from a virtual console (ctrl-f3/f4 in most distributions).
To make sure your dev/fb0 framebuffer is set up properly, you can apt get the fbset tool and display and/or alter the framebuffer resolution as most intros will make an assumption about the resolution of your framebuffer.
To test access to dev/fb0 framebuffer, you can use the following cat command:
cat /dev/urandom > /dev/fb0
which should produce random noise to the screen (ignorning the out of memory error that is expected from cat)
Alternatively, if you don't like to use the virtual console during the development of your intro, or the framebuffer setup is somehow giving you problems, there is a smalle fbe.c / fbe binary supplied with the xorchitecture intro by byteobserver that has a SDL windows mmap'ed to tmp/fb0 which you can launch alongside your intro (don't forget to redirect the dev/fb0 pointer in your intro to tmp/fb0).
Getting something on screen
First we need to fill up our local memorybuffer with pixeldata, so lets start doing that using the old AND pattern
mov ecx,width*height
setpixels:
mov ebx,width
mov eax,ecx
cdq
div ebx ; edx = x-coord , eax=y coord
and eax,edx ; xor pattern
mov [esp+ecx*4+0],al ; b
mov [esp+ecx*4+1],al ; g
mov [esp+ecx*4+2],al ; r
loop setpixels
Once your buffer (in this case marked by the esp stackpointer) is all filled up with pixeldata, you can copy it to the /dev/fb0 using the pwrite64 syscall like so:
; copy memorybuffer to screen (/dev/fb0) using the pwrite64 syscall
mov ecx,esp ; buffer ptr
mov edx,ebp ; screen size
xor esi,esi ; seek to beginning of screen
xor edi,edi
mov ebx,3 ; fd of framebuffer
mov eax,0xb5 ; pwrite64
int 0x80 ; pwrite64 to framebuffer
As an alternative to using pwrite64 you can also mmap )check out intros by The Orz for an example with mmap) to map a piece of memory to dev/fb0. However using mmap because you can get tearing, and you can't realistically do feedback effects without implementing a second buffer, as reading from the mmaped memory is VERY slow.
;mmap(NULL, buflen, PROT_WRITE, MAP_SHARED, fd, 0);
push edx ;edx = 0
push eax ;fd
push byte 1 ;MAP_SHARED
mov al, 90
push eax ;we need to set second bit for PROT_WRITE, 90 = 01011010 and setting PROT_WRITE automatically set PROT_READ
push width*height*4 ;buffer size
push edx ;NULL
mov ebx, esp ;args pointer
int 80h ;eax <- buffer pointer
Example Framework
Munching squares
So when we put all the above together, we can get a minimal kind of framework running that will look something like this munching square example provided to us by byteobserver:
; byte.observer's munching square linux example
; assembles with nasm -fbin munch.asm -o munch
width equ 1024
height equ 768
bits 32
org $00010000
db $7F,"ELF" ; e_ident
dd 1 ; p_type
dd 0 ; p_offset
dd $$ ; p_vaddr
dw 2 ; e_type, p_paddr
dw 3 ; e_machine
dd entry ; e_version, p_filesz
dd entry ; e_entry, p_memsz
dd 4 ; e_phoff, p_flags
fname:
db "/dev/fb0",0 ; e_shoff, p_align, e_flags, e_ehsize
entry:
mov ebx,fname ; e_phentsize, e_phnum
inc ecx ; = 1 = O_WRONLY
mov al,5 ; 5 = open syscall
int 0x80 ; open /dev/fb0 = 3
mov ebp,width*height*4 ; ebp = screen size
sub esp,ebp ; make room on the stack for the video memory
mainloop:
mov ecx,ebp ; init pixel index
shr ecx,2 ; divide by bits per pixel
inc edi ; frame counter
setpixels:
mov ebx,width
mov eax,ecx
cdq
div ebx ; edx = x-coord , eax=y coord
xor eax,edx ; xor pattern
add eax,edi ; make it munch
mov [esp+ecx*4+0],al ; b
mov [esp+ecx*4+1],al ; g
mov [esp+ecx*4+2],al ; r
mov [esp+ecx*4+3],al ; a
loop setpixels
; dump the whole thing to the screen using pwrite64 syscall
mov ecx,esp ; buffer ptr
mov edx,ebp ; screen size
push edi ; save frame counter
xor esi,esi ; seek to beginning of screen
xor edi,edi
mov ebx,3 ; fd of framebuffer
mov eax,0xb5 ; pwrite64
int 0x80 ; pwrite64 to framebuffer
pop edi
jmp mainloop
Adding Sound
It is possible to output digital audio by binding the aplay command into your intro. aplay is available on almost all Linux setups. You can test it by running the following, which should produce some white noise:
$ aplay /dev/urandom
By default, aplay will play 8-bit mono audio at 8000Hz, but the format can be changed easily by specifying arguments. If no filename is passed to aplay, it will read audio data from standard input, which we will use to our advantage.
To use aplay in the context of an intro, there is a bit of setup work involved. The simplest method is to write audio data to standard output and pipe it to aplay on the command line. A more clean and self-contained method uses 4 syscalls to start aplay as a child process, so that audio data can then be simply written to the appropriate file descriptor to send it to the speakers.
Method 1: pipe to aplay on the command line
The first method is the simplest, but will not be usable in all circumstances. With this method, instead of running your intro with ./intro, you have to do ./intro | aplay 2>/dev/null, which may not be allowed in some compos. However, with this method you can produce some extremely small generative music intros, including some that are only 45 bytes---where the entire intro fits inside of the ELF header!
Some basic code to produce a bytebeat sound is below. It simply consists of a loop which repeatedly writes a single 8-bit audio sample to standard output using the write syscall (0x4).
audio:
; some bytebeat
inc esi
mov eax,esi
pop ebx ; grab previous sample from stack
add ebx,eax
shr eax,5
or ebx,eax
shr eax,5
and ebx,eax
push ebx ; push next sample to stack
mov ecx,esp ; pointer to audio data (the top of the stack)
xor eax,eax
mov al,4 ; write syscall
xor ebx,ebx
inc ebx ; write to stdout (1)
mov edx,ebx ; write 1 byte
int 0x80
jmp audio
Case study: 45-byte generative music intro
Below is the complete 45-byte generative music intro. Try saving it in a file called daemon45.asm and running it with:
$ nasm daemon45.asm $ chmod +x daemon45 $ ./daemon45 0>&1 | aplay -fS32_LE -r16000 -c2
; daemon 45 by byteobserver
bits 32
org $25500000
db $7F,"ELF" ; e_ident
dd 1 ; p_type
dd 0 ; p_offset
dd $$ ; p_vaddr
dw 2 ; e_type, p_paddr
dw 3 ; e_machine
dd entry ; e_version, p_filesz
dw $001a ; e_entry, p_memsz
entry:
push eax
and eax,strict dword 4; e_phoff, p_flags
mov ecx,esp ; e_shoff, p_align
int $80
rdtsc ; e_flags
and edx,eax
loop entry ; e_ehsize
dw $20 ; e_phentsize
db 1 ; e_phnum
; e_shentsize
; e_shnum
; e_shstrndx
Let's pull this apart and see how it works...
The first thing to notice that that the origin is 0x25500000, unlike the origin of 0x00010000 that we used before. This is intentional. 0x50 is the encoding of the push eax instruction, which appears after entry:, and 0x25 is the first byte of the and eax,strict dword 4 instruction. These instructions actually overlap with the e_entry field of the header, so the high two bytes of the entry point (and origin) must match these instructions. The word 0x001a that appears immediately before entry: forms the low two bytes of the entry point, i.e., 0x001a == (entry-$$)&0xffff.
The instruction and eax,strict dword 4 may seem odd, since this instruction is encoded as 2504000000, which seems wasteful. However, doing and eax,4 (which only takes 3 bytes) would not work here, because the dword 4 needs to be stored here so that e_phoff (the program header offset) is correct.
The rest of the code also overlaps the header, of course, but the fields that it overlaps are effectively ignored when the program is loaded, so we don't need to worry about setting them to the correct values.
Now, lets strip away the header and see why this code actually produces the sound it does:
entry:
push eax
and eax,strict dword 4
mov ecx,esp
int $80
rdtsc
and edx,eax
loop entry
Clearly, this code is calling some syscall (because of the int $80), but which one is it calling? The syscall number is stored in eax, and because of the instruction and eax,strict dword 4 we know that eax must either be 4 or 0. Syscall 4 is the write syscall, which is what we want. But what is syscall 0? The docs say that syscall 0 is restart_syscall, which is used to "restart a system call after interruption by a stop signal." The man page says "This system call is designed only for internal use by the kernel." And, luckily for us, when called from user space when the only other syscall we are using is write, this has absolutely no effect, and will probably always return -1 (EINTR). So, depending on eax&4, this code will either invoke the write syscall, or do nothing at all. Which one is it? Well, the rdtsc instruction loads the low 32 bits of the CPU's cycle counter into eax, and the high 32 bits into edx. So, whether we do a write or not is effectively random depending on the current cycle count. Note that on oldskool platforms, this may be quite deterministic, but on Linux the code is interrupted many times a second, which causes effectively random fluctuations in the cycle count that the program reads.
The write syscall takes three arguments: the file to write to (in ebx), a pointer to the data to write (in ecx), and the amount of data to write (in edx).
The code doesn't even mention ebx at all, and it is zeroed at program start. So this program writes to file descriptor 0, which is standard input. Now, that sounds weird. It writes to standard input? It turns out this is a perfectly fine thing to do, and we can redirect standard input to standard output by using ./daemon45 0>&1 on the command line.
ecx is clearly set to equal esp, so the write syscall will be getting its data from the stack.
edx, the amount of data to write, is a bit more tricky. It is set to the bitwise and of the low 32 and high 32 bits of the CPU's cycle counter (rdtsc). This may seem problematic, because this number might be larger than the size of the stack. However, the write syscall will stop either after writing edx bytes, or when it encounters a memory access violation. So, it doesn't matter if edx is huge, because then it will just write the entire contents of the stack.
You may have noticed that the code has a push instruction but no corresponding pop. Won't it overflow the stack? Yes, eventually. But this takes quite a while.
So, overall what the program is does is it pushes the low 32 bits of the CPU's cycle counter to the stack, then randomly plays a variable length chunk of the stack as audio, or does nothing. This repeats until the stack overflows, which on my machine takes at least half an hour (note: you can change the stack size by running ulimit -s unlimited beforehand, so that it will run until your RAM fills up completely).
Method 2: pipe,fork,dup2,execve
This approach is as follows. First, we create a pipe using the pipe syscall (0x2a). This syscall takes a pointer to an array of 2 ints, which it fills with the file descriptors of the two ends of the pipe. In the following, we simply overwrite the top of the stack with the file descriptors. The first file descriptor is the read only/output side and the second is the write only/input side.
mov ebx,esp
xor eax,eax
mov al,0x2a ; pipe
int 0x80
Next, we fork the process (syscall 0x2). The child process will be used to exec aplay. If you do this right after creating the pipe, you don't need to zero eax before setting it to 2, because eax should already be zero (indicating that the pipe was created successfully).
mov al,2 ; fork
int 0x80 ; returns eax=0 in child process and eax=1 in parent process
dec eax
js child
parent:
; code for the rest of your intro goes here
Now, we bind the standard input of the child (which aplay receives audio data from) to the output of the pipe, using the dup2 syscall (0x3f).
child:
xor eax,eax
mov al,0x3f ; dup2
pop ebx ; get file descriptor of output side of pipe
xor ecx,ecx ; stdin is file descriptor 0
int 0x80
The following is optional. aplay will usually print a message saying some parameters of the stream that it is playing. If this interferes with your intro, you can close stderr to stop it from printing, with the close syscall (0x6).
xor eax,eax
mov al,6 ; close
mov bl,2 ; stderr
int 0x80
Finally, we just have to execute aplay with the execve syscall (0xb). Constructing the arguments to this syscall takes a bit of work. Here we are doing it in a simple way which is a bit wasteful. You can save some bytes by constructing the arguments array on the stack.
xor eax,eax ; shouldn't be necessary given the above
mov al,0xb ; execve
mov ebx,aplay ; pointer to aplay filename
mov ecx,args ; pointer to null terminated array of arguments
lea edx,[esp+12] ; get pointer to environ. this assumes nothing has been
; pushed/popped yet, and there are no args passed to your program.
; see here: http://www.mindfruit.co.uk/2012/01/initial-stack-reading-process-arguments.html
; (we are trying to get the beginning of "Environment pointers")
int 0x80 ; nothing after this point will be executed
args:
dd aplay+5
dd 0
aplay:
db "/bin/aplay", 0
Now everything should be set up, and we can start writing audio data with the write syscall (0x4). The following will produce a buzzing sound.
parent:
audioloop:
xor eax,eax
mov al,4 ; write
mov ebx,[esp+4] ; input side of pipe created earlier
mov ecx,esp ; pointer to audio data
mov edx,1 ; length of audio data (in bytes)
int 0x80
inc byte [esp] ; increment the sample
jmp audioloop
Putting it all together
Combining the above snippets and optimizing a bit, we can arrive at the following 118 byte program which plays a familiar bytebeat track.
bits 32
org $00010000
db $7F,"ELF" ; e_ident
dd 1 ; p_type
dd 0 ; p_offset
dd $$ ; p_vaddr
dw 2 ; e_type, p_paddr
dw 3 ; e_machine
dd entry ; e_version, p_filesz
dd entry ; e_entry, p_memsz
dd 4
entry:
mov al,0x2a ; pipe
mov ebx,esp ; store output of pipe on stack
int 0x80
lea edx,[ebx+12] ; environ pointer, to be used later
mov ebp, entry ; e_phentsize, this must be here for the ELF header
mov al,2 ; fork
int 0x80 ; returns eax=0 in child process and eax=childpid in parent process
dec eax
js child
audioloop:
pusha
xor eax,eax
mov al,4 ; write
mov ebx,eax ; input side of pipe created earlier
lea ecx,[edx-12] ; pointer to audio data
xor edx,edx
inc edx ; set size to one byte
int 0x80
popa
; some bytebeat
inc esi
mov eax,esi
pop ebx
add ebx,eax
shr eax,5
or ebx,eax
shr eax,5
and ebx,eax
push ebx
jmp audioloop
child:
inc eax
mov al,0x3f ; dup2
pop ebx ; get file descriptor of output side of pipe
; ecx is already zero
int 0x80
mov al,0xb ; execve
lea ebx,[ebp+((aplay+5-entry)&0xff)] ; pointer to "aplay"
push 0 ; null terminator for args list
push ebx ; pointer to "aplay" aka argv[0]
; if you want to add more args to aplay, you can push pointers to them here
mov ecx,esp ; pointer to null terminated array of arguments
mov bl,(aplay-$$)&0xff ; pointer to "/bin/aplay"
; edx is already set up as the environ pointer
int 0x80 ; nothing after this point will be executed
aplay:
db "/bin/aplay"
; no null terminator is necessary because memory past the end of the file is always zero
Can you make this smaller? Feel free to edit it!
Additional Resources
- Pouet: 256byte productions on Linux
- Pouet: 128byte productions on Linux
- A dev/fb0 framebuffer binding + ELF header for small C programs
- A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux
Larger productions (1k and 4k intros)
Creating 1k and 4k intros on linux usually requires a different setup, for more information on this check out the following links: