Floating-point Opcodes
The FPU offers a lot of operations not available to classic x86 CPU, like SIN
, COS
, ATAN
, SQRT
, etc. SIMPLY FPU by Raymond Filiatreault has a compact overview of all FPU commands. Usage and communication with the FPU is a bit uncommon and takes a bit to get used to. It's recommended to read the creation of the snippet we want to modify first, this is how it looks like originally :
cwd ; "clear" DX for perfect alignment
mov al,0x13
X: int 0x10 ; set video mode AND draw pixel
mov ax,cx ; get column in AH
add ax,di ; offset by framecounter <-- REPLACE THIS WITH FPU CODE
xor al,ah ; the famous XOR pattern
and al,32+8 ; a more interesting variation of it
mov ah,0x0C ; set subfunction "set pixel" for int 0x10
loop X ; loop 65536 times
inc di ; increment framecounter
in al,0x60 ; check keyboard ...
dec al ; ... for ESC
jnz X ; rinse and repeat
ret ; quit program
and this is how it looks if we replace the instruction with FPU code :
cwd ; "clear" DX for perfect alignment
mov al,0x13
X: int 0x10 ; set video mode AND draw pixel
mov ax,cx ; get column in AH
fninit ; init FPU first
mov [si],ax ; write first addend to a memory location
fild word [si] ; F(pu) I(nteger) L(oad)D a WORD from memory location to the FPU stack
mov [si],di ; write second addend to a memory location
fiadd word [si] ; Directly add the word in the memory location to the top FPU stack
fist word [si] ; F(pu) I(nteger) ST(ore) the result into a memory location
mov ax,[si] ; Get the word from the memory location into AX
xor al,ah ; the famous XOR pattern
and al,32+8 ; a more interesting variation of it
mov ah,0x0C ; set subfunction "set pixel" for int 0x10
loop X ; loop 65536 times
inc di ; increment framecounter
in al,0x60 ; check keyboard ...
dec al ; ... for ESC
jnz X ; rinse and repeat
ret ; quit program
The usual interaction with the FPU is as follows
-
F(N)INIT
: Initialization of the FPU - store register content in memory location(s)
- transfer from memory location onto FPU stack
- actual calculations on the FPU (more on this soon)
- transfer from FPU stack into memory location(s)
- get register from memory location
That is a lot of extra code for a single integer addition, but once more complex floating point operations are involved, it starts to pay off. For more advanced FPU operation, let's start from scratch with an unoptimized program which plots the distance of each pixel to the screen center as color, in 49 bytes.
push 0a000h
pop es ; get start of video memory in ES
mov al,0x13 ; switch to video mode 13h
int 0x10 ; 320 * 200 in 256 colors
fninit ; -
; it's useful to comment what's on the
; stack after each FPU operation
; to not get lost ;) start is : empty (-)
X:
xor dx,dx ; reset the high word before division
mov bx,320 ; 320 columns
mov ax,di ; get screen pointer in AX
div bx ; construct X,Y from screen pointer into AX,DX
sub ax,100 ; subtract the origin
sub dx,160 ; = (160,100) ... center of 320x200 screen
mov [si],ax ; move X into a memory location
fild word [si] ; X
fmul st0 ; X²
mov [si],dx ; move Y into a memory location
fild word [si] ; Y X²
fmul st0 ; Y² X²
fadd st0,st1 ; Y²+X²
fsqrt ; R
fistp word [si] ; -
mov ax,[si] ; get the result from memory
stosb ; write to screen (DI) and increment DI
jmp short X ; next pixel
A few words on this :
- The FPU registers (st0, st1, ...) are organized as a stack. When you load something to the FPU, everything else will be moved one location further away from the top (implicitly!) Some FPU instructions work only on the top, other allow the explicit parametrization with arbitrary FPU registers.
- Depending on what you do, sometimes
F(N)INIT
can be omitted. Real hardware will refuse to work more often than emulators, but it's always worth the try. - Accessing memory (size) efficiently can be a real pain. The safest way is to reference absolute memory locations (f.e
[1234]
) but that's two bytes more per instruction than referencing memory with[BX]
,[SI]
,[BX+SI]
,[BP+DI]
,[BP+SI]
,[DI]
or[BX+DI]
. When working with FPU and this classic approach of FPU communication, you have to design your codeflow to have one or some of these locations available. - Accessing the memory is always with regard to the segment register
DS
unless you perform segment overrides. When accessing memory with[BP+??]
be aware that the memory is accessed with regard to the segment registerSS
(see here, at 4.6.2.2 The Register Indirect Addressing Modes - There are a few conventions which help you identify FPU commands. "i" stands for integer (WORD or DWORD), "p" means "pop stack afterwards", so
FST
means just "store" whileFISTP
means "store as integer, then pop the stack"
Now let's unleash the state of the art sizecoding arsenal onto this, to bring it down to 37 bytes (40 bytes with aspect correction)
push 0a000h - 70 ; modified to center to 160,100
aas ; aspect ratio constant part
pop es ; get start of video memory in ES
mov al,0x13 ; switch to video mode 13h
int 0x10 ; 320 * 200 in 256 colors
X:
mov ax,0xCCCD ; perform the famous...
mul di ; ... Rrrola trick =)
sub dh,[si] ; align vertically
pusha ; push all registers on stack
fild word [bx-8] ; X
fmul st0 ; X²
fild word [bx-9] ; Y X²
fmul dword [bx+si] ; aspect ratio correction
fmul st0 ; Y² X²
fadd st0,st1 ; Y²+X²
fsqrt ; R
fistp dword [bx-5] ; -
popa ; pop all registers from stack
stosb ; write to screen (DI) and increment DI
jmp short X ; next pixel
The resulting image is almost identical to to the former. Let's go through this step by step:
-
push 0a000h - 70
Instead of aligning horizontally with sub dx,160
we can code this implicitly by moving our segment register ten units - that is 10 * 16 = 160 pixels - to the left (see Real Mode Addressing). With further multiple subtraction of 20 units - that is 320 pixels, we can shift the visible screen towards the top, to finetune vertical alignment. As long as this shift is no more than 4 lines ( 65536 / 320 - 200 = 4,8 ) there is no further visual impact.
-
aas
This is the high byte of a constant, placed in a way that [SI]
or [BX+SI]
resolves to ~1.24 when read as 32bit float. The last byte of segment ES
is also of importance. Check yourself with the IEEE 754 Converter
-
mov ax,0xCCCD
&mul di
Instead of constructing X and Y from the screen pointer DI
with DIV
you can get a decent estimation with multiplying the screen pointer with 0xCCCD
and read X and Y from the 8bit registers DH
(+DL as 16bit value) and DL
(+AH as 16bit value). The idea is to interpret DI
as a kind of 16 bit float in the range [0,1]
, from start to end. Multiplying this number in [0,1] with 65536 / 320 = 204,8 results in the row before the comma, and again as a kind of a float, the column after the comma. The representation 0xCCCD
is the nearest rounding of 204,8 * 256 ( = 52428,8 ~ 52429 = 0xCCCD). As long as the 16 bit representations are used, there is no precision loss.
-
sub dh,[si]
The instruction at [SI]
is push <word>
and has the opcode 0x68
which is 104 in decimal. Combined with the fine tuned vertical alignment above ( ~4 lines) this results in (virtually) subtracting 100 for perfect vertical alignment. This is one byte shorter than sub dh,100
.
-
pusha / popa
Instead of going the classical way of communicating with the FPU, we push all the registers, read/write values with memory addressing to/from the FPU, then pop all registers again. This works when DS
= SS
and SP
is "close enough" to BX (initially zero and kept that way) to allow [BX+<signed byte>]
addressing. It comes with the special benefit of implicit 8bit shifts. One serious drawback is loss of precision, since the registers DL
and AH
"lose connection" when using PUSHA
(see the order of registers : PUSHA/PUSHAD documentation
fild word [bx+<signed byte>]
& *fistp dword [bx+<signed byte>]
This is the so called "stack addressing". We assume that BX=0
and SP=0xFFFE
at start, so we know where the registers are in memory after pusha
(AX at [BX-4], CX at [BX-6] etc.). It's important to realize that we work with signed 16 bit values now, in the full range of [-32768,32767]. That is also why we need DWORD
when storing the result : sqrt(x²+y²)
exceeds the signed 16bit range for quite some value pairs. Note that there are already implicit 8bit shifts (bx-9,bx-5)
fmul dword [bx+si]
With the "Rrrola" trick above, we have the row number to be 204 at maximum, but also the column can't be greater than 256. This results in a wrong aspect ratio, but it can't almost completely be fixed with this two byte instruction (+ one byte for the AAS
instruction) : 256 * 1,24 = 317,44 which is quite close to 320. If aspect ratio is of no meaning to the effect, this three bytes can be shaved off.
to be continued