Prototyping DOS effects with ShaderToy
Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing it in x86 / x87 assembly. Shadertoy is a popular choice for making such prototypes. However, the ShaderToy language is a WebGL, which is a relatively powerful language, and includes native supports for vectors, matrices, built-in functions, and arithmetic, most of which are not available in x86 assembly. Thus, it is fairly easy to write stuff that looks tiny in ShaderToy but goes way over 256b when porting it finally to DOS.
To make sure your ShaderToy prototype is portable to DOS, you should avoid all the operations that are going to be costly (in terms of bytes) and only use ones that will be cheap in assembly. Below you find some size estimates for WebGL code once ported to x87 math.
Contents
Scalar operators
ShaderToy | Bytes | Rough x87 equivalent |
---|---|---|
x+=y | 2 | faddp st1, st0 |
x+y | 4 | If both x and y are needed later: fld st0; fadd st0, st2 |
The cost for -
, *
, and /
scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary fld
or fxch
instructions.
Notice the existence of fsubr
instruction, so x=(y/x)
can still be just 2 bytes, even if it looks more complicated in ShaderToy.
Also notice that operating on a single component of a vector (b.x += a.x
) is actually a scalar operation and thus takes the same 2-4 bytes.
Scalar functions
ShaderToy | Bytes | x87 equivalent |
---|---|---|
-x | 2 | fchs |
abs(x) | 2 | fabs |
sqrt(x) | 2 | fsqrt |
sin(x) | 2 | fsin |
cos(x) | 2 | fcos |
sin(x) ... cos(x) | 2 | fsincos |
tan(x) | 2 | fptan |
atan(y,x) | 2 | fpatan |
log2(x) | 4 | fld1 ... fyl2x |
exp2(x) | 14 | fld1; fld st1; fprem; f2xm1; faddp st1,st0; fscale; fstp st1 |
pow(x,y) | 16 | Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code |
exp(x) | 18 | Computed as 2^(x*log2(e)) i.e. fldl2e and fmulp, followed by the exp2(x) code |
acos
, asin
, sinh
, cosh
, tanh
, asinh
, acosh
, and atanh
are probably not worth your time, which is a pity, as tanh
is a classic "squash" function to get any number into -1 .. 1 range.
Rounding and remainders
ShaderToy | Bytes | x87 equivalent |
---|---|---|
round(x) | 2 | frndint (the default rounding mode is to nearest) |
x % y | 2 | fprem or fprem1 |
ceil(x) | 2 + up to 5 | Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint |
floor(x) | 2 + up to 5 | Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint |
Notice that x-round(x)
is a very compact way to do domain repetition for raymarchers.
Vector arithmetic (examples)
ShaderToy | Bytes | x87 equivalent |
---|---|---|
a.xy = a.yx | 2 | fxch st0, st1 |
a.xyz = a.yzx | 4 | fxch st0, st2; fxch st0, st1; |
a+=b | 5-6 | Assuming b is not needed later. 6 bytes: faddp st3, st0; faddp st3, st0; faddp st3, st0; 5 bytes: if you have a trashable register with suitable parity, use the looping three times trick |
dot(a,b) | 9-10 | If neither a or b is needed later, compute this as a*=b followed by a.z+=a.y+=a.x |
a+=b | 10 | If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6 |
length(a) | 16 | If a is not needed later: fmul st0, st0; fld st1; fmul st0, st0; faddp st1, st0; fld st2; fmul st0, st0; faddp st1, st0; fsqrt; |
From this you can already see that a simple normalize(x)
is going to take a lot of bytes, as it has to be computed as x/=length(x)
. Therefore, normalizing your raymarchers rays is usually to be avoided. cross
, reflect
, and refract
are probably also too costly for sizecoding.
Floating point constants
x87 has the following constants are built-in and loading each takes just 2 bytes:
Constant | Approximation | Instruction |
---|---|---|
0.0 | 0.0 | fldz |
1.0 | 1.0 | fld1 |
pi | 3.14159... | fldpi |
log2(e) | 1.44270... | fldl2e |
loge(2) | 0.69315... | fldln2 |
log2(10) | 3.32193... | fldl2t |
log10(2) | 0.30103... | fldlg2 |
Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that fldpi; fmulp st1, st0
is still 4 bytes, whereas fmul st0, dword [bp+offset]
can be as little as 3 bytes, if the offset is a short and you can reuse code or another value as the constant.
Even if you need to define a new constant, you don't always need all full 4 bytes to define a single IEEE floating point number, but sometimes you even a single byte suffices. With a single byte, you can already define the exponent of a float, so the order of magnitude is already correct. You can then try to place this somewhere in your code/data where at least the first few bits of mantissa are correct, to increase the accuracy slightly. You can use tools like this to see what floating point values encode to, and what different byte patterns are as floating point constants.
Case study: Balrog
With all the earlier in mind, Balrog 256b executable graphics can serve as a case study. Balrog is a fractal raymarcher, with the innermost loop of:
for(int j=0;j<ITERS;j++){ t.x = abs(t.x - round(t.x)); // abs is folding, t.x - round(t.x) is domain repetition t.x += t.x; // domain scaling r *= RSCALE; r += t.x*t.x; t.xyz = t.yzx; // shuffle coordinates so next time we operate on previous y etc. t.x += t.z * o; // rotation, but using very poor math t.z -= t.x * o; }
Even if there's vectors, the code mostly does scalar math, and then uses coordinate shuffling (t.xyz = t.yzx) to do math on other coordinates. That code ports to:
mov cl, ITERS .maploop: fld st0 ; t.x t.x frndint fsubp st1, st0 ; t.x-round(t.x) fabs ; t.x = abs(t.x - round(t.x)) fadd st0 ; t.x += t.x; fld dword [c_rscale+bp-BASE] fmulp st4, st0 ; r *= RSCALE fld st0 fmul st0 faddp st4, st0 ; r += t.x*t.x fxch st2, st0 fxch st1, st0 ; t.xyz = t.yzx fld st2 fmul dword [si] faddp st1, st0 ; t.x += t.z * o; fld st0 fmul dword [si] fsubp st3, st0 ; t.z -= t.x * o loop .maploop
The comments show exactly how each ShaderToy line maps to different x87 instructions.
The Balrog code also later exemplifies the floating point truncation technique:
c_mindist equ $-3 db 0x38 ; 0.0001 c_glowamount equ $-2 c_colorscale equ $-2 dw 0x3d61 ; 0.055 c_stepsizediv equ $-1 db 0x03 ; 807 c_stepsizediv_z equ $-3 db 0x40 ; 2.1006666666666662 c_glowdecay equ $-2 dw 0x461c ; 1e4 c_rscale equ $-2 db 0xa1, 0x3f ; 1.2599210498948732 c_rdiv equ $-2 dw 0x434b ; 203.18733465192963 c_camz equ $-1 db 0xcc, 0x12, 0x42 ; 36.7 c_xdiv equ $-1 db 0x09, 0x00, 0x40 ; 2.0006 c_xmult equ $-2 dw 0x3f2a c_camy equ $-2 dw 0x3f1c ; 0.61
Two of the constants were finally the same constant (c_glowamount and c_colorscale), many are only have the exponents (single db), and two of the constants required as much as 3 bytes to get enough precision (c_camz and c_xdiv). The ordering of the constants was carefully chosen, so that when the exponent of one constant serves as a part of the mantissa of next constant, the value is at least roughly correct.