Difference between revisions of "Prototyping DOS effects with ShaderToy"

From SizeCoding
Jump to: navigation, search
m
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing it in x86 / x87 assembly. [https://shadertoy.com/ Shadertoy] is a popular choice for making such prototypes. However, the ShaderToy language is a WebGL, which is a relatively powerful language, and includes native supports for vectors, matrices, built-in functions, and arithmetic, most of which are not available in x86 assembly. Thus, it is fairly easy to write stuff that looks tiny in ShaderToy but goes way over 256b when porting it finally to DOS.
+
Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing them in x86/x87 assembly. [https://shadertoy.com/ Shadertoy] is a popular choice for creating such prototypes. However, the shaders are written in WebGL, which is a relatively powerful language and includes native support for vectors, matrices, many built-in functions, and arithmetic operations. Most of these features are not available in x86 assembly. It is fairly easy to write tiny shaders in Shadertoy that end up well over 256 bytes once finally ported to DOS. Furthermore, the x87's limit of 8 stack items imposes significant constraints on the number of temporary variables you can use: for example, two 3D vectors already occupy 6 stack slots, and you typically need at least one additional item as scratch space, so that's almost the entire stack already.
  
To make sure your ShaderToy prototype is portable to DOS, you should avoid all the operations that are going to be costly (in terms of bytes) and only use ones that will be cheap in assembly. Below you find some size estimates for WebGL code once ported to x87 math.
+
To make sure your Shadertoy prototype is portable to DOS, you should avoid operations that are going to be costly (in terms of bytes) and only use ones that will be cheap in assembly. Below you’ll find some size estimates for WebGL code once ported to x87 math.
  
 
== Scalar operators ==
 
== Scalar operators ==
Line 7: Line 7:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! ShaderToy !! Bytes !! Rough x87 equivalent  
+
! ShaderToy !! Bytes !! x87 equivalent  
 
|-
 
|-
 
| x+=y || 2 || faddp st1, st0  
 
| x+=y || 2 || faddp st1, st0  
Line 16: Line 16:
 
The cost for <code>-</code>, <code>*</code>, and <code>/</code> scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary <code>fld</code> or <code>fxch</code> instructions.
 
The cost for <code>-</code>, <code>*</code>, and <code>/</code> scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary <code>fld</code> or <code>fxch</code> instructions.
  
Notice the existence of <code>fsubr</code> instruction, so <code>x=(y/x)</code> can still be just 2 bytes, even if it looks more complicated in ShaderToy.
+
Notice the existence of <code>fsubr</code> and <code>fdivr</code> instructions, so <code>x=(y/x)</code> can still be just 2 bytes, even if it looks more complicated in ShaderToy.
  
 
Also notice that operating on a single component of a vector (<code>b.x += a.x</code>) is actually a scalar operation and thus takes the same 2-4 bytes.
 
Also notice that operating on a single component of a vector (<code>b.x += a.x</code>) is actually a scalar operation and thus takes the same 2-4 bytes.
Line 44: Line 44:
 
| log2(x) || 4 || fld1<br>...<br>fyl2x
 
| log2(x) || 4 || fld1<br>...<br>fyl2x
 
|-
 
|-
| exp2(x) || 14 || fld1; fld st1; fprem; f2xm1; faddp st1,st0; fscale; fstp st1
+
| 2*min(x,y) || 10 || Computed as a+b-abs(a-b) i.e.<br/>fld st0; fsub st0, st2; fabs; fsubp st1, st0; faddp st1, st0<br/>Replace fsubp with faddp for 2*max(x,y)
 +
|-
 +
| min(x,y) || 11 || fcom st0, st1; fnstsw ax; sahf; jc .S; fxch st0, st1; .S: fstp st1, st0<br/>Replace jc with jnc for max(x,y)
 +
|-
 +
| exp2(y) || 14 || fld1; fld st1; fprem; f2xm1; faddp st1,st0; fscale; fstp st1
 
|-
 
|-
 
| pow(x,y) || 16 || Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code
 
| pow(x,y) || 16 || Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code
 
|-
 
|-
| exp(x) || 18 || Computed as 2^(x*log2(e)) i.e. fldl2e and fmulp, followed by the exp2(x) code
+
| exp(y) || 18 || Computed as 2^(y*log2(e)) i.e. fldl2e and fmulp, followed by the exp2(y) code
 
|}
 
|}
  
<code>acos</code>, <code>asin</code>, <code>sinh</code>, <code>cosh</code>, <code>tanh</code>, <code>asinh</code>, <code>acosh</code>, and <code>atanh</code> are probably not worth your time, which is a pity, as <code>tanh</code> is a classic "squash" function to get any number into -1 .. 1 range.
+
There are no <code>acos</code>, <code>asin</code>, <code>sinh</code>, <code>cosh</code>, <code>tanh</code>, <code>asinh</code>, <code>acosh</code>, and <code>atanh</code> instructions on x87 and implementing them yourself is probably not worth your bytes. This is a pity, as <code>tanh</code> is a classic "squash" function to get any number into -1 .. 1 range.
 +
 
 +
Notice the significant byte cost of min and max operations, primarily due to the absence of fcomi and fmovcc instructions on older processors. This can catch people off guard, as many shaders use min/max extensively. For example, raymarchers often rely on them to compute shape unions. Avoiding unnecessary use of min and max can prevent a few headaches later.
  
 
== Rounding and remainders ==
 
== Rounding and remainders ==
Line 59: Line 65:
 
! ShaderToy !! Bytes !! x87 equivalent  
 
! ShaderToy !! Bytes !! x87 equivalent  
 
|-
 
|-
| round(x) || 2 || frndint (the default rounding mode is to nearest)
+
| round(x) || 2 || frndint with the default rounding mode, which is nearest
 
|-
 
|-
| x % y || 2 || fprem or fprem1
+
| mod(x,y) || 2 || fprem or fprem1. Notice they compute the remainder, not modulo, so they are the same only for positive values of x.
 
|-
 
|-
| ceil(x) || 2 + up to 5 || Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint
+
| ceil(x) || 2-7 || Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint
 
|-
 
|-
| floor(x) || 2 + up to 5 || Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint
+
| floor(x) || 2-7 || Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint
 
|}
 
|}
  
Line 86: Line 92:
 
|| a+=b || 10 || If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6
 
|| a+=b || 10 || If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6
 
|-
 
|-
|| length(a) || 16 || If a is not needed later: fmul st0, st0; fld st1; fmul st0, st0; faddp st1, st0; fld st2; fmul st0, st0; faddp st1, st0; fsqrt;
+
|| length(a) || 14 || If a is not needed later: fmul st0, st0; fxch st0, st2; fmul st0, st0; faddp st2, st0; fmul st0, st0; faddp st1, st0; fsqrt
 
|}
 
|}
  
From this you can already see that a simple <code>normalize(x)</code> is going to take a lot of bytes, as it has to be computed as <code>x/=length(x)</code>. Therefore, normalizing your raymarchers rays is usually to be avoided. <code>cross</code>, <code>reflect</code>, and <code>refract</code> are probably also too costly for sizecoding.
+
From this you can already see that a simple <code>normalize(a)</code> is going to take a lot of bytes, as it has to be computed as <code>a/=length(a)</code>. Therefore, normalizing your raymarchers rays is usually to be avoided. <code>cross</code>, <code>reflect</code>, and <code>refract</code> are probably also too costly for sizecoding.
  
 
== Floating point constants ==
 
== Floating point constants ==
Line 114: Line 120:
 
|}
 
|}
  
Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that <code>fldpi; fmulp st1, st0</code> is still 4 bytes, whereas <code>fmul st0, dword [bp+offset]</code> can be as little as 3 bytes, if the offset is a short and you can reuse code or another value as the constant.
+
Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that <code>fldpi; fmulp st1, st0</code> is still 4 bytes, whereas <code>fmul st0, dword [bp+offset]</code> can be as little as 3 bytes, if the offset is short and you can reuse code or another value as the constant.
  
Even if you need to define a new constant, you don't always need all full 4 bytes to define a single IEEE floating point number, but sometimes you even a single byte suffices. With a single byte, you can already define the exponent of a float, so the order of magnitude is already correct. You can then try to place this somewhere in your code/data where at least the first few bits of mantissa are correct, to increase the accuracy slightly. You can use tools like [https://www.h-schmidt.net/FloatConverter/IEEE754.html this] to see what floating point values encode to, and what different byte patterns are as floating point constants.
+
Even if you need to define a new constant, you don't always need the full 4 bytes to define a single IEEE floating-point number—sometimes even a single byte suffices. With a single byte, you can already define the exponent of a float, so the order of magnitude is already correct. You can then try to place this somewhere in your code or data where at least the first few bits of the mantissa are correct, to slightly increase the accuracy. You can use tools like [https://www.h-schmidt.net/FloatConverter/IEEE754.html this] to see what floating-point values encode to, and what different byte patterns represent as floating-point constants.
  
 
== Case study: Balrog ==  
 
== Case study: Balrog ==  
Line 138: Line 144:
 
<pre>
 
<pre>
 
     mov    cl, ITERS
 
     mov    cl, ITERS
.maploop:
+
.maploop:               ; for(int j=0;j<ITERS;j++) {
     fld    st0         ; t.x t.x
+
     fld    st0
 
     frndint
 
     frndint
     fsubp  st1, st0     ; t.x-round(t.x)
+
     fsubp  st1, st0
 
     fabs                ; t.x = abs(t.x - round(t.x))
 
     fabs                ; t.x = abs(t.x - round(t.x))
 
     fadd    st0          ; t.x += t.x;
 
     fadd    st0          ; t.x += t.x;
Line 157: Line 163:
 
     fmul    dword [si]
 
     fmul    dword [si]
 
     fsubp  st3, st0    ; t.z -= t.x * o
 
     fsubp  st3, st0    ; t.z -= t.x * o
     loop    .maploop
+
     loop    .maploop     ; }
 
</pre>
 
</pre>
  

Latest revision as of 12:51, 7 October 2025

Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing them in x86/x87 assembly. Shadertoy is a popular choice for creating such prototypes. However, the shaders are written in WebGL, which is a relatively powerful language and includes native support for vectors, matrices, many built-in functions, and arithmetic operations. Most of these features are not available in x86 assembly. It is fairly easy to write tiny shaders in Shadertoy that end up well over 256 bytes once finally ported to DOS. Furthermore, the x87's limit of 8 stack items imposes significant constraints on the number of temporary variables you can use: for example, two 3D vectors already occupy 6 stack slots, and you typically need at least one additional item as scratch space, so that's almost the entire stack already.

To make sure your Shadertoy prototype is portable to DOS, you should avoid operations that are going to be costly (in terms of bytes) and only use ones that will be cheap in assembly. Below you’ll find some size estimates for WebGL code once ported to x87 math.

Scalar operators

ShaderToy Bytes x87 equivalent
x+=y 2 faddp st1, st0
x+y 4 If both x and y are needed later:
fld st0; fadd st0, st2

The cost for -, *, and / scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary fld or fxch instructions.

Notice the existence of fsubr and fdivr instructions, so x=(y/x) can still be just 2 bytes, even if it looks more complicated in ShaderToy.

Also notice that operating on a single component of a vector (b.x += a.x) is actually a scalar operation and thus takes the same 2-4 bytes.

Scalar functions

ShaderToy Bytes x87 equivalent
-x 2 fchs
abs(x) 2 fabs
sqrt(x) 2 fsqrt
sin(x) 2 fsin
cos(x) 2 fcos
sin(x) ... cos(x) 2 fsincos
tan(x) 2 fptan
atan(y,x) 2 fpatan
log2(x) 4 fld1
...
fyl2x
2*min(x,y) 10 Computed as a+b-abs(a-b) i.e.
fld st0; fsub st0, st2; fabs; fsubp st1, st0; faddp st1, st0
Replace fsubp with faddp for 2*max(x,y)
min(x,y) 11 fcom st0, st1; fnstsw ax; sahf; jc .S; fxch st0, st1; .S: fstp st1, st0
Replace jc with jnc for max(x,y)
exp2(y) 14 fld1; fld st1; fprem; f2xm1; faddp st1,st0; fscale; fstp st1
pow(x,y) 16 Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code
exp(y) 18 Computed as 2^(y*log2(e)) i.e. fldl2e and fmulp, followed by the exp2(y) code

There are no acos, asin, sinh, cosh, tanh, asinh, acosh, and atanh instructions on x87 and implementing them yourself is probably not worth your bytes. This is a pity, as tanh is a classic "squash" function to get any number into -1 .. 1 range.

Notice the significant byte cost of min and max operations, primarily due to the absence of fcomi and fmovcc instructions on older processors. This can catch people off guard, as many shaders use min/max extensively. For example, raymarchers often rely on them to compute shape unions. Avoiding unnecessary use of min and max can prevent a few headaches later.

Rounding and remainders

ShaderToy Bytes x87 equivalent
round(x) 2 frndint with the default rounding mode, which is nearest
mod(x,y) 2 fprem or fprem1. Notice they compute the remainder, not modulo, so they are the same only for positive values of x.
ceil(x) 2-7 Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint
floor(x) 2-7 Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint

Notice that x-round(x) is a very compact way to do domain repetition for raymarchers.

Vector arithmetic (examples)

ShaderToy Bytes x87 equivalent
a.xy = a.yx 2 fxch st0, st1
a.xyz = a.yzx 4 fxch st0, st2; fxch st0, st1;
a+=b 5-6 Assuming b is not needed later.
6 bytes: faddp st3, st0; faddp st3, st0; faddp st3, st0;
5 bytes: if you have a trashable register with suitable parity, use the looping three times trick
dot(a,b) 9-10 If neither a or b is needed later, compute this as a*=b followed by a.z+=a.y+=a.x
a+=b 10 If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6
length(a) 14 If a is not needed later: fmul st0, st0; fxch st0, st2; fmul st0, st0; faddp st2, st0; fmul st0, st0; faddp st1, st0; fsqrt

From this you can already see that a simple normalize(a) is going to take a lot of bytes, as it has to be computed as a/=length(a). Therefore, normalizing your raymarchers rays is usually to be avoided. cross, reflect, and refract are probably also too costly for sizecoding.

Floating point constants

x87 has the following constants built-in and loading each takes just 2 bytes:

Constant Approximation Instruction
0.0 0.0 fldz
1.0 1.0 fld1
pi 3.14159... fldpi
log2(e) 1.44270... fldl2e
loge(2) 0.69315... fldln2
log2(10) 3.32193... fldl2t
log10(2) 0.30103... fldlg2

Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that fldpi; fmulp st1, st0 is still 4 bytes, whereas fmul st0, dword [bp+offset] can be as little as 3 bytes, if the offset is short and you can reuse code or another value as the constant.

Even if you need to define a new constant, you don't always need the full 4 bytes to define a single IEEE floating-point number—sometimes even a single byte suffices. With a single byte, you can already define the exponent of a float, so the order of magnitude is already correct. You can then try to place this somewhere in your code or data where at least the first few bits of the mantissa are correct, to slightly increase the accuracy. You can use tools like this to see what floating-point values encode to, and what different byte patterns represent as floating-point constants.

Case study: Balrog

With all the earlier in mind, Balrog 256b executable graphics can serve as a case study. Balrog is a fractal raymarcher, with the innermost loop of:

for(int j=0;j<ITERS;j++){                    
    t.x = abs(t.x - round(t.x)); // abs is folding, t.x - round(t.x) is domain repetition               
    t.x += t.x; // domain scaling
    r *= RSCALE;          
    r += t.x*t.x;
    t.xyz = t.yzx; // shuffle coordinates so next time we operate on previous y etc.
    t.x += t.z * o; // rotation, but using very poor math
    t.z -= t.x * o;               
}

Even if there's vectors, the code mostly does scalar math, and then uses coordinate shuffling (t.xyz = t.yzx) to do math on other coordinates. That code ports to:

    mov     cl, ITERS
.maploop:                ; for(int j=0;j<ITERS;j++) {
    fld     st0
    frndint
    fsubp   st1, st0
    fabs                 ; t.x = abs(t.x - round(t.x))
    fadd    st0          ; t.x += t.x;
    fld     dword [c_rscale+bp-BASE]
    fmulp   st4, st0     ; r *= RSCALE
    fld     st0
    fmul    st0
    faddp   st4, st0     ; r += t.x*t.x
    fxch    st2, st0
    fxch    st1, st0     ; t.xyz = t.yzx
    fld     st2
    fmul    dword [si]
    faddp   st1, st0     ; t.x += t.z * o;
    fld     st0
    fmul    dword [si]
    fsubp   st3, st0     ; t.z -= t.x * o
    loop    .maploop     ; }

The comments show exactly how each ShaderToy line maps to different x87 instructions.

The Balrog code also later exemplifies the floating point truncation technique:

c_mindist equ $-3
    db      0x38  ; 0.0001
c_glowamount equ $-2
c_colorscale equ $-2
    dw      0x3d61  ; 0.055
c_stepsizediv equ $-1
    db      0x03 ; 807
c_stepsizediv_z equ $-3
    db      0x40 ; 2.1006666666666662
c_glowdecay equ $-2
    dw      0x461c ; 1e4
c_rscale equ $-2
    db      0xa1, 0x3f  ; 1.2599210498948732
c_rdiv equ $-2
    dw      0x434b ; 203.18733465192963
c_camz equ $-1
    db      0xcc, 0x12, 0x42 ; 36.7
c_xdiv equ $-1
    db      0x09, 0x00, 0x40 ; 2.0006
c_xmult equ $-2
    dw      0x3f2a
c_camy equ $-2
    dw      0x3f1c ; 0.61

Two of the constants were finally the same constant (c_glowamount and c_colorscale), many are only have the exponents (single db), and two of the constants required as much as 3 bytes to get enough precision (c_camz and c_xdiv). The ordering of the constants was carefully chosen, so that when the exponent of one constant serves as a part of the mantissa of next constant, the value is at least roughly correct.