Difference between revisions of "Game of Life 32b"
m (→Switching to Textmode : 38 bytes) |
|||
(26 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | You can [https://www.pouet.net/prod.php?which=85485 download and comment] the intro. | ||
+ | |||
==Original version : 65 bytes== | ==Original version : 65 bytes== | ||
− | We'll start with the old 65 bytes version and bring it down to 32 bytes. | + | We'll start with the old 65 bytes version and bring it down to 32 bytes.<br /> |
+ | |||
+ | <br /> | ||
+ | It will help to understand what the core algorithm does, before optimizing it. I will not go into the details of random number generation and key handling since these parts are removed in the final version anyway. Setting up screen mode and putting pixels to the screen is described in the basic sectionn of this Wiki. The core routine computes the [https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life#Rules "normal" game of life rules], but with a twist. Instead of regarding only eight neighbour cells inside a 3x3 neighbourhood , ALL nine cells are taken into consideration, and the rules are reinterpreted as: | ||
+ | <br /> | ||
+ | * If the number of cells is 3, the center cell will be alive. | ||
+ | * If the number of cells is 4, the center cell keeps its state. | ||
+ | * Otherwise, the cell dies (or stays dead). | ||
+ | <br /> | ||
+ | Like in other (trivial) implementations, the 2D space is parsed cell by cell, from left to right, and from top to bottom. Since the game of life does not work "in situ" (updating the current cell instantly will lead to wrong results of following calculations), current cells are "marked", and when the calculations are advanced far enough that the cell in question does not influence any calculation of the current iteration, it will be "corrected" by <code>shr [byte di-65],5</code> to the target value of the next iteration. The summation is as usual, an inner loop, adding up 3 cells of one column, and the outer loop, shifting from right (+1) to the left (-1), thus adding up 9 cells of a 3x3 neighbourhood. <br /> | ||
+ | <br /> | ||
+ | At the start of the loop, there is already the first "trick" happening. The register of summation <code>cl</code> is not properly cleaned, but at this point it can either contain 0 or 32 from the instruction <code>and al,0x20</code> after <code>xchg cx,ax</code>. If an arbitrary amount of cells has this on bit set, that won't hurt the calculation because of a special property of the <code>rcr</code> instruction. [https://kernfunny.org/x86/html/file_module_x86_id_273.html "The processor restricts the count to a number between 0 and 31 by masking all the bits in the count operand except the 5 leastsignificant bits."]<br /> | ||
+ | <br /> | ||
+ | When the summation is complete, the aforementioned <code>rcr</code> is executed, but not before setting the carry flag (<code>stc</code>) which will be rotated in from the left, and directly right of the original cell value. By extracting the 6th bit of this rotated value (with <code>and al,0x20</code> we get exactly the value according to the rules defined above. <br /> | ||
+ | <br /> | ||
+ | This value is now set in the original cell with <code>or [si-1],al</code>, which as shown before, does not hurt the computation, besides the cell value has a temporary value of 32 or 33, thus being visible as brighter blue pixel in the short time span between marking and correction. | ||
+ | |||
+ | |||
<syntaxhighlight lang="nasm"> | <syntaxhighlight lang="nasm"> | ||
Line 72: | Line 91: | ||
End Start | End Start | ||
</syntaxhighlight> | </syntaxhighlight> | ||
+ | |||
==Remove key handler and RNG : 44 bytes== | ==Remove key handler and RNG : 44 bytes== | ||
+ | In order to reach 32 bytes, all the convenient stuff has to be removed. In case there is space left, parts of it could be reintegrated again. There are tiny changes to make this work as intended. The segment where all the calculation takes place has been changed to <code>1000h</code>, pointing to a lower memory location. (Note: this might be working just with DosBox) The activity there (visible on the screen) helps spawning actual game of life structures. | ||
+ | [[File:Gol44.png|thumb|right|game of life, 44 bytes]] | ||
+ | [[File:Gol44c.png|thumb|game of life, 44 bytes, increased contrast]] | ||
+ | <code>mov al,[si]</code> and <code>inc si</code> have been replaced with <code>lodsb</code> since that saves one byte. | ||
<syntaxhighlight lang="nasm"> | <syntaxhighlight lang="nasm"> | ||
Start: | Start: | ||
− | mov al, | + | mov al,13h ; Set mode 13h |
int 10h | int 10h | ||
− | push | + | push 01000h ; DS = low memory segment |
pop ds | pop ds | ||
push 0A000h ; ES = video memory | push 0A000h ; ES = video memory | ||
Line 107: | Line 131: | ||
jmp short LifeLoop | jmp short LifeLoop | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | == Switching to Textmode : | + | |
+ | [[File:Golt38.png|thumb|game of life, text mode, 32b - 38b]] | ||
+ | == Switching to Textmode : 38 bytes== | ||
+ | Setting up screen mode and pixel access is requiring quite a bit of space, so in this version, it is removed. That is directly punished with an additional byte, because <code>DI</code> is no longer involved in the process, thus, an optimization had to be removed. The assumption is that the computer this runs on, is already in text mode (80x25 chars, colors). This also helps with the calculation, since now it takes place directly on the screen (only one segment has to be set up) and no content has to be generated initially, since there is always at least something on the screen that works as seed value. A small downside is the alignment of cells, because in textmode, one cell occupies TWO bytes (one for color information). Luckily, the color information is by default set to "gray on black". An additional <code>dec bx</code>, replacing <code>lodsb</code> with <code>lodsw</code> and changing <code>mov bl,3</code> to <code>mov bl,6</code> helps fixing the alignment issue. Additionally, the screen address changed (<code>push 0xb800</code> <code>pop ds</code>) Another lucky coincident is, that instead of blue pixels, we now have a "smiley char" with orthogonal borders, which is a decent representation of a living cell. Inbetween marking and correction it shortly changes to an exclamation mark (!), which is barely visible. | ||
+ | |||
<syntaxhighlight lang="nasm"> | <syntaxhighlight lang="nasm"> | ||
push 0xb800 | push 0xb800 | ||
Line 126: | Line 154: | ||
jnz AccLoop | jnz AccLoop | ||
lodsw | lodsw | ||
− | jmp LifeLoop | + | jmp short LifeLoop |
+ | </syntaxhighlight> | ||
+ | |||
+ | == Using instructions as segment adress : 36 bytes == | ||
+ | [[File:LDStrick.png|frame|center|possible instructions for LDS trick in textmode]] | ||
+ | Instead of using <code>push</code> and <code>pop</code> to get the screen adress, there is also the instruction <code>lds</code> available, which reads the segment value from memory. A value "close" to <code>0xb800</code> would be sufficient, because the visible screen in textmode is just a tiny part of the 64 kilobytes addressable by one segment. The idea is now to reuse parts of the code as segment address, which is possible when the instructions is one of the above. If there is such an instruction, it can start at the 4th byte (<code>[si]</code> points to the start of the code and <code>lds bx,[si]</code> puts the first two bytes into BX and the 3rd and 4th into DS, reversed). In this case <code>lodsw</code> can be reused as the first (higher) byte of the segment. The 3rd byte would be only relevant for alignment, so instead of putting "0x00" there, a one-byte-instruction can be used there. The whole process saves two bytes. | ||
+ | |||
+ | <syntaxhighlight lang="nasm"> | ||
+ | lds bx,[si] | ||
+ | LifeLoop: | ||
+ | stc ; 3 = birth, 4 = stay (tricky): | ||
+ | lodsw | ||
+ | rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3) | ||
+ | and al,20h ; ^carry | ^ | ||
+ | or [si-2],al ; Add in new cell ^ | ||
+ | shr byte [si-160-6],5 ; Shift previous value | ||
+ | mov bl,6 | ||
+ | xchg cx,ax | ||
+ | AccLoop: add cl,[si+bx-160-4] ; Add in this column | ||
+ | add cl,[si+bx-4] | ||
+ | add cl,[si+bx+160-4] | ||
+ | dec bx ; Loop back | ||
+ | dec bx ; Loop back | ||
+ | jnz AccLoop | ||
+ | jmp short LifeLoop | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | == Synchronizing SI/DI, Improved cleanup : 34 bytes == | ||
+ | A lot of tiny changes were the result of just one idea: How to optimize the clean up step? After all it is not really neccessary to correct a marked cell as soon as possible, instead, it can be waited for a certain amount of time/steps. But any nontrivial version of <code>shr byte [si-160-6],5</code> still uses four bytes, unless it is brought into one of the "pure" forms that only take up THREE bytes: <code>shr byte[(bp/bx)+si/di],x</code>. Since SI and BX were already in use, and the usage of BP would implicate that the register SS is used instead of DS, the only remaining register possible is DI.<br/> | ||
+ | Now there are very short instructions available to advance the registers SI and DI, some of them at the same time, and one of them is <code>cmpsw</code>. Not only does it not "hurt" the intended computation (the "compare" part of the instruction can be ignored), it also advances both SI and DI by TWO, so that the alignment of the screen in text mode is perfectly matched.<br/> | ||
+ | The usage of <code>cmpsw</code> requires to remove <code>lodsw</code> since there is no simple command to advance SI in the opposite direction (without involving direction flags), so it had been changed again to <code>lodsb</code> to be one of the commands that also works as high byte of a segment adress, and an additional <code>dec si</code> to align DI and SI, so that the clean up step is always in the same distance "behind" the current calculation. The assumption DI = SI - 258 is true on almost every DOS system. As a byproduct, one of the memory access instruction can now be rewritten to use DI instead of SI (like in the original), to save one byte. | ||
+ | <syntaxhighlight lang="nasm"> | ||
+ | lds bx,[si] | ||
+ | LifeLoop: | ||
+ | stc ; 3 = birth, 4 = stay (tricky): | ||
+ | lodsb | ||
+ | dec si | ||
+ | rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3) | ||
+ | and al,20h ; ^carry | ^ | ||
+ | or [si],al ; Add in new cell ^ | ||
+ | cmpsw | ||
+ | shr byte [di],5 ; Shift previous value | ||
+ | mov bl,6 | ||
+ | xchg cx,ax | ||
+ | AccLoop: add cl,[di+bx+94] ; Add in this column | ||
+ | add cl,[si+bx-4] | ||
+ | add cl,[si+bx+160-4] | ||
+ | dec bx ; Loop back | ||
+ | dec bx ; Loop back | ||
+ | jnz AccLoop | ||
+ | jmp short LifeLoop | ||
+ | </syntaxhighlight> | ||
+ | == Combining exchange with alignment : 33 bytes == | ||
+ | When thinking about <code>xchg cx,ax</code> and how to skip one row to get rid of one of the double <code>dec bx</code>, my own production "M8trix" (2015) came to mind, where i did [https://www.pouet.net/prod.php?which=63126#c713378 pretty much the same as here], pulling the <code>xchg</code>into the loop and doing alternating counting, so that <code>cl</code> counts the acual cells, while <code>al</code> is never actually used (it "counts" the colors). To make that little dance work, <code>bl</code> has to start at 7. | ||
+ | |||
+ | <syntaxhighlight lang="nasm"> | ||
+ | lds bx,[si] | ||
+ | LifeLoop: | ||
+ | stc ; 3 = birth, 4 = stay (tricky): | ||
+ | lodsb | ||
+ | dec si | ||
+ | rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3) | ||
+ | and al,20h ; ^carry | ^ | ||
+ | or [si],al ; Add in new cell ^ | ||
+ | cmpsw | ||
+ | shr byte [di],5 ; Shift previous value | ||
+ | mov bl,7 | ||
+ | AccLoop: xchg cx,ax | ||
+ | add al,[di+bx+94] ; Add in this column | ||
+ | add al,[si+bx-4] | ||
+ | add al,[si+bx+160-4] | ||
+ | dec bx ; Loop back | ||
+ | jnz AccLoop | ||
+ | jmp short LifeLoop | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | == | + | ==Modbyte tuning, jumping into modbytes, code path alignment : 32 bytes== |
+ | [[File:Modbyte to 100h.png|frame|right|lds_modbyte_instructions]] | ||
+ | Sometimes, an instruction has several degrees of "freedom". That means, that the effect of that instruction can also be achieved by an alternative version of that instruction. In this case, the <code>lds</code> instruction, which puts two bytes of the code into the segment DS, also loads two bytes into a register we (almost) don't care about. The only requirement is that <code>lds</code> points to the start of the code, which can either be done by [SI] or [BX+SI]. The right image shows which modbyte numbers would be satisfying (highlighted green). Now, this selection can be applied to the instruction table below (highlighted red). It becomes clear that the used instruction<code>and al,0x20</code> would, interpreted as modbyte, be <code>SP,[SI]</code> and thus it would be possible to jump into this modbyte to execute. <br/><br/> | ||
+ | To be more clear: <code>lds sp,[si]</code> is <code>0xc5 0x24</code>, <code>and al, 0x20</code> is <code>0x24 0x20</code>, so TWO <code>0x24</code>are merged into ONE. <br/> | ||
+ | <br/>To make this work, the "host" instruction has to be only executed once (it would not work in a loop). Also, the parameter of the injected instruction has to be put "behind" the "host" instruction (a single <code>db 32</code> in the code). Finally, it has to be made sure that this second code path aligns with the rest of the code, and does no damage to the intended effect (for example, critical registers could be modified, or worse, illegal instructions could be created that way). In this case the new codepath consists of <code>and [bp+di+0807h],dh</code> and <code>add al,0a7h</code>, after which it aligns normally. These instructions are executed only once and do not modify critical registers.<br/> | ||
+ | Sometimes, a bit of code shuffling has to be performed to make such a trick work. Here, the <code>lodsb</code> and <code>dec si</code> have been replaced with <code>mov al,[si]</code>. The critical function of being also a good segment value has been overtaken by <code>mov bl</code> (see table above).<br/> | ||
+ | [[File:Instruction selection.png|frame|center|instruction table for modbytes]] | ||
<syntaxhighlight lang="nasm"> | <syntaxhighlight lang="nasm"> | ||
+ | lds sp,[si] | ||
+ | X: db 32 | ||
+ | mov bl,7 ; O: 3 iterations | ||
+ | or [si],al ; O: Add in new cell | ||
+ | cmpsw | ||
+ | shr byte [di],5 ; O: Shift previous value | ||
+ | C: xchg cx,ax | ||
+ | add al,[di+bx+94] ; O: Add in this column | ||
+ | add al,[si+bx-4] | ||
+ | add al,[si+bx+156] | ||
+ | dec bx ; O: Loop back | ||
+ | jnz C | ||
+ | mov al,[si] ; O: 3 = birth, 4 = stay (tricky): | ||
+ | stc ; O: 1.00?0000x --> 0.0x100?00 (rcr 3) | ||
+ | rcr al,cl ; O: +---> 0.00x100?0 (rcr 4) | ||
+ | jmp short X-1 | ||
</syntaxhighlight> | </syntaxhighlight> |
Latest revision as of 04:27, 4 May 2020
You can download and comment the intro.
Contents
- 1 Original version : 65 bytes
- 2 Remove key handler and RNG : 44 bytes
- 3 Switching to Textmode : 38 bytes
- 4 Using instructions as segment adress : 36 bytes
- 5 Synchronizing SI/DI, Improved cleanup : 34 bytes
- 6 Combining exchange with alignment : 33 bytes
- 7 Modbyte tuning, jumping into modbytes, code path alignment : 32 bytes
Original version : 65 bytes
We'll start with the old 65 bytes version and bring it down to 32 bytes.
It will help to understand what the core algorithm does, before optimizing it. I will not go into the details of random number generation and key handling since these parts are removed in the final version anyway. Setting up screen mode and putting pixels to the screen is described in the basic sectionn of this Wiki. The core routine computes the "normal" game of life rules, but with a twist. Instead of regarding only eight neighbour cells inside a 3x3 neighbourhood , ALL nine cells are taken into consideration, and the rules are reinterpreted as:
- If the number of cells is 3, the center cell will be alive.
- If the number of cells is 4, the center cell keeps its state.
- Otherwise, the cell dies (or stays dead).
Like in other (trivial) implementations, the 2D space is parsed cell by cell, from left to right, and from top to bottom. Since the game of life does not work "in situ" (updating the current cell instantly will lead to wrong results of following calculations), current cells are "marked", and when the calculations are advanced far enough that the cell in question does not influence any calculation of the current iteration, it will be "corrected" by shr [byte di-65],5
to the target value of the next iteration. The summation is as usual, an inner loop, adding up 3 cells of one column, and the outer loop, shifting from right (+1) to the left (-1), thus adding up 9 cells of a 3x3 neighbourhood.
At the start of the loop, there is already the first "trick" happening. The register of summation cl
is not properly cleaned, but at this point it can either contain 0 or 32 from the instruction and al,0x20
after xchg cx,ax
. If an arbitrary amount of cells has this on bit set, that won't hurt the calculation because of a special property of the rcr
instruction. "The processor restricts the count to a number between 0 and 31 by masking all the bits in the count operand except the 5 leastsignificant bits."
When the summation is complete, the aforementioned rcr
is executed, but not before setting the carry flag (stc
) which will be rotated in from the left, and directly right of the original cell value. By extracting the 6th bit of this rotated value (with and al,0x20
we get exactly the value according to the rules defined above.
This value is now set in the original cell with or [si-1],al
, which as shown before, does not hurt the computation, besides the cell value has a temporary value of 32 or 33, thus being visible as brighter blue pixel in the short time span between marking and correction.
; http://read.pudn.com/downloads208/sourcecode/asm/981812/LIFE65.ASM__.htm
; Life simulator, 72 bytes - Vladislav Kaipetsky and Tenie Remmel
; 65 bytes - Mark Andreas
; If no args, regs on startup are:
; AX = BX = 0000h
; SI = IP = 0100h
; DI = SP = FFFEh
IDEAL
MODEL TINY
P386
CODESEG
ORG 100h
Start: int 1ah ; ah=00: cx=hours, dx=tic counter
mov al,13h ; Set mode 13h
int 10h
xchg dx,ax
push 09000h ; DS = last 64K segment
pop ds
push 0A000h ; ES = video memory
pop es
; BX is already zero
RandLoop:
rol ax,1 ; Generate random number
adc [bx],al
dec bx
jnz RandLoop
; BX will not be equal to 3 the first time this loop is executed, but
; it will be for all other times. As SI = 0100h and DI = FFFEh on
; startup, SI - DI will be equal to 258.
LifeLoop:
xchg cx,ax
AccLoop:
add cl,[di+bx-64] ; Add in this column
add cl,[si+bx-2]
add cl,[si+bx+318]
dec bx ; Loop back
jnz AccLoop
mov al,[si] ; Get center cell, set pixel
stosb
stc ; 3 = birth, 4 = stay (tricky):
rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3)
and al,20h ; ^carry | ^
; +---> 0.00x100?0 (rcr 4)
or [si-1],al ; Add in new cell ^
shr [byte di-65],5 ; Shift previous value
mov bl,3 ; 3 iterations in AccLoop
inc si ; Loop while not zero
jnz LifeLoop
mov ah,1 ; Check for key
int 16h
jz LifeLoop ; Loop if no key
xchg ax,bx ; Set text mode
int 10h
ret ; Return
End Start
Remove key handler and RNG : 44 bytes
In order to reach 32 bytes, all the convenient stuff has to be removed. In case there is space left, parts of it could be reintegrated again. There are tiny changes to make this work as intended. The segment where all the calculation takes place has been changed to 1000h
, pointing to a lower memory location. (Note: this might be working just with DosBox) The activity there (visible on the screen) helps spawning actual game of life structures.
mov al,[si]
and inc si
have been replaced with lodsb
since that saves one byte.
Start:
mov al,13h ; Set mode 13h
int 10h
push 01000h ; DS = low memory segment
pop ds
push 0A000h ; ES = video memory
pop es
; BX is already zero
LifeLoop:
xchg cx,ax
AccLoop:
add cl,[di+bx-64] ; Add in this column
add cl,[si+bx-2]
add cl,[si+bx+318]
dec bx ; Loop back
jnz AccLoop
;mov al,[si] ; Get center cell, set pixel
lodsb
stosb
stc ; 3 = birth, 4 = stay (tricky):
rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3)
and al,20h ; ^carry | ^
; +---> 0.00x100?0 (rcr 4)
or [si-1],al ; Add in new cell ^
shr [byte di-65],5 ; Shift previous value
mov bl,3 ; 3 iterations in AccLoop
; inc si ; Loop while not zero
jmp short LifeLoop
Switching to Textmode : 38 bytes
Setting up screen mode and pixel access is requiring quite a bit of space, so in this version, it is removed. That is directly punished with an additional byte, because DI
is no longer involved in the process, thus, an optimization had to be removed. The assumption is that the computer this runs on, is already in text mode (80x25 chars, colors). This also helps with the calculation, since now it takes place directly on the screen (only one segment has to be set up) and no content has to be generated initially, since there is always at least something on the screen that works as seed value. A small downside is the alignment of cells, because in textmode, one cell occupies TWO bytes (one for color information). Luckily, the color information is by default set to "gray on black". An additional dec bx
, replacing lodsb
with lodsw
and changing mov bl,3
to mov bl,6
helps fixing the alignment issue. Additionally, the screen address changed (push 0xb800
pop ds
) Another lucky coincident is, that instead of blue pixels, we now have a "smiley char" with orthogonal borders, which is a decent representation of a living cell. Inbetween marking and correction it shortly changes to an exclamation mark (!), which is barely visible.
push 0xb800
pop ds
LifeLoop:
stc ; 3 = birth, 4 = stay (tricky):
rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3)
and al,20h ; ^carry | ^
or [si-2],al ; Add in new cell ^
shr byte [si-160-6],5 ; Shift previous value
mov bl,6
xchg cx,ax
AccLoop: add cl,[si+bx-160-4] ; Add in this column
add cl,[si+bx-4]
add cl,[si+bx+160-4]
dec bx ; Loop back
dec bx ; Loop back
jnz AccLoop
lodsw
jmp short LifeLoop
Using instructions as segment adress : 36 bytes
Instead of using push
and pop
to get the screen adress, there is also the instruction lds
available, which reads the segment value from memory. A value "close" to 0xb800
would be sufficient, because the visible screen in textmode is just a tiny part of the 64 kilobytes addressable by one segment. The idea is now to reuse parts of the code as segment address, which is possible when the instructions is one of the above. If there is such an instruction, it can start at the 4th byte ([si]
points to the start of the code and lds bx,[si]
puts the first two bytes into BX and the 3rd and 4th into DS, reversed). In this case lodsw
can be reused as the first (higher) byte of the segment. The 3rd byte would be only relevant for alignment, so instead of putting "0x00" there, a one-byte-instruction can be used there. The whole process saves two bytes.
lds bx,[si]
LifeLoop:
stc ; 3 = birth, 4 = stay (tricky):
lodsw
rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3)
and al,20h ; ^carry | ^
or [si-2],al ; Add in new cell ^
shr byte [si-160-6],5 ; Shift previous value
mov bl,6
xchg cx,ax
AccLoop: add cl,[si+bx-160-4] ; Add in this column
add cl,[si+bx-4]
add cl,[si+bx+160-4]
dec bx ; Loop back
dec bx ; Loop back
jnz AccLoop
jmp short LifeLoop
Synchronizing SI/DI, Improved cleanup : 34 bytes
A lot of tiny changes were the result of just one idea: How to optimize the clean up step? After all it is not really neccessary to correct a marked cell as soon as possible, instead, it can be waited for a certain amount of time/steps. But any nontrivial version of shr byte [si-160-6],5
still uses four bytes, unless it is brought into one of the "pure" forms that only take up THREE bytes: shr byte[(bp/bx)+si/di],x
. Since SI and BX were already in use, and the usage of BP would implicate that the register SS is used instead of DS, the only remaining register possible is DI.
Now there are very short instructions available to advance the registers SI and DI, some of them at the same time, and one of them is cmpsw
. Not only does it not "hurt" the intended computation (the "compare" part of the instruction can be ignored), it also advances both SI and DI by TWO, so that the alignment of the screen in text mode is perfectly matched.
The usage of cmpsw
requires to remove lodsw
since there is no simple command to advance SI in the opposite direction (without involving direction flags), so it had been changed again to lodsb
to be one of the commands that also works as high byte of a segment adress, and an additional dec si
to align DI and SI, so that the clean up step is always in the same distance "behind" the current calculation. The assumption DI = SI - 258 is true on almost every DOS system. As a byproduct, one of the memory access instruction can now be rewritten to use DI instead of SI (like in the original), to save one byte.
lds bx,[si]
LifeLoop:
stc ; 3 = birth, 4 = stay (tricky):
lodsb
dec si
rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3)
and al,20h ; ^carry | ^
or [si],al ; Add in new cell ^
cmpsw
shr byte [di],5 ; Shift previous value
mov bl,6
xchg cx,ax
AccLoop: add cl,[di+bx+94] ; Add in this column
add cl,[si+bx-4]
add cl,[si+bx+160-4]
dec bx ; Loop back
dec bx ; Loop back
jnz AccLoop
jmp short LifeLoop
Combining exchange with alignment : 33 bytes
When thinking about xchg cx,ax
and how to skip one row to get rid of one of the double dec bx
, my own production "M8trix" (2015) came to mind, where i did pretty much the same as here, pulling the xchg
into the loop and doing alternating counting, so that cl
counts the acual cells, while al
is never actually used (it "counts" the colors). To make that little dance work, bl
has to start at 7.
lds bx,[si]
LifeLoop:
stc ; 3 = birth, 4 = stay (tricky):
lodsb
dec si
rcr al,cl ; 1.00?0000x --> 0.0x100?00 (rcr 3)
and al,20h ; ^carry | ^
or [si],al ; Add in new cell ^
cmpsw
shr byte [di],5 ; Shift previous value
mov bl,7
AccLoop: xchg cx,ax
add al,[di+bx+94] ; Add in this column
add al,[si+bx-4]
add al,[si+bx+160-4]
dec bx ; Loop back
jnz AccLoop
jmp short LifeLoop
Modbyte tuning, jumping into modbytes, code path alignment : 32 bytes
Sometimes, an instruction has several degrees of "freedom". That means, that the effect of that instruction can also be achieved by an alternative version of that instruction. In this case, the lds
instruction, which puts two bytes of the code into the segment DS, also loads two bytes into a register we (almost) don't care about. The only requirement is that lds
points to the start of the code, which can either be done by [SI] or [BX+SI]. The right image shows which modbyte numbers would be satisfying (highlighted green). Now, this selection can be applied to the instruction table below (highlighted red). It becomes clear that the used instructionand al,0x20
would, interpreted as modbyte, be SP,[SI]
and thus it would be possible to jump into this modbyte to execute.
To be more clear: lds sp,[si]
is 0xc5 0x24
, and al, 0x20
is 0x24 0x20
, so TWO 0x24
are merged into ONE.
To make this work, the "host" instruction has to be only executed once (it would not work in a loop). Also, the parameter of the injected instruction has to be put "behind" the "host" instruction (a single db 32
in the code). Finally, it has to be made sure that this second code path aligns with the rest of the code, and does no damage to the intended effect (for example, critical registers could be modified, or worse, illegal instructions could be created that way). In this case the new codepath consists of and [bp+di+0807h],dh
and add al,0a7h
, after which it aligns normally. These instructions are executed only once and do not modify critical registers.
Sometimes, a bit of code shuffling has to be performed to make such a trick work. Here, the lodsb
and dec si
have been replaced with mov al,[si]
. The critical function of being also a good segment value has been overtaken by mov bl
(see table above).
lds sp,[si]
X: db 32
mov bl,7 ; O: 3 iterations
or [si],al ; O: Add in new cell
cmpsw
shr byte [di],5 ; O: Shift previous value
C: xchg cx,ax
add al,[di+bx+94] ; O: Add in this column
add al,[si+bx-4]
add al,[si+bx+156]
dec bx ; O: Loop back
jnz C
mov al,[si] ; O: 3 = birth, 4 = stay (tricky):
stc ; O: 1.00?0000x --> 0.0x100?00 (rcr 3)
rcr al,cl ; O: +---> 0.00x100?0 (rcr 4)
jmp short X-1