Difference between revisions of "Risc OS on ARM based CPUs"

From SizeCoding
Jump to: navigation, search
m (Code Examples - Simple sizecoding framework and output to screen)
m
Line 50: Line 50:
 
* more or less easy access to common screen modes
 
* more or less easy access to common screen modes
 
* all screen modes have a linear frame buffer, no 16Bit screen banks limit like on DOS
 
* all screen modes have a linear frame buffer, no 16Bit screen banks limit like on DOS
 +
* 256 colour modes have  a full size RGB (8 Bit each value) palette compared to DOS (6 Bit)
 
* convenient access to operating system/kernel routines (so called SWI's (SoftWare Interrupt), comparable to 'int' on x86).   
 
* convenient access to operating system/kernel routines (so called SWI's (SoftWare Interrupt), comparable to 'int' on x86).   
 
* up to date 16-Bit sound system, for e.g. generating bytebeat based stuff  
 
* up to date 16-Bit sound system, for e.g. generating bytebeat based stuff  

Revision as of 01:15, 19 June 2020

Why ARM and why on Risc OS ?

x86 and CPUs based on ARM architecture are the two major CPU architectures of modern times, the later one especially for any kind of mobile devices. Back in the 80's ARM was founded to power the successor of the BBC Micro. Most popular and known may be is the Acorn Archimedes range (1987) and the Acorn Risc PC. All those home computers were run by Risc OS, a unique operating system for ARM cpu's.

Nowadays due to the work of a few enthusiast Risc OS is still in development and you can run it on popular single-board computers. Especially recommended and cheap is the Raspberry Pi range. So the fastest cpu to run Risc OS natively at the time of writing is an overclocked RPi4 at 2147 Mhz.

Actually I'm not aware if Android or an kind of Linux would be a better platform for sizecoding on ARM hardware. Just proof us wrong and write to us about it.

What does ARM offer compared to x86 ?

If you come from x86 coding on ARM will be a very different experience, as that architecture never had any inherited obstacles from an 8 or 16 Bit age. It was purely RISC and 32 Bit from the beginning regarding instruction set and register size. Over the years a lot of enhancements took place. In general you got:

  • 16 full size 32-Bit registers (r0...15, usually: r13: stack pointer, r14: link register, r15: program counter)
  • VFP/NEON(SIMD) units with 32 32-Bit single precision registers (s0...s31), 32 64-Bit multi purpose or double precision registers for SIMD (d0...d31), and 16 128-Bit multi purpose registers for SIMD (q0...q15). All those registers are fully mapped on each other
  • THUMB/THUMB-2 instruction set (especially useful regarding sizecoding)

...and of course the single commands in general are very different to x86...some things might be familiar, some are not at all...over the years the ARM instruction set became quite huge. Nowadays there's hardware integer divide, various SIMD approaches in either ARM or NEON instructions. Just regarding the FPU it still lacks trigonometric and other fancy instructions compared to x87. There is a so called FPEmulator in Risc OS for taking care of that, but that's rather slow as implemented by software and not available for THUMB/THUMB-2. By now though it might be an option for e.g. precalc when you use the Basic Assembler from RISC OS.

The size of the instructions is always 4 Bytes, only THUMB offers a limited instruction set with a length of 2 Bytes.

That may sound as a bit of a handicap regarding size coding and for some tasks that is definitely true. For others it's not due to the things even one instruction can do (e.g. conditional execution and shifts for free). The following shows an example:

ARM (8 Bytes)

cmp   r0,r1            //compare r0 with r1
addhi r0,r2,r3,lsr#4   //if r0>r1 then r0 = r2 + r3>>4  (r3 is only shifted for the add and remains unchanged)

x86 (11 Bytes)

cmp eax,ebx
jna skip 
   mov eax,edx
   shr eax,4
   add eax,ecx
skip:

The conditional execution in ARM mode isn't limited to the next instruction. You can continue endlessly with conditional instructions until the code executes an instruction that triggers the flags like e.g. cmp or an instruction with the suffix s added like e.g. adds r0,r1,r2.

When it comes to THUMB mode unfortunately only branches are conditional. But with THUMB-2 the it instruction was introduced with that up to 4 following instructions can be conditional. Some code from the ARM Information center explains this by the GDC algortithm (Greatest Common Divisor).

ARM (16 Bytes)

gcd:
   cmp   r0,r1
   subgt r0,r0,r1
   suble r1,r1,r0
   bne gcd

THUMB-2 (10 Bytes)

gcd:
   cmp   r0,r1
   ite   gt 
   subgt r0,r0,r1
   suble r1,r1,r0
   bne gcd

What does Risc OS offer for sizecoding ?

  • more or less easy access to common screen modes
  • all screen modes have a linear frame buffer, no 16Bit screen banks limit like on DOS
  • 256 colour modes have a full size RGB (8 Bit each value) palette compared to DOS (6 Bit)
  • convenient access to operating system/kernel routines (so called SWI's (SoftWare Interrupt), comparable to 'int' on x86).
  • up to date 16-Bit sound system, for e.g. generating bytebeat based stuff
  • built in BBC Basic including an Assembler

What does it lack (only partly relevant to tiny intro sizecoding) ?

  • an FPU (like x87) with trigonometric or logarithmic functions
  • no multicore support
  • no shader access or any kind of OpenGL or DirectX
  • lack of software development in general, so web browsing is there but a bit limited

Code Examples - Simple sizecoding framework and output to screen

So what would a common intro framework look like ? For now we will use the gnu assembler to assemble our code, as the built in BASIC Assembler doesn't support THUMB code.

Before we start with the actual code it's best to define some of the mentioned SWI's for OS interaction by their number. Here's a list of some basic ones.

.set OS_ScreenMode, 0x65
.set OS_RemoveCursors, 0x36
.set OS_ScreenMode, 0x65
.set OS_ReadVduVariables, 0x31
.set OS_ReadMonotonicTime, 0x42
.set OS_ReadEscapeState, 0x2c
.set OS_Exit, 0x11
.set OS_CallASWI 0x6f

So for a basic intro loop in THUMB-2 this would look like

.syntax unified
.thumb                   //assemble using thumb mode
movs r0,#0               //reason code to set screen mode by number
movs r1,#13              //screen mode 13 = 320x256 256 colours
swi OS_ScreenMode        //set screen mode 
adr.n r0,screen_address  //address of input block to read screen mode address
movs r1,r0               //address of output block where screen mode address is stored  
swi OS_ReadVduVariables  //read and write screen mode address from/to blocks 

mainloop:
ldr.n r7,screen_address  //read screen address
swi OS_ReadMonotonicTime //get OS timer to r0
movs r2,#255             //screen y
yloop:
   movs r1,#320          //screen x
   xloop:
      adds r3,r1,r0      //p = x+timer
      eors r3,r3,r2      //p = (x+timer) xor y
      strb r3,[r7],1     //plot result as byte (with standard palette)
      subs r1,r1,#1      //dec x 
   bne xloop
   subs r2,r2,#1         //dec y
bge yloop
swi OS_ReadEscapeState   //ESC pressed ?
bcc mainloop
swi OS_Exit              //if yes exit to OS

.align 2                 //align
screen_address:
.word 148                //input block to read screen address
.word -1                 //request block needs to be terminated by -1

This assembles to 52 Bytes.

As you can see for setting the screen mode you can rely on smaller old school modes with up to e.g. 800x600x256 colours by just choosing a mode by a number (listed here: Screen Modes). After you set the screen mode you got to read it's start address by the OS_ReadVduVariables, as that is not a fixed address. On one specific device it should work to read that address and finally hardcode this address into your code, but then of course you would be restricted to your device (e.g. a RPI4 shows different results than a RPI3 for the same screen mode).

An intro showing that technique is e.g. Exoticorn's Edgedancer

If you want to go for true colour it's a bit more complex. The probably shortest way is to use the option to kind of upgrade those old school screen modes by a string using reason code 15 of the SWI (Check out this link for further information). That would look like this code snippet:

.syntax unified
.thumb                   //assemble using thumb mode
movs r0,#15              //reason code to request screen mode by string     
adr.n r1,mode_string     //pointer to string
swi OS_ScreenMode        //set screen mode 
adr.n r0,screen_address  //address of input block to read screen mode address
movs r1,r0               //address of output block where screen mode address is stored  
swi OS_ReadVduVariables  //read and write screen mode address from/to blocks 

mainloop:
ldr.n r7,screen_address  //read screen address
swi OS_ReadMonotonicTime //get OS timer to r0
movs r2,#255             //screen y
ands r0,r0,r2            //get lowest byte of timer
lsls r0,r0,#8            //create 'B' for RGB from timer
yloop:
   lsls r4,r2,#16        //create 'R' for RGB from y
   orrs r4,r4,r0         //combine 'R' and 'B'
   movs r1,#320          //screen x
   xloop:
      lsrs r3,r1,#1      //x>>1 for 'G' as x>256
      orrs r3,r3,r4      //finalize RGB value 
      stmia r7!,{r3}     //store true colour pixel and increment address
      subs r1,r1,#1      //dec x 
   bne xloop
   subs r2,r2,#1         //dec y
bge yloop
swi OS_ReadEscapeState   //ESC pressed ?
bcc mainloop
swi OS_Exit              //if yes exit to OS

.align 2                 //align
mode_string:
.string "13 C16M"        //screen mode string (terminated by 0) => 13 = 320*256 C16M = true colour
screen_address:
.word 148                //input block to read screen address
.word -1                 //request block needs to be terminated by -1

This assembles to 68 Bytes.

An intro showing that technique is e.g. Exoticorn's Elsecaller

Another approach is to read the current screen mode, as most users would run in 1920x1080x32Bit anyway and not even set the screen mode, which also makes the intro independent of the resolution:

An intro showing that technique is e.g. Kuemmels's Risc OS 3dball. In a later upgrade to that intro you can also see the combined use of THUMB-2 and NEON within the code which lead to a reduction in code size from the initial non-THUMB version of around 44 Bytes. For more insights and requirements of the use of VFP/NEON check out the section below.

To trigger the THUMB mode for the resulting executable in general you can conveniently set the first Bit of the start address (executeables in Risc OS have a load address and a start address stored in the filesystem as an attribute) by the following command on the command line in Risc OS (&8000 is the general start address for executables in Risc OS). The best way to do so is to use a batch file for that, as shown in most of the above mentioned intros:

SYS "OS_File",1,"filename",&8000,&8001,,19

Regarding THUMB mode on Risc OS in general there's a small thing to address. A very ancient module has to be removed from the OS, otherwise it crashes your code. By today that bug is still not fixed. The modules names is "SpecialFX" and needs to be removed by "rmkill SpecialFX" on the command line or by any batch file as shown in the intro links from above.

To exit your intro and go back to the desktop you simple use the shown SWI OS_Exit. If you didn't change the mode you got to use e.g. the SWI "OS_NewLine" to re-trigger desktop redraw. Of course all of those can be omitted if your tiny intro compo rules allow you too...

Code Examples - Using VFP/NEON code

...work in progress...by now check out the source code of Kuemmels's Risc OS 3dball

Code Examples - Sound output by interrupt driven bytebeat

For basic sound output the principle of a so called timer based bytebeat could be used. For further reference check out this thread on pout Experimental music from very short C programs. I took an example bytebeat from rrrola (shortened by ryg).

To achieve that we need to set up an interrupt handler to take care of a timed output to the systems sound buffer. Here comes a bit of an obstacle. The SWI's for that purpose have a number that exceeds 0xff which would be fine for normall ARM code but not for THUMB. So here we've got to use the SWI OS_CallASWI to call those SWI's indirectly. The SWI number to be called has to be set in r10. As we need 3 different SWI's for that in total (install handler, sample rate, remove handler) and those SWI's are within a short range of numbers we can save some bytes by just add/sub an offset for the other calls. Check out the code here:

.syntax unified
.thumb
//--- set up shared sound interrupt handler ---------------
adr.w r0,soundcode+1    //+1 as code address for interrupt routine needs to be in thumb state also
movs  r2,#0             //immediate handler
adr.n r3,soundhandler_title
str   r2,[r3]           //dummy title string
movw  r10,#0xb440
movt  r10,#0x6          //install XSharedSound handler (SWI 0x6b440)
swi   OS_CallASWI
mov   r4,r0             //backup handler number (r0 gets corrupted by SharedSound_SampleRate)
mov   r1,#8000*1024     //sample rate *1024
add   r10,r10,#6        //XSharedSound_SampleRate (SWI 0x6b446)
swi   OS_CallASWI
sub   r10,r10,#5        //prepare r10 for XSharedSound_RemoveHandler (SWI 0x6b441) on exit later
//--- main intro loop -------------------------------------
mainloop:
//any graphics code or whatever would be here
swi OS_ReadEscapeState
bcc mainloop
mov r0,r4               //restore handler number
swi OS_CallASWI         //Remove XSharedSound handler
swi OS_Exit
//--- interrupt routine/sound generation ------------------
// r1 -> base of buffer, r2 -> end of buffer, r6 = 8.24 fractional step
// ByteBeat formula is = t*(0xca98>>(t>>9&14)&15)|t>>8
soundcode:
push {r0-r7,LR}
lsrs  r6,r6,#8          //adjust fractional step
ldr.n r0,soundtimer     //t = soundtimer
soundloop:
   lsrs r5,r0,#16       //adjust timer for bytebeat
   movw r7,#0xca98      //bytebeat multi constant
   lsrs r4,r5,#9        //t>>9
   and  r4,r4,#14       //(t>>9)&14
   lsrs r7,r7,r4        //0xca98>>(t>>9)&14
   and  r7,r7,#15       //(0xca98>>(t>>9)&14)&15
   muls r7,r5,r7        //t*((0xca98>>(t>>9)&14)&15)
   orr  r7,r7,r5,lsr#8  //t*(0xca98>>(t>>9&14)&15)|t>>8
   lsls r7,r7,#8        //8Bit => 16Bit sound
   orr  r7,r7,r7,lsl#16 //mono => stereo copy
   stm  r1!,{r7}        //store sound word
   adds r0,r0,r6        //inc timer by fractional step
   cmp  r1,r2           //check if buffer filled
bne soundloop
adr.n r4,soundtimer
str r0,[r4]             //save timer...no pc relative str in Thumb...
pop {r0-r7,PC}
//--- data ----------------------------------------------
.align 2
soundhandler_title:
soundtimer:

This assembles to 96 Bytes.

There are other ways to do sound on Risc OS, but those were not evaluated at the time of writing. Also BBC Basic has ways to create sounds by note or frequency (Link is here). Check out the Sound SWI calls in detail here Sound SWI Calls. Some further insights on the sound system can be found here The Risc OS sound system by j. Lesurf.

Resources

Links on the OS

RISC OS Open - Home of the current OS version, documentation on the OS (e.g. SWI's) and discussion forum

RISC OS Direct - Easy installation package for Risc OS and all needed sizecoding tools for your Raspberry Pi including !GCC (includes gnu assembler) and !StrongED (most popular text editor)

Links on ARM coding

Thumb 16-bit Instruction Set Quick Reference Card

ARM and Thumb-2 Instruction Set Quick Reference Card

Vector Floating Point Instruction Set Quick Reference Card

NEON Programmer's Guide

Instruction Set Overview

Coding for NEON - Part 1 - load and stores

Coding for NEON - Part 2 - dealing with leftovers

Coding for NEON - Part 3 - matrix multiplication

Coding for NEON - Part 4 - shifting left and right

Coding for NEON - Part 5 - rearranging vectors

Condition Codes 1: Condition Flags and Codes

Condition Codes 2: Conditional Execution

Condition Codes 3: Conditional Execution in Thumb-2

Condition Codes 4: Floating-Point Comparisons Using VFP