Dialectronics - the dialect of electronic communications





About Us

Our Mission

PowerPC

Lua

XCOFF Bootloader

Old World Macs

Open Firmware

Words

Journeys

Home Repair

Top Level




Some Assembly Required, Part III


A possible compromise can be to handroll in assembly specific functions that need to have maximum performance.  Handrolled functions can outperform compilers in many situations because compilers have different requirements that lead to decisions like creating a stack frame without using it.  A programmer with a good grasp of PowerPC assembly can write efficent code with minimal effort.  The effort also provides the programmer with a better grasp of the underlying architecture which immediately leads to the programmer being able to make better decisions about maximizing overall performance.

In order to help beginners learn assembly, each installment of Some Assembly Required that generates any handrolled code will also include a downloadable tarball that builds the code into an executable.  The standard operating system the tarball has been built and tested against is currently NetBSD 3.99.3, with gcc 3.3.3 and as 2.1.5 for the C and assembly compilers, respectively.  The C source code will be kept in separate files from the assembly source code.  Regardless of whether the code is in C or assembly, tabs will be used to align text in an easily read display in the default editor pico 4.9. (For the web version of code, we'll be using non-breaking space characters until we can find a better way to make stuff line up.)

Each downloadable will include a Makefile with fairly generic instructions.  Typically a Makefile in the downloadable will have a structure similar to


SHELL=  /bin/sh
PATH=   ../
PROG=   sum

OBJPATH= ${PATH}obj/
OBJS=   main.o summation.o

CC?=    cc
CFLAGS= -c
LD?=    ld
LDFLAGS=
AS?=    as
ASFLAGS= -mregnames


all: ${PROG}

${PROG}:  ${OBJS}
        ${CC} -o ${PROG} $(OBJS)

main.o: main.c
        ${CC} ${CFLAGS} -o ${OBJPATH}main.o ${PATH}main.c

summation.o: summation.s
        ${AS} ${ASFLAGS} -o ${OBJPATH}summation.o ${PATH}summation.s

clean:
        rm -f ${OBJPATH}*.o

.PHONY: clean



The use of variables in the Makefile will assist in programmers making changes easily in order to build on their own systems.  The object files will be built into a subdirectory obj/, while the source files will be at the top level.  In general, C source files will end in .c, while assembly source files will end in .s.  Other conventions do include .a and .S, but these are arbitrary as well.  The as flag -mregnames allows for the use of "r3" instead of "%r3" for consistent syntax with objdump outputs.  The default C compilation does not include any compiler optimizations; these can be specified on the command line:


make CFLAGS=-Wall\ -O3\ -c


An  easy place to start with assembly code is simple incrementing.  Consider the lowly for loop:


for (i = 0 ; i<= 100; i++)   
              j = j + i;


Although there are slickly bummed code snippets that can do this in 1.5 instructions, it's more informative to examine this loop using a wide range of aspects.  First, there are three values to consider: i, j, and the maximum value 100.  For the purpose of this article, each of those values is best manipulated in general purpose registers (GPR).  Also, for maximum readability, the assembly code syntax will not be specific to a version of an Assembly compiler, so while some compilers might need mr 3,4 or mr %r3, %r4, this article will use mr r3, r4 (remember that for as the flag -mregnames allows avoidance of "%" in the register name).  This article will also use mnemonics when convenient, but will also use the expanded instruction when clarity is preferred.

So, getting back to the for loop above, it is useful to expand the for loop above into its own function call.  One note about handrolled assembly code - if this is done inline with another programming language within the same function call, special syntax must be used to tell the compiler what variables the assembly instructions need.  For clarity, this article will use fully self-contained assembly function calls.  Here is the C equivalent of the function:


int DoSummation(int value)
{
       int i, j;

       for (i = 0; i <= value; i++)
              j = j+i;
       
       return j;
       
}

A quick refresher on PowerPC function calling: arguments passed to functions are placed in order starting at r3; the return value comes back in r3.  r4 is a good place to store the summation before returning it (don't forget to zero r4 first).  The quickest way to do this summation is do to it backwards - start at the value to sum and add each (value-1) until value is reduced to zero.

(Illustration r4 = 100 + 99 + 98 + ... + 2 + 1 )


       .text     
       .globl  DoSummation
DoSummation:
       xor  r4, r4, r4

loop:
       cmpi cr0, 0, r3, 0
       add  r4, r4, r3
       subi r3, r3, 1
       bgt+ cr0, loop
       mr   r3, r4
       blr

After a while the above code will look like the same old language you've always programmed in, sort like spending a lot of time immersed in a new spoken language.  What does the code do?  xor (Exclusive OR) zero'd r4 by xor'ing r4 with itself.  cmpi (Compare Word Immediate) compared r3 (value) to zero and stored the results in cr0 (Condition Register 0).  cr0 and cr1 are volatile, so it is not necessary to preserve their original contents.  The "0" in cmpi indicates this is a 32 bit compare, in which case r3 is not extended to 64 bits.  There is also cmp (Compare Word) that uses a value in a register instead of a fixed integer.  Since there is some latency in cmp*, going ahead and doing compares early means they are likely to have completed by the time the result of the compare is needed.  The flipside to this is that the loop will execute one additional time before the results of the comparison are evaluated.  add takes the two rightmost registers, adds them, and stores the result in the leftmost register.  Registers can overwrite themselves during instructions, so there is no need to put the result in a separate register and then move it before adding the next value.  

subi subtracts a 16 bit integer from r3 and stores the result in the same r3.  Other registers can be substituted.  Why 16 bits?  The opcode only allocates 16 bits to the value to be subtracted.  If values with more than 16 bits are needed, use sub with lis, which is subtract using registers and Load Immediate Shift, which takes a 16 bit value, shifts it a specified number of bits, and stores the result in a register.  sub and subi are actually mnemonics for add and addi, respectively, but automatically two's compliment the integer value (adding the negative of the value).

The bgt+ refers back to the compare instruction earlier.  The "+" indicates a hint for the branch instruction, such that most likely the branch will be taken (back to the location "loop:").  "-" would indicate the branch was not likely to be taken.  For the loop above, hinting for any summation of a number larger than 2 will result in improved performance with a penalty only on the last compare.

The results are moved r4 to r3, the expected location for return values, with mr r3, r4.  blr (Branch To Link Register) returns the execution to the address in the Link Register, which was actually set by the calling function to be one instruction address greater than the function call jump.

The above function call does not need a stack frame, so the normal prolog and epilog are skipped, saving a few instructions (each one is 4 bytes, by the way).  Some caveats about the above code: it doesn't work for negative numbers, and it uses registers that are labeled volatile but the compiler may not expect those registers to be used unless specified as passed arguments to the function call.  Normally preserving volatile register contents is the responsibility of the calling function; check with your assembly compiler before assuming this holds.

Since a goal is fast and small code, it is possible to reduce the number of total instructions executed by utilizing a trick first documented by Blaise Pascal (at age eight).  The summation of a number is the number times the number plus one divided by two (S = ((n+1)*n)/2):

       .text
       .globl  DoPascalSummation
       
DoPascalSummation:
       addi    r4, r3, 1
       mullw   r3, r4, r3
       li      r4, 1
       srw     r3, r3, r4
       blr


This completely eliminates looping and does the summation in a total of four instructions.  The addi instruction adds 1 to r3 and stores the result in r4, leaving r3 untouched.  The mullw instruction multiplies r3 and r3 together, with the result going to r3.  Dividing by two is the same as shifting to the right one bit, and this is accomplished first by li (Load Immediate) the number of bits, 1, into r4, and then srw uses r4 to specify the number of bits to shift r3 to the right (Shift Right Word), leaving the result in r3 for return from the function call.  Nothing here that can't be done in C or another higher language, but there's a certain elegance to simplicity.

However, as noted earlier, this does not handle negative numbers.  Extending this to properly calculate the summation of negative numbers is interesting and worth exploring.  Analytically the identity can be seen to be identical to evaluating the summation of the positive value and then sign reversed.  This is the same as taking the two's compliment of the number, which is done with the neg instruction.  The twist is the location at which the negate must occur.  Division by 2 of a two's compliment number does not yield the same answer as dividing by two of a positive number and then negating it.

For example, the summation of 0 to 100 is 5050, and the summation of -100 to 0 is -5050.  The step prior to obtaining the result -5050 is to divide -10100 by 2 (shifting the value one bit to the right).  In two's compliment -10100 is 4294957195; dividing this by 2 is 2147478597, not the desired result.  The approach that yields the correct result is to carry a flag the indicates the value was originally negative and work with the positive version, then two's compliment the final result to be returned.


    .text
    .globl  DoPascalSummation

DoPascalSummation:
        cmpi    cr0, 0, r3, 0
        bge+    cr0, 0x08
        neg     r3, r3
        addi    r4, r3, 1
        mullw   r3, r4, r3
        li      r4, 1
        srw     r3, r3, r4
        bge+    cr0, 0x08
        neg     r3, r3
        blr

The cmpi checks if r3 (the value to be summed) is positive or negative and stores the result in cr0.  At no point in the code subsequent to cmpi is cr0 overwritten.  If the value is positive, the code skips over the next instruction, which is to take the two's compliment of the value (making it positive).  The balance of the code is identical to the earlier version, with the exception of again checking if the value was positive or negative (using the stored comparison result) and two's complimenting the result if the original value was negative.


The above summation identity is a special case of a summation where the lower bound is zero.  The general identity is S = ((n-m+1)(n+m)/2).  This requires passing a lower bound and an upper bound, and as such, it's best to set up a stack frame and use non-volatile registers to hold intermediate results:

/* S = (n-m+1)(n+m)/2           */
/* r3 = n                       */
/* r4 = m                       */
/* r30 = n+m                    */
/* r31 = n-m+1                  */

asm (
       .text
       .globl  DoPascalSummationTwo

DoPascalSummationTwo:
       stwu    r1, -16(r1)
       stw     r31, 4(r1)
       stw     r30, 8(r1)
       add     r30, r4, r3
       addi    r4, r4, 1
       sub     r31, r4, r3
       mullw   r3, r31, r30
       li      r4, 1
       srw     r3, r3, r4
       lwz     r30, 8(r1)
       lwz     r31, 4(r1)
       addi    r1, r1, 16
       blr

The examination of the code is left to the reader.

In this article we looked at the fundamentals of working with PowerPC at the instruction level.  This includes some discussion of the philosophy "registers are good, and the more the better," and what the overhead is regarding this philosophy.  In the next installment of this series we'll take a look at moving data in memory with a platform dependent implementation of memcopy.  Resources:

PEM,

UM for a couple processors

Lightsoft's Guide to PPC Assembler   Motorola's Macsbug

XCOFF, ELF, PEF, Mach-O formats