Dialectronics - the dialect of electronic communications





About Us

Our Mission

PowerPC

Lua

XCOFF Bootloader

Old World Macs

Open Firmware

Words

Journeys

Home Repair

Top Level




PowerPC

As noted in our Words area, at one time IBM's developerWorks Power Architecture Zone was going to carry a series written by tim on handrolling PowerPC assembly code. While that series never did get published, we decided to publish the first two installments. Part I is an overview of the PowerPC architecture registers and a quick introduction to the syntax of a few instructions. Part II digs into some disassembling of object code in order to discuss what a programmer might run into when debugging. Neither piece should be considered the final authority and may contain one or more factual errors. Part III does some simple math, including using Pascal's Summation to sum between two numbers (neither of which have to be zero) in twelve assembly instructions. (The included version that sums from zero to a number takes only four instructions, while the twelve instruction version deals with upper and lower bounds.)

Currently tim is working on a faster bcopy/memcpy using PowerPC handcoded assembly. The principle is simple: first copy any bytes from the the total size that are not four byte divisible and then copy chunks of four bytes. Since the penalty for using GPRs with unaligned bytes is negligible, it is not necessary to drop back to single byte copies just because the size or address isn't four byte divisible. This can be extended to use Altivec, but doing so invokes additional boundary conditions to handle misaligned data (i.e., not 16 byte boundaries, offsets, et al).

There is a bcopy.S version for incorporation into NetBSD, which is the version we have used for several months without problems. It contains both bcopy and memcpy replacements. For NetBSD, this file goes at /usr/src/lib/libc/arch/powerpc/string/bcopy.S

The 4 byte PowerPC optimized code handles copies about 10% faster than stock NetBSD code for four byte aligned copies and about 80% faster than for unaligned data.

The alternative fourbytes.s can be used in applications, though, which is what tim used to test it. The code will need to be compiled with the as -mregnames flag to allow the register syntax.

tim has also begun an Altivec version that will copy 16 bytes at a time, but this requires handling cases where either or both source and destination are not aligned on 16 byte boundaries in addition to sizes not sixteen byte divisible. Don't look for it any time soon. Oh yeah, IBM might hold patents on some of the techniques needed to do this, if our research on this is correct, but we could never get a definitive answer from their legal department (no doubt waiting until we actually used the code that infringes on their patents).

Also, there is the L2 cache configuring code. It is a little out of date, so contact tim if you want a newer version. (You know the routine - gtkelly at this domain.)