Old World Macs
While reviewing how other RISC CPUs handle argument register overflow so that I could correctly implement this on RISCV, I realized that the powerpc arch was not putting overflow parameters in the correct location. The main problem is that arch/powerpc/code.c was calculating the overflow location from r3, which is the first parameter register, not r11, which is the first register that will overflow. The second issue was that the calculation for how much space between the stack pointer and the overflow location was 24 bytes instead of 8 as specified by the ELF ABI. Last, the stack frame calculation itself was overstating its needs. Along the way I fixed an issue where r0 and r2 were incorrectly labled, leading r2 to be used as a volatile register and r0 as non-volatile (the reverse is true).
printf("sequence %s (%d) found at %d %d %d sequence %s (%d) found at %d %d %d\n", sq, sq_l, x, y, x, sq, sq_l, x, y, x);would result in
stw r4,64(r1) stw r5,60(r1) stw r4,56(r1) mr r10,r11 ; assign r11 to r10 mr r9,r3 ; assign r3 to r9 mr r8,r4 ; assign r4 to r8 mr r7,r5 ; assign r5 to r7 mr r6,r4 ; assign r4 to r6 mr r5,r11 ; assign r11 to r5 mr r4,r3 ; assign r3 to r4 lis r3,L411@ha addi r3,r3,L411@l bl printf ; call (args, no result) to sconThe 56(r1) is too high. The patches below correct this to place the overflow parameters at 8 bytes above the SP, consistent with ELF ABI. Possibly Mach-O needs it elsewhere, in which case let me know. I also completely revamped the prolog and epilog code.
Correct code, including a floating point argument substituted at the fifth parameter, which does not count against the overflow of GPRs:
mr r0,r30 ; preserve FPREG mr r30,r1 ; establish frame pointer stwu r1,-32(r1) ; move the stack pointer stw r0,-4(r30) ; save FPREG relative to frame pointer stw r30,0(r1) ; save previous stack pointer mflr r0 stw r0,4(r1) .L403: mr r6,r4 ; assign r4 to r6 mr r7,r5 ; assign r5 to r7 .L407: lis r4,.L409@ha lfd f0,.L409@l(r4) lis r4,.L411@ha ; load sname lfd f2,.L411@l(r4) fadd f1,f0,f2 ; (l)double add stw r6,12(r1) stw r7,8(r1) mr r10,r6 ; assign r6 to r10 mr r9,r0 ; assign r0 to r9 mr r8,r3 ; assign r3 to r8 mr r5,r0 ; assign r0 to r5 mr r4,r3 ; assign r3 to r4 lis r3,.L413@ha addi r3,r3,.L413@l bl printf ; call (args, no result) to scon li r3,0 lwz r30,-4(r30) ; restore FPREG lwz r0,4(r1) ; reload stack pointer mtlr r0 ; restore link register lwz r1,0(r1) ; restore stack pointer blrmacdefs.h.diff
A long time ago: As noted in our Words area, at one time IBM's developerWorks Power Architecture Zone was going to carry a series written by tim on handrolling PowerPC assembly code. While that series never did get published, we decided to publish the first two installments. Part I is an overview of the PowerPC architecture registers and a quick introduction to the syntax of a few instructions. Part II digs into some disassembling of object code in order to discuss what a programmer might run into when debugging. Neither piece should be considered the final authority and may contain one or more factual errors. Part III does some simple math, including using Pascal's Summation to sum between two numbers (neither of which have to be zero) in twelve assembly instructions. (The included version that sums from zero to a number takes only four instructions, while the twelve instruction version deals with upper and lower bounds.)
Currently tim is working on a faster bcopy/memcpy using PowerPC handcoded assembly. The principle is simple: first copy any bytes from the the total size that are not four byte divisible and then copy chunks of four bytes. Since the penalty for using GPRs with unaligned bytes is negligible, it is not necessary to drop back to single byte copies just because the size or address isn't four byte divisible. This can be extended to use Altivec, but doing so invokes additional boundary conditions to handle misaligned data (i.e., not 16 byte boundaries, offsets, et al).
There is a bcopy.S version for incorporation into NetBSD, which is the version we have used for several months without problems. It contains both bcopy and memcpy replacements. For NetBSD, this file goes at /usr/src/lib/libc/arch/powerpc/string/bcopy.S
The 4 byte PowerPC optimized code handles copies about 10% faster than stock NetBSD code for four byte aligned copies and about 80% faster than for unaligned data.
The alternative fourbytes.s can be used in applications, though, which is what tim used to test it. The code will need to be compiled with the as -mregnames flag to allow the register syntax.
tim has also begun an Altivec version that will copy 16 bytes at a time, but this requires handling cases where either or both source and destination are not aligned on 16 byte boundaries in addition to sizes not sixteen byte divisible. Don't look for it any time soon. Oh yeah, IBM might hold patents on some of the techniques needed to do this, if our research on this is correct, but we could never get a definitive answer from their legal department (no doubt waiting until we actually used the code that infringes on their patents).
Also, there is the L2 cache configuring code. It is a little out of date, so contact tim if you want a newer version. (You know the routine - gtkelly at this domain.)