Dialectronics - the dialect of electronic communications





About Us

Our Mission

PowerPC

Lua

XCOFF Bootloader

Old World Macs

Open Firmware

Words

Journeys

Home Repair

Top Level




Some Assembly Required, Part I

Maximizing PowerPC performance includes embracing one key mantra: registers are good, and the more of them the better. The PowerPC specification calls for 32 32-bit or 64-bit General Purpose Registers (GPRs) and 32 64-bit Floating Point Registers (FPRs). There is also an extension called VMX (Vector Multimedia Extensions), also known as Altivec, which is 32 128-bit SIMD registers. VMX is present on several PowerPC chips including the MPC7xxx family and the IBM 970 family. Xenon, the three core PowerPC chip at the heart of XBox360, takes VMX one step further and extends it to 128 128-bit VMX registers. The PPE core of Cell also contains GPRs, FPRs, and VMX, in addition to the eight SPEs that use their own instruction set for their 128 128-bit SIMD reg isters.

At the architectural level the removal of register starvation (also known as "register pressure") alters the optimizations available to a compiler. As fast as L1 caches are, registers are even faster. A 3x3 matrix can be loaded entirely into GPRs and the re will be more than a dozen more registers available for holding scalar values and intermediate results. Registers are good, and the more of them the better.

GPRs are named r0-r31. The GPRs are used for 32-bit integer operations and code execution. Although the specific Application Binary Interface (ABI) will vary in the details of how specific GPRs are used, in general there are two kinds of GPRs, volatile a nd non-volatile. Volatile registers are those that can be destructively used during a function call, and non-volatile registers must be restored before returning from the function call. This usually includes the GPR being used to hold the stack frame. In some ABIs, GPRs will be used to store addresses into library functions. Later this piece will discuss a few specific examples of calling conventions for function calls.

In general, there is very limited penalty for accessing unaligned data in memory with GPRs. While the GPRs are 32-bit, registers can read and write on byte boundaries. The only exception is that instruction execution must occur on 32-bit (4 byte) bounda ries. When loading a library to jump into for a function call, a GPR will be used in many cases and as such, 4 byte alignment is critical. GPRs can be used for both data and addressing, including dynamically generating data that becomes an address (suc h as a run-time installed function that goes into different low-energy modes depending on the processor the code is running on).

FPR and VMX registers have alignment contraints on data access, of differing natures. FPR needs 8 byte alignment, but can be made to mimic GPR behavior and access misaligned data, at a significant performance penalty (see Resources regarding FP exception s and OS handling). VMX, on the other hand, only accesses data on 16 byte boundaries and will truncate data addresses to the first lower boundary. Data corruption is possible.


Special Purpose Registers

There are several non-data oriented registers in PowerPC, known as Special Purpose Registers (SPRs). These include system registers like the processor version register (PVR), but also include processor specific implementations such as Hardware Implementat ion (HID) registers. For operating system research, the needed documention will include the "User's Manual." Usually a web search on terms like "MPC7400UM" will turn up relevant links. The links to IBM-specific processors can be found in the Resources s ection at the end of this article. Most SPRs are supervisor mode registers, but many have user mode read-only versions. To access supervisor level SPRs without user mode versions, applications must rely on the operating system's implementation for access to those registers.

The PowerPC specification defines several very important non-data registers that are not supervisor mode SPRs. Two registers used for branching are the Link Register (LR), and the Count Register (CTR). These registers are 64 bits wide on 64-bit PowerPC C PUs, and 32 bits wide on 32-bit CPUs. Additionally, there are three registers for assisting in determining conditionals and overflow situation. The Condition Register (CR) and Floating Point Status and Control Register (FPSCR) are 32 bits regardless of the architecture, while the Fixed-Point Exception Register (XER) can be 32 or 64 bits wide. This series will deal initially with 32-bit implementations but look at 64-bit implementations when interesting, and later move to a 64 bit focus. Unlike other s tandards, the PowerPC standard was designed as both 32-bit and 64-bit at the start, rather than a 32-bit standard with 64-bit extensions added. Usually a 32-bit example will apply to a 64-bit situation (emphasis on the "usually").

The Link Register and Count Register can both be used to store return addresses, although it is more common to use the LR instead of the CTR. A typical function jump will place the parameters to be passed to the function into registers and then a Branch and Link (bl) instruction will store the address of the instruction after the branch instruction. This is so that the function being called can return to the address in the LR with a Branch to Link Register (blr) instruction, the most common manner to re turn from a function call.

(Illustration here)

It is possible to store an address to the Count Register and branch to that address with Branch Conditionally To Count Register (bcctr), but usually these are within a function call and do not return (emphasis on the "usually"). Another example of a poss ible use for the CTR is in a C++ class, where the object will branch to a method within the class through the CTR, but return to the code that called the class through the Link Register. This approach allows the Link Register contents to remain untouched and not have to be restored via the stack, which presents problems if the code attempting to return does not have the same stack frame. Any time a non-linear branch and return is needed the CTR may be involved (two jumps but one return, e.g.).

Conditionals

Conditionals are generated during many PowerPC instructions. Some are inferred and some are explicit. A Compare Immediate (cmpi) will explicitly set a conditional based on whether the immediate (a 16-bit value) is greater than, less than, or equal to the value in the register being compared. Some operations, such as addition and subtraction, have instruction forms that will implicitly set the condition register. The instruction form varies from the original using only one or two bits specifying whether the CR and XER should be set, while the core opcode group is the same as the original.

There are actually _eight_ condition segments within the 32-bit Condition Register. Each one can be set individually, and the ABI typically defines which half-byte segments are volatile and non-volatile. In most cases, CR0 and CR1 are volatile, meaning t hat after a function call the software should not rely on the values having been unchanged. This is important to keep in mind because there is latency in many compare operations (and others). This latency is a normal by-product of pipelining. With Powe rPC, if a branch is taken based on the results of a compare that had not completed and the branch is wrong, the results to that point are discarded and the other branch executes. Hinting which branch is likely to be taken helps the execution performance, but setting up compares several instructions before they are needed helps even more. Therefore, understanding when a CR segment was set is important during debugging and handrolling assembly code, and having a compare register segment written over by a function call can lead to undesired results.

Scalar Operations

There are two kinds of operations regarding scalar values, those contained in registers and those contained by hard coded values. For example, the operation

B + 3

involves one register and one hard coded scalar. The variable "B" would be contained in the register, while the scalar "3" is contained in the instruction. This has implications because as mentioned earlier, instruction addresses must occur on 4 byte bo undaries. This limits instruction lengths to 32 bits, and within an instruction such as addition of a scalar the source register, destination register and scalar must all be specified. For GPR instructions, each register requires 5 bits to specify (2^5= 32), so an instruction that requires two registers requires 10 bits be dedicated to specifying the registers. With bits being required for setting different conditions or operations, there is at most 16 bits left for specifying the scalar value. It is not possible to add a scalar greater than 65,536 to a value in a register, and usually the scalar is signed, thereby reducing the maximum value to being a range of 65,536 centered on zero.

In order to use larger than 16-bit sized scalars, instructions that use three registers must be used and an intermediate step of crafting the 32 bit scalar must be taken. The intermediate step of crafting the scalar involves loading the upper-most 16 bit s into a register (through "Load Immediate," li), shifting the contents to the left ("Shift Left Word," slw), and then loading the lower-most 16 bits into the same register through an OR operation ("OR Immediate," ori). In the case of a 64-bit PowerPC CP U, which will still use 32-bit length instructions (XXX verify if there are exceptions), an additional two shifts must occur, either through operation on the same register or by crafting the upper 32 bits in one register and the lower 32 bits in another r egister and OR'ing them. There are a variety of instructions that can assist with shifting operations to the left or to the right.

(Illustration here)

Given the above, there are usually two forms of mathematical operations. One form will use two registers and a scalar (also referred to as using an "immediate value"), the other form will use three registers, with two registers forming the operands and t he third being the destination register. Operations can almost always overwrite source registers with the resulting value. For example, in order to increment the value B by three using the add immediate (addi) form, the source and destination register wo uld be the register containing the value B.

B = B + 3

translates to

addi r3, r3, 3

if the value of B was contained in r3.



Register usage and Function Calling Conventions

Calling a function requires adhering to an agreed-upon calling convention, which is defined as part of the Application Binary Interface (ABI). The calling convention determines how registers will be used. For most PowerPC ABIs (typically OSes using PEF, ELF, XCOFF, or Mach-O object file formats), parameters are passed to functions in registers r3 through r9, and the function result is returned in r3. Some ABIs extend this to r10 for passing parameters, and r4 for returning double words (64 bit long int egers on a 32 bit PowerPC CPU). The order of parameters being placed into registers is from left to right in the function call, starting with the left value in r3.

a = sum(5, 4);

would result in 5 being placed into r3, 4 being placed into r4, and the value 9 being in r3 upon return from the sum function (which then has to be put into whatever is storing "a"). The determined use of registers is purely arbitrary with respect to the function of the registers themselves; there are no requirements architecturally for any given GPR, FPR, or VMX register to behave in a specific manner. It is simply a matter of designating registers to have certain functions and then adhering to the desi gnations as a matter of agreement within the ABI. There are well developed reasons for using the conventions, though, so while there are differences between ABIs, in general they are very similar.

Registers used to pass parameters to function calls are called "volatile" and do not need to be restored after exiting. They are assumed to be destructively used. This in turn places the responsibility of saving the volatile register contents (if needed ) by the callee. This has implications both to the called function and the callee function regarding creating storage areas. That storage area, known as the "stack frame," is in memory and accessing memory is one of the slowest parts of CPU operation.

A common implementation of C++ uses r3 to store "this" when calling a method. Others may use r12 if "this" is inferred, depending on the scope. A null value in the register used to hold the object's address can indicate that the object was destroyed pre viously and is now being referenced, something sure to cause application failure.

Scalars or Addresses

As mentioned earlier, registers can contain data or addresses, so pointers to data can be passed in registers.

err = sum(5, 4, &a);

would place 5 in r3, 4 in r4, the 32-bit address holding the variable a in r5, and return a result in r3, which the compiler would then put into the variable err. The actual sum of 5 and 4 would be stored in the memory address passed to the function. Th is allows reporting of additional information in addition to the results of the function.

Local Variables

Typically a function call has one or more local variables. These variables can be in registers or inlocations on the stack. As per our mantra "registers are good and the more the better," for the fastest manipulation of local variables they will be stor ed in registers. These registers typically begin at r13 and go through r31, and collectively are known as "non-volatile" registers (some ABIs start at r14). The specific range of non-volatile registers is determined by the ABI. Non-volatile registers m ust be restored before exiting a function, and this includes Condition Register segments. The use of non-volatile registers most often starts with r31 and moves towards r13 (or r14, if so limited).

For example, if the function sum takes two scalar and returns one scalar, it might be declared

int sum(int a, int b);

a and b would be local variables. This leads to two paths, which in more technical terms is also known as "optimization." A non-optimized approach would be to put a and b into r31 and r30, respectively, and store the result of adding them in r29, or alt ernatively load them onto the stack and retrieve them for manipulation (a very x86-ish approach and avoided at all costs on PowerPC). This requires three non-volatile registers, and in order to restore those registers the original values must be saved. T his introduces the concept of prolog and epilog to called functions.

Due to the high number of registers on PowerPC CPUs, it is not always necessary to create a stack frame for a function. If the expected operation can be done within the number of volatile registers used to pass parameters and there are no nested function calls, there is no need to create a stack frame. If a stack frame is needed, all of the common ABIs use r1 as the stack pointer, and require stacks to be aligned on 16 byte intervals. One element of the created stack will be previous stack address. St ack frames vary in size based on the number of local variables and non-volatile registers needing to be saved, so it is not always possible to simply expect the previous stack address to occur every "x" number of bytes. Most implementations store the pre vious stack address at a specific location in the stack. For example, if a stack grows downward, a calling function will save off the existing stack pointer to an address lower than the current value and save additional values back up towards the previou s location.

(Illustration here)

By knowing the convention of where previous stack addresses are, it is possible to recreate a stack calling chain in the event of a terminal error in software or during a debugging session. Another common element to be stored in the stack is the current value in the Link Register. The instruction mnemonic mflr moves the current value in the LR to a specified register and then the contents of the specified register are moved to a memory address, which is usually an offset from the stack pointer, with a s tw ("Store Word). The two values, the previous stack frame address and the LR, take up eight of the sixteen bytes the smallest stack frame can be. If more than two non-volatile registers are needed, additional space on the stack frame will need to be al located, in sixteen byte increments.

In the example above, sum requires three local variables. This leads to a function prolog that has the following steps, in a least optimized approach:

1: Save r1 to an address 48 bytes less than the current value in r1
2: Decrement r1 by 48 bytes
3: Save LR to an address 4 bytes greater than the current value in r1
4: Save r31 to r1+8 bytes
5: Save r30 to r1+12 bytes
6: Save r29 to r1+16 bytes

In many unoptimized approaches, the values in r3 and r4 will then be moved to r31 and r30. In some really unoptimized approaches, the values in r3 and r4 will be first saved to locations in the stack and _then_ put into r31 and r30, but most modern compi lers have left this approach behind. The next step will add r31 and r30, storing the results in r29. As per the calling convention, the value in r29 is then moved to r3 for return to the calling function.

The epilog of a function is the reverse of the prolog. The stored values of r29, r30, and r31 are restored from their respective locations on the stack, the LR is restored, and the stack pointer is returned to its previous location, and the function exec utes a Branch to Link Register (blr) instruction. However, not all compilers have the blr as the last instruction in a function. In some cases, if a condition exists where the function may return prematurely, such as when encountering an error while che cking the validity of passed parameters, the epilog may occur anywhere within the function, terminating with the blr, and subsequent code will branch back to the epilog location. This occasionally makes for more difficult debugging, but is not forbidden.

In considering if a stack frame needs to be created or not, knowing if a called function will call another function is important. If the called function needs control returned to it, a stack frame is necessary in order to save off the previous LR before branching although it is possible to return via the CTR. If the first function is passed three parameters (r3, r4, and r5), but the function it calls only needs two (r3 and r4), this leaves r5 unprotected. As mentioned earleir, preserving volatile regis ters are the responsibility of the callee function (a called function becomes a callee function if it calls another function). The safe (and common) approach is to create a stack frame and save off a non-volatile register and then move r5 into that non-v olatile register. It will be preserved across the function call. This invokes a prolog and epilog to the first function, which typically adds a minimum of eight instructions if there are local variables. A later installment of Some Assembly Required wi ll examine inlining versus unrolling.

Other GPRs

This leaves only a few other GPRs left to discuss. r0, r2, r11, and r12 are typically used by the ABI for infrastructure issues. For example, under the Preferred Executable Format for MacOS 9.x, r11 and r12 were used for tracking calls outside the local code fragment. In ELF 32-bit, r11 and r12 are function linkage related. r0 is also a register whose function may vary depending on ABI. r2 is in most cases a restricted register as well. In the case of a global variable, a restricted register may sto re the address of the application's global stack and an offset from that location would be the global variable.

FP and VMX

Floating point and VMX also have volatile and non-volatile registers defined by the ABIs. However, since FPR and VMX registers are almost never involved in program execution or addressing, there usually are not infrastructure considerations, just volatil e and non-volatile ones.



This concludes the initial installment of Some Assembly Required. It covered the basics of how GPRs are used, and what other registers may be used during the course of code execution. The next installment will do examine some examples of PowerPC assembl y through disassembly of some simple math operations.

Resources:

The PowerPC Programming Environments Manual (PEM) is a must-have document.

Searching on terms like "MPC7450UM.pdf" and "970FX User's Manual" will usually result in successfully locating the manufacturer's manual for the processor. Each processor will adhere to the PowerPC specification, but may have implementation specific deta ils. The 970FX User's Manual covers the basics of the 970 family and the MPC7450 User's Manual covers several implementations of the G4 processors.

Lightsoft used to market a compiler for PowerPC on MacOS. They've halted development and made it open source. You can read Lightsoft's Beginner's Guide to PPC Assembler and download the compiler code.

The System V R4 Application Binary Interface PowerPC supplement defines register usage for several operating systems, either directly or through adoption .

Although MacOS 9.x has passed (unable to reach the knives in its back), the documentation written about it was quite instructional. Assembler for Macintosh with PowerPC gives the function calling convention for Preferred Executable Format (PEF) and contains easily readable discussions of several important aspects of PowerPC assembly.

Apple's MacOS X ABI Function Call Guide is quite explicit is explaining the location of key parameters to function calls in OS X.

Not mentioned above, here is a link to the ABI used on Motorola's e500 processor.

Hollis Blanchard did an introductory article on PowerPC assembly a few years ago with a different perspective.