Old World Macs
Some Assembly Required, Part II
So far we haven't done any specific PowerPC instruction discussion. Since we've established the general outline for a function call prolog and epilog (See Resources, first installment), the obvious step is to actually show a function as simple as a s
um of two numbers.
int add_stuff(int a, int b)
c = a + b;
When compiled with gcc 3.3.3 on NetBSD without optimizations and run through objdump -d, the above code snippet results in:
1800888: 94 21 ff d0 stwu r1,-48(r1)
180088c: 93 e1 00 2c stw r31,44(r1)
1800890: 7c 3f 0b 78 mr r31,r1
1800894: 90 7f 00 08 stw r3,8(r31)
1800898: 90 9f 00 0c stw r4,12(r31)
180089c: 81 3f 00 08 lwz r9,8(r31)
18008a0: 80 1f 00 0c lwz r0,12(r31)
18008a4: 7c 09 02 14 add r0,r9,r0
18008a8: 90 1f 00 10 stw r0,16(r31)
18008ac: 80 1f 00 10 lwz r0,16(r31)
18008b0: 7c 03 03 78 mr r3,r0
18008b4: 81 61 00 00 lwz r11,0(r1)
18008b8: 83 eb ff fc lwz r31,-4(r11)
18008bc: 7d 61 5b 78 mr r1,r11
18008c0: 4e 80 00 20 blr
18008c4: 00 01 81 5c .long 0x1815c
There's a lot of other object code in the application (also known as "junk") that isn't pertinent to the discussion and so has been omitted for clarity. The line 1800888 introduces the first of several variants for storing values into memory.
The expansion of the instruction stwu is "Store Word with Update." Previously we established that the first operation upon entering a function call would be to save off the stack to a new location. By inference, it can be seen that the instruction does
just that - saves the value in r1 to an address 48 bytes lower than the existing address in r1. The "update" feature then stores the new address in r1. Here is the definition from PEM:
(should be a graphic)
stwu stwu Store Word with Update
stwu rS,d(rA) [POWER mnemonic: stu]
EA <- (rA) + EXTS(d) MEM(EA, 4) <- rS rA <- EA
EA is the sum (rA) + d. The contents of the low-order 32 bits of rS are stored into the word in memory addressed by EA.
EA is placed into rA. <-- this is what tells us what "update" means
If rA = 0, the instruction form is invalid.
Other registers altered: € None
"EA" stands for "Effective Address," which can be as much as 80 bits in length but for this article it will be 32 bits on 32 bit processors and 64 bits on 64 bit processors. "EXTS" refers to "Extend Signed," meaning that the offset wil be extend to a fu
ll 32 bits. The
-48(r1) component of the instruction is indicating a memory location 48 bytes lower than the value currently in r1. The very next instruction
shows storing to a location 44 bytes greater than the value in r1. In the case of stwu and stw, the integer outside the parenthesis refers to the offset from the memory address in the register listed. Later, when loading a value from memory, instruction
s will take the same approach to specify what memory address to load from. These offsets are limited to signed 16 bit values. This is due to the construction of opcodes.
94 21 ff d0
is the opcode for the stwu instruction. PEM lists the opcode bit values:
Bit: Value: 0-5 37 6-10 rS 11-15 rA 16-31 d
The least significant 16 bits of the opcode are
0xffd0, or -48 in decimal. Deciphering the remaining bits is left to the reader but will be explored in greater detail in a later article. Just remember that it takes five bits to identify one
of 32 possible values.
By listing r1 as both the source register for the value to be stored and the destination register for receiving the updated EA, r1 now points to the bottom of the stack frame. The next instruction, line 180088c, which was shown above as the second form o
f storing memory, saves the contents of a non-volatile register to a location in the new stack frame, but does not "update" the register.
The next line (1800890)
introduces a very common tool in PowerPC, the "simplified mnemonic." There is no instruction "mr." The expansion in general means "Move Register," but there is no actual "mr" instruction. Instead, "mr" is a mnemonic for
or. The left opera
nd is the destination register (where the result will be stored), and the right operand is the source register that will be OR'd with itself. Here is PEM (8-191):
(should be a graphic)
or rA,rS,rB (Rc = 0)
or. rA,rS,rB (Rc = 1)
rA <- (rS) | (rB)
The contents of rS are ORed with the contents of rB and the result is placed into rA.
The simplified mnemonic mr (shown below) demonstrates the use of the or instruction to move register contents.
Other registers altered:
€ Condition Register (CR0 field): Affected: LT, GT, EQ, SO (if Rc = 1)
mr rA, rS equivalent to or rA, rS, rS
As per PEM,
or can set a value in a condition register segment by specifying the form with the Rc bit set. Many PowerPC instructions have this option.
The instruction can be seen as moving r1 into r31. This is an important aspect for several reasons. It shows the use of r31, a non-volatile register, and explains why it was necessary at all to save off the contents of the non-volatile register (it was
going to be used). However, it also shows the inefficiency of not optimizing code. The next two instructions highlight this:
stw r3,8(r31) stw r4,12(r31)
r3 and r4 are the parameters passed to the function, as per the discussion earlier on ABIs. The instructions are saving those values to the stack, because that is what is in r31. Note the difference in storing locations. The non-volatile register was s
tored to 44 bytes from the stack bottom; the parameters are stored 8 and 12 bytes from the stack bottom. Another sign that this code is not compiled with optimizations are the next two lines (180089c):
After saving the parameters to the stack, the two instructions above load them right back, but into r9 and r0, respectively. The Load Word Zero (PEM 8-145) loads a 32 bit value from memory (the Effective Address is constructed from a register and an offs
et) and zero's the upper 32 bits on a 64 bit processor. There are forms of Load that do half words (two bytes) and bytes that will zero any non-used bits.
The core of the function call is actually a single line:
The instruction specifies r9 and r0 as the operands and also specifies r0 as the destination register.
"add" also has a "." form that allows the CR to be set, as well as forms that will record in the XER register if overflow occurs (the resu
lting value is greater than 32 bits). There are also two variants on the add, one which uses registers and one which uses 16 bit integers, as discussed earlier.
Of the remaining code, only the line (18008b0):
is particularly interesting. This instruction takes the summed value and places it into r3 for return to the callee, as per the ABI. Much of the remaining code is the epilog that restores the non-volatile register and restores the original stack locatio
n. I don't know what the last instruction does. Looks like junk. It's not a valid opcode. It might be the placeholder for the next function name's symbol.
The last meaningful instruction in the function is
18008c0: 4e 80 00 20 blr
This is a "Branch To Link Register" instruction, which returns the processor to the callee function. The callee function will have done something (in object code) similar to
1800834: 38 60 00 01 li r3,1
1800838: 38 80 00 02 li r4,2
180083c: 48 00 00 29 bl 1800864
to actually call the function. The above snippet places the scalar "1" into r3 and the scalar "2" into r4, and then calls "bl" (Branch and Link). The
bl instruction will place an address four bytes after the current instruction into the Lin
k Register. When the called function branches to the LR, it returns execution to the next instruction after the function call.
Optimization by the Compiler
How much is the code improved when compiled with even basic optimization (-O1)?
1800864: 94 21 ff f0 stwu r1,-16(r1)
1800868: 7c 63 22 14 add r3,r3,r4
180086c: 38 21 00 10 addi r1,r1,16
1800870: 4e 80 00 20 blr
1800874: 00 01 81 5c .long 0x1815c
The code does create a stack frame on a 16 byte boundary, but never uses it. The
addi instruction adds 16 bytes back to the address in r1, which was decremented 16 bytes when the stack frame was created. The add operation is done in one ins
truction, with no intermediate storage of parameters. The add does not require the use of non-volatile registers and performs its operation with only the volatile registers used to pass the parameters.
Optimization by the Programmer
There are several aspects to consider when deciding to optimize and when to inline. In general, optimization will create much faster executable code, but can be more difficult to debug in object code. There will _always_ be times where source code debug
gers will not help, and when this occurs having clean object code is important. There are also conditions where code that is optimized leads to different results (in particular when some assembly is inlined in C code), so compiling a version that is not
optimized and one that is optimized can help pinpoint problems. It should also be noted that compilers may schedule instructions differently depending on the optimization level desired. This is in order to maximize pipelining, but reading such code can
give the impression that many steps are being taken in a non-logical manner. An old adage for writing code, in a high level language or a low level language, is to first make the code work, then make the code fast.
Inlining functions is controversial, at best. Some feel inlining is a crime against nature, while others think it is good programming. As fast as caches are, registers are even faster and on a CPU such as the PowerPC where register availability is rarel
y an issue (meaning more local variables can be stored in registers instead of retrieved from the stack, which requires a memory access), there are legitimate reasons to inline. When done responsibly, inlining is an effective way to reduce execution time
s significantly, but when done too much can lead to code sizes that do not easily fit within today's CPU caches.
The balancing point with inlining focuses on understanding the costs of a prolog and epilog on the PowerPC. In the case of the example of adding two numbers above, even when optimized the prolog and epilog of the function are larger than the actual core
of the function (two instructions versus one). If local variables are needed, the prolog and epilog get larger, and use of VMX or FP registers can drive this cost even higher. Each instruction in the prolog and epilog is a cycle (or more) in which the C
PU is not executing core functionality. If a small function is not inlined, that prolog and epilog have to be executed each time the function is called. If a small function is rarely called, Amdahl's Law indicates that performance gains or losses will b
e little noticed. If the small function is called regularly, that prolog and epilog may have noticeable affects on performance that inlining may eliminate.
On the other hand, not inlining reduces code size overall, and this may affect how much of the code can be stored in the caches. Code that is called frequently remains resident in caches more often. To maximize this, functions that are not inlined are m
ore exposed for remaining resident. Code that is serialized (inlined instead of function calls) has to spill least-recently-used entries when execution reaches the end of cache boundaries. The overhead of cache misses may eclipse the benefits of elimina
ting prologs and epilogs by inlining. Understanding the specifics of the architecture that the code is executing on can help a programmer make intelligent decisions about performance optimizations.
Understanding the specifics of the architecture that the code is executing on can also help a programmer isolate limitations to performance that have nothing to do with the architecture itself. Consider the following example of some code in a commercial
operating system that used to use PowerPC processors:
29854: 2c 03 00 00 cmpwi r3,0
29858: 38 60 00 04 li r3,4
2985c: 4d 82 00 20 beqlr
29860: 38 60 00 00 li r3,0
29864: 3d 20 00 32 lis r9,50
29868: 90 89 fa 18 stw r4,-1512(r9)
2986c: 4e 80 00 20 blr
29870: 38 60 00 05 li r3,5
29874: 4e 80 00 20 blr
29878: 38 60 00 05 li r3,5
2987c: 4e 80 00 20 blr
29880: 3c 60 00 32 lis r3,50
29884: 38 63 fa 00 addi r3,r3,-1536
29888: 4e 80 00 20 blr
2988c: 3c 60 00 32 lis r3,50
29890: 38 63 fa 00 addi r3,r3,-1536
29894: 4e 80 00 20 blr
29898: 3c 60 00 32 lis r3,50
2989c: 38 63 fa 00 addi r3,r3,-1536
298a0: 4e 80 00 20 blr
In the above snippet, six (six!) functions exist, two of which return the
scalar 5, three return 0x0032fa00, and one which returns either 4 or 0, but
stores the value in r4 to a memory address if the passed value in r3 is
not zero. The first function can be legitimately argued as not
inlinable. However, the two functions that return 5 certainly could have
been implemented as constants. The three functions that return
0x0032fa00 could have gone either way. While each function does not
create a stack frame and has no prolog or epilog, each function call
does require a Branch and Link (bl) to call and a Branch to Link
Register (blr) to return, even though in two of the five functions the
actual function call itself was only a single instruction. Additionally,
the result is always in r3; this requires that the register contents be
moved into the location of the variable that is being set to the result
of the function call. If the value was compiled as a constant, the
compiler would simply move the scalar into the location with lis
(mnemonic for addis, Add Immediate Shifted). There is also the
preservation of values in volatile registers across function calls,
which is the responsibility of
the calling function, not the callee. By inference, the above approach
requires a minimum of three additional instructions to carry out an
operation properly done in one or two instructions. If these functions
are rarely called, perhaps it is a non-issue. However, if these
functions are called repeatedly, it is rather disingenuous to blame
processor clock speed for poor performance and suggests further analysis
is warranted regarding rumors that the operating system was
intentionally distributed with minimal compiler optimizations and
programming design in order to justify a preconceived goal.
It should be noted that the above code was taken from the file on disk,
not the linked image in memory. Also of note, the value returned in
three of the functions (0x0032fa00) is remarkably close to the address
to which the first function stores a passed value. This suggests the
address being returned is in a range of global variables that might
change across kernel builds. However, this approach is one of design
convenience and not one optimized for performance.
This concludes the second installment of Some Assembly Required. It
looked at some examples of disassembling C code for examination of
prologs, epilogs and optimizations. The next installment of Some
Assembly Required will continue with math examples by writing some
handrolled assembly code to calculate two summation identities, one
being the simple case of either the lower or upper bound being zero and
the other being the more general case of non-zero bounds.