Dialectronics Some Assembly Required, Part II

Dialectronics - the dialect of electronic communications

About Us

Our Mission

Minix

RISCV

PowerPC

Lua

XCOFF Bootloader

Old World Macs

Open Firmware

Words

Journeys

Home Repair

Top Level

Some Assembly Required, Part II

So far we haven't done any specific PowerPC instruction discussion. Since we've established the general outline for a function call prolog and epilog (See Resources, first installment), the obvious step is to actually show a function as simple as a s um of two numbers.

int add_stuff(int a, int b)
{
       int c;

       c = a + b;

       return c;        }

When compiled with gcc 3.3.3 on NetBSD without optimizations and run through objdump -d, the above code snippet results in:

01800888 <add_stuff>: 1800888:       94 21 ff d0    stwu   r1,-48(r1) 180088c:       93 e1 00 2c    stw    r31,44(r1) 1800890:       7c 3f 0b 78    mr     r31,r1 1800894:       90 7f 00 08    stw    r3,8(r31) 1800898:       90 9f 00 0c    stw    r4,12(r31) 180089c:       81 3f 00 08    lwz    r9,8(r31) 18008a0:       80 1f 00 0c    lwz    r0,12(r31) 18008a4:       7c 09 02 14    add    r0,r9,r0 18008a8:       90 1f 00 10    stw    r0,16(r31) 18008ac:       80 1f 00 10    lwz    r0,16(r31) 18008b0:       7c 03 03 78    mr     r3,r0 18008b4:       81 61 00 00    lwz    r11,0(r1) 18008b8:       83 eb ff fc    lwz    r31,-4(r11) 18008bc:       7d 61 5b 78    mr     r1,r11 18008c0:       4e 80 00 20    blr 18008c4:       00 01 81 5c    .long 0x1815c

There's a lot of other object code in the application (also known as "junk") that isn't pertinent to the discussion and so has been omitted for clarity. The line 1800888 introduces the first of several variants for storing values into memory.

stwu   r1,-48(r1)

The expansion of the instruction stwu is "Store Word with Update." Previously we established that the first operation upon entering a function call would be to save off the stack to a new location. By inference, it can be seen that the instruction does just that - saves the value in r1 to an address 48 bytes lower than the existing address in r1. The "update" feature then stores the new address in r1. Here is the definition from PEM:

(should be a graphic) stwu stwu Store Word with Update stwu rS,d(rA) [POWER mnemonic: stu] EA <- (rA) + EXTS(d) MEM(EA, 4) <- rS[3263] rA <- EA EA is the sum (rA) + d. The contents of the low-order 32 bits of rS are stored into the word in memory addressed by EA. EA is placed into rA. <-- this is what tells us what "update" means If rA = 0, the instruction form is invalid. Other registers altered: € None

"EA" stands for "Effective Address," which can be as much as 80 bits in length but for this article it will be 32 bits on 32 bit processors and 64 bits on 64 bit processors. "EXTS" refers to "Extend Signed," meaning that the offset wil be extend to a fu ll 32 bits. The -48(r1) component of the instruction is indicating a memory location 48 bytes lower than the value currently in r1. The very next instruction

stw    r31,44(r1)

shows storing to a location 44 bytes greater than the value in r1. In the case of stwu and stw, the integer outside the parenthesis refers to the offset from the memory address in the register listed. Later, when loading a value from memory, instruction s will take the same approach to specify what memory address to load from. These offsets are limited to signed 16 bit values. This is due to the construction of opcodes.

94 21 ff d0

is the opcode for the stwu instruction. PEM lists the opcode bit values:

Bit: Value: 0-5 37 6-10 rS 11-15 rA 16-31 d

The least significant 16 bits of the opcode are 0xffd0, or -48 in decimal. Deciphering the remaining bits is left to the reader but will be explored in greater detail in a later article. Just remember that it takes five bits to identify one of 32 possible values.

By listing r1 as both the source register for the value to be stored and the destination register for receiving the updated EA, r1 now points to the bottom of the stack frame. The next instruction, line 180088c, which was shown above as the second form o f storing memory, saves the contents of a non-volatile register to a location in the new stack frame, but does not "update" the register.

The next line (1800890)

mr     r31,r1

introduces a very common tool in PowerPC, the "simplified mnemonic." There is no instruction "mr." The expansion in general means "Move Register," but there is no actual "mr" instruction. Instead, "mr" is a mnemonic for or. The left opera nd is the destination register (where the result will be stored), and the right operand is the source register that will be OR'd with itself. Here is PEM (8-191):

(should be a graphic) orx OR or rA,rS,rB (Rc = 0) or. rA,rS,rB (Rc = 1) rA <- (rS) | (rB) The contents of rS are ORed with the contents of rB and the result is placed into rA. The simplified mnemonic mr (shown below) demonstrates the use of the or instruction to move register contents. Other registers altered: € Condition Register (CR0 field): Affected: LT, GT, EQ, SO (if Rc = 1) Simplified mnemonics: mr rA, rS equivalent to or rA, rS, rS

As per PEM, or can set a value in a condition register segment by specifying the form with the Rc bit set. Many PowerPC instructions have this option.

The instruction can be seen as moving r1 into r31. This is an important aspect for several reasons. It shows the use of r31, a non-volatile register, and explains why it was necessary at all to save off the contents of the non-volatile register (it was going to be used). However, it also shows the inefficiency of not optimizing code. The next two instructions highlight this:

stw    r3,8(r31) stw    r4,12(r31)

r3 and r4 are the parameters passed to the function, as per the discussion earlier on ABIs. The instructions are saving those values to the stack, because that is what is in r31. Note the difference in storing locations. The non-volatile register was s tored to 44 bytes from the stack bottom; the parameters are stored 8 and 12 bytes from the stack bottom. Another sign that this code is not compiled with optimizations are the next two lines (180089c):

lwz    r9,8(r31) lwz    r0,12(r31)

After saving the parameters to the stack, the two instructions above load them right back, but into r9 and r0, respectively. The Load Word Zero (PEM 8-145) loads a 32 bit value from memory (the Effective Address is constructed from a register and an offs et) and zero's the upper 32 bits on a 64 bit processor. There are forms of Load that do half words (two bytes) and bytes that will zero any non-used bits.

The core of the function call is actually a single line:

add    r0,r9,r0

The instruction specifies r9 and r0 as the operands and also specifies r0 as the destination register. "add" also has a "." form that allows the CR to be set, as well as forms that will record in the XER register if overflow occurs (the resu lting value is greater than 32 bits). There are also two variants on the add, one which uses registers and one which uses 16 bit integers, as discussed earlier.

Of the remaining code, only the line (18008b0):

mr     r3,r0

is particularly interesting. This instruction takes the summed value and places it into r3 for return to the callee, as per the ABI. Much of the remaining code is the epilog that restores the non-volatile register and restores the original stack locatio n. I don't know what the last instruction does. Looks like junk. It's not a valid opcode. It might be the placeholder for the next function name's symbol.

The last meaningful instruction in the function is

18008c0:       4e 80 00 20    blr

This is a "Branch To Link Register" instruction, which returns the processor to the callee function. The callee function will have done something (in object code) similar to

1800834:       38 60 00 01    li     r3,1 1800838:       38 80 00 02    li     r4,2 180083c:       48 00 00 29    bl     1800864

to actually call the function. The above snippet places the scalar "1" into r3 and the scalar "2" into r4, and then calls "bl" (Branch and Link). The bl instruction will place an address four bytes after the current instruction into the Lin k Register. When the called function branches to the LR, it returns execution to the next instruction after the function call.

Optimization by the Compiler

How much is the code improved when compiled with even basic optimization (-O1)?

01800864 <add_stuff>: 1800864:       94 21 ff f0    stwu   r1,-16(r1) 1800868:       7c 63 22 14    add    r3,r3,r4 180086c:       38 21 00 10    addi   r1,r1,16 1800870:       4e 80 00 20    blr 1800874:       00 01 81 5c    .long 0x1815c

The code does create a stack frame on a 16 byte boundary, but never uses it. The addi instruction adds 16 bytes back to the address in r1, which was decremented 16 bytes when the stack frame was created. The add operation is done in one ins truction, with no intermediate storage of parameters. The add does not require the use of non-volatile registers and performs its operation with only the volatile registers used to pass the parameters.

Optimization by the Programmer

There are several aspects to consider when deciding to optimize and when to inline. In general, optimization will create much faster executable code, but can be more difficult to debug in object code. There will _always_ be times where source code debug gers will not help, and when this occurs having clean object code is important. There are also conditions where code that is optimized leads to different results (in particular when some assembly is inlined in C code), so compiling a version that is not optimized and one that is optimized can help pinpoint problems. It should also be noted that compilers may schedule instructions differently depending on the optimization level desired. This is in order to maximize pipelining, but reading such code can give the impression that many steps are being taken in a non-logical manner. An old adage for writing code, in a high level language or a low level language, is to first make the code work, then make the code fast.

Inlining functions is controversial, at best. Some feel inlining is a crime against nature, while others think it is good programming. As fast as caches are, registers are even faster and on a CPU such as the PowerPC where register availability is rarel y an issue (meaning more local variables can be stored in registers instead of retrieved from the stack, which requires a memory access), there are legitimate reasons to inline. When done responsibly, inlining is an effective way to reduce execution time s significantly, but when done too much can lead to code sizes that do not easily fit within today's CPU caches.

The balancing point with inlining focuses on understanding the costs of a prolog and epilog on the PowerPC. In the case of the example of adding two numbers above, even when optimized the prolog and epilog of the function are larger than the actual core of the function (two instructions versus one). If local variables are needed, the prolog and epilog get larger, and use of VMX or FP registers can drive this cost even higher. Each instruction in the prolog and epilog is a cycle (or more) in which the C PU is not executing core functionality. If a small function is not inlined, that prolog and epilog have to be executed each time the function is called. If a small function is rarely called, Amdahl's Law indicates that performance gains or losses will b e little noticed. If the small function is called regularly, that prolog and epilog may have noticeable affects on performance that inlining may eliminate.

On the other hand, not inlining reduces code size overall, and this may affect how much of the code can be stored in the caches. Code that is called frequently remains resident in caches more often. To maximize this, functions that are not inlined are m ore exposed for remaining resident. Code that is serialized (inlined instead of function calls) has to spill least-recently-used entries when execution reaches the end of cache boundaries. The overhead of cache misses may eclipse the benefits of elimina ting prologs and epilogs by inlining. Understanding the specifics of the architecture that the code is executing on can help a programmer make intelligent decisions about performance optimizations.

Understanding the specifics of the architecture that the code is executing on can also help a programmer isolate limitations to performance that have nothing to do with the architecture itself. Consider the following example of some code in a commercial operating system that used to use PowerPC processors:

   29854:       2c 03 00 00    cmpwi   r3,0    29858:       38 60 00 04    li     r3,4    2985c:       4d 82 00 20    beqlr    29860:       38 60 00 00    li     r3,0    29864:       3d 20 00 32    lis    r9,50    29868:       90 89 fa 18    stw    r4,-1512(r9)    2986c:       4e 80 00 20    blr    29870:       38 60 00 05    li     r3,5    29874:       4e 80 00 20    blr    29878:       38 60 00 05    li     r3,5    2987c:       4e 80 00 20    blr    29880:       3c 60 00 32    lis    r3,50    29884:       38 63 fa 00    addi   r3,r3,-1536    29888:       4e 80 00 20    blr    2988c:       3c 60 00 32    lis    r3,50    29890:       38 63 fa 00    addi   r3,r3,-1536    29894:       4e 80 00 20    blr    29898:       3c 60 00 32    lis    r3,50    2989c:       38 63 fa 00    addi   r3,r3,-1536    298a0:       4e 80 00 20    blr

In the above snippet, six (six!) functions exist, two of which return the scalar 5, three return 0x0032fa00, and one which returns either 4 or 0, but stores the value in r4 to a memory address if the passed value in r3 is not zero. The first function can be legitimately argued as not inlinable. However, the two functions that return 5 certainly could have been implemented as constants. The three functions that return 0x0032fa00 could have gone either way. While each function does not create a stack frame and has no prolog or epilog, each function call does require a Branch and Link (bl) to call and a Branch to Link Register (blr) to return, even though in two of the five functions the actual function call itself was only a single instruction. Additionally, the result is always in r3; this requires that the register contents be moved into the location of the variable that is being set to the result of the function call. If the value was compiled as a constant, the compiler would simply move the scalar into the location with lis (mnemonic for addis, Add Immediate Shifted). There is also the preservation of values in volatile registers across function calls, which is the responsibility of the calling function, not the callee. By inference, the above approach requires a minimum of three additional instructions to carry out an operation properly done in one or two instructions. If these functions are rarely called, perhaps it is a non-issue. However, if these functions are called repeatedly, it is rather disingenuous to blame processor clock speed for poor performance and suggests further analysis is warranted regarding rumors that the operating system was intentionally distributed with minimal compiler optimizations and programming design in order to justify a preconceived goal.

It should be noted that the above code was taken from the file on disk, not the linked image in memory. Also of note, the value returned in three of the functions (0x0032fa00) is remarkably close to the address to which the first function stores a passed value. This suggests the address being returned is in a range of global variables that might change across kernel builds. However, this approach is one of design convenience and not one optimized for performance.

This concludes the second installment of Some Assembly Required. It looked at some examples of disassembling C code for examination of prologs, epilogs and optimizations. The next installment of Some Assembly Required will continue with math examples by writing some handrolled assembly code to calculate two summation identities, one being the simple case of either the lower or upper bound being zero and the other being the more general case of non-zero bounds.