Optimising for code size might not do what you expect - a GCC and PowerPC example

2015-02-09

Getting tracing libraries to run on a new system is hard, but it's something that we regularly have to do here at Rapita as part of our support for timing analysis on diverse platforms. In the past few weeks I've been experimenting with creating a tracing library for Freescale's P4080DS development board, which comes fully loaded with an 8 core P4080 SoC and plenty of trace options, including Aurora based NEXUS tracing, multiple ethernet links and lots of DRAM.

However, while doing this, I've come across some interesting intricacies in GCC's powerpc implementation that might make for interesting reading. To understand how I've got to this point, lets have a look at the process I'm taking to implement some tracing code on a new platform:

1. How might I get to bare metal?

First things first, if we want to start playing directly with the hardware to see what we can do with a trace, we're going to want to get as close to bare metal as we can. There were three possible options here:

Write a linux kernel module.
Upload and run a binary directly with a debugger (a Lauterbach Power Trace II in our case).
Build and run a u-boot 'standalone' binary. It seemed to me that the most obvious route was to build a standalone u-boot binary for accessing the bare metal level of the machine, as we'll then get a bunch of niceties from u-boot's API, such as printf, getc and malloc.

2. Have I got an existing example to build on?

Now we've decided on the method we want to use, we're going to need to find somewhere to start. Thankfully, the u-boot developers have provided some examples of how to build and run standalone applications as part of their distribution. If we look at U-Boot Standalone Applications, Denx have given us some clear instructions on how to build a classic 'Hello World' example which runs standalone. In the (then current) tutorial, we're told that this example is to be loaded at 0x40000 and then executed from 0x40004, four bytes (or one instruction) ahead in the file. This will be important later, however, on a first try, this example ran smoothly.

3. Can I pull apart and use the example to build my own isolated code?

So here's the final step we'll be taking towards writing our bare metal tracing code, where we pull out the relevant libraries and examples we need from u-boot to create something new. In this instance, I built an even smaller test application which builds against the u-boot source by overriding the 'SUBDIR_EXAMPLES' variable in the build system and loaded it in the same way as the example.

That's when things started to go wrong ...

In my example, I simply pulled out some of the extra printing done by the 'Hello World' code, trimming down what the code did slightly, like this: The original code:

int i;
/* Print the ABI version */
app_startup(argv);
printf ("Example expects ABI version %d\n", XF_VERSION);
printf ("Actual U-Boot ABI version %d\n", (int)get_version());
printf ("Hello World\n");
printf ("argc = %d\n", argc);
for (i=0; i<=argc; ++i) {
printf ("argv[%d] = \"%s\"\n",
i,
argv[i] ? argv[i] : "<NULL>");
}
printf ("Hit any key to exit ... ");
while (!tstc())
;
/* consume input */
(void) getc();
printf ("\n\n");
return (0);

My new example:

app_startup(argv);
printf ("Hello World\n");
printf ("Hit any key to exit ... ");
while (!tstc())
;
(void) getc();

So not really a huge change, but even small changes to the input code can make significant changes to the binary we eventually get, as we're about to see.

Running this code as I did the original example produced odd results, with the program executing as normal and then hanging the processor when it attempted to return. Then I decided to try running my code from its base load address, and everything went fine. At this point, I began to dig into the disassembly to see what was going wrong, and I found one fundamental difference. These two pieces of code produce radically different assembler for restoring register state when returning from a function, with my small example producing the following:

4003c:	80 01 00 14 	lwz     r0,20(r1)
40040:	38 60 00 00 	li      r3,0
40044:	38 21 00 10 	addi    r1,r1,16
40048:	7c 08 03 a6 	mtlr    r0
4004c:	4e 80 00 20 	blr

This is relatively standard code for powerpc assembler, in that we restore the previous stack pointer from the previous stack frame, update the link register to point to our previous address before calling the current function and then branch to the location of our link register. (For a full reference guide to assembler instructions for powerpc check the Freescale instruction set documentation.)

However, u-boot's original example did something different:

400cc:	38 60 00 00 	li      r3,0
400d0:	48 00 02 3c 	b       4030c <_restgpr_27_x>

Now what's going on here? Why don't we simply branch back to the link register? Looking at the code in (and surrounding) _restgpr_27_x gives us a clue as to what's occurring:

0004030c <_restgpr_27_x>:
4030c:	83 6b ff ec 	lwz     r27,-20(r11)
00040310 <_restgpr_28_x>:
40310:	83 8b ff f0 	lwz     r28,-16(r11)
00040314 <_restgpr_29_x>:
40314:	83 ab ff f4 	lwz     r29,-12(r11)
00040318 <_restgpr_30_x>:
40318:	83 cb ff f8 	lwz     r30,-8(r11)
0004031c <_restgpr_31_x>:
4031c:	80 0b 00 04 	lwz     r0,4(r11)
40320:	83 eb ff fc 	lwz     r31,-4(r11)
40324:	7c 08 03 a6 	mtlr    r0
40328:	7d 61 5b 78 	mr      r1,r11
4032c:	4e 80 00 20 	blr

This code is interesting, in that it restores a number of general purpose registers from the stack, falling through each _restgpr_ call into the next, with the preceding code jumping to the relevant symbol based on how many general purpose registers it used. Once we've restored all of our registers, we then return as normal. This tells us why the original example is able to tolerate being run from 0x40004 where my example cannot.

The most interesting question here is why is this happening? This is something that's significantly harder to answer. Both of these pieces of code were built using the same compiler options, and while experimenting, I've discovered that this optimisation is the result of telling GCC to optimise for code size (both builds were using -Os, though setting this to -O2 causes both examples to use standard return code). However, even when using Os, we can see that not all input code causes this style of return code (known as 'out-of-line restore functions') to be generated.

As it turns out, after posting this info to the Denx mailing list, it became clear that at some time in the past the compiler used to generate and write the examples was producing different start addresses, leading to the tutorial advocating starting from a four byte offset from the base address. My compiler didn't have this problem, and fully expected the code to execute from the base address.

There's two main points I feel we can take from this:

1. When you're working this close to the machine, tiny changes have big ramifications. Missing a single instruction in my small example made the difference between a correct return and a hanging processor.

2. Compilers often do strange things you don't expect, and what might seem like a 'small change' to you may have huge ramifications for your generated code, so be careful and assume nothing!