Although you might have used all the compiler based optimization options like –O3, -pm etc but still your code may not achieve the set performance target. The reasons are many. Poor code and data ordering and not enough information to the compiler to reorganize the code to take advantage of the pipeline depth. Following are some of the proven methods (not exhaustive) that could help you achieve better performance. For more information please download the optimization guides for your processor from TI website (
http://www.ti.com/).
· Memory Alias Disambiguation
What would happen if the following function is called?
void my_func (int *ptr_in, int *ptr_out)
{
LDW *ptr_in++, A0
ADD A0, 4, A1
STW A1, *ptr_out++
}
The compiler may think that *ptr_in and *ptr_out point to the same memory location leading to memory aliasing and hence the compiler may not optimize this piece of code. In order to avoid memory disambiguation use the restrict keyword. The above maybe rewritten as
void my_func (int restrict *ptr_in, int *ptr_out)
{
LDW *ptr_in++, A0
ADD A0, 4, A1
STW A1, *ptr_out++
}
· Use Pragmas
PRAGMAS are preprocessor directives that can give extra information to the compiler about the code below. There are various pragmas that can be used like:
o UNROLL (# of time to unroll)
#pragma UNROLL (2)
for(i = 0; i < Count ; i++)
{
sum += a[i] * x[i];
}
§ Tells the compiler to unroll the for() loop twice
§ The compiler will generate extra code to handle the case that count is odd
§ The #pragma must come right before the for() loop
§ UNROLL(1) tells the compiler not to unroll a loop
o MUST_ITERATE (min, max, %factor)
#pragma MUST_ITERATE (10, 100, 2)
for(i = 0; i < Count ; i++)
{
sum += a[i] * x[i];
}
§ Gives the compiler information about the trip (loop) count
In the code above, we are promising that:
count >= 10, count <= 100, and count % 2 == 0
§ If you break your promise, you might break your code
§ Allows the compiler to remove unnecessary code
§ Modulus (%) factor allows for efficient loop unrolling
§ The #pragma must come right before the for() loop
o DATA_ALIGN (variable, 2n alignment)
#pragma DATA_ALIGN (a, 8)
short a[256] = {1, 2, 3,. . . . 256};
#pragma UNROLL (2)
#pragma MUST_ITERATE (10, 100, 2)
for(i = 0; i < Count ; i++)
{
sum += a[i] * x[i];
}
§ Tell compiler to create variables on a 2n boundary
§ Allows use of (double) word-wide optimized loads/stores
· Adjust structure sizes to power of two
When arrays of structures are involved, the compiler performs a multiply by the structure size to perform the array indexing. If the structure size is a power of 2, an expensive multiply operation will be replaced by an inexpensive shift operation. Thus keeping structure sizes aligned to a power of 2 will improve performance in array indexing.
· Place frequent case labels first
If the case labels are placed far apart, the compiler will generate if-else-if cascaded code with comparing for each case label and jumping to the action for leg on hitting a label match. By placing the frequent case labels first, you can reduce the number of comparisons that will be performed for frequently occurring scenarios. Typically this means that cases corresponding to the success of an operation should be placed before cases of failure handling.
· Minimize local variables
If the number of local variables in a function is less, the compiler will be able to fit them into registers. Hence, it will be avoiding frame pointer operations on local variables that are kept on stack. This can result in considerable improvement due to two reasons:
§ All local variables are in registers so this improves performance over accessing them from memory.
§ If no local variables need to be saved on the stack, the compiler will not incur the overhead of setting up and restoring the stack pointer.
· Reduce number of function parameters
Function calls with large number of parameters may be expensive due to large number of parameter pushes on stack on each call. For the same reason, avoid passing complete structures as parameters. Use pointers and references in such cases.
· In-line small (1-5 loc) functions
Converting small functions (1 to 5 lines) into in-line will give you big improvements in throughput. In-lining will remove the overhead of a function call and associated parameter passing. But using this technique for bigger functions can have negative impact on performance due to the associated code bloat. Also keep in mind that making a method inline should not increase the dependencies by requiring a explicit header file inclusion when you could have managed by just using a forward reference in the non-inline version.
· Get same calculations/comparisons out of the loop
for (int i = 0; i < Max; i++)
{
if (Val > CONST_VAL)
{
...
}
else
{
....
}
}
This can be optimized as:
if (Val > CONST_VAL)
{
for (int i = 0; i < Max; i++)
{
.....
}
}
else
{
for (int i = 0; i < Max; i++)
{
.....
}
}