More Performance¶
So far in all our examples we’ve been able to meet our timing goals by writing our code in the C programming language. The C compiler does a surprisingly good job at generating code, most the time. However there are times when very precise timing is needed and the compiler isn’t doing it.
At these times you need to write in assembly language. This chapter introduces the PRU assembler and shows how to call assembly code from C. Detailing on how to program in assembly are beyond the scope of this text.
The following are resources used in this chapter.
Note
Resources
Calling Assembly from C¶
Problem¶
You have some C code and you want to call an assembly language routine from it.
Solution¶
You need to do two things, write the assembler file and modify the Makefile
to include it. For example, let’s write our own my_delay_cycles
routine in
in assembly. The intrinsic pass:[__]delay_cycles
must be passed a compile time
constant. Our new delay_cycles
can take a runtime delay value.
delay-test.pru0.c is much like our other c code, but on line 10 we declare
my_delay_cycles
and then on lines 24 and 26 we’ll call it with an argument of 1.
1// Shows how to call an assembly routine with one parameter
2#include <stdint.h>
3#include <pru_cfg.h>
4#include "resource_table_empty.h"
5#include "prugpio.h"
6
7// The function is defined in delay.asm in same dir
8// We just need to add a declaration here, the definition can be
9// separately linked
10extern void my_delay_cycles(uint32_t);
11
12volatile register uint32_t __R30;
13volatile register uint32_t __R31;
14
15void main(void)
16{
17 uint32_t gpio = P9_31; // Select which pin to toggle.;
18
19 /* Clear SYSCFG[STANDBY_INIT] to enable OCP master port */
20 CT_CFG.SYSCFG_bit.STANDBY_INIT = 0;
21
22 while(1) {
23 __R30 |= gpio; // Set the GPIO pin to 1
24 my_delay_cycles(1);
25 __R30 &= ~gpio; // Clear the GPIO pin
26 my_delay_cycles(1);
27 }
28}
delay.pru0.asm is the assembly code.
1; This is an example of how to call an assembly routine from C.
2; Mark A. Yoder, 9-July-2018
3 .global my_delay_cycles
4my_delay_cycles:
5delay:
6 sub r14, r14, 1 ; The first argument is passed in r14
7 qbne delay, r14, 0
8
9 jmp r3.w2 ; r3 contains the return address
The Makefile
has one addition that needs to be made to compile both delay-test.pru0.c
and delay.pru0.asm. If you look in the local Makefile
you’ll see:
1include /opt/source/pru-cookbook-code/common/Makefile
This Makefle includes a common Makefile at /opt/source/pru-cookbook-code/common/Makefile
, this the Makefile
you need to edit. Edit /opt/source/pru-cookbook-code/common/Makefile
and go to line 195.
$(GEN_DIR)/%.out: $(GEN_DIR)/%.o *$(GEN_DIR)/$(TARGETasm).o*
@mkdir -p $(GEN_DIR)
@echo 'LD $^'
$(eval $(call target-to-proc,$@))
$(eval $(call proc-to-build-vars,$@))
@$(LD) $@ $^ $(LDFLAGS)
Add *(GEN_DIR)/$(TARGETasm).o*
as shown in bold above. You will want to remove
this addition once you are done with this example since it will break the other examples.
The following will compile and run everything.
bone$ config-pin P9_31 pruout
bone$ make TARGET=delay-test.pru0 TARGETasm=delay.pru0
/opt/source/pru-cookbook-code/common/Makefile:29: MODEL=TI_AM335x_BeagleBone_Black,TARGET=delay-test.pru0
- Stopping PRU 0
- copying firmware file /tmp/vsx-examples/delay-test.pru0.out to /lib/firmware/am335x-pru0-fw
write_init_pins.sh
- Starting PRU 0
MODEL = TI_AM335x_BeagleBone_Black
PROC = pru
PRUN = 0
PRU_DIR = /sys/class/remoteproc/remoteproc1
The resulting output is shown in Output of my_delay_cycles().
Notice the on time is about 35ns and the off time is 30ns.
Discission¶
There is much to explain here. Let’s start with delay.pru0.asm.
Line |
Explanation |
---|---|
3 |
Declare my_delay_cycles to be global so the linker can find it. |
4 |
Label the starting point for my_delay_cycles. |
5 |
Label for our delay loop. |
6 |
The first argument is passed in register |
7 |
qbne is a quick branch if not equal. |
9 |
Once we’ve delayed enough we drop through the quick branch and hit the jump. The upper bits of register r3 has the return address, therefore we return to the c code. |
Output of my_delay_cycles() shows the on time is 35ns and the off time is 30ns.
With 5ns/cycle this gives 7 cycles on and 6 off. These times make sense
because each instruction takes a cycle and you have, set R30
, jump to
my_delay_cycles
, sub
, qbne
, jmp
. Plus the instruction (not seen) that
initializes r14 to the passed value. That’s a total of six instructions.
The extra instruction is the branch at the bottom of the while
loop.
Returning a Value from Assembly¶
Problem¶
Your assembly code needs to return a value.
Solution¶
R14
is how the return value is passed back. delay-test2.pru0.c shows the c code.
1// Shows how to call an assembly routine with a return value
2#include <stdint.h>
3#include <pru_cfg.h>
4#include "resource_table_empty.h"
5#include "prugpio.h"
6
7#define TEST 100
8
9// The function is defined in delay.asm in same dir
10// We just need to add a declaration here, the definition can be
11// separately linked
12extern uint32_t my_delay_cycles(uint32_t);
13
14uint32_t ret;
15
16volatile register uint32_t __R30;
17volatile register uint32_t __R31;
18
19void main(void)
20{
21 uint32_t gpio = P9_31; // Select which pin to toggle.;
22
23 /* Clear SYSCFG[STANDBY_INIT] to enable OCP master port */
24 CT_CFG.SYSCFG_bit.STANDBY_INIT = 0;
25
26 while(1) {
27 __R30 |= gpio; // Set the GPIO pin to 1
28 ret = my_delay_cycles(1);
29 __R30 &= ~gpio; // Clear the GPIO pin
30 ret = my_delay_cycles(1);
31 }
32}
delay2.pru0.asm is the assembly code.
1; This is an example of how to call an assembly routine from C with a return value.
2; Mark A. Yoder, 9-July-2018
3
4 .cdecls "delay-test2.pru0.c"
5
6 .global my_delay_cycles
7my_delay_cycles:
8delay:
9 sub r14, r14, 1 ; The first argument is passed in r14
10 qbne delay, r14, 0
11
12 ldi r14, TEST ; TEST is defined in delay-test2.c
13 ; r14 is the return register
14
15 jmp r3.w2 ; r3 contains the return address
An additional feature is shown in line 4 of delay2.pru0.asm. The
.cdecls "delay-test2.pru0.c"
says to include any defines from delay-test2.pru0.c
In this example, line 6 of delay-test2.pru0.c #defines TEST and line 12 of
delay2.pru0.asm reference it.
Using the Built-In Counter for Timing¶
Problem¶
I want to count how many cycles my routine takes.
Solution¶
Each PRU has a CYCLE
register which counts the number of cycles since
the PRU was enabled. They also have a STALL
register that counts how
many times the PRU stalled fetching an instruction.
cycle.pru0.c - Code to count cycles. shows they are used.
1// Access the CYCLE and STALL registers
2#include <stdint.h>
3#include <pru_cfg.h>
4#include <pru_ctrl.h>
5#include "resource_table_empty.h"
6#include "prugpio.h"
7
8volatile register uint32_t __R30;
9volatile register uint32_t __R31;
10
11void main(void)
12{
13 uint32_t gpio = P9_31; // Select which pin to toggle.;
14
15 // These will be kept in registers and never written to DRAM
16 uint32_t cycle, stall;
17
18 // Clear SYSCFG[STANDBY_INIT] to enable OCP master port
19 CT_CFG.SYSCFG_bit.STANDBY_INIT = 0;
20
21 PRU0_CTRL.CTRL_bit.CTR_EN = 1; // Enable cycle counter
22
23 __R30 |= gpio; // Set the GPIO pin to 1
24 // Reset cycle counter, cycle is on the right side to force the compiler
25 // to put it in it's own register
26 PRU0_CTRL.CYCLE = cycle;
27 __R30 &= ~gpio; // Clear the GPIO pin
28 cycle = PRU0_CTRL.CYCLE; // Read cycle and store in a register
29 stall = PRU0_CTRL.STALL; // Ditto for stall
30
31 __halt();
32}
Discission¶
The code is mostly the same as other examples. cycle
and stall
end up in registers which
we can read using prudebug. Line-by-line for cycle.pru0.c is the Line-by-line.
Line |
Explanation |
---|---|
4 |
Include needed to reference CYCLE and STALL. |
16 |
Declaring cycle and stall. The compiler will optimize these and just keep them in registers. We’ll have to look at the cycle.pru0.lst file to see where they are stored. |
21 |
Enables CYCLE. |
26 |
Reset CYCLE. It ignores the value assigned to it and always sets it to 0. cycle is on the right hand side to make the compiler give it its own register. |
28, 29 |
Reads the CYCLE and STALL values into registers. |
You can see where cycle
and stall
are stored by looking into /tmp/vsx-examples/cycle.pru0.lst Lines 113..119.
113 102 .dwpsn file "cycle.pru0.c",line 23,column 2,is_stmt,isa 0
114 103;----------------------------------------------------------------------
115 104; 23 | PRU0_CTRL.CTRL_bit.CTR_EN = 1; // Enable cycle counter
116 105;----------------------------------------------------------------------
117 106 0000000c 200080240002C0 LDI32 r0, 0x00022000 ; [ALU_PRU] |23| $O$C1
118 107 00000014 000000F1002081 LBBO &r1, r0, 0, 4 ; [ALU_PRU] |23|
119 108 00000018 0000001F03E1E1 SET r1, r1, 0x00000003 ; [ALU_PRU] |23|
Here the LDI32
instruction loads the address 0x22000
into r0
. This is the offset to
the CTRL
registers. Later in the file we see /tmp/vsx-examples/cycle.pru0.lst Lines 146..152.
146 129;----------------------------------------------------------------------
147 130; 30 | cycle = PRU0_CTRL.CYCLE; // Read cycle and store in a register
148 131;----------------------------------------------------------------------
149 132 0000002c 000000F10C2081 LBBO &r1, r0, 12, 4 ; [ALU_PRU] |30| $O$C1
150 133 .dwpsn file "cycle.pru0.c",line 31,column 2,is_stmt,isa 0
151 134;----------------------------------------------------------------------
152 135; 31 | stall = PRU0_CTRL.STALL; // Ditto for stall
The first LBBO
takes the contents of r0
and adds the offset 12 to it and copies 4 bytes
into r1
. This points to CYCLE
, so r1
has the contents of CYCLE
.
The second LBBO
does the same, but with offset 16, which points to STALL
,
thus STALL
is now in r0
.
Now fire up prudebug and look at those registers.
bone$ sudo prudebug
PRU0> r
r
r
Register info for PRU0
Control register: 0x00000009
Reset PC:0x0000 STOPPED, FREE_RUN, COUNTER_ENABLED, NOT_SLEEPING, PROC_DISABLED
Program counter: 0x0012
Current instruction: HALT
R00: *0x00000005* R08: 0x00000200 R16: 0x000003c6 R24: 0x00110210
R01: *0x00000003* R09: 0x00000000 R17: 0x00000000 R25: 0x00000000
R02: 0x000000fc R10: 0xfff4ea57 R18: 0x000003e6 R26: 0x6e616843
R03: 0x0004272c R11: 0x5fac6373 R19: 0x30203020 R27: 0x206c656e
R04: 0xffffffff R12: 0x59bfeafc R20: 0x0000000a R28: 0x00003033
R05: 0x00000007 R13: 0xa4c19eaf R21: 0x00757270 R29: 0x02100000
R06: 0xefd30a00 R14: 0x00000005 R22: 0x0000001e R30: 0xa03f9990
R07: 0x00020024 R15: 0x00000003 R23: 0x00000000 R31: 0x00000000
So cycle
is 3 and stall
is 5. It must be one cycle to clear the GPIO and 2 cycles to read the
CYCLE
register and save it in the register. It’s interesting there are 5 stall
cycles.
If you switch the order of lines 30 and 31 you’ll see cycle
is 7 and stall
is 2. cycle
now includes the
time needed to read stall
and stall
no longer includes the time to read cycle
.
Xout and Xin - Transferring Between PRUs¶
Problem¶
I need to transfer data between PRUs quickly.
Solution¶
The pass:[__]xout()
and pass:[__]xin()
intrinsics are able to transfer up to 30 registers between PRU 0 and PRU 1 quickly.
xout.pru0.c shows how xout()
running on PRU 0 transfers six registers to PRU 1.
1// From: http://git.ti.com/pru-software-support-package/pru-software-support-package/trees/master/examples/am335x/PRU_Direct_Connect0
2#include <stdint.h>
3#include <pru_intc.h>
4#include "resource_table_pru0.h"
5
6volatile register uint32_t __R30;
7volatile register uint32_t __R31;
8
9typedef struct {
10 uint32_t reg5;
11 uint32_t reg6;
12 uint32_t reg7;
13 uint32_t reg8;
14 uint32_t reg9;
15 uint32_t reg10;
16} bufferData;
17
18bufferData dmemBuf;
19
20/* PRU-to-ARM interrupt */
21#define PRU1_PRU0_INTERRUPT (18)
22#define PRU0_ARM_INTERRUPT (19+16)
23
24void main(void)
25{
26 /* Clear the status of all interrupts */
27 CT_INTC.SECR0 = 0xFFFFFFFF;
28 CT_INTC.SECR1 = 0xFFFFFFFF;
29
30 /* Load the buffer with default values to transfer */
31 dmemBuf.reg5 = 0xDEADBEEF;
32 dmemBuf.reg6 = 0xAAAAAAAA;
33 dmemBuf.reg7 = 0x12345678;
34 dmemBuf.reg8 = 0xBBBBBBBB;
35 dmemBuf.reg9 = 0x87654321;
36 dmemBuf.reg10 = 0xCCCCCCCC;
37
38 /* Poll until R31.30 (PRU0 interrupt) is set
39 * This signals PRU1 is initialized */
40 while ((__R31 & (1<<30)) == 0) {
41 }
42
43 /* XFR registers R5-R10 from PRU0 to PRU1 */
44 /* 14 is the device_id that signifies a PRU to PRU transfer */
45 __xout(14, 5, 0, dmemBuf);
46
47 /* Clear the status of the interrupt */
48 CT_INTC.SICR = PRU1_PRU0_INTERRUPT;
49
50 /* Halt the PRU core */
51 __halt();
52}
PRU 1 waits at line 41 until PRU 0 signals it. xin.pru1.c sends an interrupt to PRU 0 and waits for it to send the data.
1// From: http://git.ti.com/pru-software-support-package/pru-software-support-package/trees/master/examples/am335x/PRU_Direct_Connect1
2#include <stdint.h>
3#include "resource_table_empty.h"
4
5volatile register uint32_t __R30;
6volatile register uint32_t __R31;
7
8typedef struct {
9 uint32_t reg5;
10 uint32_t reg6;
11 uint32_t reg7;
12 uint32_t reg8;
13 uint32_t reg9;
14 uint32_t reg10;
15} bufferData;
16
17bufferData dmemBuf;
18
19/* PRU-to-ARM interrupt */
20#define PRU1_PRU0_INTERRUPT (18)
21#define PRU1_ARM_INTERRUPT (20+16)
22
23void main(void)
24{
25 /* Let PRU0 know that I am awake */
26 __R31 = PRU1_PRU0_INTERRUPT+16;
27
28 /* XFR registers R5-R10 from PRU0 to PRU1 */
29 /* 14 is the device_id that signifies a PRU to PRU transfer */
30 __xin(14, 5, 0, dmemBuf);
31
32 /* Halt the PRU core */
33 __halt();
34}
Use prudebug
to see registers R5-R10 are transferred from PRU 0 to PRU 1.
PRU0> r
Register info for PRU0
Control register: 0x00000001
Reset PC:0x0000 STOPPED, FREE_RUN, COUNTER_DISABLED, NOT_SLEEPING, PROC_DISABLED
Program counter: 0x0026
Current instruction: HALT
R00: 0x00000012 *R08: 0xbbbbbbbb* R16: 0x000003c6 R24: 0x00110210
R01: 0x00020000 *R09: 0x87654321* R17: 0x00000000 R25: 0x00000000
R02: 0x000000e4 *R10: 0xcccccccc* R18: 0x000003e6 R26: 0x6e616843
R03: 0x0004272c R11: 0x5fac6373 R19: 0x30203020 R27: 0x206c656e
R04: 0xffffffff R12: 0x59bfeafc R20: 0x0000000a R28: 0x00003033
*R05: 0xdeadbeef* R13: 0xa4c19eaf R21: 0x00757270 R29: 0x02100000
*R06: 0xaaaaaaaa* R14: 0x00000005 R22: 0x0000001e R30: 0xa03f9990
*R07: 0x12345678* R15: 0x00000003 R23: 0x00000000 R31: 0x00000000
PRU0> *pru 1*
pru 1
Active PRU is PRU1.
PRU1> *r*
r
Register info for PRU1
Control register: 0x00000001
Reset PC:0x0000 STOPPED, FREE_RUN, COUNTER_DISABLED, NOT_SLEEPING, PROC_DISABLED
Program counter: 0x000b
Current instruction: HALT
R00: 0x00000100 *R08: 0xbbbbbbbb* R16: 0xe9da228b R24: 0x28113189
R01: 0xe48cdb1f *R09: 0x87654321* R17: 0x66621777 R25: 0xddd29ab1
R02: 0x000000e4 *R10: 0xcccccccc* R18: 0x661f83ea R26: 0xcf1cd4a5
R03: 0x0004db97 R11: 0xdec387d5 R19: 0xa85adb78 R27: 0x70af2d02
R04: 0xa90e496f R12: 0xbeac3878 R20: 0x048fff22 R28: 0x7465f5f0
*R05: 0xdeadbeef* R13: 0x5777b488 R21: 0xa32977c7 R29: 0xae96b530
*R06: 0xaaaaaaaa* R14: 0xffa60550 R22: 0x99fb123e R30: 0x52c42a0d
*R07: 0x12345678* R15: 0xdeb2142d R23: 0xa353129d R31: 0x00000000
Discussion¶
xout.pru0.c Line-by-line shows the line-by-line for xout.pru0.c
Line |
Explanation |
---|---|
4 |
A different resource so PRU 0 can receive a signal from PRU 1. |
9-16 |
|
21-22 |
Define the interrupts we’re using. |
27-28 |
Clear the interrupts. |
31-36 |
Initialize dmemBuf with easy to recognize values. |
40 |
Wait for PRU 1 to signal. |
45 |
|
48 |
Clear the interrupt so it can go again. |
xin.pru1.c Line-by-line shows the line-by-line for xin.pru1.c
.
Line |
Explanation |
---|---|
8-15 |
Place to put the received data. |
26 |
Signal PRU 0 |
30 |
Receive the data. The arguments are the same as xout(), 14 says to get the data directly from PRU 0. 5 says to start with register r5. dmemBuf is where to put the data. |
If you really need speed, considering using pass:[__]xout()
and pass:[__]xin()
in assembly.
Copyright¶
1/*
2 * Copyright (C) 2015 Texas Instruments Incorporated - http://www.ti.com/
3 *
4 *
5 * Redistribution and use in source and binary forms, with or without
6 * modification, are permitted provided that the following conditions
7 * are met:
8 *
9 * * Redistributions of source code must retain the above copyright
10 * notice, this list of conditions and the following disclaimer.
11 *
12 * * Redistributions in binary form must reproduce the above copyright
13 * notice, this list of conditions and the following disclaimer in the
14 * documentation and/or other materials provided with the
15 * distribution.
16 *
17 * * Neither the name of Texas Instruments Incorporated nor the names of
18 * its contributors may be used to endorse or promote products derived
19 * from this software without specific prior written permission.
20 *
21 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
22 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
23 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
24 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
25 * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
26 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
27 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
28 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
29 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
30 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
31 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
32 */
33