Understanding the BeagleBone’s built in microcontrollers (2016)

etaioinshrdlu · on Jan 6, 2019

Why do the PRU's exist? Part of the reason is that real time programming is kind of hard. It's much easier to reason about if you have dedicated processors per task!

My experience with real time systems led me to believe that RTOS's are kind of harmful and don't live up to expectations.

Using an RTOS felt basically like a regular OS but with cooperative multitasking. I could never separate the time consumption of one task from another and think about them in isolation.

What I really ended up wanting was preemptive multitasking.

But to instead divide the CPU time into fixed allocation of say 10 to 100 tasks, each with a dedicated slice of CPU time that never varies.

Then you can finally reason about tasks in isolation!

Interrupts are then disallowed.

It's basically equivalent to having a bunch of little CPUs communicating. (One should try to use ring buffers for all communication, not shared memory.)

To me, this is the only real time architecture I feel smart enough to even handle.

sarchertech · on Jan 6, 2019

I designed a product once using a Parallax Propeller. I loved it because it had 8 CPU cores that had rotating access to shared memory and no interrupts. If I remember correctly I did implement a ring buffer to handle communication between cores. I think the Propeller 2 is close to being released now that I think about it.

etaioinshrdlu · on Jan 6, 2019

Totally. Now that is a nice architecture. It looks like it may have trouble attracting wide industry adoption as it is a little idiosyncratic with Spin and custom languages for everything...

sarchertech · on Jan 6, 2019

Yeah Spin is a definite barrier. Right when I was wrapping up my design, Parallax was just starting to encourage people to use the C compiler in production. I would guess the Prop 2 will be similar.

It's still different enough to scare a lot of people off, but I'll give it a shot. I went poking around the forums and it looks like they just shipped out the first 100 evaluation boards in December.

petschge · on Jan 6, 2019

Litmus, a real time extension of the linux kernel allowed such partioned scheduling. The last version is based on 4.9.30 and there is not plans to maintain it. Check https://www.litmus-rt.org/ for more details.

monocasa · on Jan 6, 2019

Even Litmus isn't great here AFAIU. The issue is that you kind of need to design the kernel from the ground up to not have potential long times between preemption points.

guan · on Jan 6, 2019

Don’t some of the commercial RTOSes have optional fixed, time slot–based scheduling like you describe? I thought I had seen that, but now I can’t find any examples with some simple searches.

Gibbon1 · on Jan 6, 2019

Most of them will support that. It's not actually that big of a jump between cooperative to preemptive multitasking. Preemptive can give lower latency, but you can still have high priority tasks blocking lower priority ones. Work arounds exist but they aren't a panacea.

pstuart · on Jan 6, 2019

> One should try to use ring buffers for all communication, not shared memory

Could you expound upon that?

etaioinshrdlu · on Jan 6, 2019

It is a similar concept like used in Go and Erlang. Don't use shared memory and locks, instead, send messages between processes.

It works great with embedded real time systems as well. It can be easier to get right than the alternative -- because without locks, the performance is more predictable.

Less chance for race conditions and deadlocks as well.

It is also basically the same thing as hooking up different processors with a bunch of serial ports. UARTs typically have a hardware ring buffer for send and receive. So basically, you simulate a bunch of separate processors all communicating. You could do it in hardware, or within software. Both work equally well.

thebruce87m · on Jan 6, 2019

Would you not need to use locks for the ring buffers?

sarchertech · on Jan 6, 2019

Here's a good explanation from Linux Device Drivers, Third Edition.

"When carefully implemented, a circular buffer requires no locking in the absence of multiple producers or consumers. The producer is the only thread that is allowed to modify the write index and the array location it points to. As long as the writer stores a new value into the buffer before updating the write index, the reader will always see a consistent view. The reader, in turn, is the only thread that can access the read index and the value it points to. With a bit of care to ensure that the two pointers do not overrun each other, the producer and the consumer can access the buffer concurrently with no race conditions."

thebruce87m · on Jan 6, 2019

This would rely on an atomic update of the write index surely - is this guaranteed?

tlb · on Jan 6, 2019

The requirement is generally that it can be written in a single instruction.

This can fail on 32-bit processors like the x86 if you use a 64-bit index. For embedded work, it can fail on 8-bit processors if you use a 16-bit index.

You can check, in C++, whether operations on a type are a single instruction. This might be true on an x84_64 and false on an x86:

  std::atomic<int64_t>::is_lock_free()

sarchertech · on Jan 6, 2019

Yes most lock free thread safe paterns require the use of some atomic instructions.

In the context of embeded real time systems you are generally targeting specific hardware, so you can make gurantees about the atomicity of specific instructions.

mch82 · on Jan 6, 2019

I think what you’re describing is one of the objectives of the Nerves Project (https://nerves-project.org), which happens to include the BeagleBone as a target platform.

etaioinshrdlu · on Jan 6, 2019

Nerves doesn't look terribly amenable to realtime in design, however. About as amenable as running Node.js on linux. Not a great solution for strict timing requirements.

pstuart · on Jan 6, 2019

Thanks for the detailed response.

wybiral · on Jan 6, 2019

I love the BeagleBone boards. Some of my favorite SBC's, especially the Octavo OSD335x-based [1] PocketBeagle [2].

For anyone interested in more examples and explanation of PRU programming on BeagleBone boards this repo is really good: https://github.com/MarkAYoder/PRUCookbook

[1] https://octavosystems.com/octavo_products/osd335x/

[2] https://beagleboard.org/pocket

Gibbon1 · on Jan 6, 2019

We use them at work. The NAND flash is really solid compared to what we were using previously. However we also use a super cap to keep the lights on while the system shuts down too.

Notable though the power controller for the BeagleBone has a serious issue with brownouts/slow power application. If the power ramps up too slowly it'll hang and needs to be manually reset. That's a not ready for prime time sort of thing. We solve it with a small uP that monitors the BB and kicks it if it fails to power up or hangs.

elcritch · on Jan 6, 2019

You could add a voltage based trigger chip. Set it up to hold the reset line on the BB until it reaches a sufficient voltage.

Gibbon1 · on Jan 6, 2019

You could, though I've found analog jungle logic to do things like this tends to end in sorrow. With a watchdog uP you can also force the host to verify it's actually functioning. Bonus is that cures 'failed to boot properly' Which happens.

We do that to avoid having to send out a tech. Amusingly a friend maintains a similar system on satellite hardware.

elcritch · on Jan 7, 2019

Sending a tech out to fix a satellite would definitely be quite a trip.

hinkley · on Jan 5, 2019

I put my BBBs aside for a bit and when I came back to them I could never get the ROM flashing instructions to work again and gave up.

But I get a kick out of getting more oomph out of little devices than you’d think you could, and one of the things I’d hoped to accomplish with the PRUs was to get some streaming processing going.

If I understood correctly, if you wanted stable, high sampling rates on GPIO you were going to want to do that on the PRU, and with 2 you could send and receive in duplex. So I thought some sort of control or logging backplane to remove traffic off of the underpowered ethernet port would be pretty cool.

And then the other idea was figuring out how to cross compile zlib to do transport compression without hogging the CPU.

but now I’m off fiddling with NanoPi’s, and not even really as embedded devices. With 1GB of memory and GigE ethernet I’m thinking of them as tiny servers instead. I’m probably missing out, though.

smcl · on Jan 6, 2019

Weird - I had the same problem, leaving the board untouched for a while (a year) then trying again to reflash and failing completely. I just assumed I’d messed something up accidentally

amelius · on Jan 5, 2019

I think more boards should have a bunch of simple microcontrollers on board (besides the main CPU) for real-time control.

joezydeco · on Jan 5, 2019

NXP's i.MX7 and i.MX8 lines of application SoCs contain Cortex-M4 subprocessors for real-time work.

https://www.nxp.com/products/processors-and-microcontrollers...

zokier · on Jan 5, 2019

How do the M4 and Ax(x) cores communicate? From the block diagram it looks like they are very separate from each other. The neat thing about PRUs is that afaik they also have very high performance connection to the main memory, allowing you to do high-speed IO via them

kens · on Jan 5, 2019

(Author here.) One of the inconvenient things about the PRUs is there's a fair bit of latency to access anything outside the PRU. For instance, 34 cycles to access a non-PRU GPIO or 47 cycles to access main (DDR) memory. I've run into problems where I think the PRUs will have plenty of performance, and then these latencies eat up the time budget.

http://processors.wiki.ti.com/index.php/AM335x_PRU_Read_Late...

mallets · on Jan 6, 2019

By access do you mean read or write? I am using that right now and it is just 2 cycles to write (SBBO) and total 8 cycles for it to take effect (the PRU is free to do other stuff after the 2 cycles). Better of reading from the PRU pins though, just 1 cycle. Maybe you can use the other PRU to read from DDR and signal when the data is ready , if using the non PRU pins is the only way to go.

monocasa · on Jan 6, 2019

47 cycles to DDR is pretty great though. Big ol' x86/POWER/s390 cores can get close to an order of magnitude farther away to hit the actual of chip DRAM.

jws · on Jan 6, 2019

The PRUs run at 200MHz. So that is something like a 250ns latency, so four times faster than an Apple ][ for RAM access. (Maybe ten times faster, I think all the memory instructions took two cycles or more.)

nudgeee · on Jan 5, 2019

They communicate via AXI/AHB bus[0] according to the block diagram in the datasheet[1].

[0] https://en.m.wikipedia.org/wiki/Advanced_Microcontroller_Bus...

[1] https://www.nxp.com/docs/en/data-sheet/IMX7DCEC.pdf

zokier · on Jan 6, 2019

Interesting, I went out and looked more into it and it seems like rpmsg from openamp is the "blessed" way of doing communication between the cores, and they seem to share the memory space with the memory controller sitting on the same bus.

Random presentation slides I found on the topic:

https://elinux.org/images/3/3b/NOVAK_CERVENKA.pdf

nudgeee · on Jan 5, 2019

Some of the smaller Piccolo and Delfino microcontrollers from TI already do, called the CLA (Control Law Accelerator).

http://processors.wiki.ti.com/index.php/Control_Law_Accelera...

It even has its own (kind of) C compiler: http://processors.wiki.ti.com/index.php/C2000_CLA_C_Compiler

kens · on Jan 5, 2019

Just to clarify, an interesting thing about the BeagleBone is the microcontrollers are inside the main CPU chip.

johnohara · on Jan 5, 2019

The microcontrollers (PRUs) run at 1/5 the clock speed (200MHz) of the CPU (1000MHz) and can share memory with the CPU as well.

isoprophlex · on Jan 5, 2019

Sounds like these things are fantastic for DMA-based hacks that use bit-banged GPIO pins to output FM audio, vga video, ...

MegaDeKay · on Jan 6, 2019

Better yet, a 14 channel 100 MHz logic analyzer...

https://hackaday.com/2015/02/19/turn-your-beagleboneblack-in...

isoprophlex · on Jan 6, 2019

Incredible, thanks!

monocasa · on Jan 5, 2019

Yeah, exactly.

They're designed for deterministic bit banging of random protocols.

amelius · on Jan 6, 2019

I'd rather have a bunch of them in a package attached to a bus that connects to the CPU, so it is extensible (i.e. you can add more microcontrollers if desired). It's no problem that the bus has no real time constraints, because the CPU doesn't have them either.

petra · on Jan 5, 2019

What's the big advantage it gives ?

nomel · on Jan 5, 2019

My personal use case was for hard realtime requirements.

For example, burning electronic fuses in an ASIC, where timing had to be guaranteed or the chip was toast.

Beyond that, like most have said, bit banging odd protocols at reasonable speed, with full access to main memory for streaming.

For a funny optimization that resulted in lower performance: I offloaded one of our protocols to one of these coprocessors but noticed the overall throughput dropped to about 1/4 what it was before, with huge latency between transactions. I looked at the cpu and, as was the goal, main processor usage was down from 30% to about 2%. I eventually realized that, due to the low CPU usage, the main processor clock rate was scaled to the lowest frequency. This caused interrupt handling, and everything else, to slow wayyy down, including the arbiter in the kernel driver and user space code where the transactions were coming from.

The fix was to disable clock scaling completely, burning some watts more. I’m sorry dolphins :(

nudgeee · on Jan 5, 2019

For running real-time control algorithms. As an example in a digital power inverter it could be generating PWM signals controlling a H-Bridge while the main CPU deals with UI tasks such as button presses and controlling the LCD screen.

wtracy · on Jan 5, 2019

From a user/developer perspective, the advantage would be low latency.

I imagine that the more interesting advantages are from the perspective of the board designer.

gmueckl · on Jan 5, 2019

You can get several kinds of guarantees out of these combinations: the microcontroller gives you an environment suitabke for guaranteed timings (not just latency), but also a seperate domain to isolate your software in. So stuff like highly safety critical code goes into the microcontoller (e.g. tight monitoring and control loops) while the complex application fluff (UI, networking, bells and whistles) goes on the main CPU with a big, convenient OS to develop on. And both parts can barely influence each other in unintended ways.

There are a lot of boards on the embedded systems market that use e.g. ARMs big.LITTLE architecture for that.

wtracy · on Jan 6, 2019

Sorry, I misunderstood the question to mean, "What is the advantage to having the microcontroller on the CPU die, as opposed to having separate chips on the same board."

gmueckl · on Jan 6, 2019

There are software side advantages as the interconnect between the two can be very different from bus interface like SPI or I2C. Some of these systems offer shared memory between the cores, for example.

god_bless_texas · on Jan 5, 2019

I feel like I constantly read things 2 weeks late on HN. I just got done integrating a SAM32 with a raspi. Insulting to say, I have a BBB on the shelf, in a box.

etaioinshrdlu · on Jan 6, 2019

Did you program the little chip from the ras pi? I did it with OpenOCD and it was rather easy. It was more reliable than Atmel's own tools which are garbage :)

god_bless_texas · on Jan 6, 2019

No both separately. But I'm going to look at OpenOCD now.

etaioinshrdlu · on Jan 6, 2019

The one issue I ran into was that verification failed unless I padded the end of the image to upload with 0xFF until it finished on a complete sector.

mch82 · on Jan 6, 2019

The cool thing about the BeagleBone in general is that it’s design is released under a permissive open source license. Kind of like the “Apache 2.0” of hardware.

A Raspberry Pi is closed source (though it can be designed in as a component to a larger open source hardware system).

An Arduino is “share-alike”, so more restrictive like a GPL 2+ license.

rshm · on Jan 6, 2019

One of the use case : 14-CHANNEL, 100MSPS LOGIC ANALYZER

[https://hackaday.com/2015/02/19/turn-your-beagleboneblack-in...]

makomk · on Jan 6, 2019

More recently, people have experimented with repurposing the ARISC power management core on Allwinner H3-based boards like the Orange Pi for similar realtime purposes: https://github.com/orange-cnc/h3_arisc_firmware (It's basically an outdated OpenRISC core with direct access to some GPIO pins running at up to a few hundred megahertz.)

mutagen · on Jan 6, 2019

I bought (or crowdfunded, don't remember) the https://bela.io/ cape for the BeagleBone and have it sitting on the shelf waiting for me to free up some time to dig into the existing ecosystem and try to take things further.