Multicore use with Pure data and Pisound?

Jaffasplaffa · September 1, 2020, 7:36pm

Hey

Just ordered my Pisound yesterday and can’t wait to play around with it.

I have been using PD for years and I am pretty familiar with it, but no so familiar with Linux and Rpi3.

Multi core use and PD:
In PD if you want to use more than 1 core, you have to use the PD object called pd~. When using that object, it will boot another instance of PD, in a new thread(I think), which makes it possible to use more than 1 core. To my understanding this just makes it possible for the operating system to move the process to another core, it’s still the operating system that handles the distribution og processes.

Since Rpi3 has 4 cores, but not so fast cores, it is pretty essential to utilise more than 1 core, at least for PD, cause it works the way it works.

I was wondering:
How well is Patchbox/Linux/Pisound at handling use of more than 1 core with PD?

Does using the Pisound app to load parches close down all instances of pd, in case you are having more versions of pd open for multi core patch use?

And also just curious to hear if anyone are doing this successfully

Thanks!

danielzi · November 20, 2020, 10:50pm

Sorry no PD user - but I know jack, and so far I only used it single theaded for the main audio transport core. That is not a problem since it’s just handing over one frame of data from one plugin to the next - and those should absolutely process them samples in their own thread. So - 5 plugins could use 6 threads.

Jaffasplaffa · November 27, 2020, 7:55pm

That’s nice to know

My plan was to try and build a polyphonic synth, with 4 voices running on each it’s own instance of Pure Data, using pd~ object.

I open on instance of PD and then use the pd~ to open 3 more Pure Data instances. Each voice could essentially be assigned to it’s own core, so we could make more advanced synths.

I know there will probably be a little bit of latency, sending data from one iteration of PD to another.

I wanna find out how well this could work, as I don’t find using 1 core enough.

Since we have 4, we might as well use them

mzero · November 28, 2020, 3:46pm

You are almost certainly overthinking this. In my benchmark tests (Benchmarking audio performance) I get over 40 simple, but realistic synth voices at very low latency on a Raspberry Pi 3b. With tuning the system a little, you can get double that. This is in single-core SuperCollider. Your performance with Pd should be similar.

I’d say just build the 4 voice polyphonic synth directly in Pd, the normal way… Wait until you hit limitations before you introduce the complicating factor of pd~.

Jaffasplaffa · November 30, 2020, 7:08am

I don’t think I am not overthinking this. I hit the cpu limit all the time. I also do on my Organelle. I need the extra processing power to be able to make what I want to make. Once you start to use better quality audio processing, bandwidth limited oscillators, oversampled filters and having several voices, the single cores limit is reached in no time. It’s just not enough. I do not want to make a simple synth, I want to make something really cool.

And I was just thinking how to go about this is a creative way and to utilise all the available processing power seems like a rational thing to do.

I dont understand why anyone would not try to make use of all 4 cores. It makes no sense not to use the 3 other cores, when they are there. Like 3/4th of the device is not being used. That’s not very good performance wise.

But yeah, I think I have time to try it out this coming weekend. Will see what I can come up with. I am mostly surprised that not a lot of people have looked into this before, when actually having 4 cores available and most of the time only 1 core is in use.

mzero · November 30, 2020, 5:44pm

Because it isn’t a no cost option. Splitting audio processing between cores involves synchronisation to divide up the work, then synchronization to recombine it again. That synchronization takes processing time, and at low latencies can be a significant portion of available resources. You can mitigate this some by pipelining if your structure allows it (for example, doing all the voices in one core, and the shared effects chain in another)… but that becomes another complex constraint to work with. And the code becomes more complex to manage, and your latency goes up.

In DSP, typically there are many optimizations to turn to first, which yield big performance wins, before introducing the complexities of multi-core computation. You are coding in Pd, in which is is very easy to be inefficient in structure - I’m not saying you are, but spending time knowing that you have taken care here would then make it easier to explore using pd~ .

I have a reasonable background in audio processing () - I think one can create four really cool high quality synth voices in a single RPi core.

Have you tried:

& echo performance | sudo tee /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
# the Pi has only one governor, changing one changes them all

This gets you almost double the performance on a Pi 3.

You don’t say what model Organelle. The older one was significantly less powerful.

Lastly - remember that the other 3 cores are not totally idle. Run htop to see… The operating system needs core time to run. Most subsystems (USB, WiFi, Bluetooth, display, etc…) all have associated processes that need some CPU as well. You’d never want to max out all four cores doing DSP - you’d have nothing left for the OS to get the samples out the machine! For an operating system like Linux, it isn’t unreasonable to leave a whole core or two free for spikes in OS demand.

Jaffasplaffa · December 1, 2020, 6:27am

Ahh yes, one of the costs is latency, for example. I just want to find out how bad it is. And yes of course the OS does use some CPU too. But any case spreading voice out over multiple threads, atleast gives the computer the option to distribute the work load to where it makes most snse.

Thanks for the tip on optimising, that is really nice to know. Will surely try that out, that seems like an easy fix

For the DSP I never found Pure Data to be the most efficient. Once you start doing upsampling, making bandwidth limited oscs, or even creating your own filter, using fexp~ the efficiency is not the best.

On my Macbook I can do more complex stuff, but on Organelle(original) and Pisound, I always have to compromise, even for single instruments.

But yeah, will surely give the performance tip a go

Jaffasplaffa · January 17, 2021, 12:52pm

Hey again @mzero

I was thinking about trying out that tip you suggested today, for better performance.

But not sure how to do it? Could you explain how to do it, if it’s not too complicated?

My Pi setup:
Patchbox OS & Pisound audio interface

Thanks in advance

mzero · January 17, 2021, 4:00pm

Just run the line of code I gave in a shell. You only need to run it once per boot.

What are you using to measure the performance of what you are doing? How many voices you can run before getting xruns is what I used in my performance tuning work.

Jaffasplaffa · January 17, 2021, 4:15pm

Cool I will look into creating a shell script. Haven’t tried before but will give it a go, thanks

To be honest, I haven’t done any methodical measurements. I just want to make a system, where I don’t have to worry too much about cpu.

I usually tweak polyphonic patches by changing voice number too.

I am kind of considering getting a “small army” of RPI3’s with Pisound, but need to know how much I can push them before investing and/or selling some of my other gear to finance it.

Jaffasplaffa · January 17, 2021, 4:31pm

Ps.

Was wondering if there are any “costs” by using this script?

About testing it, can I just load a PD patch and check the cpu load before and after installing the script?

mzero · January 17, 2021, 5:44pm

The “cost” of setting the CPU scaling governor to “performance” is that it won’t drop the CPU clock speed when it is idle. This means the unit will consume more power, and generate more heat. If you are not running on battery, this isn’t much of a concern. I often leave mine in this state for hours at a time.

Be hesitant to start on the road of “premature optimizations”. If you don’t know what limitation you are running into, and at what point it occurs, then you won’t know what direction to head in to surmount it. As I suggested much earlier - if you haven’t actually run into a limitation in your four voice synth… then don’t do anything yet… just build it and use it. Once you’ve hit some wall, then we can look for what might work around it.

In other words - you already have a moderately powerful system: For many things you don’t have to worry about any of this (for example, MODEP can patch and run some pretty nice pedal boards, just as your system is already). There are certainly things which will over stress the Pi - and some work you can do to alleviate it… but what to do really depends on what limits you’re hitting with a particular project. As for CPUs, remember that it isn’t like 3 of 4 CPUs are idle and you can just get 4x the power by using them somehow… It really won’t work like that.

Giedrius · January 18, 2021, 1:46pm

We have this tweak done by default in Patchbox OS images. On Patchbox OS, this tweak can be enabled using:

sudo systemctl enable cpu_performance_scaling_governor

(it should be enabled by default)

Jaffasplaffa · January 18, 2021, 2:08pm

@Giedrius

Sounds great

Is there anyway I can check if it’s enabled, just to learn a bit and see it with my own eyes?

Giedrius · January 18, 2021, 2:26pm

cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

sudo systemctl status cpu_performance_scaling_governor

Jaffasplaffa · January 18, 2021, 2:49pm

@Giedrius

Thanks

benh · October 25, 2021, 7:47am

Picking up that old discussion … One thing that would make sense I think would be to use Linux “isolcpu” feature with nohz_full, to essentially dedicate the audio processing core (and have the web UI only care about that one in its CPU usage metric).

This should allow us to ensure that the core doing audio processing is entirely dedicated to that and only that … no random kernel threads, GUI stuff etc… messing with it and occasionally causing glitches.

It’s still Linux … there are things that can still go wrong, but it should help without needing the RT kernel.

What I’m not entirely sure of is whether jackd would also benefit from that treatment and whether it’s preferable to have it on the same core as mod-host or on the contrary, on a different (dedicated ?) one. We have 4 cores, 2 are plenty to deal with networking and the web UI … Also we probably want to route audio interrupts to where jackd is pinned, and no other interrupts to that core.

Rasmus_Booberg · November 4, 2021, 4:52pm

Iv’e been experimenting a bit with multithreading in pd with this exact purpose so i thought id share some of what iv’e found. It’s nothing new really but might save you some time…

Sending audio from one process to another is very demanding. So probably only send data to your additional pd instances at control rate. Also filter (6-8 pole biquad is really efficient, check out brickwall~ if you havent allready and downsample to 48k or similar before sending audio to your main process. Preferably jusy one channel of audio per thread.

This also means that you should avoid having too many threads with audio. Its probably best to have 4 or 5 threads running on an rpi, one for gui, one main audio, and 3 for additional voices. And then possibly have several voices per audio thread.

My personal plan for what im working on is to have 3 audio subprocess with 2 voices each + the main audio process witch deals with control and additional audio effects. I currently send control data to the voices over UDP.

I think you allready know this but dont mix gui with audio. Run threads with -nogui and -noaudio if possible. Don’t use visual arrays, bangs or toggles etc.

Not related to multithreading but good for reducing cpu load:

its very good to know how to write externals! Complex processes can often be achived with something like 1/10 of the cpu usage using one external, compared to building it from a large quantity of objects. This is especially true for musicak filters where you need buffersize = 1 for a feedbackloop. I usually design algorithms with mostly vanilla objects and then type it out as an external when the voicing is right.

And many bandlimited oscillators are very demanding! I found that using the transition splice method can sound amazing, bright and present in all registers, runs a lot faster, and with less aliasing, compared to the nice sounding externals out there. Check out the tutorial in pd if you havent allready it can take some time with voicing and anti alias calibration but it has been well worth it for me atleast. Use the example in the tutorials, if you send a perfect triangle wave through that transition table you will have a bandlimited square where you can ad pulse width control really easilly. I wont spoil all the fun by explaining to much. I’ve uploaded a triangle wave using this method at the pd forum. That was a bit more involved…

Sorry i got a bit carried away. Good luck!