For those that are interested, I’ve now got working prototype of consuming a Pi 4 core and achieving sub-1ms latency. I’ll be using this for creating hardware loopers, but this technique could put into more generalized software such as Jack. The caveat is that any software using this needs to be well behaved. If you want to play around, I’ve put my code here: https://github.com/looperlative/PiUserSpaceAudio
The current settings in the code use a FIFO depth of 16 words on both in and out. With 2 words per sample time, this should give a total end-to-end latency of 16 sample times. It is possible to change thresholds in the software. On the output, we can prefill to the 24 sample time marker which would change the overall latency to 32 sample times and thus give the system a little more room to handle the chunkiness of software.