Bodging GenServers Together

2024-11-22

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

What feels like forever ago but what was probably a year and a half I gave a talk about Lively LiveView with Membrane. Video is available for the curious. It was a stunt talk but also a talk about creativity and how Elixir let’s me plug things together and try things that feel magical. That feeling has never left me.

Some of the magic is a goddamn pain. Take the Membrane framework. I really like what it enables me to do but it is also a bit messy. Setting up pipelines is not trivial, writing custom elements is a bit finicky and requires some practice. But more centrally the dependencies, build requirements and underlying nastiness of video formats is really what makes everything harder than you want it to be. Membrane gives you a lot of capability to do live things, within Elixir, but you are playing with media and media as a domain is terrible.

If you spend a lot of time in the browser you are used to the platform papering over the difficulty of opening a microphone in a cross-platform manner or playing video.

Something I want to do in the near term is finish up my tech demo of the Seeed Studio ReTerminal DM and that includes making it do some unexpected stuff, all from Elixir. This actually involves “AI”, meaning Machine Learning. LLMs are mostly impractical for local use. They definitely don’t run well on a Raspberry Pi 4 CPU. But there are models that are perfectly feasible.

Take a Voice-Activity Detection model, which are lightweight models designed to determine what parts of an audio stream are speech. Sean Moriarty has covered this as a part of his experiments in voice assistance. It is a relatively small and cheap-to-run model that will save you a lot of performance where you don’t need to bother to process things not identified as speech. The device I run has microphones, so it can do this. And then we can take the voice-parts and throw them at the heavier Whisper model and get text. Then we can decide how we want to map text to action. This is all pretty useful applications of ML, quite practical and hands-on. Plumbing and pre-trained models.

ML is also goddamn painful. Whenever you paint outside the lines a bit things get really weird really quickly. The VAD model used in the guide has gone beyond the ONNX support in libraries and the input arguments have changed shape. So I have to pin down the right old model to use and wrangle the inputs just so. It is a deep domain and when the tools pave over the hard parts everything is nice and when the hard parts peak up you may want to scream. But I don’t have to shuttle data back and forth to a Python script so I prefer it.

I think there is bit more I could do in terms of Machine Learning as well. If I want something close to LLMs language semantics for recognizing commands/actions to take I could use an embedding model and make vectors of all available commands, make a vector of the command given and it should let me perform a vector search on language semantics more than just letter-level proximity. Pulling out arguments and such would still be utterly painful.

LLMs are decent at pulling out arguments but a 3B model tuned to function calling (looked at Galive Function Calling V1 for example) seems to require about 10GB of memory for your GPU, that’s a nice big enthusiast Nvidia GPU you need there. This is not feasible for an RPi. Of course I could pass the inputs off to an LLM API service but I find that uninteresting as a demo.

Suffice to say, I get to work within some creative constraints which is not all that bad. I think I can get reasonably responsive voice commands and that would be novel and an interesting demo. It will allow a lot of fun LiveView stuff since all of it is streaming. I can get events about when the VAD detects your voice and immediately start indicating it is paying attention, then as soon as the VTT model starts to spit out words I can show you what it is hearing, then I could do various interesting things in terms of searching for the closest commands, maybe show you what you are closest to at every turn. Depends on how fast the vector search is.

Membrane pipelines are just GenServers. Nx Servings that underpin Bumblebee are just GenServers. LiveViews are just GenServers. Nerves was born to start GenServers for you.

Do you ever bodge together GenServers? What is your place of joyful experimentation and where does it get painful? If you want to discuss you can reach me on the fediverse @lawik@hachyderm.io or lars@underjord.io. Also on Bluesky recently, will see how long that place lasts.

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

Note: Or try the videos on the YouTube channel.