Voice Activity Detection in Elixir and Membrane

2024-11-27

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

I hacked on something quite useful in the last few weeks, off and on. Voice Activity Detection in Elixir with Silero VAD through ONNX. I’ll show what I did and try to give an idea of what it is and why it is useful.

It boiled down to this gist as a proof of concept. Should work on most Elixir installs. These days Membrane will even try to pull pre-compiled dependencies for the libraries it wants. This was pleasant news as it can otherwise be a hassle to pull the right shared libraries for media processing. I had completely missed that this was added. Update: Added the URL to the correct older version of the model. You can also fix it and use the newer version.

Using Silero VAD from Elixir is not something I discovered or was first to write about. I leaned on this post by Sean Moriarty for DockYard. Finding the right version of Silero VAD to match Ortex was a bit of a hassle because the model has changed significantly and is out of step with Ortex, the Elixir library based on the Rust library Ort which provides and ONNX runtime. Got some good help in the Erlang Ecosystem Foundation’s Slack, the #machine-learning channel is where that stuff happens. Shout-out and thanks to Travis Morton and Andrés Alejos.

I’ve covered Membrane before and I’ve used it in talks. For that talk I did voice controlled slides and a bunch of fun little tricks where Membrane ran the pipeline that fed my voice into the Whisper model. A VAD model would actually have been great there.

Voice Activity Detection or VAD, is the weird art of determining if audio signal is voice or not. A hotdog or not for the human voice. And this VAD model is a lot lighter to run than Whisper, it is smaller and faster. And it tends to do a good job rejecting keystrokes and various other non-voice noises. And by running this in Elixir it becomes very doable to get near-realtime messages to your UI about whether there is voice activity. More importantly we only need to send the chunks that seem voice-like on for processing or only transmit those to a listener on the other side. No data is the best compression.

Time to unroll that gist. The important dependencies are, annotated below with comments:

elixir

Mix.install([
  # ONNX Runtime bindings to Rust library, adapted for Elixir
  {:ortex, "== 0.1.9"},
  # Numerical Elixir  (Nx) used for dealing with ML data types
  {:nx, "== 0.7.0"},
  # Membrane itself
  {:membrane_core, "~> 1.0"},
  # Membrane element for the cross-platform audio device library
  # portaudio, used for opening the microphone
  {:membrane_portaudio_plugin, "~> 0.19.2"},
  # Membrane element for resampling audio using ffmpeg
  {:membrane_ffmpeg_swresample_plugin, "~> 0.20.2"},
  # Membrane element for the LAME MP3 encoder, to produce an
  # output .mp3 file.
  {:membrane_mp3_lame_plugin, "~> 0.18.2"}
  # Used to write a file as the final output
  {:membrane_file_plugin, "~> 0.17.0"},
])

Next we can look at the Membrane pipeline. A Membrane pipeline is mostly a GenServer talking to a pile of element GenServers. There might be a Supervisor in there somewhere, I haven’t really checked. I will annotate the pipeline with comments for what I think is worth pointing out:

elixir

defmodule Membrane.Demo.SimplePipeline do
  use Membrane.Pipeline
  @impl true
  def handle_init(_ctx, _) do
    # This is a simplistic pipeline, each element only feeds into the
    # next one and it doesn't need anything else
    spec =
      # We configure the PortAudio source to open the microphone
      # at the specs that Silero VAD actually wants
      # signed 16-bit samples at 16000 Hz
      # the buffer size is specified in frames
      # a frame is according to the docs:
      # "For a stereo stream, a frame is two samples."
      # This is a mono stream.
      # So we have 16000 / 1000 = 16 frames per ms
      # If I want 100ms chunks to process a buffer size of
      # 1600 frames will do.
      # PortAudio produces RawAudio (PCM)
      child(:source, %Membrane.PortAudio.Source{
        channels: 1,
        sample_format: :s16le,
        sample_rate: 16000,
        portaudio_buffer_size: 1600
      })
      # My custom VAD element, will cover that later
      |> child(:vad, VAD)
      # As the audio gets passed through we upsample it to suit
      # conversion to MP3
      |> child(:converter, %Membrane.FFmpeg.SWResample.Converter{
        output_stream_format: %Membrane.RawAudio{
          sample_format: :s32le,
          sample_rate: 44_100,
          channels: 2
        }
      })
      # Change the raw audio into an MP3
      |> child(:encoder, Membrane.MP3.Lame.Encoder)
      # Write it to a file on disk
      |> child(:file, %Membrane.File.Sink{location: "local.mp3"})

    {[spec: spec], %{}}
  end
end

Let’s look at the element. I over-complicated this because I hadn’t tuned the PortAudio config when I built it. So I needed to capture the samples up until my desired chunk-size and only then process them. This could be simplified but I know this runs and does the job so you get what you get:

elixir

defmodule VAD do
  # This is a Filter, meaning it takes input AND produces output
  # otherwise it would be a Source (only output) or Sink (only input)
  use Membrane.Filter

  # An input pad designs what the expected input is
  # technically Membrane can be used to sling any data
  # and I'd be curious to do weirder things with that
  # but now we stick to raw audio in buffers
  def_input_pad :input,
    availability: :always,
    flow_control: :manual,
    demand_unit: :buffers,
    accepted_format: Membrane.RawAudio

  # This element passes on RawAudio after processing
  def_output_pad :output,
    availability: :always,
    flow_control: :manual,
    accepted_format: Membrane.RawAudio

  # As we start the pipeline it will initialize the element
  @impl true
  def handle_init(_ctx, _mod) do
    # Load our model from file
    model = Ortex.load("./silero_vad_likely.onnx")

    min_ms = 100

    sample_rate_hz = 16000
    sr = Nx.tensor(sample_rate_hz, type: :s64)
    n_samples = min_ms * (sample_rate_hz / 1000)
    bytes_per_chunk = n_samples * 2

    # Set up the initial state, we will reduce over this as data comes in
    init_state = %{h: Nx.broadcast(0.0, {2, 1, 64}), c: Nx.broadcast(0.0, {2, 1, 64}), n: 0, sr: sr}
    state = %{run_state: init_state, model: model, bytes: bytes_per_chunk, buffered: []}
    {[], state}
  end

  @impl true
  def handle_playing(_ctx, state) do
    # When the pipeline is active, "playing"
    # we trigger demand which will tell the preceding
    # element that we want input
    {[demand: {:input, 1}], state}
  end

  @impl true
  def handle_demand(:output, size, :buffers, _ctx, state) do
    # If later elements in the pipeline make demands of us
    # we make demands as well. The samples must flow.
    {[demand: {:input, size}], state}
  end

  # This does most of the real work
  @impl true
  def handle_buffer(:input, %Membrane.Buffer{payload: data} = buffer, _context, state) do
    %{n: n, sr: sr, c: c, h: h} = state.run_state
    buffered = [state.buffered, data]
    # This is the now-unnecessary insurance that we've built up
    # enough buffer to do meaningful processing
    if IO.iodata_length(buffered) >= state.bytes do
      data = IO.iodata_to_binary(buffered)
      # Turn the data into a shape Silero VAD expects
      input = data
        |> Nx.from_binary(:s16)
        |> Nx.as_type(:f32)
        |> List.wrap()
        |> Nx.stack()

      # Run the model on the input data and get updated data
      {output, hn, cn} = Ortex.run(state.model, {input, sr, h, c})
      # Turn the output into a probability between 0.0 and 1.0
      prob = output |> Nx.squeeze() |> Nx.to_number()

      # Our only output right now
      IO.puts("Chunk ##{n}: #{Float.round(prob,3)}")
      run_state = %{c: cn, h: hn, n: n + 1, sr: sr}
      state = %{state | run_state: run_state, buffered: []}

      # This 0.9 is not the right threshold, I had some off calculations
      # that made the probabilities start around 0.5 but fixing that
      # made it very consistent, I could probably threshold around 0.2
      if prob > 0.9 do
        # This would be where I either trigger a Membrane notification
        # or use some other Erlang messaging mechanism to notify whoever
        # cares about voice activity that it is happening
        # We also pass the buffer forward.
        {[demand: {:input, 1}, buffer: {:output, buffer}], state}
      else
        buffer_size = byte_size(buffer.payload) * 8
        # Two fun options:
        # 1. pass nothing forward, this makes for an MP3 with no gaps between speech
        # 2. pass silence forward, this would make for clean silences between speech
        {[demand: {:input, 1}], state}
        #{[demand: {:input, 1}, buffer: {:output, %{buffer | payload: <<0::size(buffer_size)>>}}], state}
      end
    else
      %{state | buffered: buffered}
      {[demand: {:input, 1}], state}
    end
  end
end

The final part is only starting and sleeping:

elixir

Membrane.Pipeline.start_link(Membrane.Demo.SimplePipeline, [])
:timer.sleep(:infinity)

My goal with this was actually to use it in The Grand Kiosk a Nerves demo for the Seeed Studio ReTerminal DM that I am working on. It has a mic array. And Membrane works fine with Nerves even though the precompiled stuff isn’t available for Nerves targets so I do need to add it in my custom system. But I know how to do all that. Running this as a filter for hailing Whisper and having Whisper interpret speech that I can then use for text-based commands seems like fun. I want to see if I can do a nice job with vector embeddings for fuzzy-matching pre-defined voice commands as well.

Unfortunately Ortex won’t cross-compile because some underlying Rust library won’t cross-compile quite right. For now I am stuck. I toyed with trying to hack out the bad arg from the Rust deps but I don’t know Cargo well enough to do it right and might not have found the right dep to hack.

There will be a path forward eventually I’m sure. I was happy to see that once I got the right data shape into Silero VAD all my previous experience with Membrane elements made those parts just slot together quite nicely. And it is a pretty neat little example of an Elixir script.

Thanks for reading. If you have questions you can get a hold of me via the fediverse @lawik@hachyderm.io or email lars@underjord.io.

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

Note: Or try the videos on the YouTube channel.