Booting 5000 Erlangs on Ampere One 192-core

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

In the previous post on 500 virtual linux devices on ARM64 I hinted that I expected serious improvements if we got KVM working. Well. We’re there. Let’s see what we got going on.

Disclosure: I am running a conference called Goatmire Elixir which Ampere is a sponsor of. This post is not part of the sponsorship exchange as such. It is prep for my talk for the conference which uses the hardware they lent me. So this is your transparency notice, but fundamentally I am not making comparisons on whether they are better or not. I’m learning and sharing about qemu and virtual Linux machines. Now I’d love if they paid me to shill them a bit later and I’d be transparent about that too. But this is not that :)

To recap. We have an Ampere One 192-core machine with 1 TB of RAM. The goal is to run as many virtual Linux IoT devices using the Nerves framework. We got 500 of them last time before I tried pushing any further. I also got a bit further on the same setup when I tried. Maybe 1000, I don’t recall exactly. But there have been developments, so read on!

Briefly on Nerves: the framework treats the BEAM virtual machine like the OS and essentially only uses Linux for a kernel, drivers and the like. This means we can write much if not all of the embedded device in a productive high-level language running on a provenly robust and reliable environment with memory safety and solid recovery strategies. And it means your cloud integration developer doesn’t risk seg-faulting the entire device while mangling JSON back and forth. Nerves also brings some best-practice tooling and conventions. Your init process is erlinit, your updates use fwup to provide A/B partitions and factory reset, auto-failback, validation of firmware viability, disk encryption, delta updates, streaming updates and a bunch more.

The most interesting development is the thing you can probably learn the most from. Frank Hunleth who has been my co-conspirator and a massive help saved me from fighting u-boot by .. writing another bootloader. Introducing little_loader. This adorable tractor will load up your ARM64 qemu device, consult the uboot environment that Nerves uses, find a Linux kernel from information in that and then boot. Consequently it enables the A/B upgrade features and everything else that makes Nerves great.

Writing a boot loader is a little bit ridiculous. Frank knows his way around C and apparently ChatGPT knows a fair bit about ARM and qemu. Enough to be dangerous. And where it was wrong he could rummage around until he found the way. How he does what he does is beyond me but the result is a very small boot loader that you can probably read through and understand. So if you are curious about booting ARM64 or about how qemu starts things the code should be a worthwhile read.

We got a bit tangled up in EL1 vs EL2 when we only ever needed EL1 to work. EL2 on ARM is what you’d run under if you want to be able to run VMs in your VMs so you can VM while you VM. And the version of qemu + KVM I got from Ubuntu doesn’t seem to support that. We weren’t interested in it either. At some point we might explore EL3 for secure boot and whatnot. Only time will tell.

One of the weirder challenges and something we haven’t disentangled yet is that we have some compilation issue where using the toolchains I was using the non-debug build would hang while the debug one runs fine. For now I run the debug build of the bootloader. I think it was fine from GCC 15? Anyway, hopefully we pin that down at some point. But it tripped us up a few times when the bootloader would hang due that issue rather than any actual problems with the implementation.

KVM didn’t really require anything extra aside from making sure we didn’t go to EL2. And when we tried it on MacOS it worked great with HVF as well. Host-based ARM64 VMs are ridiculously fast and practical. As in booting to the full IEx prompt in single-digit seconds instead of double-digit. And they use about 500Mb less memory. And see, that’s important. Because we want to shove as many as we can into this server I got access to.

Accelerated on host

My very hacky project for running this stuff is available here. This code is cribbed from simple.sh:

shell simple.sh
  qemu-system-aarch64 \
	-machine virt,accel=kvm \
	-cpu host \
	-smp 1 \
	-m 150M \
	-kernel ../little_loader/little_loader.elf \
	-netdev user,id=eth0 \
	-device virtio-net-device,netdev=eth0,mac=de:ad:be:ef:00:01 \
	-global virtio-mmio.force-legacy=false \
	-drive if=none,file=/space/disks/special.img,format=raw,id=vdisk \
	-device virtio-blk-device,drive=vdisk,bus=virtio-mmio-bus.0 \
	-nographic

To go through it. We use qemu-system-aarch64 to emulate an ARM64 machine. aarch64 is the common shortname for ARM64, except sometimes on MacOS where I hear it can be arm64. We specify the machine to be virt. Previously we’d leave it there but now we use an accelerator named KVM (Kernel-based Virtual Machine). It is the virtualization mechanism included with Linux and qemu can integrate with that to accelerate the execution. This also requires -cpu host, meaning, we are no longer trying to emulate a cortex-a53 processor. We are trying to run on the host processor, whatever that is. host means that we emulate the host CPU, or at least as much as qemu and the KVM accelerator can support of what the host can do. This is where we drop about 500Mb of memory overhead. We no longer have to have a pretend ARM chip in memory because we have an actual ARM chip to run on. That’s my understanding at least. Would love notes on that.

We only give it 1 core via -smp 1 and we give it 150Mb memory with -m 150. Skipping ahead we give it virtual Ethernet and a virtio block storage drive. You’ll see a lot of virt and virtio when doing this stuff. And with -nographic we tell it to not bother trying to pop up a GUI window, so we get our console in the terminal. I’ve done all the work over SSH so that’s definitely my preference.

The disk we provide runs from the raw disk image file special.img which was generated using fwup based on the Nerves project amproj I mentioned earlier. If you build that project with mix firmware you can then run:

shell
fwup -a -i amproj.fw -d special.img -t complete

That’ll give you an image file that contains a full Nerves system. The stuff written to disk is:

  • A uboot env formatted chunk of data. We don’t use uboot this time but we used that format.
  • A linux kernel, not on a filesystem. Just written to the disk. RAW!
  • An MBR and some partitions:
    • Root filesystem A (squashfs, read-only)
    • Root filesystem B (squashfs, read-only)
    • Application data partition (f2fs, read/write)

The uboot env is used to tell the bootloader important things about the A/B upgrade process as well as where to find the kernel to load as well as what kernel cmdline to use which is how we tell it what root filesystem to use.

The only config I put into the loader is to set the offset where it can expect the uboot-env and I set that at build-time.

Promising results

NervesCloud received 3389 simultaneous connected devices before the server hit me with the OOM killer. It was probably running a few more but around there. So each VM is:

  • Bootloader
  • Linux
  • erlinit
  • BEAM/ERTS
  • Nerves base functionality
  • NervesHubLink for connecting to NervesHub

I have had 3000 devices running stable and then I started to see “fun” challenges. For one thing, our NervesCloud hosts were looking a bit tight on memory because all these devices connect from the US west coast and we were only running a single node in that region. I scaled that up a smidge to make sure I didn’t bother any paying customers.

The VMs are super well-behaved, the Ampere CPU just works. The memory usage is roughly where I’d expect it. 150-250 total I think. There are probably things I can do to make it behave a little more tighter. Will explore that if time allows.

Then I ran my first demo workloads. As the purveyor of the finest Over-the-Air updates for embedded devices we here at NervesCloud.. I kid. But I wanted to shove lots of updates at them and see what that did. The updates process is a lot of compression, decompression and IO. Probably mostly IO-bound but if the devices would be struggling for CPU that’d be noticeable. If the memory usage exploded, that’d be noticed very quickly.

It worked. Not really any problems. I limited the concurrency of the update to 1000 and it couldn’t hand out the updates faster than they completed so it tended to hover around 200-300 concurrent updates happening. Or at least that’s my understanding of what happened. Did I mention the KVM setup is pretty fast?

I logged some issues about UI behavior as I was watching things live and trying to adjust things. It seems like good guy Nate Shoemaker already has a fix in flight for this. There may be more details. When you get a lot of progress reports the LiveView UI perhaps shouldn’t try to refresh all the things all the time.

Memory tuning

Frank gave me some tips about tuning Linux memory usage and tuning BEAM memory usage. When I looked into his advice I ended up doing a few things:

For BEAM VM, we change the allocators. This should use less memory and probably trades off in raw performance. Which is fine for this purpose.

Erlang release, use default mode instead of embedded. Which probably makes it boot a bit faster, makes it use less memory but it could lead to surprising delays and growing memory usage later if it loads code ad-hoc. The use of embedded mode is helpful in making the release behave much more consistently, is my understanding.

Made a bunch of adjustments to Linux memory usage. Using zram was suggested by Frank. Then I checked with (famous ML model) Claude to get hints about what knobs were available to tune on Linux because Frank hinted that it might be caching a bit much and I never know where to start when it comes to what I can do to the Linux kernel. It had some suggestions, I looked those up, found articles that matched the claims that this might reduce memory usage. Changing swappiness, dirty ratios and vfs_cache_pressure like I knew what I was doing and it sure seems to have improved things.

I know I could play with different allocators, my co-founder Josh has been doing that for NervesCloud recently. I think I could also do something with virtual balloons to reclaim memory and essentially over-provision VMs but I haven’t got there yet.

This memory tuning led to some interesting further runs where we ran a solid 5100 devices and I could have pushed it a bit further. I just didn’t have time and could be bothered to do more math at the time. The VMs are now started with 110 MB of RAM on the inside and they seem to run steady around 160 MB RES according to htop. The people I’ve talked to at Ampere indicate that I’m probably running the most VMs anyone has ever ran on their hardware. Which is fun. I’m not even running tiny VMs. I could make a Buildroot system that does nothing and run another gajillion probably. But this is much closer to a real device and workload.

The utility of it all

Honestly, getting a chance to run significant, not massive, but significant workloads against a SaaS is pretty useful. But the work we’ve put in now means we can tidy up this Nerves system and make it part of supported Nerves tooling. This would make it easy to run stuff “on device” without physical hardware. It would make running more detailed tests of Nerves functionality much more feasible as well. Essentially you’d need an ARM64 Linux box with KVM or an Apple Silicon Mac and you’d get the blazing fast ones. Or you can absolutely get by with the emulated, more demanding things from the x86 side of things. There is a lot we can do with a full-featured qemu-system for ARM devices.

While my experimentation is a bit of a stunt and mostly for the joy of experimentation and Frank’s bootloader is mostly about learning the end result is still that we have produced something we should get good mileage out of.

Heck that MacOS thing. I just tried the DELAY=1 COUNT=200 CHUNK=10 ./run.exs after modifying the script to use hvfinstead of kvm on my M2 MacBook Air. I think I had 50 VMs when I ran out of disk. Solvable problem, but not throwing out my photo library right this minute.

Further work

I need to look at how KVM and NUMA interact and if/how I can pin. I don’t think I’ll hit problems where caches and pinning matter all that much but it would feel better. When the VMs are at rest after booting the overall system CPU usage is generally less than 20% running thousands of VMs. Mostly idle, yes, but there are things happening in all of them.

Should run the workload with some graphs to see what is actually happening big picture. Right now I’m mostly going “hey, it is STILL running, eh!?”. Which is fine enough when figuring out if it fits in memory.

Tidying up

We are in the process of tidying up nerves_system_qemu_aarch64 and then it should get a proper release and some docs. It has a mix task for generating an appropriate qemu command for you. So this all becomes a part of Nerves. Over time we should be able to build some really nice tooling based off of this. And if you have ideas you should be able to pick it up and run with it already.

Really enjoying this deeper dive into things I’ve only been at the periphery of. Learning a lot of Linux, getting to really get into it with qemu, performance tuning for both the BEAM, Linux and virtualization. It is a ton of fun to see how far you can push the hardware.

Alright, that’s enough words. Let me know what you think and if there is anything in particular you’d like me try in and around this. Thanks for reading, hit me up on lars@underjord.io or @lawik@hachyderm.io or wherever you find me.

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

Note: Or try the videos on the YouTube channel.