500 virtual Linux devices on ARM 64

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

This is the first part of an experimental journey as I explore how many instances of my favorite IoT framework I can run on the 192 core Ampere One.

Background

I work on the Nerves project which is an IoT framework providing best-practice underpinnings and support so that you can build your IoT hubs, smart thermostats and the like with a safe and productive high-level language on a runtime known for reliability, resilience and consistent performance. The language being Elixir and the runtime being Erlang’s BEAM virtual machine.

If you want more about Erlang, we’ve had Björn on BEAM Radio talking a lot about his work on the compiler and runtime.

As part of my frequent collaborator GleSYS sponsoring Sweden’s first Elixir conference Goatmire Elixir (I would be shilling but we essentially don’t have any tickets left *shrug*) they suggested we might connect with Ampere as they have this particularly interesting hardware with the Ampere One server CPUs (you may have seen the 192 core, 3.2 GHz one discussed) and we turned it into a joint sponsorship. Since I didn’t have a talk topic lined up we discussed me doing something with their hardware which seemed fun. I love experimenting with impressive hardware.

Disclosure: This post is not part of the sponsorship exchange. They get some posts on socials and space in my newsletter along with branded presence at the event. This is me reporting on what I’m up to and providing the background for that. But I want to be transparent, they have supported the not-for-profit that runs the conference which I am organizing.

The runtime

If you know the BEAM you know it is highly concurrent and parallel. By default it starts one scheduler thread per core available and then does work stealing across those to ensure efficient use of the cores. Based on anecdata from a friend who tests these sorts of things (he has data, he shared it anecdotally) the BEAM does not scale arbitrarily with this amount of cores. I’ve speculated whether that’s due to NUMA but I don’t know the architecture of the chips well enough to say really. There would be overhead when running many schedulers of course and whatever coordination is needed. I know Meta has contributed recent updates to Erlang/OTP that should improve the many core performance. I need to see if I can find a really good benchmark for testing the limits of a single BEAM on this thing. Anyway, this was not what I primarily wanted to do.

Going embedded

This is an ARM64 part. And I mostly deal with ARM64 on my laptop or in the shape of various embedded Linux boards for Nerves usage. Now I could put Nerves on the server and call that good but that is just one massive Linux board. And this has been done with nerves_system_x86_64 and specifically for Vultr VPSes. While it would have some interesting challenges I’m not sure it would be all that interesting in terms of a demo. And I also have to shill NervesCloud which is the hosted version of NervesHub that me and Josh Kalderimis offer. Over-the-Air updates of firmware is a nice and lively type of demo. Also, while we know NervesHub scales to hundreds of thousands of devices we don’t very often get to run tests with a lot of full-fledged concurrent devices. We can simulate a lot of connections but having them backed by “real” devices is a different matter.

You will note that this post claims 500 virtual devices and you might think “that is not a lot” and you’d be right. Each device being single-core it shakes out to every device getting about 1.2GHz to play with. They should not need that much. I very much hope that a later post about this project will have a much higher number. But we are getting ahead of ourselves, we are not there yet.

The goal I have in mind. Running as many virtual Linux devices packed with Nerves as feasible on the hardware I’ve been handed access to.

Knowing I had a deadline of 10th of September and a lot of stuff to do until then both for putting together the conference, helping other presenters, I felt I needed insurance. So I reached out to Frank Hunleth (creator of Nerves) and kindly checked that if I got stuck on this, could he help unstick me. He graciously obliged and when I got the server credentials I went into it with his DMs as my backstop.

Making a custom Nerves system

My approach to Nerves, embedded and IoT is top down. I come from the outside, from higher up in the stack, web and cloud stuff. The parts I know the best are the parts that face the user and the parts I’m learning are the foundations and underpinnings. Nerves made embedded more approachable to me and over time I’ve unravelled the helpful structure and come to understand more and more of it. I’ve made initramfs-booting work with dm-verity for verified boot. I’ve compiled more kernels than I’d ever expected. I’ve made minor contributions upstream to buildroot. I’m getting there.

I still tend to start a custom Nerves project from some foundation. I knew I shouldn’t start from a Raspberry Pi system because those are weird, unique and do things no other ARM systems do. config.txt, cmdline.txt, the FAT /boot partition dealio. Compared to most boards I’ve seen since, it is weird. Approachable in many ways but fundamentally weird. So I looked at the BeagleBone Black system, it is lauded as a good workhorse board. Then I realized it is ARM 32-bit and also Frank mentioned that the software around that board is kind of hairy. So it would probably need a lot of adaptation.

So I grabbed a fairly modern ARM 64 system I knew worked which was [the IOT Gateway iMX8 Plus](https://github.com/redwirelabs/nerves_system_iot_gate_imx8plus] by Redwire Labs. And I went to town. It had a lot of Compulab, NXP and iMX-specific stuff in there but at least it was ARM 64 and I am comfortable tearing up a Linux defconfig and a Buildroot defconfig.

Running virtually

My goal is running under qemu with KVM acceleration so I’m running ARM64 guests natively on ARM64 hardware. You have the command qemu-system-aarch64 which can run ARM 64 systems. And if you grab a buildroot you can use the qemu_aarch64_virt board config to get something that runs essentially.

So I could prove fairly easily that this was possible, confirming much of what my web searches had indicated about feasibility. I even managed to confirm that I could run a buildroot build under accel=kvm and -cpu host.

Now qemu can start Linux directly which is kind of cheating compared to most boards. Usually you need a bootloader for various reasons. But with qemu you can just throw a kernel at it, provide a root FS disk image and off it goes. Buuuuut. That removes some of Nerves' greatest features. A/B partitions, factory reset, safe updates. Those rely on boot loader features. So most Nerves systems use u-boot (hereafter known as uboot because that’s what buildroot calls it).

Failing at uboot

I said I know my way around a Linux defconfig and a Buildroot defconfig. Well. I don’t know my way around a uboot defconfig. Not really. I have done a few run-ins with uboot previously preparing for this experiment but essentially I don’t know it very well and there is a lot to it. It is featureful. I wonder how long it will take until I need to learn swupdate, I think that’s another defconfig. Anyway. I found the documentation for the uboot qemu ARM support and verified that it worked.

And then I spent untold hours trying to translate all these parts into the nerves system. I got pretty far with Linux and buildroot, the parts I already know decently. I got somewhere, multiple times, only to realize I was way off. Very often the only response on a build was that qemu would hang and print nothing. That is honestly still a common response.

This was an area where Frank chimed in, in the strangest way possible. I don’t want to spoil entirely what he got up to until it is public and I can make his part proper justice. But suffice to say he provided a solution that completely bypassed needing uboot. At least for now, maybe for the entire project. I still want to get better at uboot but I have to be pragmatic here and the solution Frank provided is quite interesting and will be public eventually I’m sure.

Meanwhile you can look at my work-in-progress system and the barebones project I use for it. The docs are all wrong, there is no guidance. Just sharing if you are curious or want to help fix my uboot I suppose.

Aside: approaches

Frank and I had a call where we resolved some weird errors I was seeing and discussed approaches. He is incredibly experienced in the embedded Linux realm and would have taken a very different approach to what I did. And I figure I should share it here because it is the wise way if you can do it.

He would not have used Nerves to get started. He would have started with just buildroot. I did verify things with buildroot and essentially backed my way into grabbing a working buildroot config for my target as I described above. But he would essentially make sure he got all of the desired Linux and system bits working through buildroot first. And then the A/B partition stuff with uboot or whatever mechanism the board might have. Before bringing those configs into a Nerves system to enjoy fwup and Nerves for testing all those firmware updates. He applies a bottom-up approach because he already knows the foundations. I am getting to know the foundations so I tend to approach from the parts I know best. My approach is definitely more trial and error.

I can totally see his approach working now, even for me. I couldn’t see starting there when I started this project. Now I feel like that possibility is closer. This is why it is hard to document “how to make a custom system with Nerves” because it ties together a bunch of embedded Linux know-how. How Nerves does things is a relatively small set of conventions on top of that which makes a compelling whole. Anyway, learn your buildroot if you want to make custom systems.

Running virtual Nerves

Yesterday we got the Nerves system running under qemu. Not with KVM, instead with -cpu cortex-a53. This should leave a lot of performance on the table. I hope we can get through the challenging bits of getting KVM and running directly on the host CPU. Let’s just say currently qemu just hangs there but Frank certainly has ideas about how to fix it and arguably it should be fixable through uboot as well. Whichever way gets us there first.

But it is interesting to see how far we can push with less-than-optimal virtualization. So I finished up the Nerves firmware project to make it connect to NervesCloud and so on.

1 device. Up and running. Gotta implement generating serial numbers.

2 devices. Good, it works.

10 devices. No problem.

50 devices. No problem. Takes a bit to start.

500 devices. Well the host machine hit more than 450 load avg for a while there. Took a bunch of time to calm down. Then no issue. So I don’t think we are taxing this machine beyond the startup.

Limits

Why stop at 500? I ran out of evening. I did that yesterday. My last bit of time went to starting to make the devices report back to their orchestrating script when they are up and running so that I could potentially bring them up more gracefully. Then I can try to push more interesting numbers.

I’ll have some figuring out to do. I’m guessing I can bring them up at about 190 or so at a time without making it unnecessarily slow. And once each device is up and reporting in they don’t spend a lot of CPU grunt. I hope that would make bringing up a couple of thousand somewhat smooth.

KVM support would almost certainly help make them boot much faster and improve things in general. My hope is to reach the point where memory is the limit. The BEAM is not particularly lean on memory, it ain’t bad, but I’m guessing I’ll need 100-150 Mb per virtual device. I’ll tune that if I see the need and have the time. I have 1 TB of RAM. In the best of worlds that ends up being the limiting factor and in rough math that means almost 7000 devices. I don’t know if we can get all the way there. I’m assuming there are bottlenecks I don’t know about. I hit a small one that I believe we’ve addressed. The base disk image was 2.4 Gb or so (mostly empty). I trimmed it down to a couple of 100 Mb. That was required to fit the 500 on the system disk. Now I think I have a new disk to play with so that fun little limitation should be thoroughly handled.

Useful?

So this is actually taking us somewhere useful. Having efficient qemu builds of Nerves means we can build test harnesses and things for some fairly intricate functionality, like A/B updates and a bunch of the resiliency-mechanism around Nerves.

It would allow people to do host development against a virtual board for cases where that is desireable. This can be particularly useful in workshops and training, though I personally tend to prefer having the hardware.

It is essentially a starting point for running on any ARM virtual servers offered by cloud providers.

This should also be doable to translate into virtualized ARM 64 on top of Apple Silicon. This may mean we can drop the creaking Docker implementation from Nerves (use UTM with a Linux VM instead if you are doing Nerves on MacOS) and maybe get something better. And it would mean that running a ton of virtual Nerves devices for testoing things is actually feasible on my laptop.

Some of the features we develop for NervesHub are quite hard to test without proper devices and being able to spin up a small IoT startup at a moment’s notice is really helpful. So yeah, I see a few different use-cases.

Next?

Let’s push it, shall we?

So a few different things. I need to make that reporting-in on boot behavior work as intended and see how efficiently I can start and stop these things. And then I need to see how many we can actually do under the current setup. Because 500 was entirely arbitrary.

I may have nerd-sniped Frank into doing work towards the KVM support working. We’ll see :)

Big thanks to Dave Cottlehuber for early help on getting to grips with this stuff. Massive thanks to Frank Hunleth as always helping me grow my understanding and enabling me to succeed when I flail. And thanks to GleSYS and Ampere for the hardware access.

This ain’t over.

If you have thoughts or questions you can reach me on email at lars@underjord.io or via the fediverse @lawik@hachyderm.io.

Underjord is an artisanal consultancy doing consulting in Elixir, Nerves with an accidental speciality in marketing and outreach. If you like the writing you should really try the pro version.

Note: Or try the videos on the YouTube channel.