Getting stuff done is the important bit

My strange data client (I think he reads this. Hi!) has sent me on wide-ranging data engineering research expedition. I've found very cool tools that really come close to what I believe to be my ideal data pipeline building blocks and I've sketched and explored some really fun stuff. Learnt a lot. He asked me to pin down what the stack would be for our starting point. My answer was: Whatever DataGuy over yonder is most comfortable with.

That's not the fun answer. I want to build my ideal pipeline of course. I want to push things forward. I want state of the art combined with novel and forward-looking.

But that's not how you get a business-oriented data product prototype in front of customers. We could use literally anything that will produce an outcome without too much work. I want to say that I don't care about it. I do care. A ton. It just isn't the right starting point. We can keep things very simple initially until we know the problems we are solving. It pains me. But that's they way.

So what I expect to be doing is something like Airbyte for moving data from sources to our stores. Probably something Data Lake-ish. It should have some Orchestration and whether that's Apache (Beam | Flink | Airflow), Prefect or Dagster doesn't matter all that much because all of them should be able to model a flow. And supposedly dbt is very good once you have data in your system and want to remodel it batch-style.

This seems quite typical. It is in and around the "modern data-stack" though some reading of that concept seem to imply you use exclusively hosted SaaS. So Airbyte can be changed to Fivetran. Airflow could be Estuary. Or you could just pay for the hosted versions of all the OSS things.

This would get stuff done.

The lesson is done here. If you just wanted pragmatism. If you are interesting in data pipelines and how they might be improved in the future. Keep reading but you are allowed to peace out here. With my blessing.

Now what would I be exploring to build something novel?

Objectives:

Completely open source and self-hostable
High performance
Relatively lightweight operationally
Highly scalable
Stream processing by default with a good batch story
Good runtime characteristics, back pressure, fault recovery
Easy to develop within and against the user-facing bits

I generally want to minimize my reliance on Java and Python projects.

Java projects are often beasts to operate and I find them inscrutable to contribute to. This is a me problem to some extent. I also know people from the Java space that swear against operating anything JVM-based. They can absolutely be high performance but are rarely lightweight operationally. They seem culturally distinct from many more modern alternatives in Go for example.

Python is just not the tool for the job if you want performance. Unless you are only operating with accelerated bindings and doing special numerical compute thing with dataframes. Sure. But take Airbyte. Java and Python talking to each other. It is very pragmatic. It does a ton of useful stuff. It is absolutely a pile of overhead on top of overhead. It is not an ideal to aim for.

Estuary is interesting and what they call Estuary Flow is source available. It is BSL so I don't consider it fully open source and I can't build business on it. It uses a bunch of Rust and has made the pragmatic choice that I was considering separately of simply supporting the Airbyte Connector API so that you can use those legacy connectors with their inefficiencies to bootstrap a new ecosystem. Why do I call them legacy? Because they are don't match my future ambitions ;)

A company called Voltron Data is stitching together what seems to be a potential future of the data space. And they place a focus on standards for composability. Standardizing on the Airbyte connector would be the wrong standard in my eyes. Because it leaves massive performance on the table. The Voltron Data approach centers around the Apache Arrow project and a number of related technologies.

Arrow is a zero-copy columnar format for going the same over the wire, in memory and into the CPU (SIMD-friendly) and GPU. It is what Pola.rs uses. So it is what Elixir's Explorer uses.

A very pragmatic but future-oriented thing they've done that we already have some support for in Elixir is ADBC. Database connectors that either use Arrow support in the database OR produce Arrow data by conversion. Meaning you can pull Arrow data from any supported database and get an in-memory representation that you can perform intense processing on. It is a good format for storing columnar data, very close to what DuckDB uses internally, it is good for operating on it.

They are also building out a standard called Arrow Flight which is an RPC format based on gRPC that I hope we'll see in your Kafka, NATS and friends in the future. It let's you stream data and the internal format? Arrow.

These parts all string together to reduce processing cost. Whether you leverage that for lower prices, less environmental impact or higher throughput is up to you. But I like it. If something goes from one storage, into two stages of processing and then on to new storage and it looks like:

ADBC -> Processing in Arrow format -> Arrow Flight -> Processing in Arrow format -> Arrow Flight -> ADBC

Then that would only produce new data and would ideally never need to make a needless copy. It would also make that data load quickly into CPUs and GPUs along the path. Readable text files are nice but I think the potential gains here are way more important.

Any standard for connectors going forward should bear this in mind. Code that can take or produce Arrow format fit into this.

This is where my grand vision starts to poke at Elixir. Now dealing with low-level zero-copy such as Arrow is not very Elixir-native. It means NIFs like Explorer and keeping opaque references. That's just our version of using Python for driving a ton of C++ libraries. But Elixir is good a clustering, distributed workloads, orchestrating work and providing observability and insight into what the system is doing. As well as having some of the best frameworks for building a UI onto the system state.

I would place Elixir as a very good tool for building out an Airflow-replacement. The orchestrator. It can do low-latency no problem. It can do absolutely brainless concurrency for you. Defining workflows and running them as DAGs (Directed Acyclic Graph). All of that.

Broadway could do this but is kind of batch-oriented to me and doesn't have a crystal clear pipe-together-a-DAG approach. I poked about with Flow and the next-step-abstraction on Flow could be the ticket. And Arrow Flight or ADBC at the start and on any forwarding step that benefits from it.

If someone wants to make a library to integrate the Rust implementation of Arrow Flight with Elixir through Rustler that'd be rad. Making it work with Explorer importantly. Arrow Flight is early days. But being able to expose Arrow Flight endpoints and talk to them would be very interesting for moving data pipelining in Elixir forward.

Meanwhile, I want to find relevant tools that are currently available. Nothing does Arrow Flight from what I can tell. Estuary apparently builds on top of Gazette. Which is kind of Kafka-ish in that it exposes a bunch of topics/streams that are backed by persistence and you can follow them and resume them from a particular offset and so on. It is a very cloud native Golang type thing which I think is pretty straightforward. I like the really dumb persistence (files on a bucket) and really dumb semantics (byte offsets). This makes it all very adaptable.

Gazette is low-latency streaming first with ongoing materialization. Streaming is the paradigm do aim for. You can't get to low latency from batching but you can get batching from low latency streaming. For short-term data you can have your bucket lifecycle the data out while with long-term data you could just keep it. There are some knobs to tune for specifying the "journals" and how data naming is done etc.

Essentially the persistence is a Data Lake and you get that automatically as it goes. Other systems can consume this separately. Out of band. Or they can consume it slowly at their leisure. From scratch or just updates. It has a very simple HTTP API as well as gRPC. It feels like it is an inch away from supporting Arrow Flight.

For cases where the business needs a Kafka I've been recommended to prefer Redpanda or NATS Jetstream just because Kafka is a bit of a beast to operate and costly performancewise. I can't speak to it personally but I would take that recommendation and try those first. If the business already has Kafka, then that's that, probably. I think Gazette is a better approach than Kafka inside a Data Pipeline. But I don't think it necessarily replaces Kafka as a backbone which a Data Pipeline feeds off of.

So I would prefer Arrow-oriented processing where appropriate and possible. Potentially Avro is a decent match before you get to columnar-style data, if you need to be able to extract records rather than columns. Exactly how to delimit Arrow over Gazette is a bit of a question that I'd have to pin down but it should be fully doable.

The open question is how to build data transformations currently. My Elixir vision would mean doing it with Elixir, Explorer, Nx. It could also mean using Plasma to share Arrow data to Erlang Ports running whatever you need, Python and friends for example. Some pragmatism is required here. But that's not here or now.

Currently doing transformations under Arrow is probably best done in custom code. If you are consuming Gazette there is special nice consumer bindings built out in Go. Arrow has Go bindings. I guess transformation could be done in Go for the moment? I don't find that extremely compelling as I imagine it is a bit more seremony than I'd prefer.

I think Elixir would do very well in picking up much of the user-facing side of this. The service end. You need to have a websocket for your user that keeps their data fresh from the stream engine? Easy enough. Need dashboards and control UIs that keeps live data? We have perfect tools for building that. I think we could also do well to build single-developer Elixir-fullstack tools that do most of what the data landscape needs. Flow and Broadway already give much of it if you have some dev muscle to throw at the problem.

This is very long. I apologize. I must make blog posts for this that I can reference to trim down the exposition. I am exploring and trying to learn the space, I will be getting experience with it in the near-term but currently I'm very much theory-crafting, sorry to say.

Did I miss your favorite and best data tool? Am I completely misunderstanding the space? Does any of your experiences contradict these ideas? You can reply to this email or poke me on the fedi @lawik@fosstodon.org, I enjoy hearing your thoughts.

Thanks for reading. I appreciate your attention.