Post-mortem: 10 years in the vertical - Part 3

Underjord is a tiny, wholesome team doing Elixir consulting and contract work. If you like the writing you should really try the code. See our services for more information.

Content warning: Contains poor technical decisions, inexperience and stories of a developer just starting out and up to roughly present day. Be kind to who I was, he gets enough shit from me.

This continues our dive from Part 2. Part 1 can be found here.

The better system 2.0

The big embarrassment of the system we’d previously built was performance, it wasn’t scaling with a growing customer base and was slowing down. When discussions started for a proper rewrite of the system “scaling” was probably the central word. There was another parallel product in the company that was also experiencing performance issues. This put performance front and center but since it was attributed to growth the focus was more on scalability than specifically efficient performance.

We were tasked with finding a plan. This was when microservices was heavily on the rise and they promised to let us scale horizontally rather than vertically. In what felt like a stroke of pragmatism and wisdom we leaned on the experience of others. We read the blog post In praise of “boring” technology by Spotify and basically followed it. Making plenty of decisions along the way since it is fairly high-level.

I rather like the core of that post still. It basically reinforces the idea that you don’t need to get fancy. You can use well-known, well-proven tools and build a series of simple things that achieve something complex and they are likely to be more reliable than more new-fangled and fancy solutions. Reliable, maintainable and thus, potentially sustainable.

I don’t think leaning on the general idea was a bad call. I have some reservations about leaning on the details of it that we should get into. The basic philosophy is still pretty close to what I subscribe to.

The second priority was to create something that could be re-used for when the other product and any number of future products would be created. That is, re-use what makes sense. Build some things to be general. Sounded good at the time. The impact of this and the way we divided our microservices had an enormous impact down the line.

Meanwhile the product designer and some domain experts got busy with focus groups and exploring what the system should be to take what we’d learnt from the existing version and push that to the next level.

Product team life

I really like the team we had. Good people throughout. Overall I ended up having most of the technical responsibility because I was the only one with a 100% time commitment to this product. Or mostly 100%, it varied over time but during this green field development it was certainly most of my time. We were 2-3 devs off and on. One product designer, UX, sometimes JS person that was learning programming as we went, a pretty hardcore journey and an interesting story on its own.

I really liked being on a team with only the product work to focus on. Agency life was always a balance between serving many masters and delivering enough stuff on time. Here we more or less embarked on grand voyage to build The Better Thing.

In the end we also had deadlines and limitations, so the first part of the work was the thoroughly considered and enjoyable part and then there was a significant period of forced march to get it past the goalposts. Not the best but not the worst I’ve had either.

This Time We’ll Do It Right

So the technical stack was:

  • Python
    • Django
      • Django Rest Framework v2 (for our product-specific API)
      • OAuth2 toolkit (for our generalized account service)
    • Flask (for our smaller public APIs)
  • ZeroMQ (for RPC, Queue/Worker, PubSub communication)
  • Protocol Buffers (for message formats/contracts)
  • PostgreSQL (main datastore)
  • Redis (cache)
  • Elasticsearch (as needed, for some statistics)
  • Bind (not used for DNS, only service discovery)
  • AngularJS (web frontend, 1.x series, SPA-style app)
  • Android app, thin web wrapper
  • iOS app, thin web wrapper
  • Ionic (1.x, later limited featureset, better UX app)

I don’t really regret the specific technical building blocks. There was a lot of implementation detail that was not right and several architectural turns I wouldn’t take today. And some of these things I wouldn’t pick up now (Angular 1 and Ionic 1 were both evolutionary dead-ends apparently). Many of these things could have been done easier with slightly different tools. But I’m happy with the selection and would definitely consider using a bunch of these tools again. ZeroMQ specifically isn’t something I expect to use again but I learned tons from getting to grips with it. The ZeroMQ online guide was a good read, if a bit hyperbolic at times.

So the microservices thing. Turns out how you divide your services matters a lot. We divided a lot to get that sweet, sweet scalability. We broke it apart into something like this:

The general services

  • users (user data, primarily for auth, password hashes, resetting passwords, confirming email, extra security concerns)
  • account (OAuth, SSO, public login UI, profile information and public APIs for anything concerning the user)
  • relations (hierarchies, organizational units, user roles)
  • files (public endpoints and internal, file storage, hooks for access checking, video transcoding)
  • media (image sets, embeds, extra data such as captions, files)
  • notifications (queueing, separate workers for GCM, Apple Push Notifications and email)

Product-specific services

  • public API
  • core domain functionality
  • calendar functionality
  • messaging functionality
  • export functionality
  • access (used by files API and public API for authorization)

Independent parts of the domain were separated which was generally fine. One slow killer was that we generalized out the organizational structure, and user roles to a service, user information to a different service and user access checking to a third service. Over time I ended up building so much caching around this to make it faster. So much caching.

Managing files and media in a different service worked pretty well and that did need to scale separately as it turned out. Now we didn’t need to make it two services, the line between media and the underlying files did not end up being important and the general case implementation added a lot of overhead that caused frustration anytime anyone wanted to understand how the system uploads a picture.

ZeroMQ and Protocol Buffers was much more hardcore than we needed. The multitude of git repositories (one per service, one per frontend client) were sometimes painful and a monorepo would probably have been easier to keep moving in good order. We should have used python packages or just some scripting for some things where we used git submodules. Git submodules have never been the right choice for me.

What did it do well?

Well it worked. And it could scale horizontally and did so whenever that turned out to be the bottleneck. It did a good job separating concerns between services sometimes.

Once the very painful migration was done and most early bugs worked through customers generally liked the system. Overall it was an improvement. It was definitely visually more professional which some loved and some hated.

It was built to match the government requirements placed on preschools much more closely and I believe it did that well. Including having good support for showing the curriculum and tagging documentation according to the curriculum.

It also did well in separating frontend development and backend development a fair bit. Unfortunately the win was limited there as in the long term all development was me and other full-stackish people. It did allow releasing frontend updates without making a bunch of backend changes. That was good.

What technical flaws contributed to its end?

The weight of the architecture was crushing. It was built on ideas from a much larger company with much larger teams. I spent long periods of time alone in running and developing the system. I got to be pretty fast at it. But it was an incredibly poor choice for a small lean team to spend time updating protobufs for service communication and worrying about scaling calendar and messages separately when both were at trivial usage.

The cost of the number of hosts we spread across was so much more than should have been necessary for a product at this scale.

YAGNI. You Ain’t Gonna Need It. We ended up building one prototype product beyond the preschool product on top of this ecosystem. It was abandoned quite quickly. The idea of sharing services was abandoned but this product bore the weight of that generalization until the very end. Especially with an unecessarily generalized and complex system for handling the customer org structure which affected access checks, authorization, querying for common pieces of data. This was why I needed to add a lot of caching, so many checks. I started efforts to bring this closer to the core and remove the generalization but never had enough time.

No circuit breakers or controls to prevent services just murdering each other during high load. This lead to some quite pathological problems whenever something hit a performance snag or reached a breaking point and it could take a bit for the system to recover. We also had very spikey usage. Everyone posted things around lunch and every parent and guardian would get notifications and check around the same time during their lunch.

Wrapped web apps on mobile. So much futzing with video and bad web view feature sets. This was on the old iOS web view and mostly on the old Android one as well. The later Ionic stuff worked great. Until we hadn’t updated the app in a while and our dependencies rotted and shipping an update turned into NPM dependency hell.

What did you learn from working on it?

So much, so very much. Reading up on ZeroMQ was great fun learning to wield it was wild. Building the request/response protocol we used as well as queue/worker and pub/sub patterns were great learning experiences.

Lots of Ansible. So much Ansible. Our playbooks weren’t perfect but the Ansible part of deployment was generally great. Our vagrant dev setup was, eh, not great but worked.

I learned to dislike the GIL in Python as we actually tried to use threads and later green threads for a bunch of things.

APM can be incredibly useful. We used New Relic which was costly but did so much for letting me figure out bottlenecks and fix things.

I don’t disagree with all our choices at all. They were wrong for the product and the team. But the choices themselves were mostly fine. But with this particular combination it mostly caused resource starvation which wasn’t particularly fun. So I think this has given me a better sense of what can be heavy and what can be light and when one or the other is appropriate.

Invent fewer things if you can. We did too many things basically from scratch. Super fun to build. Could have done something simpler. Thrift would probably have given us all we needed rather than ZMQ. But even so, I don’t think the microservice approach was necessary at all for this product.

Don’t work counter to your tools. We really did not work in alignment with Django for our uses. Our biggest Django application didn’t have a database, it just called out to services. So we lost most of the conveniences offered via Django models. It was still more featureful in some ways than Flask but it wasn’t pretty.

We could have just built on Django, Rails or Laravel and done all of it. We couldn’t have used Phoenix because it didn’t exist at the time. I don’t think Elixir existed when we started, I’d never heard of it at least.

I got to work with a bunch of the tricky parts of large amounts of files and how tricky performant IO can be, before object storage was standard practice.

So I learned a lot of tech throughout this but I probably learned more about tradeoffs during the creation and the subsequent operation of this system.

In reflecting on this system I’m glad for the experience overall. Each iteration pushed me along in building my skillset and challenged me immensely. I’m continuously both proud for the product we provided and how appreciated it was while also being very frustrated that we didn’t make better choices so we could have provided something much simpler and much better. The power of hindsight.

A big take-away is that before you’ve done something it is hard to know what running the system is going to be like. And before you’ve lived the experience or have seen enough of something from afar it is difficult to really determine trade-offs. My current thinking is that keeping things lean and simple works very far. It both scales and performs. I have so far found it easier to break things apart when required rather than needing to shove things back together after the fact.

This ends the saga. For now. I think I’ll write up an epilogue on the end of the product (it isn’t around anymore) and my attempt at creating a spiritual successor.

If you have questions or thoughts on this post feel free to contact me at lars@underjord.io or on Twitter as @lawik.

Underjord is a 4 people team doing Elixir consulting and contract work. If you like the writing you should really try the code. See our services for more information.

Note: Or try the videos on the YouTube channel.