If you’ve read part one of this series, welcome back. If you didn’t, it was an action-packed exposé on start-up life and the ways in which microservices will decrease your stress and your snack budget. Pour your beverage of choice and go read it.
In chapter two - the project hits the big-time. Serverless functions – and asynchronous queues - save the day with a “just in time” scale.
Continuing to function with queues and functions
The main metric for success in this project was order growth, and we quickly realised that partnerships were going to be critical to our success. The brands we partnered with had marketing lists which made ours seem “fun-sized” by comparison, and orders were doubling every couple of months. Everyone was happy – until the proliferation of events and partner systems started to reveal the flaws of the legacy architecture. The following was an especially diabolical flaw:
The checkout service sent orders to a third-party inventory system, which also accepted payments. Immediately after processing the order, the checkout used a different third-party API to send notification emails to the customer. This worked well most of the time. But occasionally - the notification API went offline or was rate limited. After exceeding the number of allowed retries, the checkout would cancel the order and refund any payments!
This design failed the “take my money” principle. Adding fuel to the fire - the order system sent notifications to vendors for any new orders and cancellations. Vendors had front-row seats to the sales we were losing.
I remember the boss saying something like “you can wear a clown outfit to work from now on - I don’t care - just fix it”. I did not own a clown outfit, but I had a solution in mind. Rather than complete all of the API calls during the request, we would allow these things to happen as background tasks using asynchronous queues. Here is how I explained asynchronous queues:
...think of a queue like a kitchen in a restaurant. The waiter takes your order and immediately sends this as a message to the kitchen, where it waits in a queue with other orders. The first available chef picks up your order message and prepares your meal, which is eventually delivered to you.
Hopefully, there are enough chefs to get through the queue quickly and deliver your food. If not, more chefs could be hired to handle the load. If there is some problem in the kitchen, it is handled behind the scenes (hopefully) and the customer is unaware.
Applying this to our problem, our checkout would now always accept a valid order immediately, just like our waiter did. The notification would be queued up as a message to be processed by a backend worker program. Think of these backend worker programs as the chefs in our metaphorical kitchen, processing the orders as quickly as possible.
If the notification service was down at that time, the messages would go back into the queue to be retried later. Decoupling the notifications from the orders with a queue meant that we could continue to serve customers, even when there was a service outage.
Just-in-time scale with serverless functions
The messages were processed by a scheduled task running on a single virtual machine. But when there were serious outages, we’d end up with a huge backlog of messages in the queue, leading to notification delays. A colleague had heard about serverless functions and suggested they were a good fit for the problem.
She explained that serverless is “compute-on-demand”. It scales automatically based on metrics, such as the number of messages in a queue. As the number of messages grows and shrinks, the cloud adds or removes function instances from a pool. A way to think about this is calling in more chefs to handle a backlog of orders. The hosting pool is global - meaning it is not specific to any customer. And billing is based entirely on consumption. If you had three requests that day, you would be billed for those three requests. This made our accountant very happy.
While “just-in-time” scaling and pay-per-use billing were awesome, there was another big drawcard. Azure Functions – and similar products like AWS (Amazon Web Services) Lambda – had built-in connectors for cloud services like Azure Service Bus – a popular solution for queues and multi-subscriber topics. There were even connectors for the cloud notification service we were using. This meant that we could spend less time writing specialised code, and more time on improving the functionality of our product.
Projects can become a victim of their own success when there is no strategy to scale out events and tolerate failure. In our case, seemingly harmless assumptions led to poor user experience when the unexpected happened.
Projects should be designed with the unexpected in mind. By embracing a new approach using asynchronous message queues and serverless functions, we were able to take control of the chaos in our kitchen. I hope the problems in this story made you cringe a little, and the solutions gave you inspiration and hope.
In the final post, we will put on our accountant/ops hats and look at minimising operational costs by hosting functions in AKS (Azure Kubernetes Service).