Everyone loves a good technology horror story, right? The gorier the better in my experience. Well, over the last weekend we had an, um, “incident“. To my mind, it’s got all the ingredients of an excellent example of the genre: a small mistake, complicated conditions, and a big financial hit (mercifully averted). Sound like your kind of thing? Read on. I hope I do it justice…
I started my Monday morning with the usual routine: having a quick nosey through the analytics and logs. Naturally, I tend to start with production, but (thankfully!) it was when I got to the stats for our development environment that I was filled with dread.
Our functions app was out of control and had scaled to 50x the usual number of servers. Huge amounts of data were flying about and the memory footprint was miles higher than normal.
Looking closely, the usage had gradually started to rise throughout Saturday and had then stayed high. The dip and subsequent fall visible on the 3rd of September are our efforts to staunch the bleeding. When we realised what had happened it was a stupidly easy fix (see below), but it did take me a while to figure out what was going on.
When the dust settled, we were left with a whopping great big bill. Honestly, it made my eyes water and (despite how nice my boss is) I even worried momentarily about job security. I won’t say how much money it was, but let me just say I hope to never see a bill that large ever again.
Why did it happen?
So that we didn’t have to wait for the timer to fire “naturally” when we were testing, we were using
RunOnStartup = true in the definition of our timer-triggered Azure functions. We were doing this in the (mistaken) belief that “startup” here referred to that of the functions app (it doesn’t), which suggested the trade-off was pretty benign. So what if these functions run an additional time when we turn on the App Service? It’s not like they do anything destructive, after all.
In actual fact,
RunOnStartup refers to the startup of each host and the consequence of this was pretty nasty: every time Azure scaled out our App Service, it triggered functions that we had intended to ordinarily only run on a schedule – and infrequently at that!
This in itself wouldn’t even be so bad. If some functions – that we’ve already accepted don’t do anything dangerous – get called more often than they should, what’s the big deal? And it really would have been fine, if not for the fact that two of our the affected functions are basically task generators. These functions whole purpose is to enqueue a large number of messages whenever they run. Now you start to see the problem…
OK, lets put it all together. As load increased, Azure was being helpful and provisioning new hosts for our functions to run on. As soon as these new hosts came online they executed our timer-triggered functions, which dumped a whole heap of messages on the queue. Since these messages trigger other functions, the net result was that every new host caused an increase in load before it started to reduce it.
Or, if you’d prefer:
- Load increases by a moderate amount.
- Azure scales out and provisions new hosts.
- New hosts run functions intended for infrequent execution.
- These functions add messages to queues.
- Seeing a further increase in load, Azure again responds by provisioning more hosts.
- Go to step 3.
Etc. etc. ad nauseam… Basically a great big chain reaction.
With hindsight, this was a ticking time bomb. Our use of
RunOnStartup on timer-trigger functions wasn’t new (in fact, it predates me joining the company), but had never been problematic before. It just took the right combination of conditions to set it off. The fact that it had to happen over a weekend is just an excellent example of Finagle’s law.
How we fixed it
Once we’d worked out what was happening, the fix itself was super easy. To stop the symptoms, we turned the functions app off and manually cleared the queues. To prevent it from happening again we removed the offending attribute from all our timer-triggered functions.
Of course, removing
RunOnStartup did mean we had to rethink our local development approach. As luck would have it though, we’d been considering a new approach anyway. Our timer-triggered functions now all have a “twin” HTTP-triggered function. Both the timer- and HTTP-triggered functions point to a common private method which actually contains the functionality. This pattern gives us greater flexibility, as we can trigger this functionality whenever we need without having to do
func host start again.
What we learnt
First and foremost, we’ve seen how dangerous
RunOnStartup can be. Secondary to this, I think we’ve learnt to challenge assumptions more and to each take responsibility to personally check documentation, which in this case does contain a clear warning.
We’ve also learnt to put more emphasis on monitoring and alerts in our development environment. Whereas previously we’d focused heavily on ensuring the availability and integrity of our production environment, I think its now more apparent to us that the inherently experimental nature of our development environment means it also needs careful monitoring, if for slightly different reasons.
This debacle was also a great crash course in Azure fundamentals. As well as a boost to my proficiency inside the Azure Portal, an incident like this really forces you to think carefully about how all the pieces are interacting. For instance, I learnt a lot about how the scaling algorithm works.
At the end of the day though, the biggest lesson has to be that automatic scaling is a dangerously powerful tool and should be treated with respect. The same features that make serverless computing so attractive are also the ones that, if you’re not careful, can shoot you in the foot.
Well, that’s all folks. I hope you enjoyed our tale of woe. I think it’s fair to say that we were feeling pretty sheepish for a few days, but I think enough time has passed now for us to move on.
All that’s left for me to do is express my gratitude to the nice folks in the Azure billing department at Microsoft, who very kindly took pity on us and decided to waive the anomalous expenses. Thanks guys!
Have you got a serverless horror story? Let me know in the comments.