Aggressively tuning Cosmos DB (the long way round)

How many ways are there to tune Cosmos DB? In our (eventually pretty desperate) attempts to coax it into scaling nicely, we tried 9 approaches. As they became more arcane, we saw diminishing returns. Some were flat-out counter-productive! The eventual solution in our case was a single-line fix (how embarrassing!), but the journey was really interesting. It’s this journey that I want to share today.

As regular readers will know, I work for HeadUp Labs. We make a health app that applies data science to our user’s data in order to help them understand their bodies. We use a range of Azure technologies to ingest data from wearable devices and apply data science at scale.

Since launching the app back in October, we’ve rapidly scaled to reach many tens of thousands of users. On one hand, scaling at this rate has validated a lot of the strategic decisions we took early on about the technologies to use: on the whole we’re very pleased with Azure Functions and Cosmos DB. On the other hand, our rapid intake of users has been unforgiving when it comes to the details of our implementation.

One of the main battles we’ve had is with overloading Cosmos DB. Since the early days, we’ve found that provisioning enough “Request Units” (“RUs”) for our peak times has been very expensive. On the flip side, failing to provision enough RUs means that requests get throttled and we start to see the dreaded 429 status code in our logs.

As we scaled, we found that there were periods when we were managing to completely swamp the database. During these periods, the number of requests eliciting a 429 status (RequestRateTooLarge) could dwarf the number of successful (200, 201) requests.

We eventually realised that our problems were primarily caused by a really simple misconfiguration. (Skip to the end if you want the skinny on this issue and the fix.) The solution isn’t really the interesting bit though. The reason I wanted to write this blog post was to talk about the things we did before we worked out what the big issue was. The journey we took and the techniques we tried are the interesting topics here. I hope you’re sitting comfortably…

Spreading the load

The first way we tried to tame Cosmos was a simple, but highly effective architectural approach. Many of our jobs are driven by timer-triggered Azure functions. For instance, we might run a certain data model for each of our users once a day. By staggering the execution of each task, we are able to perform the same amount of work with a much lower RU ceiling.

Instead of triggering 10,000 jobs at midnight UTC, we would instead place that many items on an Azure Storage Queue, each with a different VisibilityTimeout. This property prevents the item from being dequeued until the specified amount of time has elapsed.

Depending on the type of task, we use a variety of strategies to determine the amount of delay for each item. A common pattern is:

  • Use the user’s timezone to offset the task by a whole number of hours, such that the task will always happen at roughly the same time of day, local to the user.
  • Apply an offset within the hour based on the user’s cohort (an arbitrary designation), which might result in groups being spaced 10 minutes apart.
  • Within each cohort band, apply a random number of seconds, so that the users are distributed approximately evenly throughout the 10 minute period.

The overall effect is to smooth a massive daily spike into a smaller, more constant stream of work. Since we need to provision Cosmos to be able to handle the biggest spike, this strategy results in a big cost saving.

Implementing index documents

Another problem of our initial naive implementation was around complicated query logic. For each user, we may have a large collection of “ratings”, relating to different aspects of their health. We maintain several categories and each rating may have a different status. A common query we might make would be to find the most recent Rating document for each combination of category and status. This sort of query quickly became very expensive as both the number of users and the number of ratings per user grew.

We were able to buy some time by switching our query from using a text-based representation of a DateTime to the inbuilt _ts property, which is a numeric representation of the time the document was last changed.

Eventually, though, we needed to rethink our approach altogether. The route we ended up taking was to introduce a new type of document: a RatingsIndex. This document is basically a grouped collection of IDs of Rating documents elsewhere in Cosmos. We update this document whenever we change a rating, and query it to work out which rating to use.

This might seem counter-intuitive. After all, we’ve replaced one query with two. The trick lies in the fact that we were actually replacing one expensive query with a combination of a very cheap query and read. Read operations are more like a call to the filesystem than a database query. They are insanely efficient.

We’ve effectively shifted the logic of selecting ratings from read-time to write-time. Since we read a lot more than we write, we’ve found this to be highly beneficial. The cost of the combination of an extra write (when we update the index) and extra read (after we’ve queried the index) is more than offset by the gains we made by removing an expensive and frequently-called query.

Shielding Cosmos with Redis

Even after spreading the load, our growth meant that costs kept rising. As with most applications, we read a lot more than we write, so we decided to expand our Cosmos implementation.

I say “expand” because we were already using Redis to cache database calls made from our API project. As is commonplace, we were using a layered design for our ASP.NET-based API. Having all our database logic centralised in our repository layer made it very straightforward to weave in caching as a defensive layer.

The problem was that our API is actually pretty thin – most of the heavy lifting is performed by our (numerous) Azure Functions. The idiomatic way to access a database in a Function is by accepting a client via a “binding” parameter. The example below is lifted straight from the documentation:

We had dozens of functions written in this form. Despite having caching logic implemented in repositories sitting in our API project, we had no real way of using them here. Re-implementing the caching in the Functions would have meant a massive amount of duplicate code, as well as being error-prone and hard to debug.

Our luck changed when we discovered an unofficial implementation of AutoFac for Azure Functions: AzureFunctions.Autofac. Suddenly we could inject dependencies into our Functions. Indeed, our API project was already using AutoFac for DI, so it was extremely simple to start sharing the services and repositories and remove all the duplicated functionality. Much code was deleted that day.

By removing all direct bindings to Cosmos from our Azure Functions and replacing them with cache-aware repository calls, we were able to slash the number of requests making their way to the database.

But still, they came…

For a while, things were looking a lot better. We were able to grow our user-base without unduly cranking up the provision of RUs or succumbing to an unacceptable number of throttled requests (429s).

Then we started to see some unpleasant symptoms during peak hours. The number of requests would rise beyond capacity and stay there for an hour or two. During this time, we would see higher volumes of RequestRateTooLarge exceptions and 429 status codes.

Those purple hills are definitely downers.

These events became more frequent and more significant over time. Worse, we were beginning to see the application misbehave as a result. Jobs wouldn’t finish. Data would end up in strange states. You get the idea… Well, as you might imagine, the drive to find a solution became more and more urgent.

Tweaking retries

One idea we latched onto early on was that somehow it was the combination of using Storage Queues, Azure Functions and the .NET Cosmos DB SDK. Both queues and the database will attempt to retry a failed operation, you see.

If the database is unable to handle a query, for instance, the SDK will back off and try again. It will do this a set number of times before throwing an exception. This exception will cause the Function to fail, at which point its claim on the queue item will lapse, meaning that another instance of the Function is free to swoop in and try again.

The implication is that a single queue item may result in many times more requests to the database than might be expected. According to this helpful doc, the default value is 9 retries for the Cosmos SDK and 3 for Storage Queues. In other words, a single item on our queue could cause 27 database hits. Could this be why we were seeing such severe failure rates?

If true, this implied two things. Firstly, it suggests that raising the provisioned throughput by a little bit might have a big effect on the failure rate, as each additional success might prevent 26 further requests. Secondly, it prompted us to think about how we might want retries to behave – do all operations really need 27 chances to succeed?

Encouraged to bump up throughput a little bit, we also reduced the number of retries for our backend application. (We left the API project alone, figuring that the impact of failure there was likely to be greater.) The reasoning was something along the lines of “eventual consistency”.

As you might have guessed, this was not a great strategy. Not only were the rates of 429s barely affected, but we found that the system managed to get itself into some pretty strange states, as an increasing number of operations failed hard. We reverted this change.

Logging Cosmos DB transactions

One of the key lessons from our adventures with our failed experimentation with retries was that we had exhausted all the normal ideas and were now flying blind. Accordingly, we started to look at the available data with a newfound vigour.

The metrics built into Azure, as well as Application Insights, were able to tell us in broad strokes where the issues lay. We could tell, for instance, that we were spending a higher proportion of our RUs on upserts that we had been previously (the effect of caching, I suppose). Digging deeper isn’t really possible in the Azure Portal, however, and we found ourselves unable to derive anything actionable. We couldn’t confidently point at any particular “hotspot” in our code.

Frustrated, we borrowed a trick from our pasts, from the world of SQL Server… We decided to try profiling our Cosmos DB usage.

Thankfully, our earlier move to centralise database logic into a set of shared repositories paved the way. These repositories all used the same underlying Cosmos DB client class, so it was relatively painless to add some code to log each request. We decided to write the request type, the query, the calling method, and the cost in RUs to a new Azure Storage Table.

After leaving the profiling code enabled for a few days, we were able to hook up Power BI to the table of transactions. The data didn’t quite match the overall patterns we were seeing (the final section makes the reason for this painfully obvious), but we had some strong leads to pursue.

Can’t outsmart the SDK

A few of the things we tried after looking at the profile data yielded no fruit. I’ll put these all together in a class of bittersweet realisations. Bitter, because we didn’t see the improvements we’d been hoping for. Sweet because we realised that the .NET SDK for Cosmos was already smarter than we gave it credit for. I won’t dwell on these, but I’ll give an example.

We had left an inappropriate setting in place. Deep in the bowels of our implementation, we had the following:

So we had the thought that, perhaps, maybe each query was incurring RU costs once for each partition. Given that our database now spanned many partitions, it seemed as though this might explain the explosion is RU consumption.

It was a simple thing to try. For all the queries where we had access to the partition key (which, thankfully, is most of them), we disabled EnableCrossPartitionQuery and explicitly set the PartitionKey within the FeedOptions object.

The result? No change whatsoever. The SDK had clearly been working it’s magic away from prying eyes; optimising our queries without us even knowing. Whilst, on the one hand, it’s great to know that the SDK is so smart, on the other hand, our search for a solution would have to continue.

Beware the update!

One thing that we learnt from our profiling efforts was the seemingly disproportionate cost of an update operation. As Maxime Rouiller points out in a blog post on the subject of calculating RU costs, an update is effectively a delete followed by a create. This results in an RU cost 10 times greater than would be incurred for reading the same document.

Looking at the output of our profiling clearly showed some “hot spots” caused by updates. In the most significant case, we were frequently updating a single property in a relatively large document. Since the cost in RUs is proportional to the size of the document, we decided to split this document into two: one small document containing just the properties we often need to update and one large document with the rest.

Having to read two documents does come at a cost. Each of the two documents will both have at least a standard set of core properties (e.g. _ts, etag), which means you’ll be necessarily be reading more data after splitting your documents. But because of the 10x cost difference between reading and writing, writing to a smaller document results in a big enough saving to overcome the penalty incurred by reading two documents.

We can use the Azure Cosmos Capacity Calculator to estimate the economics of the split. I’ll leave all the general settings at their default values and just change the section relating to our documents. Imagine we have a single big document (like this one; 1.99kb) and we need to read and update 100 documents like this per second. This would require us to provision 734 RU/s – 105 for reads and 629 for writes. Now imagine we took a couple of properties from the big doc (to leave it slightly smaller; 1.94kb) and moved them to a small one (0.47kb). If we read both 100 times per second, but only wrote the smaller one at this frequency, the total throughput drops to 700 RU/s (205 for reads + 495 for writes). This is a toy example, but we can already see a 5% improvement.

NaiveSplit%
Reads105205+95%
Writes629495-79%
Total734700-5%

In our real-life example, by reducing the size of the document we update most frequently, we were able to shave off close to 15% of the RU cost of one of our most frequent operations.

Pre-emptive scaling

Whilst our efforts to this point had resulted in noticeable gains, we were still facing big peaks and troughs in consumption. The peaks were invariably seeing huge numbers of throttled requests as demand exceeded the throughput we’d provisioned. Keeping the database at a scale that would satisfy this demand looked to be prohibitively expensive.

In order to handle the peaks without paying over the odds during quieter periods, we wrote a Function to automatically scale Cosmos to anticipate demand. We were fortunate that our periods of high demand were fairly predictable, so we could scale up and down on a schedule.

It turns out that altering the number of provisioned RUs per second is possible through the DocumentClient itself, so the code to scale the database up and down is very straightforward:

As you can see, the hard part is creating a new Offer object with the new provision.

We run this from a timer-triggered function which looks at a Storage Table for its schedule. It checks the schedule at quarter-to and quarter-past the hour and scales the database up or down depending on the known patterns of demand.

Granted, this doesn’t solve the problem of higher-than-anticipated RU demand. What it does do is make it significantly cheaper whilst we find a proper solution.

Tuning the index policies

In our increasing desperation to reduce our RU spend, we started to look at Cosmos’ indexing policies. The default and recommended approach is to allow Cosmos to index every property on every document. We’d initially accepted this but were now sceptical of everything. Could it be that our index metadata was now so vast that it was harming performance?

We tried switching away from the default policy of having every property indexed. Instead, we used the data from our profiling exercise to identify the properties we were actually using in our queries and instructed Cosmos to only index these. The motivation behind this was to reduce the cost of write operations.

The documentation seems to support this as a strategy, but the results were not as impressive as we had hoped.

A one-line fix

As we started to run out of ideas – and still no closer to a performant system – we began to despair. In our despair, we started aimlessly combing over our code.

We even started looking at code that we knew worked. We knew it worked because we’d written it months ago and had used it ever since. It hadn’t been modified since long before the start of our current woes. It couldn’t possibly be at fault, but still, we leafed through the files, scrolling hopelessly through ancient methods.

One of the classes we looked at was our Cosmos client singleton. When we’d first started writing code to access Cosmos, we’d followed the performance tips and had implemented our client as a shared singleton. It all looked fine, but we wanted to double check that Autofac was properly resolving it as a singleton, so we decided to step through it.

F11, F11, F11… “Yep, it’s definitely sharing a single instance.”

“But… what is that method doing? And why did it take 750ms to run?” So we ran it again. Same behaviour. Very strange.

Ladies and gentlemen, it turns out that we had left a particular method call in the query method instead of the constructor. Not just any method call either, but a call to the method that proactively fetches all the metadata for the collection in order to speed up each query. Given the size of our database, these metadata requests were now taking hundreds of milliseconds and costing us a large number of our provisioned RU/s.

The fix was comically simple in the end. We moved OpenAsync(); out from the method performing the actions/queries and into the method that instantiated the singleton.

As soon as we deployed this change, the plateaus of huge numbers of 429 responses went away. Throughput dropped so much that we were able to reduce the number of RU/s we were provisioning to 20% of what it had been. (In other words, this fix instantly slashed our database bill by 80%.)

Everyone was delighted. Costs were way down, exceptions were non-existent, and overnight the app started behaving as it was designed to.

Reflections

It’s been a few weeks since we wrapped this up, but I’ve been thinking about writing this blog post and thinking about a whole load of related questions.

Chief among these has to be “why did it take us so long to get right?” And I think the answer to that is nuanced and interesting. Fundamentally, it breaks down into the following factors:

  1. Code consolidation. The problematic class was only used by our API at first. We started using it in our Functions project as we started injecting dependencies to our repositories. In other words, we spread the problem as part of a fix.
  2. A growing problem. OpenAsync is a bit of a black box (the docs say very little about what it’s doing under the hood), but we think it pulls a dictionary of partition keys and partitions. This would make sense, as it would allow the SDK to make most queries lightning fast. This would also explain why we didn’t notice this issue when the database was so much smaller (or on our non-prod DBs). The upshot is that the problem was literally growing in parallel with the database.
  3. Lack of documentation. In a similar vein, what is a “ReadFeed” operation? We could see these in our Azure metrics. They were clearly something to do with our problems, but our Googling didn’t answer our questions We now know that OpenAsync performs a ReadFeed operation, but we didn’t manage to make this connection until afterwards.
  4. Partial logging. Remember when I said that the data from our profiling work didn’t quite match what we were seeing in Azure? Well, we hadn’t considered that OpenAsync might be costing RUs. We were only tracking the costs of our queries, upserts etc. With our imperfect data, we were able to target some sensible incremental optimisations, but missed the fundamental reason that the system was misbehaving.

Underpinning all this is, I suppose, a good dollop of “new tech” an “inexperience”. With hindsight, we had already provisioned more throughput than should have needed – we just weren’t familiar enough with the technology for those alarm bells to ring.

On the other hand, I can’t us ever learning some of the things we’ve learnt about Cosmos DB by any other route. Sometimes necessity is the best instructor.

As well as the knowledge we’ve gained, we’ve kept a lot of the changes we made:

  • We still use a variety of techniques to spread load and smooth out spikes. In fact, this is a central part of our architectural thinking now. Regardless of any issues in our code, the availability model of Cosmos is suited more to a baseload than spikes.
  • Index documents are something we’ve not just kept, but extended. For documents that are frequently read, but infrequently written, it’s just a lot more efficient. We’ve rolled this technique out to a few other document types.
  • Injecting repositories with caching logic was still a smart move, even if it initially spread a defect in our DB client. We still use repositories throughout our application to encapsulate all the logic around caching and database access.
  • We’ve left in the code that logs calls to Cosmos, so we can do profiling of new subsystems in the future. All we need to do it flip a flag in our config and all database calls get logged, along with details of the responsible repository method.
  • We’ve retained the ability to scale the database on a schedule. Even with everything else we’ve fixed/implemented, we still have peak and off-peak times. This is mainly due to us having biases in the geographic spread of our users and the desire to have certain actions run during certain times of the day in users local timezones. So we’ve kept the code that scales our provisioned throughput up and down throughout the day.

So, in a lot of ways, the problems we’ve encountered have made us wiser and driven us to create tools with ongoing utility.

Conclusion

The last couple of weeks prior to us finding the fix were pretty unpleasant. The team were feeling the pressure and doubt was creeping in around whether Cosmos was even the right tool for the job.

Now that we’ve emerged through to the other side, the change is dramatic. The whole system now behaves as it was designed to and we’re paying much less than we were before. It’s safe to say that any doubt we had about Cosmos has now evaporated.

We tried a lot of tactical solutions for a problem that was, in the end, just far more fundamental. Along the way we learnt loads and built some cool tools, so ’twas not all for nought.

I decided to publish a list of the things we’d done when we were still deep in the woods. I didn’t guess then that the answer would be as simple as moving one method call. I imagined that I’d be writing a definitive playbook for tuning Cosmos. When we stumbled on our solution, my first thought was to dismiss all the other things we’d tried.

Ultimately though, it was the journey that was interesting to me. I hope, having read the full post, that you’ll agree.

2 Comments

  1. This is an outstanding post and we greatly appreciate you sharing it. We are in the process of going through a migration from a fully traditional relational model to a hybrid model where our most critical but also most “document-like” entities will live in Cosmos. We are likewise new to Cosmos and have a bursty workload with peaks of high volume traffic like you. This is a treasure trove of valuable learnings, stuff like this is generally so hard to find. Cannot thank you enough for taking the time to share this journey.

    Reply
  2. That’s very kind of you to say @XinuniX. It’s hard to shed the ego and post about our mistakes sometimes, but knowing that you found it helpful has made my day. Thanks for commenting!

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.