500MB of Memory Saved Us ~60% in our DynamoDB Bill

Boris Cherkasky
Riskified Tech
Published in
6 min readMar 26, 2020

--

The main goal of the development department in a fast-growing startup is to deliver business value — and deliver it fast. In many cases this speed comes with a price tag — in some cases we might choose the easiest solution to implement rather than the optimal solution, in other cases, we might use some SaaS/managed platform to avoid the additional effort of maintaining infrastructure in house.

In the “cloud era”, where every action is audited, labeled and then billed, every inefficiency can easily add up to quite a substantial amount at the end of the month.
This is what happened to one of our high throughput data pipelines that leverages DynamoDB — it grew expensive, fast!

But, with just a few days of work, using a simple in-memory cache, we were able to reduce its price tag by around 60%.

In this blog post, I’ll go over our solution, and it’s performance.

Our data pipeline in a nutshell

A few years ago, we at Riskified, needed a high throughput, low latency data pipeline, that can collect a huge number of signals and aggregate them.

We’ve chosen to build our pipeline like so:

High-level data pipeline

We needed a simple and super scalable solution, and we needed it fast. So we turned to DynamoDB, which gave us an amazing performance and helped us deliver a truly awesome product.

That being said, as our throughput grew, our DynamoDB bill grew with it, until we reached the point when we had difficulty justifying that cost compared to the product offering we were giving.

To understand how that problem came to be and what we did to solve it, let’s dive into our data modeling.

Data modeling in NoSQL datastores

Our business domain required us to query our data by a small set of different keys while maintaining low latency reads and high throughput writes, processing large amounts of data.
Considering all of these requirements, we chose to use a NoSQL datastore, and since we were a small, fast-growing startup, we went with AWS DynamoDB.

In most NoSQL data stores, including DynamoDB, each record has a primary key called the partition key, and fast queries are available only by this primary key (in practice there are few other keys to query the tables by, but query by the partition key is the main one).

With this in mind, we divided our data into 3 sets: raw data, aggregated data, and indexes.

  • Raw data holds all of the data we get, raw, with the data’s partition key as the primary key.
  • Aggregated, holds our data ready for querying and processing, and this is the main business value we give.
    We used our data’s partition key as the primary key here too.
  • Index tables used to translate different queries to the primary key of our data.

The index tables give us the ability to query our main data by different keys. In practice, those tables map different keys to our primary data key.

Our data modeling

Spotting the most expensive link in the chain

Thankfully, AWS provides quite good monitoring metrics and dashboards for their services, and the ones for DynamoDB are no exception — you can see each table read/write throughput/latency, etc.

Using said monitoring tools, we were able to identify that the write throughput of the index tables was very high! It made sense since our service is totally stateless and therefore each request was processed separately and written to the datastore. We also suspected that the data we wrote to the index tables was fairly similar, i.e we wrote the same data over and over again.

Combining this knowledge with the AWS cost calculator, we assumed those writes were the main reason our DynamoDB bill sky-rocketed, and we were determined to reduce this cost.

Easy win with an in-memory cache

We decided to add an in-memory write-through cache in front of each index table, we don’t need much, 250MB of memory should do the trick

With this solution, each request made to the index tables in DynamoDB is first looked up in the cache, and if there’s a cache hit, we don’t write it to DynamoDB.
On a cache miss, we write the data both to DynamoDB and to the cache.

With this solution, if successful, only the first write is sent to DynamoDB, and all records arriving with the same data are found in the cache, without having to be written.

Data model with cache

Our service is written in Scala, so we’ve used Scaffeine, a Scala wrapper for Java’s caffeine cache, which gave us great performance, easy usage, and cache performance monitoring. A code sample can be found here.

Performance and cost reduction

With just 250MB given to the cache, we reached an amazing cache hit-rate of around 75%, and over time it climbed up to 80%.

Cache hit rates

Those amazing hit rates are easily explained by our usage of Kinesis, where elements with the same partition-key get mapped to the same consumer, where the value is already cached.

Our DynamoDB write throughput dropped by more than 80%:

Index table write throughput

And the crown jewel — our DynamoDB bill started to shrink after 2 years it just climbed unstoppably:

Our DynamoDB bill after introducing caching

Other solutions and concerns

The in-memory cache was developed as an easy-win solution. We had no guarantee it’ll get to the amazing performance we did.

Other solutions we were considering are:

  • Dax:
    Dax is a cache layer for DynamoDB, used mostly for performance boost. It might also reduce some usage, but since it’s managed, it also comes with a price tag, so we didn’t look at it thoroughly.
  • Redis (Or other cheaper key-value storage):
    Since our index tables are a key-value cache in a way, we were considering migrating them to a “cheaper” solution, such as Redis (Or the AWS managed version of it — elasticache-redis).
    This alternative was dropped since it requires adding another infrastructure layer to our stack, which contradicts the “keep it simple principle”. We were planning to use this solution as a “plan B” if the caching approach wouldn’t have given us the value we expected.

One more thing is the cache size:

To be honest, we picked our cache size by a hunch. We were thinking about what can we afford without affecting too much the memory footprint of our process in our machines. Having that in mind, and with some calculations of the available memory in our machines and the average item size we save to DynamoDB, we decided to “start small” with 250MB.

We assume that adding some more memory would help, but since our hit rates are high as they are, the extra work isn’t worth it at the moment.

Closing thoughts

Cache is usually the go-to solution to reduce latency, and external resources load, or to memoize resource-intensive computations.
Apparently, it also can be used to save some money ;)

I encourage you to have a close look at your services and systems and see where you can save a few coins by caching expensive data or optimizing usage of third party services.

As always, if you like what you’ve read, come follow me @cherkaskyb

--

--

Boris Cherkasky
Riskified Tech

Software engineer, clean coder, scuba diver, and a big fan of a good laugh. @cherkaskyb on Twitter