Redis as the primary data store? WTF?!
Courtney Couch • 2013-04-08
Redis is a key-value in memory data store typically used for caches and other such mechanisms to speed up web applications. We however store all our data in Redis as our primary database.
The web is abound with warnings and cautionary tales about going this route. There are horror stories about lost data, hitting memory limits, or people unable to effectively manage the data within Redis, so you might be wondering "What on earth were you thinking?!" So here is our story, why we decided to use Redis anyway, and how we overcame those issues.
First of all, I want to stress that most applications shouldn't even worry about the engineering hurdles involved with going this route. It was important for our use case, but we may very well be an edge case.
Redis as a data store
Redis is Fast. When I say Fast, I mean Fast with a capital F. It's essentially memcached with more elaborate data types than just string values. Even some advanced operations like set intersection, zset range requests, are blindingly fast. There's all kinds of reasons to use Redis for fast changing, heavily accessed data. It's used quite often as a cache that can be rebuilt from a backing alternative primary database for this reason. It's a compelling replacement for memcached allowing more advanced caching for the different kinds of data you store.
Like memcached, everything is held in memory. Redis does persist to disk, but it doesn't synchronously store data to disk as you write it. These are the two primary reasons Redis sucks as a primary store:
- You have to be able to fit all your data in memory, and
- If your server fails between disk syncs you lose anything that was sitting in memory.
Due to these two issues Redis has found a really solid niche as a transient cache of data you can lose, rather than a primary data store, making often accessed data fast with the ability to rebuild when necessary.
The drawback to using a more traditional data store behind Redis is that you are stuck with the performance of that store. You are trading performance to be able to trust that your data is persisted to disk. A totally worthwhile tradeoff for almost every application. You can get great read performance, and ok write performance this way. I should clarify that 'ok' by my standard might very well be insanely fast by most people's. Suffice to say that 'ok' write performance should satisfy all but the most heavily loaded applications.
I suppose you could also do something like a write queue on Redis and then persist using the relational store but then you run the same risks of Redis failing and losing that write queue.
What did we need?
Moot is offered as a completely free product. We, therefore, need to be able to push a massive load on a very small amount of hardware. If we need a bunch of large databases for a forum that's pushing a few million users a month, there's no way we can continue to be a free service. Since we want Moot to be both free and unlimited, we really had to optimize to the extreme.
We could have simply avoided this by putting some sort of cap on the free service and charge for page views or posts. I don't know about you, but I generally dislike products that are free "unless you do well." Let's say you set up your forum, then something on your site goes viral. Suddenly, you are slammed with a bill for usage overages past the free tier. Now what started out as glee for the sudden popularity of your conspiracy theory blog turns into horror about the impending bills. You get punished for your success. This is something we wanted to avoid.
We also could have decided to monetize the content through ad placement as well, allowing higher operational costs. This, however, completely violates our core values as a business. In our mind, if anyone is going to place ads on your site, it should be you, not us. Moot should be offered without any conditions, limitations, or strings attached.
Considering all this, unparalleled throughput for posting and reading needed to be accomplished regardless of engineering complexity. It's core to our ability to operate. We had an initial goal that all API calls would be processed in less than 10ms even under heavy load, and even when doing large complex listings or searches. Redis could clearly give us this performance but there's those two big problems: How the heck can we use Redis when we might have hundreds of gigs of data, and what about when a server fails?
What to do now?
Thus began our investigation into ways to design around these limitations. We had a clear idea of what Moot's goals and our values as a company would be from the very beginning so we were lucky to be able to consider these issues before writing a single line of code. I believe that these problems would be prohibitively difficult had we decided on going this route after we already had a lot of legacy code to work with.
All the data's in memory. Shit.
This is the more complex of the two problems. There's a finite amount of memory you can have on a single machine. The largest option on EC2 is a 244GB memory server. While this is still finite, it's a pretty good cap to start with. Unfortunately with this though, your 16 core machine will only use one core with Redis. Well how about adding a Redis slave for each core. Now you are down to 15GB of memory per instance. Damn again! That's not a good cap if you want to be able to saturate the server for performance. That's not a lot of data for a hosted service.
We decided to design our Redis store to be split amongst many different Redis clusters from the beginning. We hash and split data into shards that contain all the relevant structures for that segment of data. The data is heavily sharded from the beginning and we can create more shards as necessary quickly and simply.
To shard data we store tables of hashes and addresses like so:
When some data comes in, we create a hash based on our own internal need to correlate data, we then check in the shards.map to see if that has been assigned to a shard, if so, we can forward our calls to that shard.
If the hash has not been assigned to a shard, we create a list of the available shards, multiplying the number of times a shard appears to match the score. If, for example, we have this:
The list we will generate will look like this:
Then we randomly assign a shard from the list, save it in the map, and continue.
Using this strategy we can easily control how much data goes into shards, add new shards, or even remove shards from consideration as we see fit.
We actually started with hundreds of shards so there is literally no worry about saturating server cores and the memory limitations. Individual shards are kept very small. Single servers hold many shards in Redis databases, and if those shards increase in size we can simply split up the Redis databases into independent instances. Let's say we have a Redis instance with 100 shards, we are starting to see some of the shards in increase in size so we split Redis into two instances with 50 shards each. We can fine tune the weights to keep distribution even between shards in realtime as well.
The most difficult part of this is clearly defining how you segment your data. This is a very application specific problem and our solution for data segmentation is probably a topic for an entirely separate blog post.
This strategy really had to be designed into the app from the beginning. Often people are trying to shard data that wasn't designed that way, and therein lies the rub to using Redis for them. Since we clearly knew the memory limitation would be a problem we could design a solution into the core of how we manage data before we wrote a line of code.
Dealing with failures proved to be a simpler solution (almost laughably so). We have 3 different Redis roles for a cluster.
- Master role where all (almost all) writes go,
- Slave role where all (almost all) reads go, and
- Persistence role which is dedicated to persistence
If you look at our master and slave instances, they operate basically like every other Redis cluster. There's nothing interesting there. What we did differently is that we have 2 servers in each cluster using the "Persistence role." These servers:
- Do not accept any incoming connections and have absolutely no Redis query load aside from simply replicating
- Use AOF persistence every second.
- Do hourly RDB snapshots
- Sync AOF and RDB to S3
Since the performance characteristics for persistence may be quite a bit different, a single persistence server may handle a varying number of shards. We simply run a Redis instance for each shard that needs to be persisted. In other words there is not necessarily a 1:1 relationship between shards and servers with the persistence role.
We have two of these in different availability zones so even if an availability zone fails, we have a working up to date persistence server.
In order for us to lose data in this scenario, it would take a fairly catastrophic failure in EC2 and even then our exposure is still about only about a second of data.
If you are envisioning network partition scenarios where perhaps the master is isolated from the slaves, this is minimized by replication checks to slaves (set an arbitrary key to an arbitrary value and check if the slaves update). If a master is isolated we block writes: Consistent and Partition tolerant at the cost of Availability. Redis Sentinel is also available that could help with a lot of this piece, but Sentinel was out after we had already built much of this. We haven't explored how Sentinel could fit into the equation.
The End Result
In the end we were able to build a system that can fulfill API requests under load at around 2ms.
The 2ms value is using one of our heaviest API calls, the initialization API call. Many of our methods return in far less (likes for example often take 0.6-0.7ms). We're able to sustain a throughput of 1,000 API calls per second per API server and it takes a single API call to render a page view. This measurement includes all our data validation, shard management, authentication, session management, connection handling, JSON serialization, and so on.
The API servers that we are able to push this load with cost a mere $90/month so we're able to support and scale horizontally to support pretty massive loads at a very low cost. The other side benefit of heavily sharded data is that horizontal scaling is trivial.
Much of this isn't ONLY due to these solutions for Redis. There are quite a few tricks we use to keep the system performing under heavily concurrent loads. Another of those tricks has to do with the fact that nearly half of our code is also written in Lua running directly on Redis. This is another thing that people generally say never to do. As far as why and how we have thousands of lines of Lua, that will have to wait until the next post on our Redis usage.
Looking at our real world performance, we launched a couple days ago and had a decent spike in traffic initially. We were servicing some 50 API calls per second and the CPU of our main API server (we're sending all traffic to one for now) sat completely idle. Here's the charts starting from our launch until now (when I'm writing this).
There are a lot of reasons not to use Redis as your primary disk store, but if, for some reason, your use case requires it, you need to prepare from the beginning. You need to design your data around sharding and keep in mind the extra cost of dedicated persistence servers.
I just noticed I am using individual server cost off our internal charts and forgot to add in the amortized one time cost for reserved instances. That number should be $213.