Saturday, 19 May 2012

Adventures in Redis. Part 1

While looking for cool buzz-wordy tech to play around with I investigated an in-memory key/value(or more accurately a key/data structure) store called Redis.
Redis was immediately appealing due to how easy it was to get started, I was able to test the commands from within my web browser due to the fantastic documentation pages. e.g: http://redis.io/commands/hincrby

See the last line in the example that reads 'redis>' ? That's an interactive command prompt!
And installation was a synch:
$ wget http://redis.googlecode.com/files/redis-2.4.13.tar.gz
$ tar xzf redis-2.4.13.tar.gz
$ cd redis-2.4.13
$ make
It didn't take long for me to realize this would be a great fit for a problem a team at work where tackling (and there went my cool side project).
The guys where generating inverted indices over some travel data, sharding it by origin country and storing it in process in a .NET program to be queried via http endpoints. This data would also be updated tens of thousands of times an hour.

Their queries took the shape of a set membership and/or set intersection tests (e.g Where id is in index: 'Origin country: Spain' AND is is in index 'Day of departure: 20121202'). With the goal of finding the cheapest ticket that met all the criteria.

To map this problem to Redis I would take the quote data and also build inverted indices over it using the Redis sorted set, obviously sorting on price.

To check the existence of a ticket to Spain on 2012-12-02 I would ZINTERSTORE (intersect and store) the Spain index and the 20121202 index. This was going to be slow, I knew this as the Redis documentation also gives you the time complexity for all of it's commands.

But the plan was to shard the data into hundreds of shards as we had quite a lot of CPU's at our disposal, my idea was that small data sets intersected in parallel would overcome the speed issue.

Another one of the goals here was to take something out the box that had persistence and replication and apply it as simply as possible.

As a nice side effect of Redis storing the intersection allowed us to build facets against the now reduced set where as the existing solution would build each facet a new.
So we spiked this in just one day, my go to data dude, a python-wizard and my self. During the spike we discovered a couple interesting things about our data:

1) The distribution of our data was horrible, 95%+ of our origin and destinations where in the UK, North America or Russia.

2) The requirements of the project where a bit too optimistic (but I won't go into this)

(chart by Charlie G.)
Point one meant that a query who's origin and destination fell in that 95% head (e.g London to New York) was dealing with almost all of our data.

At the end of the day we where 6-20 times faster than the existing solution (depending on the query) but still not fast enough for production.

Since our spike it has come to light that 50%-90% of our data can be culled and still provide 80%-90% of the functionality. Also the requirements have been loosened up.

With the removal of this excess data, should we revisit the Redis solution we would be able to pre-intersect a LOT of our data (e.g store pre intersected city-to-city, or country-to-country departing-april, etc).
It was a very rewarding experience, working with a couple of really smart guys, learning a lot about our data and a great tool in the form of Redis.

No comments:

Post a Comment