Saturday, 19 May 2012

Initial thoughts: Graph Databases

We've recently had some very interesting tech talks given by our talented developers and guest speakers.The first one was regarding graph databases (e.g Neo4J) and how we may use such a tool to store our routes data.

After the talk I was jotting down some of the ideas talked about to help get my head around graph databases.

I really liked the thought of storing not only route data but also pricing and user search metrics within the graph.
For example if a user was to choose Edinburgh as their point of origin a lot of relevant information about their, potential, requirements would be within only a couple of hops of the graphs nodes.

In the image (not at all based on real data) we can see a relationship going from Edinburgh to Spain that represents a large percentage of searches with Edinburgh as an origin have Spain as a destination (we can also see the price and distance/journey time between these relationships).

There is also a relationship going from Edinburgh to Glasgow that shows a large percentage of searches that start off with Edinburgh as an origin often see the user switching to Glasgow airport as their origin (perhaps due to ticket price or availability of flights to a given destination). In conjunction with this we can see that Italy is the main destination searched for when Glasgow is the origin.


Lastly there is a link from Edinburgh to Amsterdam telling us that Amsterdam is the most popular "via" location in a multi-hop journey and that Dubai is the most popular destination from Amsterdam.

I haven't gotten a chance to see how such a database would work in reality... But I do like the idea of being able to take a user origin as soon as its selected, e.g Edinburgh, and be able to immediately come back with information like:

Flights from Edinburgh to Spain from £30 (journey times around 2 hours)
Flights from Edinburgh to Dubai from £432 (journey times around 10 hours)
Flights from Glasgow to Italy from £51 (journey times around 4 hours)
Train from Edinburgh to Glasgow from £18 (journey time < 1 hour)

Adventures in Redis. Part 1

While looking for cool buzz-wordy tech to play around with I investigated an in-memory key/value(or more accurately a key/data structure) store called Redis.
Redis was immediately appealing due to how easy it was to get started, I was able to test the commands from within my web browser due to the fantastic documentation pages. e.g: http://redis.io/commands/hincrby

See the last line in the example that reads 'redis>' ? That's an interactive command prompt!
And installation was a synch:
$ wget http://redis.googlecode.com/files/redis-2.4.13.tar.gz
$ tar xzf redis-2.4.13.tar.gz
$ cd redis-2.4.13
$ make
It didn't take long for me to realize this would be a great fit for a problem a team at work where tackling (and there went my cool side project).
The guys where generating inverted indices over some travel data, sharding it by origin country and storing it in process in a .NET program to be queried via http endpoints. This data would also be updated tens of thousands of times an hour.

Their queries took the shape of a set membership and/or set intersection tests (e.g Where id is in index: 'Origin country: Spain' AND is is in index 'Day of departure: 20121202'). With the goal of finding the cheapest ticket that met all the criteria.

To map this problem to Redis I would take the quote data and also build inverted indices over it using the Redis sorted set, obviously sorting on price.

To check the existence of a ticket to Spain on 2012-12-02 I would ZINTERSTORE (intersect and store) the Spain index and the 20121202 index. This was going to be slow, I knew this as the Redis documentation also gives you the time complexity for all of it's commands.

But the plan was to shard the data into hundreds of shards as we had quite a lot of CPU's at our disposal, my idea was that small data sets intersected in parallel would overcome the speed issue.

Another one of the goals here was to take something out the box that had persistence and replication and apply it as simply as possible.

As a nice side effect of Redis storing the intersection allowed us to build facets against the now reduced set where as the existing solution would build each facet a new.
So we spiked this in just one day, my go to data dude, a python-wizard and my self. During the spike we discovered a couple interesting things about our data:

1) The distribution of our data was horrible, 95%+ of our origin and destinations where in the UK, North America or Russia.

2) The requirements of the project where a bit too optimistic (but I won't go into this)

(chart by Charlie G.)
Point one meant that a query who's origin and destination fell in that 95% head (e.g London to New York) was dealing with almost all of our data.

At the end of the day we where 6-20 times faster than the existing solution (depending on the query) but still not fast enough for production.

Since our spike it has come to light that 50%-90% of our data can be culled and still provide 80%-90% of the functionality. Also the requirements have been loosened up.

With the removal of this excess data, should we revisit the Redis solution we would be able to pre-intersect a LOT of our data (e.g store pre intersected city-to-city, or country-to-country departing-april, etc).
It was a very rewarding experience, working with a couple of really smart guys, learning a lot about our data and a great tool in the form of Redis.