Tuesday, 11 September 2012

Experimenting with ZooKeeper


I've recently been reading up on Zookeeper; with a view to give a lightning talk on it.

At first glance I found it hard to get a firm grasp on what it is that Zookeeper actualy is/does. When I started telling people I was planning a talk on Zookeeper it turned out that I was not alone, I got a lot of responses along the lines of "great! I heard Zookeeper was awesome but I have no idea why or what it is".

After having spent some time playing with Zookeeper in Python and Closure I'm both very impressed with it and puzzled as to why I didn't get it in the first place...
It's such a simple concept and it is, pretty much, as described on the official web site. I guess I could attribute mine and others confusion to the fact that Zookeeper is kinda "low level" in as much that the site has patterns you can implement to build such things as distributed locks and leader election, but as it stands Zookeeper is just a tree based file system for meta-data... It just so happens this is the exact foundation you want for these high-level constructs.

There are some great libraries for working with Zookeeper, here are the handful I've used so far:
The first thing I wanted to try with ZooKeeper was to tackle the messy issue of configuring applications in an environment specific way.
To this end I built Enclosure, a python command line tool that I can point at ZooKeeper and an on disk directory; it models znodes on the structure of that directory and loads the data from files into those nodes.
I also built a script in Python (Clojure version coming soon) that would allow an application to join an environment, download its configuration and subscribe to updates to that configuration file.
You can find Enclosure here: https://github.com/jdoig/Enclosure

I was also mulling over a system for caching bulk API calls that involved a two stage, distributed, caching mechanism that would need to communicate amongst other instances & processes as to whether a call was:
a) not yet started
b) started and streaming into level 2 cache or
c) finished, sorted, cleaned and stored into level 1 cache.
...Zookeeper's barrier recipe (a distributed lock) fit the bill just perfectly.

I'll put my lightning talk video up over the next few weeks.

Sunday, 12 August 2012

Trello Rocks!

Almost a year ago now we started doing daily "stand-ups" and tracking progress on story cards with a view to becoming more agile.

I like the idea of getting everyone together and talking about the tasks at hand and having some way to track that work that is simple and tactile.

The problem is the paraphernalia... You can't move the board easily if you wan't to relocate, its not easy to attach documents to the wall and a lot of people's handwriting is illegible.

I'd spotted Tello on hacker-news some time back and thought it looked like an amazing tool but attempts to push it at work had met with: "But then we have to track tasks in two places"
(That wall was going nowhere; it was the only remaining artefact of a, very expensive, agile adoption attempt).

Just recently a handfull of our most awesome rockstar developers and myself landed a pretty cool "Friday project" that we would be self-managing.

So we went with Trello and it has been nothing short of a joy to work with.

The bulk of that joy comes from the fact that rather than standing round a bunch of cards in a stuffy office we can sit in a coffee shop, scoffing croissants and flicking tasks around on Charlies iPad (trying not to get jam all over it).

We also have access to it anywhere... If all of a sudden we remember something that was missed or have a question we can just bring up trello on our phones, tablet or laptops and throw stuff in there.

We've also given the client access to our board so, at any time, they can log-on and see how things are going, follow the links to our showcase environment, watch videos of walkthroughs, etc.

What are you waiting for? Get you some Trello!: https://trello.com/

Using Java github repo's from Clojure

This week after watching Mike's lightning talk on bloom & count-min sketch algorithms I found myself hacking away at a Clojure implementation of a bloom filter and needing to pull in a Java project from github (a murmur hash implementation).



Here's how it's done:
  1. Add the following line to ~/.lein/profiles: [lein-git-deps "0.0.1-SNAPSHOT"] under :user :plugins, like so:
  2. Form the command line run: lein deps

  3. In your projects project.clj file add the following: where :git-dependencies is the github url of the repo you wish to use and :java-source-paths is the path where Leiningen will find the source code to build (note: it will be downloaded from github into a directory .lein-git-deps/ by default
  4. From the command like run: lein git-deps

  5. Add an import statement to your clojure files:


Saturday, 11 August 2012

My first lightning talk

The tech-talks at work have fallen by the wayside since they no longer have an "owner" so during their downtime I've been arranging lightning talks as they're far more informal and take a lot less time to arrange.

My first talk in front of an audience at work was during these first run of lightning talks. I was joined by four colleagues: Daniel, Mike, Mark & Charlie (blog) who talked about:
Bundler, Sketch algorithms, Fictionary, CAP theorem & Riak

Deciding to be a bit cute I chose the subject of "Speed" for my first lightning talk and aimed to address the issue of choosing a language for a given project based on the ability to learn it, write it, execute it, run it & fix it.


Speed from James Doig on Vimeo.

Five minuets is not a long time and that is quite an in depth subject, well you live and learn and hopefully it will serve as a good warm-up for my full length Clojure talk in a couple of weeks.


Saturday, 19 May 2012

Initial thoughts: Graph Databases

We've recently had some very interesting tech talks given by our talented developers and guest speakers.The first one was regarding graph databases (e.g Neo4J) and how we may use such a tool to store our routes data.

After the talk I was jotting down some of the ideas talked about to help get my head around graph databases.

I really liked the thought of storing not only route data but also pricing and user search metrics within the graph.
For example if a user was to choose Edinburgh as their point of origin a lot of relevant information about their, potential, requirements would be within only a couple of hops of the graphs nodes.

In the image (not at all based on real data) we can see a relationship going from Edinburgh to Spain that represents a large percentage of searches with Edinburgh as an origin have Spain as a destination (we can also see the price and distance/journey time between these relationships).

There is also a relationship going from Edinburgh to Glasgow that shows a large percentage of searches that start off with Edinburgh as an origin often see the user switching to Glasgow airport as their origin (perhaps due to ticket price or availability of flights to a given destination). In conjunction with this we can see that Italy is the main destination searched for when Glasgow is the origin.


Lastly there is a link from Edinburgh to Amsterdam telling us that Amsterdam is the most popular "via" location in a multi-hop journey and that Dubai is the most popular destination from Amsterdam.

I haven't gotten a chance to see how such a database would work in reality... But I do like the idea of being able to take a user origin as soon as its selected, e.g Edinburgh, and be able to immediately come back with information like:

Flights from Edinburgh to Spain from £30 (journey times around 2 hours)
Flights from Edinburgh to Dubai from £432 (journey times around 10 hours)
Flights from Glasgow to Italy from £51 (journey times around 4 hours)
Train from Edinburgh to Glasgow from £18 (journey time < 1 hour)

Adventures in Redis. Part 1

While looking for cool buzz-wordy tech to play around with I investigated an in-memory key/value(or more accurately a key/data structure) store called Redis.
Redis was immediately appealing due to how easy it was to get started, I was able to test the commands from within my web browser due to the fantastic documentation pages. e.g: http://redis.io/commands/hincrby

See the last line in the example that reads 'redis>' ? That's an interactive command prompt!
And installation was a synch:
$ wget http://redis.googlecode.com/files/redis-2.4.13.tar.gz
$ tar xzf redis-2.4.13.tar.gz
$ cd redis-2.4.13
$ make
It didn't take long for me to realize this would be a great fit for a problem a team at work where tackling (and there went my cool side project).
The guys where generating inverted indices over some travel data, sharding it by origin country and storing it in process in a .NET program to be queried via http endpoints. This data would also be updated tens of thousands of times an hour.

Their queries took the shape of a set membership and/or set intersection tests (e.g Where id is in index: 'Origin country: Spain' AND is is in index 'Day of departure: 20121202'). With the goal of finding the cheapest ticket that met all the criteria.

To map this problem to Redis I would take the quote data and also build inverted indices over it using the Redis sorted set, obviously sorting on price.

To check the existence of a ticket to Spain on 2012-12-02 I would ZINTERSTORE (intersect and store) the Spain index and the 20121202 index. This was going to be slow, I knew this as the Redis documentation also gives you the time complexity for all of it's commands.

But the plan was to shard the data into hundreds of shards as we had quite a lot of CPU's at our disposal, my idea was that small data sets intersected in parallel would overcome the speed issue.

Another one of the goals here was to take something out the box that had persistence and replication and apply it as simply as possible.

As a nice side effect of Redis storing the intersection allowed us to build facets against the now reduced set where as the existing solution would build each facet a new.
So we spiked this in just one day, my go to data dude, a python-wizard and my self. During the spike we discovered a couple interesting things about our data:

1) The distribution of our data was horrible, 95%+ of our origin and destinations where in the UK, North America or Russia.

2) The requirements of the project where a bit too optimistic (but I won't go into this)

(chart by Charlie G.)
Point one meant that a query who's origin and destination fell in that 95% head (e.g London to New York) was dealing with almost all of our data.

At the end of the day we where 6-20 times faster than the existing solution (depending on the query) but still not fast enough for production.

Since our spike it has come to light that 50%-90% of our data can be culled and still provide 80%-90% of the functionality. Also the requirements have been loosened up.

With the removal of this excess data, should we revisit the Redis solution we would be able to pre-intersect a LOT of our data (e.g store pre intersected city-to-city, or country-to-country departing-april, etc).
It was a very rewarding experience, working with a couple of really smart guys, learning a lot about our data and a great tool in the form of Redis.