Around 2 months ago I started writing code for the Exotel Tech Challenge. The objective was to implement a Redis clone with a small subset of its commands. Redis is a hugely popular in memory data store. The code I have written is nowhere close to matching Redis, but it serves as a conceptual proof the underlying data structures and I/O logic. And not to forget, it passes the test-cases ( which I wrote myself ). I call my in memory data store ( sort of ) VigRedis. VigRedis implements a few commands that Redis supports : GET, SET, GETBIT, SETBIT, ZADD, ZRANGE, ZCARD, ZCOUNT, SAVE.
To start of I would like to mention the limitations of VigRedis.
- No Pipelining – Requests/commands to VigRedis cannot be pipelined. The client has to send a command and wait for the response before it can safely begin transmitting the next one. I avoided this due to the complexities involved in buffer management and parsing required to support this.
- ZADD – Supports only a single score-member pair.
- ZRANGE – Duplicate keys are not sorted in lexicographical order.
- ZCOUNT – Bracket notation is not supported.
- No Eventloop – VigRedis does all its tasks in a predetermined sequence, so occasionally there might be a client whose response will be delayed because some data structure maintenance is being done ( eg : increasing hash table size ).
None whatsoever. VigRedis compiles on any standard Linux distribution with the gcc compiler. No external libraries are required. The test-cases require python-nose to be installed.
Why C ?
In the beginning I had to make a choice as to what language to use to implement VigRedis. Probably if I had chosen Python, I would have written 10 times less code. Python had its powerful built-in data types which would significantly reduce my work. But while starting out I did not have the complete picture of the set of operations that need to be done to support. I knew that the moment I started doing any computationally intensive task using Python, it would imply a serious slow down. With C there is always an assurance that your program will be reasonably fast provided that it is algorithmically sound. More than that I wanted to know if I could build the required data structures from the ground up. Being a Computer Science student, I have been studying about hash tables, linked lists, dynamic arrays for a while now. Being able to implement them in a practical way ( space and time complexity wise ) using nothing but C primitives is an achievement ( in my own little world 🙂 ). And I can claim that I can fully understand what’s going on behinds the scenes.
A few Comparisons
In the beginning I tried comparing my dictionary data structure with Python’s. When Python creates an empty dictionary it allocates 8 slots. I ensured my code did the same ( see this line ). I tried inserting and retrieving a large number of keys. I noticed that my dictionary was faster than Pythons, but only upto a million keys. After 10,00,000 keys, Pythons dictionary kept getting faster than mine. But I wasn’t surprised ( Me v/s Guido van Rossum, Guido van Rossum wins )
Then I ran a few tests over the local network by executing speed.py. The results are shown below. The first execution was with VigRedis listening on the port and the second was with Redis.
Although the test favors me I wouldn’t get too enthusiastic because if the requests were pipelined Redis would surely be the winner. And artificial benchmarks like these rarely reflect real world scenarios.
VigRedis uses two important data structures.
See skip_list.c.These are a slightly modified version of this paper. I came across them while going through Redis‘ source code. A good explanation can be found in this video. SkipLists are ordered by score and hold strings as values.
See set.c. Ordered sets store strings sorted by the assigned score. An Ordered Set internally uses a dictionary and a skiplist. The skiplist orders strings by their scores and the dictionary handles duplicate score-member pairs.
Behind the Scenes
A short summary of how I have handled some important tasks
Simultaneous I/O with multiple clients is implemented using select(). Each client after connections is dynamically allocated a buffer which is scanned for a predefined characters to detect the end of a command. You can find the implementation on this line.
The parsing of commands is done using strtok(). See handler.c.
These commands are handled by a single instance of the dictionary object. See kv_dict.
These commands are handled using set_dict, which is a dictionary mapping strings to ordered set objects.
Saving and loading of data is managed using a naive file format. See dump.c. A line starting with one space is a key-value entry.A line staring with two spaces is a set-score-member entry.
Running unit tests requires python-nose
sudo apt-get install python-nose cd tests/ nosetests -v
The tests cases cover the most basic command usage. The same test cases produce identical results when used with Redis as well.
VigRedis was a tremendous experience. Too bad that it will mostly just lie around github, most likely never to be touched again. It still contains many bugs which might be just lurking around and might have escaped my testing. No matter how buggy or limited it is in terms of its functionality, this is exactly how I feel right now.