So for the past two weeks I have been experimenting with different graph data structures. Me and my mentors are trying to zero down on a structure which is efficient in terms of time as well as space. Although this phase is taking a little longer than I anticipated, the correct design choices here will save us a lot of memory and CPU time in the future. Below I will list out some of our requirements and a look at the plausible solutions
The graph data structure should be able to hold close to 10^5 nodes which is not uncommon for 3D images. This rules out the adjacency matrix representation which would require 10^10 slots. A plausible solution is to use Scipy’s sparse matrices. The degree of each node is the number of nodes adjacent to it, for 2D images it will be close to 4, where as for 3D it will be close to 8. So we can safely assume that the number of edges is comparable to the number of nodes. I am testing 4 different approaches to store a graph, 2 of which are inspired from scipy’s sparse matrix classes. Another thing that had to keep in mind was to allow quick contraction of edges. This is useful for approaches like Hierarchical Clustering.In the testing code I am constructing a RAG for the image and randomly collapsing edges till a fixed number of nodes are left, while monitoring memory usage and speed.
The source code and benchmarking results can be found here. The file being used ( watershed.npy ) is a labeled 3d image using 3d watershed for a 500 x 500 x 500 volume . The 4 approaches I am testing are
LIL (List of Lists)
This is similar to Scipy’s LIL matrix . In a graph of N nodes each node i is assigned two lists. One list holds all the nodes adjacent to i and the other holds the corresponding weights. Instead of using Python list we chose to implement it using Numpy arrays, and used Cython to speed up some of the graph operations.
This approach uses the least storage, but it is also the slowest among what I have tested so far, because construction involves moving around a lot of memory.(See graph_lil.py and rag_lil.pyx)
Custom Graph Class
In this case for a graph of N, for each node i , a dictionary is maintained which maps its neighbors to the corresponding weights. This has higher memory usage that LIL for CSR approach , because dictionaries store keys as well as values and due to the small load factor od Python dictionaries. However, graph construction is fast, since it does not involve moving around a lot of memory.The overall time taken is fastest in this case (See graph_custom.py and rag_custom.pyx )
Networkx Graph Class
This class inherits from networkx.Graph. The only extra code that I had to write in this case was to contract an edge. Although the memory usage is the highest, the time taken for randomly merging nodes is about 20 times faster. ( See graph_nx.py and rag_nx.py )
CSR Graph Class
This is currently work in progress. It will hold it’s data in the same manner as scipy’s csr_matrix. However to handle edge contraction a new node will be created, and it’s information will be appended towards the end. To accommodate for this all the internal arrays will be dynamically resized, doubling their size when required.