Storage

One primary performance concern for a graph is that typical graph processing are IO intensive and require random data access. Approaches that scale must partition and use RAM.

Graph processing has two general classes: online query processing and offline analytics. The performance requirements for each class differ greatly and are typically treated differently in implementation (Neo4J vs Giraph, for example).

A graph can be thought of as a collection of entities (AKA cells) that can be considered edges or vertices. A typical graph implementation requires that each cell describe its topology and properties to achieve “index free adjacency.” Index free adjacency can be generally understood to mean "locality of relevant information." This allows relations between vertices to be matched in O(1) time.

The desired characteristics for a backing store for graph cells are as follows:
  • Ability to blob binary data that represents a highly compact cell with embedded relation information
  • Ability to use RAM as the primary storage for cell information
  • Ability to handle cells of varying size
  • Ability to scale across multiple nodes and appear as a (mostly) unified store

Relational databases resolve relations using global indices which typically execute in O(log n) time. This doesn't eliminate an RDBMS as a backing store for a graph however, if the RDBMS has the desired characteristics for graph processing.

SQL Server 2014 Hekaton offers these characteristics, and provides many out of box capabilities for availability and data preservation that are desirable and hard to implement. Redis is another interesting potential backing system that appears to be getting traction on the Windows platform.

The horizontal scaling capabilities required of the graph engine means application level sharding of cell data must be accounted for in design regardless of backing implementation.

Design Principles for Storage

  • Abstract the underlying storage using a repository abstraction (via class inheritance due to interface indirection costs).
  • A non-COTS product implementation will be provided to maintain OSS purity.
  • Processing will follow the model of sending instruction to local cell storage to execute, instead of retrieving data from remote partitions for centralized processing (Map-reduce model).

Organization of Storage

The following image shows a 2 level partition scheme where one or more RAM Trunks are mapped to a node: Partitioned Graph Design.png
The idea of a RAM Trunk is to define a manageable unit of recovery and replication. Additionally, heuristics that match new cells to related cells to optimize OLTP processing are scoped to the RAM Trunk. There are no expected guarantees that RAM Trunks will be bound to specific nodes. To resolve the location of a cell, the trunk must be resolved via global index, consistent hashing, or rendezvous hashing.

raw design notes
Global Shared Constructs
CAP
Cell Format
Query Processing

Last edited Mar 2, 2015 at 10:29 AM by bkjuice, version 11