Cassandra System in Facebook

Cassandra was designed to accomplish scalability and availability for the Inbox Storage problem. It was a system developed for Facebook. It would need to handle more than a billion write operations. It would also need to scale with the number of users. The data centers which serve the users are distributed across earth’s geography.

Image result for cassandra image

Figure 1 Cassandra Symbol

In order to keep the search latencies down, it would be necessary for the users to replicate the data over the data centers. Facebook has installed Cassandra as its back end storage system. This is done for multiple services available at Facebook.

Distributed file systems have hierarchal name spaces. The existing systems allow operations which are disconnected. They are also strong against general issues like outages and network partitions. Conflicts resolution is different in different systems in Coda and Ficus there is level conflict resolution.

Application level resolution is allowed by Bayou. Traditional relational databases seek at providing guarantee of consistent replicated data. Amazon uses the Dynamo storage system for storing and retrieving user details. It uses the Gossip member protocol to maintain node data. Vector clock scheme is used to detect conflict. It has more preference for client side conflict resolution mechanism. In systems which need to handle a high write through put, Dynamo can be disadvantages as read would be needed to manage the vector stamps.

Casandra is a non-relational database. It has a distributed multi-dimensional map. This map is indexed by a key. The value which the key points to is highly structured. The size of the row key is a string which has no restrictions. It has size corresponding to 16 to 36 bytes.

Like the Big table system, the columns are grouped together into sets. These sets are called as column families. The column families are divided into two type:

1) Simple column families

These are the normal column families

2) Super column families

The super family has a column family inside a column family. Sorted order of the column can be specified. The inbox display usually displays the messages in time sorted fashion. This can be used by Cassandra as it allows the sorting over the columns by time or by name. The results are displayed in easily for the inbox searches in a time sorted manner.

The syntax used to access column family is column_family:column.

For a super column family it is column_family: super_column: column.

Cassandra cluster is used a part of an application. They are then managed as a part of a service. All the deployments have jsut one table in their schema. But it does support the notion of multiple tables.

The API of Cassandra has the below three basic commands:

insert (table, key, rowMutation)
get (table, key, columnName)
delete (table, key, columnName)

column name stands for a super column family or simple column family, a specific column in the column family.

Consider the architecture of storage system involves plenty of complicated scenarios. Many factors need to be handled such as configuration management, robustness, scalability, For this document we consider primary features of Cassandra that includes membership, partitioning, failure handling, scalability, replication

For the various read write requests, the module works in synchrony. In order to confirm the completion of writes, the system routes requests to replica.

Reads are handled differently. System reroutes the requests to the nearest replica / route and awaits a quorum of responses.

Partitioning

Ability to increase scaling is a critical feature provided by Cassandra. This is provided in dynamic way. In the cluster, the partition takes place over the storage hosts. Consistent hashing and also preserving has functions are performed to take care of partitioning.

Consider the consistent hashing approach. Here the largest hash value covers the smallest hash value. All nodes are then provided another adhoc value represented by the position of ring. Application provides the key with Cassandra leverages that to move requests. Responsibility is established at a node level around the ring region.

Main benefit of this approach is that transition of node impacts only the neighboring node, whereas other nodes are not impacted.

There does exist some difficulties for this approach.

There is lack of uniform data and load distributions due to the adhoc positions of nodes around the ring. The approach ignores the differences in performances of nodes.

Replication

In order to increase the durability and availability, Cassandra provides replication. For this purpose, all data item is copied over at N hosts. Each node is conscious aware of other nodes in network, thus high durability is established.

Each row is replicated across various data centers that are further synced with very high speed network links.

Bootstrapping

A configuration is maintained for a node joining the cluster. Configuration file provides the necessary contact points to join the cluster. These are known as seeds. A service can also provide such configuration. Zookeeper is one of them.

Scaling the Cluster

Consider the case of adding a new node to system. For this purpose, a token is assigned to it. Goal is to reduce load on heavily loaded node. New node is split on a range wherein previous node was assigned for. Web dashboards are provided that can perform above tasks. These can also be achieved through command line utility.

Local Persistence

Local file system helps provide the necessary local persistence for Cassandra. For recovering data efficiently, disks are used to represent data. There are standard write operations. These include ability to commit and update into a data structure. After successful commit log, then write to in-memory data structure is performed.

Implementation Details

The Cassandra process on a single machine is primarily consists

The process involves clustering, fault identification and storage modules. These apply for a specific machine. There exists event driven items. These split the message across the process pipeline and also task pipeline. These are performed across various steps as part of architecture. JAVA is primary source and all modules are built from scratch using Java. For the clustering and fault detection module, input output that is of type non-blocking is built upon.

There are few lessons that were learnt over maintaining Cassandra. New features should be added after understanding its implications over the system. Few scenarios are stated below:

7TB of the data needed to be indexed for 00M users. It was extracted, transformed an loaded into the Cassandra database using Map reduce jobs. The Cassandra instance juts becomes a load over the network bandwidth as some of the data was sent over serialized data over the Cassandra network.

Application requirement is to have an atomic operation per key per replica.

Storage system features, architecture and implementation is described including partitioning, replication, bootstrapping, scaling, persistence and durability. These are explained through Cassandra’s perspective which provides those benefits.

[1] Avinash Lakshman, Facebook & Prashant Malik, Facebook, Cassandra – A Decentralized Structured Storage System

Order Now