NoSQL databases Voldemort db
NoSQL
NoSQL databases refer to database management systems (DBMS) that differ from RDBMS in some way. NoSQL databases avoid join operations and normally scale out horizontally.
Here are some of the advantages of NoSQL datbases.
Advantages
- Elastic scaling: NoSQL databases scale out horizontally. Usually we can add low-cost commodity servers and scale NoSQL databases.
- Big data: NoSQL databases are normally designed to handle huge unstructured data. Social Networking sites such as Facebook, Linkedin use NoSQL systems for their operation.
- Less DBA requirement: NoSQL databases require less interference from database administrator. With designs such as read-repair, data distribution and replication there is less administrative task left for the DBA.
- Flexible data models: NoSQL systems such as MongoDB is document-based database. It does not have any inherent schema requirement. Thus the schema of the data can change on the fly.
However there are some challenges of NoSQL, which should be kept in mind.
Challenges
- Maturity: NoSQL databases have recently started gaining momentum. There are not many experts who know these systems in and out and most of the databases are not mature.
- Support: NoSQL databases are mostly open-source and built with the help of community effort. Therefore they may lack quick support, customer care, etc.
- Administration: NoSQL databases installation require good technical skills. The maintenance of these systems is also tedious.
- Expertise: There are few NoSQL developers and most of them are still in the learning phase.
Categorization
NoSQL implementations can be categorized by their manner of implementation.
Here are different categories along with an implementation.
- Document store : MongoDB
- Graph : Neo4j
- Key-value store : Voldemort
- Wide-column : Cassandra
Voldemort Introduction
Voldemort is a distributed key-value storage system.
It’s various characteristics are:
- Automatic replication of data over multiple servers.
- Automatic partitioning of data, so each server contains only a subset of the total data
- Transparent handling of server failure
- Pluggable serialization support
- Versioning of data items to maximize data integrity
- No central point of failure as each node is independent of other nodes.
- Pluggable data placement strategies support for things like distribution across data centers that are far geographically.
Voldemort is used at Linkedin.
Design
For availability and high-performance reasons, only trivial key-value data access is permitted. The queries which have been supported are:
* value = storeClient.get(key)
* storeClient.put(key, value)
* storeClient.delete(key)
Both keys and values can be simple or complex objects like lists, maps. The keys and values are then serialized.
Though the query format means that there can be no complex query filters, no foreign key constraints, no triggers and that joins must be in done in code, there are several advantages like:
- Easy distribution across cluster
- Ready-made caching layer
- Predictable performance as only queries listed above are possible
- For keeping the services loosely coupled, we have to do database-joins in code anyway.
- For performance, anyways serialized format is required.
- Clean separation of logic and storage.
Architecture of the System
Logical Architecture of the System
As we can see the logical architecture of the system is a layered architecture. Each layer perform its own singular task such as serialization, failover or interacting with the underlying storage engine. For instance, in serialization layer the system handles the work of translating objects to byte arrays. That is suppose I have a key, value pair to be stored in the Voldemort ‘store’. And suppose key is string and value is a complex java object. Then I have to declare this that serialization for key is string, and serialization for value is java-serialization in stores.xml file in the config directory of Voldemort. Then appropriate classes are used to achieve serialization.
The beauty of the layered architecture is that, we can mix-n-match different layers and adapt the architecture according to a specific use-case. For instance we can compress the serialized data and then transfer it over the network, by adding a ‘Compression Layer’ after serialization. Likewise the architecture can be adapted for deserialization.
Also data can be intelligently routed to its appropriate partition. This can be achieved by following three approaches:
* 3-Tier, Server Routed: In this partition-aware routing is done on the server side.
* 3-Tier, Client Routed: In this partition-aware routing is done on the client side.
* 2-Tier, Front-end routed: In this the client needs to be very smart and it will handle the routing of data to its appropriate partition. Typically the client needs to be very closely tied to implementation detail of the database (like written in java, using Voldemort libraries).
The following is pictorial representation of the same:
As we can see from the figure, fewer hops are required when we move the intelligence of routing up the stack. Performance is normally degraded by network hop and disk-access. As mentioned above we can use the flexibility of the architecture in elimination of the network hops. Avoiding disk access can be achieved by partitioning the data and wherever possible caching it.
Where can we find this in the source code ?
In voldemort.store.Store interface we can see that it exposed basic operations such as get/put and delete. The voldemort.store package has several sub-packages and in that we find the implementation of different functionalities for each store. This is what provided it functional layering. Some of the packages here are:
* voldemort.store.routed : This implementation handles routing to nodes.
* voldemort.store.serialized : This implementation handles converting objects to byte array.
* voldemort.store.memory : Implementation of in-memory storage engine.
* voldemort.store.versioned : When a value is ‘put’ more than once for a key, its version is incremented. The implementation lies in this package.
The main server side code can be found in the package voldemort.server, where the main class is VoldemortServer. It is responsible for bootstrapping all the needed services like storage service, http service, etc. Following are different service types in Voldemort:
public enum ServiceType {
HTTP(“http-service”),
SOCKET(“socket-service”),
ADMIN(“admin-service”),
JMX(“jmx-service”),
SCHEDULER(“scheduler-service”),
STORAGE(“storage-service”),
VOLDEMORT(“voldemort-server”),
ASYNC_SCHEDULER(“async-scheduler”),
GOSSIP(“gossip-service”),
REBALANCE(“rebalance-service”);
private final String display;
private ServiceType(String display) {
this.display = display;
}
public String getDisplayName() {
return this.display;
}
}
The client side code resides in voldemort.client package. The StoreClient class is the main interface that the user deals with.
Partitioning of data and Replication
The first question we need to ask is why do we need to partition the data ? Why can’t we have all data in one disk? The answer is that if we had the data on a single disk or a single server it would be a single point of failure. It means that if the server goes down, all our data is lost. Now-a-days the price of information is very large and it is very important to keep multiple copies of data and not to keep all eggs in one basket or all the data in one location.
Also partitioning helps improve the performance. We can understand this as follows. Suppose one of the node contains the entire data-set, that is, it is a “hot” node. Then if multiple concurrent queries are hit on that node, it will be a performance hit, and the response will be slow. On the other hand if we split the data into multiple locations / partitions and we know which partition the requested data belongs to, each partition will mostly have fair load.
So lets say we partition the data into S partitions (one partition per server) and we store redundant copy of each key K on R servers. Now we can associate the key K with R servers using modulo hashing:
a = K mod S
and then the values can be stored on the servers a, a+1, … a+r.
This system is useful because we can know the location of data, given its key K. And this query can be answered by any of the peers and there is no need of central server which stores the mapping of keys to the server location. However its downside can be seen when a server is added or removed. Lets understand this with an example:
Problem with Hashing in a cluster:
Modulo hashing: k mod(n) where, n=10
Lets put 100 documents in nodes: 1..10
18 mod(10)= 8…etc
Say, node 10 dies
So, n=9
Now, if you want 18th doc: 18mod(9)=0
Node0 does not have 18th doc! It’s in Node8!
Solution: You need to re-hash all the values!
Expensive when you have 100Petabytes of data.
Consistent Hashing helps here.
Let’s visualize consistent hashing with the help of above diagram. In consistent hashing, the nodes reside on an imaginary ring which is split into say 2^31 partitions. An arbitrary hash function is used to map the key onto the ring and we pick R unique nodes responsible for this key, by traversing the ring in clockwise direction.
Thus in consistent hashing when a server is removed or added the load is automatically balanced between the servers.
Important points in Consistent Hashing
- There are 2^m keys in an m-bit key space.
- Keys ordered in a ring topology
- In case of server failure , it minimizes re-balancing or rehashing of keys between two nodes.
Important points of Replication
- Ensure durability and
- High availability of data
- Replication strategy: Data is not only persisted in nodeA, but also in next N nodes (clock wise manner). Where N=Replication factor
- So, when node A is down, request is handed to the next node in the ring.
Where can I find this in the source code ?
The configuration files cluster.xml holds all information regarding the clusters present and the partitions it maps to. The server.properties file contains the individual node id which is used in the cluster.xml file.
The package voldemort.client.rebalance included code related to re balancing the cluster. The class RebalanceController is the main class and acts as the central controller for making the decisions regarding balancing the cluster.
Data Model and Serialization
As we have seen, Voldemort supports simple operations such as put, get and delete. Both keys and values can be simple or complex objects. Serialization means translating the object into an byte array for transmission over the network. Pluggable serialization in Voldemort is a very good feature and this allows one to implement one’s own serializer.
The following types of serialization are supported:
* json
* string
* java-serialization
* protobuf
* thrift
* avro
* identity
Let’s see json type in more detail.
JSON Serialization
The data should be easily translated between the following different states:
Object ↔ Network transmission ↔ Text Representation
JSON is being widely used in the industry as it supports common data-types across various programming languages. It does not have an inherent schema. However we can specify the schema by defining the ‘type’ of field. The ‘type’ can be ‘int32’, ‘string’, etc ..
For the store containing a simple ‘User’ object, we could define type as :
{“firstname”:”string”, “lastname”:”string”, “id”:”int32″}
In this case Java code will return a Map<String,Object>. It will contain each of the keys which were specified, and the corresponding value.
Where can I find this in the source code ?
The package voldemort.serialization contains implementation for the various serializations supported.
Some of the sub-packages here are:
* voldemort.serialization.avro
* voldemort.serialization.json
* voldemort.serialization.protobuf
* voldemort.serialization.thrift
Consistency and Versioning
In a read-only database, when the data is fetched, it will be consistent as there are no updates in the read-only database. In a traditional RDBMS, consistency is maintained by using transactions. That is at a time only one process is allowed to change the data at a row-level.
In a distributed world, the data can reside on many servers and there can be multiple replicas of the data. In case an update is made, all the copies of the data should have same value. This is possible using a distributed transaction but it is very slow.
Another approach is that we tolerate a little inconsistency. According to CAP theorem, for achieving both availability and partition-tolerance we should relax consistency. Such AP systems are known as eventually consistent. Following figures gives an overview of NoSQL databases based on CAP throrem
What is eventual consistency ? Eventual consistency means that over a period of time the update of the data will reach all the nodes in the cluster.
Approaches such as Two-phase commit and Paxos-style consensus prevent permanent inconsistency. Voldemort uses Read Repair for reaching consistency. Let’s understand Read Repair:
Read repair strategy means that all the nodes in the cluster will eventually have the latest version of data. In this approach, all the inconsistent values are written to the nodes when there is a write request. At the time of reading, it is checked if the version of the data read from a node is stale. If yes, a conflict is detected and then all nodes should be synchronized so that all nodes share the same value.
Weak consistency in read operations means that the performance of read will be optimized by returning requested data before all the nodes are synchronized with the same data. That is, a read operation returns immediately, and triggers an async process which will take care of synchronization of data across all nodes. This does perform faster than strong consistency, but has drawback that the data returned is not always consistent.
Strong consistency in read operations means that consistent data will be returned to the user. In this when stale data is detected at time of read, all the nodes are synchronized with the same data and then the response is sent. This does perform slower than weak consistency, but has guarantees that the data is always consistent.
Versioning
Versioning can be achieved in a centralized database by optimistic locking. We just store a counter for each row, and when the row is updated we increment the counter. Here the updates are only allowed when the ‘counter’ value is correct. This way we know the latest version of data.
In a distributed system, versioning is difficult as servers can fail, servers can be added and replication of data can take time. We need to know for each server what was the latest data value previously and the information to determine if that is obsolete.
# 2 servers fetch the same value simultaneously
[client 1] get(12345) => {“name”:”tushar”, “email”:”[email protected]“}
[client 2] get(12345) => {“name”:”tushar”, “email”:”[email protected]“}
# client 1 updates the ‘name’
[client 1] put(12345, {“name”:”tushar sjsu”, “email”:”[email protected]“})
# client 2 updates the ’email’
[client 2] put(12345, {“name”:”tushar”, “email”:”[email protected]“})
# We now have the following conflicting versions:
{“name”:”tushar”, “email”:”[email protected]“}
{“name”:”tushar sjsu”, “email”:”[email protected]“}
{“name”:”tushar”, “email”:”[email protected]“}
Thus here the original value is overwritten by both clients. But we do not know which is the latest part on each client. And also we need information to determine which of the version is obsolete.
This can be achieved by vector-clock. A vector-clock helps us by maintaining a counter and updating it on each write and let’s us know when 2 versions are in conflict, and which is the latest version.
A vector-clock can be thought of as a list of server:version pairs, where version is an indicator that server was the master for that no. of writes:
[1:50, 2:3, 5:66]
“A version v1 succeeds a version v2 if for all i, v1i > v2i. If neither v1 > v2 nor v1 < v2, then v1 and v2 co-occur, and are in conflict.” Here is a example of 2 conflicting versions:
[1:2,2:1]
[1:1,2:2]
Following diagram shows the working of Vector clock.
Vector Clock
As shown in the diagram, Alice, Ben, Cathy and Dave are friends. They want to decide a day to meet, which is flexible for all. Following sequence of steps occur:
* Alice writes ‘Wed’ on the clock and sends to rest all.
* For Ben, it is not possible on ‘Wed’ so he updates clock with ‘Tue’. However this update is missed by Cathy.
* Cathy is comfortable with ‘Thu’ so she updates clock to ‘Thu’.
* Now Dave receives two updates; of ‘Tue’ and ‘Thu’. Dave is smart and realizes Cathy was not in loop for ‘Tue’. So he sends ‘Thu’ as final Cathy and Alice.
* Alice sends a final confirmation with clock value ‘Thu’ to all.
Where can I find it in the code ?
The package voldemort.store.routed contains implementation for routing data across all nodes in the cluster. The class ReadRepairer in this package is responsible for performing read-repair.
The description of class says:
“Repair out-dated reads, by sending an up-to-date value back to the offending clients”
In package voldemort.versioning you can find implementation of Vector-clock in the class VectorClock.
Exploration
In one of our database class projects, we had to explore NoSQL databases. We had to choose one NoSQL database for each category namely, key-value, wide-column and document-based. So we choose the following:
- Key-value: Voldemort
- Wide-column: Cassandra
- Document-based: MongoDB
The goal of the project was to examine these NoSQL databases and get a feel of how the data is stored in these databases and how these databases perform for CRUD operations. Wikipedia was used as the source of data as it provided a rich collection of interconnected documents which was necessary for us to analyze the performance of three databases.
We used Page Scraper to collect data from Wikipedia site. We fetched data from Wikipedia and stored it in our local file and then performed our operations considering the file as source. The Page Scraper uses a preset list of stop-words through which the downloaded document is passed through. These stop words are filtered out and rest of the data consists of keywords, links, url and title. This data is then stored in each of the NoSQL databases and CRUD performance is measured.
Following Junit test cases were written for each databases and each test file had five test cases.
* Case1: inserts all nine pages into respective database
* Case2:removes particular page
* Case3: retrieves the pages whose title contains ‘table’ word
* Case4:search the page and modify the title of the page
* Case5: retrieves a single page given the key
Above tests were performed on a single-node cluster. All three databases were installed on the same machine and were running simultaneously while testcases were executed. The performance numbers may differ if replication and sharding were to be used on a multi-node cluster. We had set the timers in each of the No-SQL implementations on our insert, select, delete and update operations to have comparison on the time taken for doing each operation.
Let’s see how Voldemort performs in CRUD operations:
In Voldemort, data is stored in the database as a “store”. Data in Voldemort is stored as simple key-value data. Both keys and values can be as complex as lists or maps. Each key is unique to a store, and each key can have at most one value. In a multi-node cluster, data will be automatically sharded into multiple machines and hence it will be highly available. Each server would contain only a subset of the total data.
Voldemort Configuration
In Voldemort server operations are primarily controlled by three configuration files:
* cluster.xml: This config file contains information about all the servers in the cluster like their hostname, port they use, etc. It is identical for all voldemort nodes. It does not contain information which is specific to the particular node such as tuning parameters or data folders for those nodes, but rather information which is public to the cluster.
* stores.xml: This includes information about all the stores in the cluster. Information like required-reads and required-writes to maintain consistency, and also how the serialization of keys and values is done, is stored in this file. It is identical on all noded in the cluster.
* server.properties: This contains tunable arguments that control a particular server (node). Local node id (which corresponds to the entry in cluster.xml ), threadpool size, and also local persistence engine parameters. This file varies on each node.
Here is cluster.xml for our project:
<cluster>
<name>mycluster</name>
<server>
<id>0</id>
<host>localhost</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<partitions>0, 1</partitions>
</server>
</cluster>
Here it is important to note that partitions are not static partitions of node, but a way for partitioning the key-space such that each key is mapped to a specific data partition. This means that a specific cluster may support multiple stores each with variable replication factors. This is significant, since some data is more crucial than other data, and the trade-off between consistency and performance for one store may not be same as another store. The count of data partitions is fixed and cannot be changed.
It is important that both config files cluster.xml as well as stores.xml be same for each node, and also partition and node ids remain same for consistency across nodes.
Here is server.properties for our project:
# The ID of *this* particular cluster node
node.id=0
max.threads=100
############### DB options ######################
http.enable=true
socket.enable=true
# BDB
bdb.write.transactions=false
bdb.flush.transactions=false
bdb.cache.size=1G
#NIO connector settings.
enable.nio.connector=false
storage.configs=voldemort.store.bdb.BdbStorageConfiguration, voldemort.store.readonly.ReadOnlyStorageConfiguration
Here is stores.xml for our project:
<store>
<name>test</name>
<persistence>bdb</persistence>
<routing>client</routing>
<replication-factor>1</replication-factor>
<required-reads>1</required-reads>
<required-writes>1</required-writes>
<key-serializer>
<type>string</type>
</key-serializer>
<value-serializer>
<type>java-serialization</type>
</value-serializer>
</store>
Here name represents the name of the store, and we say that we will use bdb as local persistence engine. With regards to routing parameter, we say that client will perform the routing. Now let’s see about the parameters N (Replication factor), R (required reads) and W (required writes). The replication factor says how many replicas do we want. R parameter is the minimum number of reads that should succeed. Similarly W parameter is the minimum number of writes that should be successful in our cluster.
Next important thing is serialization. We say that we will be using ‘url’ which is a normal string as key for each of our record. Therefore key-serialization is string. However for value, we are not storing simple string. It will a Map<String, Object>. In turn the map will contain different keys such as ’tilte’, ‘keywords’, ‘links’ and their respective values. That is the reason why we need to use java-serialization.
Get Voldemort up and running:
- Install Voldemort
- Make sure all config files (cluster.xml, store.xml and server.properties) match those given above.
- From voldemort folder run the command:
- bin/voldemort-server.sh config/single_node_cluster > /tmp/voldemort.log &
- Via Shell: Connect to ‘test’ store using following command and then perform store operations.
- bin/voldemort-shell.sh test tcp://localhost:6666
Voldemort CRUD
Connect to store:
String bootstrapUrl = “tcp://localhost:6666”;
factory = new SocketStoreClientFactory(
new ClientConfig().setBootstrapUrls(bootstrapUrl));
client = factory.getStoreClient(“test”);
Disconnect from the store:
factory.close();
Inserting data:
// Get the file (wiki page) stored on local disk.
URL pageUrl = new URL(“file://”
+ historyfiles[i].getCanonicalPath());
// Scrape the page with following url
Page pg = s.scrape(pageUrl);
// Get meta-data
String title = pg.getTitle();
String url = pg.getUrl();
// The url acts as a unique key
String key = url;
// We get a Versioned object for a known key from the store.
// Get operation of the store
Versioned<Map<String, Object>> currentValue = client.get(key);
// Fill the hashmap, the ‘value’ for the key in our store.
Map<String, Object> pageMap = new HashMap<String, Object>();
if (url != null) {
urls.add(url);
pageMap.put(“url”, url);
}
if (title != null) {
pageMap.put(“title”, title);
}
// Similar code for filling keywords, links in map…
// …
if (currentValue == null) {
// There was no existing key-value pair for the given key
// So create new value
currentValue = new Versioned<Map<String, Object>>(pageMap);
} else {
// Update existing value.
currentValue.setObject(pageMap);
}
// Put operation of the store
client.put(key, pageMap);
Retrieving data:
Versioned<Map<String, Object>> currentValue = client.get(url);
Deleting data:
client.delete(url);
Updating data:
Versioned<Map<String, Object>> currentValue = client.get(url);
Map<String, Object> data = currentValue.getValue();
data.put(“url”, url);
currentValue.setObject(data);
client.put(url, data);
Searching data:
The criteria used for searching across all the pages stored in the databases was: “Find all pages which have the word ‘table’ in its title”.
String searchInTitle = “table”;
ArrayList<String> result = new ArrayList<String>();
Versioned<Map<String, Object>> currentValue = null;
for (int i=0; i < urls.size(); i++) {
currentValue = client.get(urls.get(i));
Map<String, Object> data = currentValue.getValue();
String title = (String) data.get(“title”);
if(title.contains(searchInTitle))
result.add(urls.get(i));
}
Please note that we had inserted nine pages in the Voldemort store.
Here is the output when the test was executed:
Test Output
Voldemort’s performance compared to other databases
The result was that we found writes are faster in Cassandra. Reads are faster in Voldemort, MongoDB than Cassandra. These results were for a single-node cluster.
Order Now