NoSQL Databases | Research Paper
In the world of enterprise computing, we have seen many changes in platforms, languages, processes, and architectures. But throughout the entire time one thing has remained unchanged – relational databases. For almost as long as we have been in the software profession, relational databases have been the default choice for serious data storage, especially in the world of enterprise applications. There have been times when a database technology threatened to take a piece of the action, such as object databases in the 1990’s, but these alternatives never got anywhere.
In this research paper, a new challenger on the block was explored under the name of NoSQL. It came into existence because of there was a need to handle large volumes of data which forced a shift to building bigger hardware platforms through large number of commodity servers. The term “NoSQL” applies to a number of recent non-relational databases such as Cassandra, MongoDB, Neo4j, and Azure Table storage. NoSQL databases provided the advantage of building systems that were more performing, scaled much better, and were easier to program with.
The paper considers that we are now in a world of Polyglot Persistence where different technologies are used by enterprises for the management of data. For this reason, architects should know what these technologies are and should be able to decide which ones to use for various purposes. It provides information to decide whether NoSQL databases can be seriously considered for future projects. The attempt is to provide enough background information on NoSQL databases on how they work and what advantages they will bring to the table.
Table of Contents
Introduction
Literature
Technical Aspects
Document Oriented
Merits
Demerits
Case Study – MongoDB
Key Value
Merits
Demerits
Case Study – Azure Table Storage
Column Stores
Merits
Demerits
Case Study – Cassandra
Graphs
Merits
Demerits
Case Study – Neo4j
Conclusion
References
Introduction
NoSQL is commonly interpreted as “not only SQL”. It is a class of database management systems and is does not adhere to the traditional RDBMS model. NoSQl databases handle a large variety of data including structured, unstructured or semi-structured data. NoSQL database systems are highly optimized for retrieval and append operations and offer less functionality other than record storage. The run time performance is reduced compared to full SQL systems but there is increased gain in scalability and performance for some data models [3].
NoSQL databases prove to be beneficial when a huge quantity of data is to be processed and a relational model does not satisfy the data’s nature. What truly matters is the ability to store and retrieve huge amount of data, but not the relationships between them. This is especially useful for real-time or statistical analysis for growing amount of data.
The NoSQL community is experiencing a rapid change. It is transitioning from the community-driven platform development to an application-driven market. Facebook, Digg and Twitter have been successful in using NoSQL and scaling up their web infrastructure. Many successful attempts have been made in developing NOSQL applications in the fields of image/signal processing, biotechnology, and defense. The traditional relational database systems’ vendors also assess the strategy of developing NoSQL solutions and integrating them in existing offers.
Literature
In recent years with expansion of cloud computing, problems of data-intensive services have become prominent. The cloud computing seems to be the future architecture to support large-scale and data intensive applications, although there are certain requirements of applications that cloud computing does not fulfill sufficiently [7]. For years, development of information systems has relied on vertical scaling, but this approach requires higher level of skills and it is not reliable in some cases. Database partitioning across multiple cheap machines added dynamically, horizontal scaling or scaling-out can ensure scalability in a more effective and cheaper way. Today’s NoSQL databases designed for cheap hardware and using the shared-nothing architecture can be a better solution.
The term NoSQL was coined by Carlo Strozzi in 1998 for his Open Source, Light Weight Database which had no SQL interface. Later, in 2009, Eric Evans, a Rackspace employee, reused the term for databases which are non-relational, distributed and do not conform to atomicity, consistency, isolation and durability. In the same year, “no:sql(east)” conference held in Atlanta, USA, NoSQL was discussed a lot. And eventually NoSQL saw an unprecedented growth [1].
Scalable and distributed data management has been the vision of the database research community for more than three decades. Many researches have been focused on designing scalable systems for both update intensive workloads as well as ad-hoc analysis workloads [5]. Initial designs include distributed databases for update intensive workloads, and parallel database systems for analytical workloads. Parallel databases grew to become large commercial systems, but distributed database systems were not very successful. Changes in the data access patterns of applications and the need to scale out to thousands of commodity machines led to the birth of a new class of systems referred to as NoSQL databases which are now being widely adopted by various enterprises.
Data processing has been viewed as a “constant battle between parallelism and concurrency” [4]. Database acts as a data store with an additional protective software layer which is constantly being bombarded by transactions. To handle all the transactions, databases have two choices at each stage in computation: parallelism, where two transactions are being processed at the same time; and concurrency, where a processor switches between the two transactions rapidly in the middle of the transaction. Parallelism is faster, but to avoid inconsistencies in the results of the transaction, coordinating software is required which is hard to operate in parallel as it involves frequent communication between the parallel threads of the two transactions. At a global level, it becomes a choice between “distributed” and “scale-up” single-system processing.
In certain instances, relational databases designed for scale-up systems and structured data did not work well. For indexing and serving massive amounts of rich text, for semi-structured or unstructured data, and for streaming media, a relational database would require consistency between data copies in a distributed environment and will not be able to perform parallelism for the transactions. And so, to minimize costs and to maximize the parallelism of these types of transactions, we turned to NoSQL and other non-relational approaches.
These efforts combined open-source software, large amounts of small servers and loose consistency constraints on the distributed transactions (eventual consistency). The basic idea was to minimize coordination by identifying types of transactions where it didn’t matter if some users got “old data” rather than the latest data, or if some users got an answer while others didn’t.
Technical Aspects
NoSQL is a non-relational database management system which is different from the traditional relational database management systems in significant ways. NoSQL systems are designed for distributed data stores which require large scale data storage, are schema-less and scale horizontally. Relational databases rely upon very structured rules to govern transactions. These rules are encoded in the ACID model which requires that the database must always preserve atomicity, consistency, isolation and durability in each database transaction. The NoSQL databases follow the BASE model which provides three loose guidelines: basic availability, soft state and eventual consistency.
Two primary reasons to consider NoSQL are: handle data access with sizes and performance that demand a cluster; and to improve the productivity of application development by using a more convenient data interaction style [6]. The common characteristics of NoSQL are:
Not using the relational model
Running well on clusters
Open-source
Built for 21st century web estates
Schema less
Each NoSQL solution uses a different data model which can be put in four widely used categories in the NoSQL Ecosystem: key-value, document, column-family and graph. Of these the first three share a common characteristic of their data models called aggregate orientation. Next we briefly describe each of these data models.
3.1 Document Oriented
The main concept of a document oriented database is the notion of a “document” [3]. The database stores and retrieves documents which encapsulate and encode data in some standard formats or encodings like XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures and can offer different ways of organizing and grouping documents:
Collections
Tags
Non-visible Metadata
Directory Hierarchies
Documents are addressed with a unique key which represents the document. Also, beyond a simple key-document lookup, the database offers an API or query language that allows retrieval of documents based on their content.
img1.jpg
Fig 1: Comparison of terminology between Oracle and MongoDB
3.1.1 Merits
Intuitive data structure.
Simple “natural” modeling of requests with flexible query functions [2].
Can act as a central data store for event storage, especially when the data captured by the events keeps changing.
With no predefined schemas, they work well in content management systems or blogging platforms.
Can store data for real-time analytics; since parts of the document can be updated, it is easy to store page views and new metrics can be added without schema changes.
Provides flexible schema and ability to evolve data models without expensive database refactoring or data migration to E-commerce applications [6].
Demerits
Higher hardware demands because of more dynamic DB queries in part without data preparation.
Redundant storage of data (denormalization) in favor of higher performance [2].
Not suitable for atomic cross-document operations.
Since the data is saved as an aggregate, if the design of an aggregate is constantly changing, aggregates have to be saved at the lowest level of granularity. In this case, document databases may not work [6].
.3.1.3 Case Study – MongoDB
MongoDB is an open-source document-oriented database system developed by 10gen. It stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. The language support includes Java, JavaScript, Python, PHP, Ruby and it also supports sharding via configurable data fields. Each MongoDB instance has multiple databases, and each database can have multiple collections [2,6]. When a document is stored, we have to choose which database and collection this document belongs in.
Consistency in MongoDB database is configured by using the replica sets and choosing to wait for the writes to be replicated to a given number of slaves. Transactions at the single-document level are atomic transactions – a write either succeeds or fails. Transactions involving more than one operation are not possible, although there are few exceptions. MongoDB implements replication, providing high availability using replica sets. In a replica set, there are two or more nodes participating in an asynchronous master-slave replication. MongoDB has a query language which is expressed via JSON and has variety of constructs that can be combined to create a MongoDB query. With MongoDB, we can query the data inside the document without having to retrieve the whole document by its key and then introspect the document. Scaling in MongoDB is achieved through sharding. In sharding, the data is split by certain field, and then moved to different Mongo nodes. The data is dynamically moved between nodes to ensure that shards are always balanced. We can add more nodes to the cluster and increase the number of writable nodes, enabling horizontal scaling for writes [6, 9].
3.2 Key-value
A key-value store is a simple hash table, primarily used when all access to the database is via primary key. They allow schema-less storage of data to an application. The data could be stored in a data type of a programming language or an object. The following types exist: Hierarchical key-value store Eventually-consistent key-value store, hosted services, key-value chain in RAM, ordered key-value stores, multi value databases, tuple store and so on.
Key-value stores are the simplest NoSQL data stores to use form an API perspective. The client can get or put the value for a key, or delete a key from the data store. The value is a blob that is just stored without knowing what is inside; it is the responsibility of the application to understand what is stored [3, 6].
3.2.1 Merits
Performance high and predictable.
Simple data model.
Clear separation of saving from application logic (because of lacking query language).
Suitable for storing session information.
User profiles, product profiles, preferences can be easily stored.
Best suited for shopping cart data and other E-commerce applications.
Can be scaled easily since they always use primary-key access.
3.2.2 Demerits
Limited range of functions
High development effort for more complex applications
Not the best solution when relationships between different sets of data are required.
Not suited for multi operation transactions.
There is no way to inspect the value on the database side.
Since operations are limited to one key at a time, there is no way to operate upon multiple keys at the same time.
3.2.3 Case Study – Azure Table Storage
For structured forms of storage, Windows Azure provides structured key-value pairs stored in entities known as Tables. The table storage uses a NoSQL model based on key-value pairs for querying structured data that is not in a typical database. A table is a bag of typed properties that represents an entity in the application domain. Data stored in Azure tables is partitioned horizontally and distributed across storage nodes for optimized access.
Every table has a property called the Partition Key, which defines how data in the table is partitioned across storage nodes – rows that have the same partition key are stored in a partition. In addition, tables can also define Row Keys which are unique within a partition and optimize access to a row within a partition. When present, the pair {partition key, row key} uniquely identifies a row in a table. The access to the Table service is through REST APIs [6].
3.3 Column Store
Column-family databases store data in column-families as rows that have many columns associated with a row key. These stores allow storing data with key mapped to values, and values grouped into multiple column families, each column family being a map of data. Column-families are groups of related data that is often accessed together.
The column-family model is as a two-level aggregate structure. As with key-value stores, the first key is often described as a row identifier, picking up the aggregate of interest. The difference with column-family structures is that this row aggregate is itself formed of a map of more detailed values. These second-level values are referred to as columns. It allows accessing the row as a whole as well as operations also allow picking out a particular column [6].
3.3.1 Merits
Designed for performance.
Native support for persistent views towards key-value store.
Sharding: Distribution of data to various servers through hashing.
More efficient than row-oriented systems during aggregation of a few columns from many rows.
Column-family databases with their ability to store any data structures are great for storing event information.
Allows storing blog entries with tags, categories, links, and trackbacks in different columns.
Can be used to count and categorize visitors of a page in a web application to calculate analytics.
Provides a functionality of expiring columns: columns which, after a given time, are deleted automatically. This can be useful in providing demo access to users or showing ad banners on a website for a specific time.
3.3.2 Demerits
Limited query options for data
High maintenance effort during changing of existing data because of updating all lists.
Less efficient than all row-oriented systems during access to many columns of a row.
Not suitable for systems that require ACID transactions for reads and writes.
Not good for early prototypes or initial tech spikes as the schema change required is very expensive.
3.3.3 Case Study – Cassandra
A column is the basic unit of storage in Cassandra. A Cassandra column consists of a name-value pair where the name behaves as the key. Each of these key-value pairs is a single column and is stored with a timestamp value which is used to expire data, resolve write conflicts, deal with stale data, and other things. A row is a collection of columns attached or linked to a key; a collection of similar rows makes a column family. Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists on multiple columns. The difference is that various rows do not need to have the same columns, and columns can be added to any row at any time without having to add it to other rows.
By design Cassandra is highly available, since there is no master in the cluster and every node is a peer in the cluster. A write operation in Cassandra is considered successful once it’s written to the commit log and an in-memory structure known as memtable. While a node is down, the data that was supposed to be stored by that node is handed off to other nodes. As the node comes back online, the changes made to the data are handed back to the node. This technique, known as hinted handoff, for faster restore of failed nodes. In Cassandra, a write is atomic at the row level, which means inserting or updating columns for a given row key will be treated as a single write and will either succeed or fail. Cassandra has a query language that supports SQL-like commands, known as Cassandra Query Language (CQL) [2, 6]. We can use the CQL commands to create a column family. Scaling in Cassandra is done by adding more nodes. As no single node is a master, when we add nodes to the cluster we are improving the capacity of the cluster to support more writes and reads. This allows for maximum uptime as the cluster keeps serving requests from the clients while new nodes are being added to the cluster.
3.4 Graph
Graph databases allow storing entities and relationships between these entities. Entities are also known as nodes, which have properties. Relations are known as edges that can have properties. Edges have directional significance; nodes are organized by relationships which allow finding interesting patterns between the nodes. The organization of the graph lets the data to be stored once and then interpreted in different ways based on relationships.
Relationships are first-class citizens in graph databases; most of the value of graph databases is derived from the relationships. Relationships don’t only have a type, a start node, and an end node, but can have properties of their own. Using these properties on the relationships, we can add intelligence to the relationship – for example, since when did they become friends, what is the distance between the nodes, or what aspects are shared between the nodes. These properties on the relationships can be used to query the graph [2, 6].
3.4.1 Merits
Very compact modeling of networked data.
High performance efficiency.
Can be deployed and used very effectively in social networking.
Excellent choice for routing, dispatch and location-based services.
As nodes and relationships are created in the system, they can be used to make recommendation engines.
They can be used to search for patterns in relationships to detect fraud in transactions.
3.4.2 Demerits
Not appropriate when an update is required on all or a subset of entities.
Some databases may be unable to handle lots of data, especially in global graph operations (those involving the whole graph).
Sharding is difficult as graph databases are not aggregate-oriented.
3.4.3 Case Study – Neo4j
Neo4j is an open-source graph database, implemented in Java. It is described as an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in table. Neo4j is ACID compliant and easily embedded in individual applications.
In Neo4J, a graph is created by making two nodes and then establishing a relationship. Graph databases ensure consistency through transactions. They do not allow dangling relationships: The start node and end node always have to exist, and nodes can only be deleted if they don’t have any relationships attached to them. Neo4J achieves high availability by providing for replicated slaves. Neo4j is supported by query languages such as Gremlin (Groovy based traversing language) and Cypher (declarative graph query language) [6]. There are three ways to scale graph databases:
Adding enough RAM to the server so that the working set of nodes and relationships is held entirely in memory.
Improve the read scaling of the database by adding more slaves with read-only access to the data, with all the writes going to the master.
Sharding the data from the application side using domain-specific knowledge.
Conclusions
NoSQL databases are still evolving and more number of enterprises is switching to move from the traditional relational database technology to non-relational databases. But given their limitations, they will never completely replace the relational databases. The future of NoSQL is in the usage of various database tools in application-oriented way and their broader adoption in specialized projects involving large unstructured distributed data with high requirements on scaling. On the other hand, an adoption of NoSQL data stores will hardly compete with relational databases that represent reliability and matured technology.
NoSQL databases leave a lot work on the application designer. The application design is an important part of the non-relational databases which enable the database designers to provide certain functionalities to the users. Hence a good understanding of the architecture for NoSQL systems is required. The need of the hour is to take advantage of the new trends emerging in the world of databases – the non-relational databases. An effective solution would be to combine the power of different database technologies to meet the requirements and maximize the performance.
Order Now