A Study On Distributed Database Systems Information Technology Essay

Cloud computing is a technology that involves delivering hosted services which uses internet and remote server machines to maintain applications and data. It has gained much popularity in the recent years and many companies offer a variety of services like Amazon EC2, Google “Gov Cloud” and so on. Many startups are adopting the cloud as their sole viable solution to achieve scale. While predictions regarding cloud computing vary, most of the community agrees that public clouds will continue to grow in the number and importance of their tenants [4]. Having said that, there is an opportunity for rich data-sharing among independent web services that are co-located within the same cloud. We also expect that a small number of giant-scale shared clouds (such as Amazon AWS) will result in an unprecedented environment where thousands of independent and mutually distrustful web services share the same runtime environment, storage system, and cloud infrastructure. With the development of distributed system and cloud computing, more and more applications might be migrated to the cloud to exploit its computing power and scalability. However most cloud platforms don’t support relational data storage for the consideration of scalability. To meet the reliability and scaling needs, several storage technologies have been developed to replace relational database for structured and semi-structured data, e.g. BigTable [3] for Google App Engine, SimpleDB [10] for Amazon Web Service (AWS) and NOSQL solutions such as Cassandra [8] and HBase [9]. As a result, relational database of existing applications have started transforming to cloud-based databases so that they can efficiently scale on such platforms. In this paper, we will look in detail how this alternative database architecture can truly scale in the cloud while giving a single logical view across all data.

1 Introduction

In the recent days, a lot of new non-relational databases have cropped up in public and private clouds. We could infer one key message from this trend: “Go for a non-relational database if you want on-demand scalability”. Is this a sign that relational databases have had their day and will decline over time? Relational databases have been around for over thirty years. During this time, several so-called revolutions flared up all of which were supposed to spell the end of the relational database. All of those revolutions fizzled out and none even made a dent in the dominance of relational databases. In the coming sections, we will see how this situation changed, the current trend of moving away from relational databases and what this means for the future of relational databases. We will also look at a couple of distributed database models that sets the trend.

Before we can take a look at the way this paper is structured, let me give a brief overview on the list of references used in this paper in the order it is listed in the reference section. “Distributed Database management Systems” [1], a text book written by Saeed and Frank addresses the distributed database theory in general. The chapters in this book explore various issues and aspects of a distributed database system and further discusses various techniques and mechanisms that are available to address these issues. Reference [2] is a URL reference that talks about the next generation distributed databases. It mostly addresses the points of the distributed databases as being non-relational, distributed, open-source and horizontal scalable and so on. The web page also has links (URLs) to each and every distributed database model it has discussed. “Bigtable: a distributed storage system for structured data” [3] is a paper reference that discusses the most popular distributed storage system ‘BigTable’ on which many distributed databases like HBase and Cassandra have based their data models on. Reference 4 is an URL reference that predicts ten cloud computing developments expected to see next year for both cloud service providers and enterprise users. Reference [5] is an URL reference that talks about the history of SQL and the recent rise of alternatives to relational databases, the so called “NoSQL” data stores. “Automatic Configuration of a Distributed Storage System” [6] is a paper reference that presents an approach of automatically configuring a distributed database by proposing a design that supports an administrator to correctly configure such a system in order to improve application performance. They intend to design a software control loop that automatically decides how to configure the storage system to optimize the performance. “The Hadoop distributed File system (HDFS)” [7] is a paper reference that presents a distributed file system with a framework for analysis and transformation of very large data sets using its MapReduce paradigm. HBase (a distributed database model which we would be discussing later in this paper) uses HDFS as its data storage engine. “Cassandra-A Decentralized Structured Storage system” [8] is a paper reference that presents the Cassandra distributed database model. “HBase and Hypertable for large scale distributed storage systems” [9] is a paper reference that provides a view on the capabilities of each of these implementations of BigTable, which helps those trying to understand their technical similarities, differences, and capabilities. “SimpleDB: a simple Java-based multiuser system for teaching database internals” [10] is a paper reference that presents the architecture of SimpleDB. References [11] and [12] are URL references that point to the definitions of CAP theorem and MapReduce. “MapReduce: Simplified Data Processing on Large Clusters” [13] is a paper reference that presents a programming model for processing and generating large datasets. It uses the ‘map’ and the ‘reduce’ functions to partition a job into sub-jobs and later combine their results in some way to get the output. Many real world tasks are expressible in this model. Let us now take a look at the paper structure.

This paper is structured as follows. Section 2 talks about various features of relational databases and its drawbacks that led to a distributed model. Section 3 presents the distributed database model, its types, with benefits and drawbacks. In section 4, we present the importance of big data in the future of Information technology. Section 5 compares the data model and features of Hadoop HBase and Cassandra in detail. Finally, section 6 concludes with my opinion on distributed databases and the future of cloud computing.

2 What is a Relational Database?

A relational database is essentially a group of tables or entities that are made up of rows (also called as record or tuple) and columns. Those tables have constraints and one can define relationships between them. Relational databases are queried using Structured Query Language (SQL), and result sets are produced from the queries that access data from one or more tables. Multiple tables being accessed in a single query are “joined” together, typically by a criterion defined in the table relationship columns. One of the data-structuring models used with relational databases that removes data duplication and ensures data consistency is ‘Normalization’.

Relational databases are facilitated through Relational Database Management Systems (RDBMS) which is a DBMS that is based on the relational model. Almost all database management systems we use today are relational, including those of SQL Server, SQLite, MySQL, Oracle, Sybase, TeraData, DB2 and so on. The reasons for the dominance of relational databases are not trivial. They have continually offered the best mix of simplicity, flexibility, robustness, performance, compatibility and scalability in managing generic data. Relational databases have to be incredibly complex internally to offer all of this. For example, a relatively simple SELECT statement could have hundreds of potential query execution paths, which the optimizer would evaluate at run time. All of this is hidden to us as users, but under the cover, RDBMS determines the “execution plan” that best answers our requests by using things like cost-based algorithms.

2.1 Drawbacks of Relational Databases

Though Relational Database Management Systems (RDBMS) have provided users with the best mix of simplicity, flexibility, robustness, performance, compatibility and scalability, their performance and rate of acceptance in each of these areas is not necessarily better than that of an alternate solution pursuing one of these benefits in isolation. It has not been much of a problem so far because the universal dominance of the Relational DBMS has outweighed the need to push any of these boundaries. Nonetheless, if we really had a need that couldn’t be answered by any of these generic relational databases, alternatives have always been around to fill those niches.

We are in a slightly different situation today. One of these benefits mentioned above is becoming more and more critical for an increasing number of applications. While still considered a niche, it is rapidly becoming a mainstream. That benefit is ‘Scalability’. As more and more DB applications are launched in environments that have massive workloads, (such as web services) their scalability requirements can ‘change’ very quickly and ‘grow’ very large. The former scenario can be difficult to manage if you have a relational database sitting on a single in-house data server. For example, if your server load doubles or triples overnight, how quickly do you think you can upgrade your hardware? The later scenario in general can be too difficult to manage with a relational database.

When run on a single server node, relational databases scale well. But when the capacity of that single node is reached, the system must be able to scale out and distribute that load across multiple server nodes. This is when the complexity of relational databases starts to rub against their potential to scale. And the complexities become overwhelming when we try scaling to hundreds or thousands of nodes. The characteristics that make relational database systems so appealing drastically reduce their viability as platforms for large distributed systems.

2.2 Towards Distributed Database

Addressing this limitation in private and public clouds became necessary as cloud platforms demand high levels of scalability. A cloud platform without a scalable data store is not much of a platform at all. So, to provide real-time customers with a scalable place to store data, cloud providers had only one real option, which was to implement a new type of database system that focuses on scalability, at the expense of the other benefits that come with relational databases. These efforts have led to the rise of a new breed of database management system termed as ‘Distributed’ DBMS. In the upcoming sections, we will take a deep look at these databases.

3 Distributed Databases – The New Breed

In the book ‘Distributed Database Systems’ [1] authors Saeed and Frank define Distributed database as follows – “A Distributed DBMS maintains and manages data across multiple computers. A DDBMS can be thought of as a collection of multiple, separate DBMSs, each running on a separate computer, and utilizing some communication facility to coordinate their activities in providing shared access to the enterprise data. The fact that data is dispersed across different computers and controlled by different DBMS products is completely hidden from the users. Users of such a system will access and use data as if the data were locally available and controlled by a single DBMS.”

As opposed to Distributed DBMS, a centralized DBMS is a software that allows an enterprise to access and control its data on a single machine or server. It has become common for the organizations to own more than one DBMS which did not come from the same database vendor. For example, an enterprise could own a mainframe database that is controlled by IBM DB2 and a few other databases controlled by Oracle, SQL Server or other vendors. Most of the time, users access their own workgroup database. Yet, sometimes, users access data in some other division’s database or even in the larger enterprise-level database. As we can see now, the need to share data that is scattered across an enterprise cannot be satisfied by centralized Database management system software leading to a distributed management system.

3.1 Homogeneous Vs Heterogeneous databases

In recent years, the distributed database management system (DDBMS) has been emerging as an important area in data storage and its popularity is increasing day by day. A distributed database is a database that is under the control of a central DBMS in which not all hardware storage devices are attached to a single server with common CPU. It may be stored on multiple machines located in the same

physical location, or may be dispersed over a network of interconnected machines. Collections of data (e.g., database table rows/columns) can be distributed across multiple machines in different physical locations. In short, a distributed database is a logically interrelated collection of shared data, and a description of this data is physically distributed over the network (or cloud).

A distributed database management system (DDBMS) may be classified as homogeneous or heterogeneous. In an ideal DDBMS, the sites would share a common global schema (although some relations may be stored only at some local sites), all sites would run the same database management software or model and each site is aware of the existence of other sites in that network. In a distributed

system, if all sites use the same DBMS model, it is called a homogenous distributed database system. However, in reality, a distributed database system has to be constructed by linking multiple machines running already-existing database systems together, each with its own schema and possibly running different database management software. Such systems are called heterogeneous distributed database

systems. In a heterogeneous distributed database system, sites may run different DBMS software that need not be based on the same underlying data model, and thus, the system may be composed of relational, hierarchical, network and object-oriented DBMSs (these are various types of database models).

Homogeneous distributed DBMS provides several advantages such as ease of designing, simplicity and incremental growth. It is much easier to design and manage than a heterogeneous system. In a homogeneous distributed DBMS, making the addition of a new site to the system is much easier, thereby providing incremental growth in the site. On the other hand, heterogeneous DDBMS are flexible

in the way that it allows sites to run different DBMS softwares with its own model. The communications between different DBMSs are required for translations which might add an additional overhead to the system.

3.2 Relational Vs Non-relational databases

3.2.1 Scalability

Relational databases usually reside on one server or a machine. Scalability here can be achieved by adding more processors (maybe with advanced capabilities), adding hardware memory and storage. Relational databases do reside on multiple servers and synchronization in this case is achieved by using replication techniques. On the other hand, ‘NoSQL’ databases (term used to refer distributed/non-relational databases) also can reside on a single server or a machine but more often are designed to work across a cloud or a network of servers.

3.2.2 Data formats

Relational databases are comprised of a collection of rows and columns in a table and/or view which includes a fixed schema and join operations. ‘NoSQL’ databases often store data in a combination of key and value pairs also called as Tuples. They are schema free and have an ordered list of elements.

3.2.3 Dataset Storage

Relational databases almost always reside on a secondary disk drive or a storage area. When required, a collection of database rows or records are brought into main memory with stored procedure operations and SQL select commands. In contrast, most of the NoSQL databases are designed to exist in main memory for speed and can be persisted to disk.

3.3 Components of Distributed Database

A Distributed DBMS controls the storage and efficient retrieval of logically interrelated data that are physically distributed among several sites. Therefore, a distributed DBMS includes the following components.

Workstation machines (nodes in a cluster) – A distributed DBMS consists of a number of computer workstation machines that form the network system. The distributed database system must be independent of the computer system hardware.

Network components (hardware/software) – Each workstation in a distributed system contains a number of network hardware/software components. These components allow each site to interact and exchange data with each other site.

Communication media – In a distributed system, any type of communication or information

exchange among nodes is carried out through the communication media. This is a very important component of a distributed DBMS. It is desirable that a distributed DBMS be communication media independent. (i.e.) it must be able to support several types of communication media.

Transaction processor – A Transaction processor is a software component that resides in each workstation (computer) connected with the distributed database system and is responsible for receiving and processing local/remote applications’ data requests. This component is also known as the transaction manager or the application manager.

Data processor – A data processor is also a software component that resides in each workstation (computer) connected with the distributed system and stores/retrieves data located at that site. This is also known as the data manager. In a distributed DBMS, a centralized DBMS may act as a Data Processor.

Now, a number of questions might arise while we research how distributed DBMS could be deployed in a company. What disadvantages and advantages will come with this new kind of architecture? Can we reuse the existing hardware component efficiently with this new setup? What vendors can one choose to go with? What kind of code changes will this introduce to the database developers in the company and what kind of cultural shocks will they see? Let’s take a look at its drawbacks and benefits.

3.4 Drawbacks of Distributed DBMS

The biggest disadvantage in deploying distributed DBMS in most of the scenarios would be the code and the culture change. Developers are used to writing complex code in SQL that connects to a database and executes a stored procedure that lives in the database. Introducing this new distributed architecture would completely change their routine and environment. The stored procedures are usually written in Java or another JIT language. This essentially means a rewrite of much of the Java code infrastructure. Building on that, they also introduce additional complexities into the existing architecture. This potentially means more cost in work labor.

Companies have also invested a lot of time and money into the hardware components and architecture they have today. Introducing the concept of commodity hardware into the existing system could cause both political and economical backlashes. There are also solution-specific problems that are introduced on top of this. The concept of Distributed DBMS itself is rather new and the lack of options and experience among the developers and customers raises concern. This also means there’s a lack of standards to follow. I would also be concerned with how the vendor treats concurrency control among the nodes. This is one of the major considerations in distributed environments.

3.5 Benefits of Distributed DBMS

The first benefit that I see in this approach is that it is extremely scalable. With distributed database expansion will be easier while not sacrificing reliability or availability. If one needs to add more hardware, they should be able to pause a node’s operations, copy it over and bring up the new hardware at a very high-level without worrying much about the rest of the system.

Secondly, this setup doesn’t require high-end massive machines that are currently being used in a non-distributed environment or datacenters. One can use a cheap commodity hardware providing a decent RAM and good CPUs. The hardware one would consider for this would have 4 to 8 GB of RAM and dual Core i7s. One can provide as many cores as they can to each node in a cluster for higher throughput or performance. Now, the complexity of the network becomes one of the bottlenecks. These nodes need to be able to communicate more often in the cluster. To attain this, the network connection speed needs to be increased.

3.6 Automatic configuration of Distributed DBMS

Although a network-based storage system has several benefits, managing the data across several nodes in a network is complex. Moreover, it brings several issues (few of which have been discussed in section 3.2) that are inexistent in centralized solutions. For example, the administrator has to decide whether data needs to be replicated and if yes, then at what rate should it be replicated, how to speed up several concurrent distributed accesses in the network (kept in memory – good for temporal short files; verified for similarity – good to save storage space and bandwidth when there is high similarity between write operations), etc. To avoid such complexities, the database systems typically fix these decisions making it easier to manage and yet keep the system simple.

Few versatile storage systems were proposed to solve these problems. The versatility is the ability of providing a predefined set of configurable optimizations that can be configured by the administrator or the running application based on their needs thus improving application performance and/or the reliability. Although this kind of versatility allows applications to obtain improved performance from the storage system, it adds more responsibilities to the administrator who now has to manually tune the storage system after learning the network and applications. Such a manual configuration of the storage system might not be a desired task to the administrator for the following reasons: lack of complete knowledge about the application using this network, the network workload, the workload change, and performance tuning which could be time-consuming.

In the paper ‘Automatic Configuration of a Distributed Storage System’ [6], the authors had proposed a design to support the administrator to correctly configure such a distributed file system in order to improve the application performance. They intended to design an architecture that allows automatic configuration of the parameters in such a versatile storage system. Their intended solution had the following requirements:

â€¢ Easy to configure: The main goal is to make the administrator’s job easier. Thus, they intended to design a solution that requires minimal human intervention to turn the optimizations on. The ideal case is no human intervention.

â€¢ Choose a satisfactory configuration: The automatic configuration should detect a set of parameters that can make the performance of the application closer to the administrator’s intention.

â€¢ Reasonable configuration cost: The cost required to detect the optimizations and set the new configuration should be paid off according to the utility to the system. For example, considering as a main goal having the smallest possible application execution time, if the cost to learn the correct parameters (the time in this case) is greater than the time saved by the optimizations, automatic configuration solution

is useless.

Figure 1 – Software Control loop

Courtesy: ‘Automatic Configuration of a Distributed Storage System’ [6]

To configure the system, they intended to design a software control loop (Figure 1) that automatically decides how to configure the storage system to optimize the performance. Thus, the system would have a closed loop consisting of a monitor which constantly verifies the behavior of the distributed storage system operations and extracts a set of measurements at given moment. The actual measurements may change depending on the target metric and on the configuration. Examples of measurements include: amount of data sent through network, amount of data stored, operation response time, system processing time, and communication latency. The monitor informs the system’s behavior to the controller by sending the measurements. The controller, then, analyzes the metrics, the impacts of its last decisions and may dispatch new actions to change the configuration of the storage system. Now, let us chip into the Big Data!

4 Big Dataset

While when it comes to cloud computing, not many of us know exactly how it will be used by the enterprise. But, something that is becoming increasingly clear is that Big Data is the future of Information Technology. So, tackling Big Data is the biggest challenge we would face in the next wave of cloud computing innovation. Data is growing exponentially everywhere from users, applications, machines and no vertical or industry is being spared. And the result is that organizations everywhere are being forced to grapple with storing, managing and extracting value from every piece of it as cheaply as possible. Yes, the race has begun!

IT architectures have been revisited/reinvented so many times in the past in order to remain competitive. The shift from the mainframes to distributed microprocessing environments, their subsequent shifts to web services through Internet are few of those to mention. While cloud computing will leverage these technological advantages in computing and networking to tackle Big Data, it will also embrace deep innovations in the areas of storage/data management.

4.1 Big Data stack

Before cloud computing will be broadly embraced by the customers, a new protocol stack will need to emerge. Infrastructure capabilities such as virtualization, security, systems management, provisioning, access control, availability, etc. will have to be standard before IT organizations are able to deploy the cloud infrastructure completely. In particular, the new framework needs the ability to process data real-time and in greater orders of magnitude. They could leverage commodity servers for computing and storage.

The cloud stack discussed above has already been implemented in many forms at large-scale data center networks. At the same pace, as the volume of data exploded, these networks also quickly encountered the scaling limitations of traditional SQL databases. To solve this, distributed and object-orientated data stores are being developed and implemented at scale. Many solved this problem through techniques like MySQL instance sharding, in essence using SQL more as data stores than true relational databases (i.e. no table joins, etc.). However, as internet data centers scaled, this approach didn’t help much.

4.2 The rise of Distributed Non-relational DBMS

Large web properties have been building their own so-called ‘NoSQL’ databases. But while it can seem like a different version of DDBMS sprouts up every day [2], they can largely be categorized into two flavors: One, distributed key value/tuple stores, such as Azure (Windows), CouchDB, MongoDB and Voldemort (LinkedIn); and two, wide column stores/column families such as HBase (Yahoo/Hadoop), Cassandra (Facebook) and Hypertable(Zvents).

These projects are in various stages of deployment and adoption but promise to deliver a scalable data layer on which applications can be built elastically. One facet that is common across these myriad of ‘NoSQL’ databases is a data cache layer, which is essentially a high-performance, distributed memory caching system that can accelerate applications in web by avoiding continual database hits.

5 Hadoop HBase Vs Cassandra

Managing distributed, non-transactional data has become more challenging. Internet data center networks are collecting massive volumes of data every second that somehow need to be processed cheaply. One solution that was been deployed by some of the largest web giants like Facebook, Yahoo, LinkedIn, etc. for distributed file systems and computing in a cloud is Hadoop [7]. In many cases, Hadoop essentially provides a primary compute and storage layer for the NoSQL databases. Although the framework has roots in data centers, Hadoop is quickly penetrating broader enterprise use cases.

Another popular distributed database that gathered the most attention is Cassandra [8]. Cassandra purports to being a hybrid of BigTable and Dynamo, whereas HBase is a near-clone of Google’s BigTable [3]. The split between Hbase and Cassandra can be categorized into the features that it supports and the architecture itself.

As far as I am concerned, customers seem to consider HBase as being more suitable for data warehousing, large scale data analysis (such as that involved when indexing the Web) and processing and Cassandra as being more suitable for real-time transaction processing and serving of interactive data. Before moving on to the comparisons, let us take a look at a powerful theorem.

5.1 Consistency, Availability and Partition tolerance (CAP)

CAP is a powerful theorem that applies to the development and deployment of distributed systems (in our case distributed databases). This theorem was developed by Prof. Brewer, Co-founder of Inktomi. According to Wikipedia CAP theorem is stated as follows: “The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

* Consistency (all nodes see the same data at the same time)

* Availability (node failures do not prevent survivors from continuing to operate)

* Partition tolerance (the system continues to operate despite arbitrary message loss)” [11].

The theorem states that a distributed system (“shared data” in our context) design, can offer at most two out of three desirable properties such as Consistency, Availability and Partition tolerance to network. Very basically from database context, “Consistency” means that if one writes a data value to a database, other users will immediately be able to read the same data value back. “Availability” means that if some number of nodes fail in a cluster (means network of nodes) the distributed DBMS can remain operational. And, “Partition tolerance” means that if the nodes or machines in a cluster are divided into two or more groups that can no longer communicate (maybe because of a network failure), the system again remains functional.

Many developers (and in fact customers), including many in the HBase community have taken it to heart that their database systems can only support two of these three CAP properties and have accordingly worked to this design principle. Indeed, after reading many posts in online forums, I regularly find the HBase community explaining that they have chosen Consistency and Partition (CP), while Cassandra has chosen Availability and Partition tolerance (AP). It is a fact that most developers would need consistency (C) in their network at some level.

However, the CAP theorem explained above only applies to a single distributed algorithm. And there is no reason why one cannot design a single database system where for any given operation or input, an underlying algorithm is selectable. Thus while it is true that a distributed database system may only offer two of these properties per operation, what has been widely missed is that a system can be designed that allows a user (caller) to choose which properties they want when any given operation is performed. Not only that, a system should also be possible to offer differing degrees of balance between consistency, availability and tolerance to partition. This is one place where Cassandra stands out. The beauty of Cassandra is that you can choose the operation and hence the trade-offs you want on a case by case basis such that they best match the requirements of the particular operation you are performing. Cassandra proves you can go beyond the popular interpretation of the CAP Theorem and keep the clock ticking.

5.2 Modularity

Another important distinction between Cassandra and HBase is that while Cassandra comes as a single Java process to be run per node. In contrast, a Hadoop HBase solution is comprised of several parts – we have the database process itself, which may run in several modes, a properly configured and operational HDFS [7] (hadoop distributed file system) setup, and a Zookeeper system to coordinate the different HBase processes. Both the approaches for a distributed DBMS have its own pros and cons.

5.3 MapReduce

Here is the definition of MapReduce from Wikipedia for those not versed in this technology – “MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers. The framework is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms. MapReduce libraries have been written in C++, C#, Erlang, Java, Ocaml, Perl, Python, Ruby, F#, R and other programming languages.” [12].

From distributed databases perspective, MapReduce[13] is a system for the parallel processing of vast amounts of data, such as the extraction of statistics from millions of pages (from DB tables) from the web etc. And one thing Cassandra can’t do well yet is MapReduce! MapReduce and related systems such as Pig and Hive work well with HBase architecture because it uses Hadoop Distributed File System (HDFS) to store its data, which is the platform these systems were primarily designed to work with, which is not the case with Cassandra due to the absence of HDFS. If one need to do that kind of data crunching and analysis, HBase stands out and may be the best option for deployment in the cloud.

5.4 Ease of Use and Installation

In this section we will see how distributed database architecture installations and usability differ from each other. Talking about Cassandra, it is only a Ruby gem install away. It’s pretty impressive. However, we still have to do quite a bit of manual configuration. Whereas HBase is a TAR package that you need to install and setup on your own. HBase has a pretty decent documentation, making the installation process a little more straightforward than it could’ve been. Now let’s take a look at its user interfaces.

HBase installations comes with a very nice Ruby shell that makes life easy to perform most of the basic database operations like create/modify, set/retrieve data, and so on. People use it constantly to test their code. Cassandra did not have a shell. It just had a basic API. Recently, they had introduced a shell which I think is not popular yet. HBase also has a nice web-based UI that you can use to view cluster status, determine which nodes store various data, and do some other basic database operations. Cassandra, again initially lacked this web UI, making it harder to operate but they introduced one later. Their web page looks much more professional and nicer than HBase. It’s very easy to get it up and running. The website is well documented. Literally, it takes only few minutes to set it up and get it running. Overall, Cassandra wins on installation, Hbase on usability.

5.5 A Comparison study on the Architecture

The real challenge in deploying a database model in a cloud is to understand its data model and try to implement it based on the user requirements. Even though Cassandra and HBase inherits the same data model from BigTable, there are some fundamental differences between them.

Hadoop [7] is a framework that provides a distributed file system. It provides analysis and transformation of very large data sets using its MapReduce [13] paradigm. An important characteristic of Hadoop is the partitioning of data and computation across thousands of nodes (or hosts), and executing application computations in parallel close to their data. HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes.

HBase is a distributed database that is based on BigTable which uses the Hadoop Filesystem (HDFS) as its data storage engine. The advantage of this approach is that, HBase doesn’t need to worry about data replication, data consistency and resiliency because HDFS has handled it already. In the HBase architecture, data is stored in Region Servers which are nodes located as part of the Hadoop cluster. The key-to-server mapping is needed to locate the corresponding region server and this mapping is stored as a table similar to other user data table. The client contacts a predefined Master server to learn the Root region and META region servers which holds a mapping from user table to actual region servers (where the actual table data resides). Master also monitors and coordinates the activities of all region servers.

Apache Cassandra [8] is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency. Created at Facebook, it is now used at some of the most popular sites on the Web.

The other key differences between these two systems are highlighted below:

5.5.1 Concept of Table

Cassandra lacks concept of a Table. All the documentation will point us that it’s not common to have multiple key spaces in a cluster. It means that one has to share a key space in a cluster. Furthermore adding a key space requires a cluster restart which is costly. In Hbase, concept of Table exists. Each table has its own key space. This was a big win for HBase over Cassandra. You can add and remove table as easily as a RDBMS table.

5.5.2 Universally Unique Identifier (UUID)

Universally Unique Identifier (UUID) is a standardized unique identifier (in the form of a 128 bit number). In its canonical form UUIDs are represented by a 32 digit hexadecimal number. UUIDs are used in column comparisons in Cassandra. Cassandra uses lexical UUID for non-time based comparisons. It is compared lexically by byte value. One can also use TimeUUID if they want their data to be sorted by time. HBase uses binary keys. It’s common to combine three different items together to form a key. This means that, in HBase, we can search by more than one key in a give table.

5.5.3 Hot Spotting

Hot spotting is a situation that occurs in a distributed database system when all the client requests go to one server in a cluster of nodes. Hot spotting problem won’t occur in Cassandra even if we use TimeUUID, as Cassandra naturally load balances client requests to different servers. In HBase, if your key’s first component is time or a sequential number, then hot spotting occurs. All of the new keys will be inserted to one region until it fills up (hence by causing a hot spotting problem). Automatic hot spot detection and resharding is on the roadmap for HBase.

5.5.4 Column Sorting and Supercolumn

Cassandra offers sorting of columns whereas Hbase does not. And the concept of ‘Supercolumn’ (Supercolumn is a tuple with a binary name and value which is a map containing an unbounded number of columns keyed by the column’s name) in Cassandra allows you to design very flexible, complex schemas. Hbase does not have supercolumns. But we can design a super column like structure as column names.

Moreover, columns and supercolumns are both a tuples with a name and value. The key difference is that a standard column’s value is a “string” and in a supercolumn the value is a Map of columns. Another minor difference is that supercolumns don’t have a timestamp component associated with them.

5.5.5 Eventual consistency

Cassandra does not have any convenient method to increment a column value. In fact, the very nature of eventual consistency makes it difficult to update/write a record and read it instantly after the update. HBase by design is consistent. It offers a nice convenient method to increment columns which makes it much suitable for data aggregation.

5.5.6 MapReduce Support

Let us recall the definition of MapReduce from section 5.3 “MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers”. MapReduce support is new in Cassandra and not straight forward as implemented in HBase. We will need a hadoop cluster to run it. Data will be transferred from Cassandra cluster to the hadoop cluster. However, Cassandra is not suitable for running large data map reduce jobs. Map Reduce support in Hbase is native. HBase is built on top of Hadoop distributed file system. Data in HBase does not get transferred as it is in the case of Cassandra.

5.5.7 Complexity

Cassandra is comparatively simpler to maintain since we don’t have to have hadoop DFS setup and running. HBase on the other hand is comparatively complicated as we have many moving pieces such as Zookeeper, Hadoop (like NameNode, Region servers etc.) and HBase (Master Server, Region Server, etc.) itself.

5.5.8 Client interface

Cassandra does not have a native Java API to interface with clients as of now. Even though written in Java, one has to use a third party framework like Thrift for the clients to communicate with a cluster. Hbase has a nice native Java API. Feel much more like a Java system than Cassandra. HBase also supports thrift interface for other languages (clients) like C++, PHP or Python to interface with the cluster nodes.

5.5.9 Single Point of Failure

Cassandra has no master server, and hence no single point of failure. In HBase, although there exists a concept of a master server, the system itself does not depend on it heavily. HBase cluster can keep serving data even if the master goes down. However, Hadoop’s Namenode (which is part of HDFS) is a single point of failure.

In my opinion, if consistency is what a customer is looking for, HBase is a clear choice. Furthermore, the concepts of table, native MapReduce and simpler schema that can be modified without a cluster restart are a big plus one cannot ignore. HBase supports both Java API and thrift, and is a much more matured platform. In forums, when people site Facebook and Twitter using Cassandra, they forget that the same organizations are using HBase too. In fact, Facebook hired one of the code committers of HBase which clearly showed their interest in HBase.

At the same time, I should also point out that Cassandra and HBase should not necessarily be viewed as competitors. While it is true that they may often be used for the same purpose inside a cloud (in much the same way as PostgreSQL and MySQL), what I believe will likely emerge in the future is that they will become preferred solutions for different cloud applications. As an example, ‘StumbleUpon’ has been using HBase with the associated Hadoop MapReduce technologies to crunch the vast amounts of data added to its service. ‘Twitter’ and many more companies are now using Cassandra for real time interactive community posts in social networking.

6 Conclusion

We presented an introduction to the clouds and the need for a distributed database environment. In section 2, we saw various features of relational databases, drawbacks that worried customers in the clouds and the need for a transition to distributed database models in order to address the scalability limitations. In the section following that, we learned the distributed database model, components of distributed database system, homogeneous and heterogeneous distributed DBMS types, did a comparative study with the relational/non-relational databases on scalability, datasets and storage issues with their benefits and drawbacks. We then saw how automatic configuration of distributed systems could help administrators to easily manage their networks. In section 4, we saw the importance of big data in the future of Information technology and how companies have tackled this biggest challenge so far. Further, after comparing the data models and features (in Section 5), we noticed how HBase stands out in most of the features and aspects.

One of the main challenges of cloud computing and management today is to provide ease of programming, scalability, consistency and elasticity over cloud data at the same time. Though current solutions have been quite successful, they were developed with specific, relatively simple applications in mind. In particular, they have sacrificed a bit of consistency and ease of programming for the sake of scalability. This has resulted in a pervasive approach relying on data partitioning and forcing applications to access data partitions individually, with a loss of consistency guarantees across data partitions. As the need to support tighter consistency requirements, e.g., for updating multiple tuples in one or more tables, increases, cloud application developers will be faced with a very difficult problem in providing isolation and atomicity across data partitions through careful engineering. I believe that new solutions are needed that capitalize on the principles of distributed and parallel database systems to raise the level of consistency and abstraction, while retaining the scalability and simplicity advantages of current solutions.

Order Now