Ntroduction To Big Data Technologies Computer Science Essay

Research and development in the area of database technology during the past decade is characterized by the striving need for better application support beyond the traditional world, where primarily high volumes of simply structured data had to be processed efficiently. Advanced database technology provides new and much-needed solutions in many important areas; these same solutions often require thorough consideration in order to avoid the introduction of new problems. There have been times when a database technology threatened to take a piece of the action, such as object databases in the 1990’s, but these alternatives never got anywhere. After such a long period of dominance, the current excitement about non-relational databases comes as a surprise.

The RASP Pvt. Ltd. Firm is a startup company and is experiencing a huge burst in the amount of data to be handled along with an increased number of customer bases. This exponential growth in the business needs to be provisioned with mechanisms to handle enormous amount and variety of data, provide scalability and availability functionalities and increased performance. Besides this, reliability has to be ensured by enabling automatic failover recovery. We aim to provide solutions which can help to overcome these hurdles.

Our approach is to explore new technologies besides the mature and prevalent traditional relational database systems. We researched various non-relational database technologies and the features and functionalities they offer. The most important aspect of these new technologies is polyglot persistence that is to use different databases for different needs within an organization. Our attempt was to provide few solutions by combining the powerful features of these technologies and provide an integrated approach to handle the problem at hand.

Outline

Introduction

History

Current Trends

Merits

Demerits

In Depth Problem Review

Storage

Computation

Performance

Scalability

Availability

Introduction to Big Data

What Is Big Data

Hadoop

Map Reduce

HDFS

NoSQL Eco System

Document Oriented

Merits

Demerits

Case Study – Mongo DB

Key Value

Merits

Demerits

Case Study – Azure Table Store

Column Store

Merits

Demerits

Case Study – Cassandra

Graph

Merits

Demerits

Case Study – Neo4j

Solution Approach

NoSQL Methods to MySQL

Problem Addressed

Challenges

MongoDB & Hadoop

Problem Addressed

Challenges

Cassandra & Hadoop

Problem Addressed

Challenges

Azure Table Storage & Hadoop

Problem Addressed

Challenges

Neo4J

Problem Addressed

Challenges

Conclusion

References

1. Introduction

1.1 History

Database system is been used since 1960. It has been evolved largely in last five decades. Relational database concepts were introduced in the decade of 1970. RDBMS took birth with such a strong advantages and usability that is sustained for almost 40 years now. In 1980 structured query languages were introduced that only enriched the use of traditional database system. It gave a facility to retrieve useful data in seconds with the help of two liner query. Lately, internet is used to empower database that provides distributed database systems.

1.2 Current Trends

Database has become inevitable part of IT industry. It has its own significance in every model. Normally there is a separate layer in almost all applications called data layer which talks about how to store data and how to retrieve it. There is mechanism provided to access the database in almost every language. The scope of IT industry is expanding with new technologies like mobile computing. New type of databases is being introduced very frequently. Storage capacity was the issue before few days which has been solved with cloud technologies. This entire new trend is also introducing new challenges for traditional database system like large amount of data, dynamically created data, storage issues, retrieval problems etc.

1.2.1 Merits

The main advantage of database system is it provides ACID properties and allows concurrency. Database designer takes care of redundancy control, data integrity by applying normalization techniques. Data sharing and transaction control are added advantages of RDBMS. Data security is also provided to some extent. There are in built encryption facilities to protect data. Backup and Recovery subsystems provided by DBMS help to recover data loss occurred on hardware failure. Structured query languages provide easy retrieval and easy management for database. It also supports multiple views for different users.

1.2.2 Demerits

Database design is most important part of the system. It is difficult to design database that will provide all advantages mentioned above. It is complex process and difficult to understand. Sometimes after normalization the cost of retrieval increases. Security is very limited. It is costly to manage database servers. Single server failure affects badly to the entire business. Large amount of data generated on regular basis is difficult to manage through traditional system. We still don’t have support for some type of data such as media files.

1.3 In Depth Problem Review

1.3.1 Storage

Ever increasing data has always been a challenge for IT industry. This data is often in unstructured format. Traditional database system is not capable of storing such a large amount of unstructured data. As volume increases it becomes difficult to structure, design, index and retrieve data. Traditional database system also uses physical servers to store data which may lead to single point failure. It requires cost to maintain physical database servers. Recovery is also complicated and time consuming for traditional database system.

1.3.2 Performance

Often normalization effects on performance. Highly normalized database contain large number of tables. Many keys and foreign keys are created to relate these tables with each other. Multiple joins are used to retrieve a record and data related to record. Queries containing multiple joins deteriorate performance. Updating and deleting also takes maximum reads and writes. Designer of the database should consider all these things while designing the database.

1.3.3 Scalability

In tradition database model, data structures are defined when the table is created. To store data, especially text data, it is hard to predict the length. If you allocate more length and data is less then space goes in vain. If you allocate less length but data is of more length then without giving any error it will save part of data that can be accommodate in that length. You have to be very specific with the data type. If you try to store float value in integer type, and some field is calculated using that field then all data can be affected. Also, traditional databases focus more on performance.

1.3.4 Availability

As mentioned before, Data is stored in database servers. Bigshot companies have their own data stores located worldwide. To increase performance data is split and stored on different locations. There are some tasks like daily backup which are conducted to take backup of data. If by any reason (natural calamities, fire, flood etc.) data is lost then application will be down as data restore will take some time.

2. Introduction to Big Data

2.1 What is Big Data?

Big data is a term used to describe the exponential growth, availability, reliability and use of structured, semi-structured and unstructured data. There are four dimensions to the Big Data:

Volume: Data is generated from various sources and is collected in a massive amounts. Social websites, forums, transactional data for later use are generated in terabytes and petabytes. we need to store these data in a very meaning full way to make a value out of it.

Velocity: Velocity is not just about producing data faster but it also means processing data faster to meet the requirement. RFID tags requires to handle loads of data as a result they demand a system which deals with huge data in terms of faster processing and generating data. It’s difficult to deal with that much data to improve on velocity for many organizations.

http://www.sas.com/big-data/index.html

Variety: Â Today, there are many sources for organizations to collect or generate data from such as traditional, hierarchical databases created by OLAP and users. Also there are unstructured and semi structured data such as email, videos, audio, transactional data, forums, text documents, meter collected data. Most of the data is not numeric but still it is used in making decisions.

http://www.sas.com/big-data/index.html

Veracity: As the variety and number of sources grows it’s difficult for the decision maker to trust on the information they are using for the analysis. So to ensure the trust in Big data is a challenge.

2.2 Hadoop

Apache HadoopÂ is anÂ open-sourceÂ software frameworkÂ that supports data-intensiveÂ distributed applications.Â It was derived from google’s mapreduce and google file system (GFS) paper. It’s written in JAVA programming language and supports the application which run on large clusters and gives them reliability. Hadoop implements MapReduce and uses Hadoop Distributed File System (HDFS). It is defined to be reliable and available because both MapReduce and Hadoop are designed to handle any node failures happen causing data to be available all the time.

Some of the advantages of hadoop are as follows

cheap and fast

scales to large amounts of storage and computation

flexible with any type of data

and with programming languages.

Figure 1: Multi-node cluster (source: http://en.wikipedia.org/wiki/File:Hadoop_1.png)

2.2.1 MapReduce:

It was developed for processing large amounts of raw data, for example, crawled documents or web request logs. This data is distributed across thousands of machines in order to be processed faster. This distribution implies the parallel processing by computing same problem on each machine with different data set. MapReduce is an abstraction that allows engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.

Figure 2: MapReduce Implementation (Source: http://code.google.com/edu/parallel/mapreduce-tutorial.html)

The library of MapReduce in the program shards input files into X pieces. Each shard file is of 16 MB to 64 MB. Then these files are run on the cluster.

One of the shard files is the master. Master assigns work to the worker nodes. Master has M map task and R reduce task to assign to worker node. There might be some idle workers. Master chooses those workers and assign them these tasks.

Map task is to reads the contents. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. These intermediate pairs are stored in memory.

On timely basis, pairs stored on memory are written to local disk which is partitioned by the partitioning function in R regions. The locations of these partitions on local disk is transferred to the master, which in turn passes it to the worker which performs reduce function.

worker assigned reduce work uses remote procedure calls to read the data from local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the MapReduce execution is available in the R output files.

2.2.2 Hadoop Distributed File System (HDFS):

Hadoop Distributed File System is a portable, scalable and distributed file system. It is written inÂ JavaÂ for the Hadoop framework. HDFS cluster is made up of cluster of datanode. and each node in hadoop has single namenode. Every datanode has blocks of data on the network using a block protocol specific to HDFS. As file system is on the network it uses TCP/IP layer for the communication and clients useÂ RPCÂ to communicate between each other. Each node does not need to have a datanode present in it. HDFS stores large files with the size multiple of 64MB, across multiple machines. ByÂ replicatingÂ the data across multiple hosts it achieves reliability, and it does not need RAID on the host server. Data is stored on 3 nodes, 2 of them are stored on the same rack and 1 on different rack. Default replication value used for storing is 3. Data rebalancing, moving copies of data and keeping high replication of data is achieved by communicating between the nodes.

HDFS has high-availability capabilities. It allows the main metadata server to be manually as well as automatically failed over to a backup in the event of failure. The file system has aÂ Secondary Namenode, which connects with the Primary Namenode to build snapshots of Primary Namenode’s directory information. These snapshots are then stored in local or remote directories. These snapshots are then used to restart a failed primary name node. This eliminates replaying the entire file system action. This can be a bottleneck for accessing huge amount of small files as namenode is the only single point for storage and management of metadata. HDFS Federation helps in serving multiple namespaces by separate Namenodes .

The main advantage of HDFS is communication between job tracker and task tracker regarding data. By knowing the data location jobtracker assign map or reduce jobs to task trackers. let’s say, if node P has data (l,m,n) and node Q has data (a,b,c). Job tracker will assign node Q to do map or reduce task on a, b, c and node P will be assigned to do map reduce on l, m, n. This help reduce the unanted traffic on the network.

Figure 3: HDFS Architecture (Source: http://hadoop.apache.org/docs/r0.20.2/images/hdfsarchitecture.gif)

2.3 NoSQL Ecosystem

NoSQL is a non-relational database management systems which is different form the traditional relational database management systems in significant ways. NoSQL systems are designed for distributed data stores which require large scale data storage, are schema-less and scale horizontally. Relational databases rely upon very hard-and-fast, structured rules to govern transactions. These rules are encoded in the ACID model which requires that the database must always preserve atomicity, consistency, isolation and durability in each database transaction. The NoSQL databases follow the BASE model which offers three loose guidelines: basic availability, soft state and eventual consistency.

The term NoSQL was coined by Carlo Strozzi in 1998 for his Open Source, Light Weight Database which had no SQL interface. Later, in 2009, Eric Evans, a Rackspace employee, reused the term for databases which are non-relational, distributed and do not conform to atomicity, consistency, isolation and durability. In the same year, “no:sql(east)” conference held in Atlanta, USA, NoSQL was discussed a lot. And eventually NoSQL saw an unprecedented growth.

Two primary reasons to consider NoSQL are: handle data access with sizes and performance that demand a cluster; and to improve the productivity of application development by using a more convenient data interaction style. The common characteristics of NoSQL are:

Not using the relational model

Running well on clusters

Open-source

Built for 21st century web estates

Schema less

Each NoSQL solution uses a different data model which can be put in four widely used categories in the NoSQL Ecosystem: key-value, document, column-family and graph. Of these the first three share a common characteristic of their data models called aggregate orientation. Next we briefly describe each of these data models.

2.3.1 Document Oriented

The main concept of a document oriented database is the notion of a “document”. The database stores and retrieves documents which encapsulate and encode data in some standard formats or encodings like XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures and can offer different ways of organizing and grouping documents:

Collections

2.3.2 Key-value

A key-value store is a simple hash table, primarily used when all access to the database is via primary key. Key-value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. The following types exist: Eventually-consistent key-value store, hierarchical key-value store, hosted services, key-value chain in RAM, ordered key-value stores, multivalue databases, tuple store and so on.

Key-value stores are the simplest NoSQL data stores to use form an API perspective. The client can get or put the value for a key, or delete a key from the data store. The value is a blob that is just stored without knowing what is inside; it is the responsibility of the application to understand what is stored.

2.3.2.1 Merits

Performance high and predictable.

Simple data model.

Clear separation of saving from application logic (because of lacking query language).

Suitable for storing session information.

User profiles, product profiles, preferences can be easily stored.

Best suited for shopping cart data and other E-commerce applications.

Can be scaled easily since they always use primary-key access.

2.3.2.2 Demerits

Limited range of functions

High development effort for more complex applications

Not the best solution when relationships between different sets of data are required.

Not suited for multi operation transactions.

There is no way to inspect the value on the database side.

Since operations are limited to one key at a time, there is no way to operate upon multiple keys at the same time.

2.3.2.3 Case Study – Azure Table Storage

For structured forms of storage, Windows Azure provides structured key-value pairs stored in entities known as Tables. The table storage uses a NoSQL model based on key-value pairs for querying structured data that is not in a typical database. AÂ tableÂ is a bag of typed properties that represents an entity in the application domain. Data stored in Azure tables is partitioned horizontally and distributed across storage nodes for optimized access.

Every table has a property called theÂ Partition Key, which defines how data in the table is partitioned across storage nodes – rows that have the same partition key are stored in a partition. In addition, tables can also defineÂ Row KeysÂ which are unique within a partition and optimize access to a row within a partition. When present, the pair {partition key, row key} uniquely identifies a row in a table. The access to the Table service is through REST APIs.

2.3.3 Column Store

Column-family databases store data in column-families as rows that have many columns associated with a row key. These stores allow storing data with key mapped to values, and values grouped into multiple column families, each column family being a map of data. Column-families are groups of related data that is often accessed together.

The column-family model is as a two-level aggregate structure. As with key-value stores, the first key is often described as a row identifier, picking up the aggregate of interest. The difference with column-family structures is that this row aggregate is itself formed of a map of more detailed values. These second-level values are referred to as columns. It allows accessing the row as a whole as well as operations also allow picking out a particular column.

2.3.3.1 Merits

Designed for performance.

Native support for persistent views towards key-value store.

Sharding: Distribution of data to various servers through hashing.

More efficient than row-oriented systems during aggregation of a few columns from many rows.

Column-family databases with their ability to store any data structures are great for storing event information.

Allows storing blog entries with tags, categories, links, and trackbacks in different columns.

Can be used to count and categorize visitors of a page in a web application to calculate analytics.

Provides a functionality of expiring columns: columns which, after a given time, are deleted automatically. This can be useful in providing demo access to users or showing ad banners on a website for a specific time.

2.3.3.2 Demerits

Limited query options for data

High maintenance effort during changing of existing data because of updating all lists.

Less efficient than all row-oriented systems during access to many columns of a row.

Not suitable for systems that require ACID transactions for reads and writes.

Not good for early prototypes or initial tech spikes as the schema change required is very expensive.

2.3.3.3 Case Study – Cassandra

A column is the basic unit of storage in Cassandra. A Cassandra column consists of a name-value pair where the name behaves as the key. Each of these key-value pairs is a single column and is stored with a timestamp value which is used to expire data, resolve write conflicts, deal with stale data, and other things. A row is a collection of columns attached or linked to a key; a collection of similar rows makes a column family. Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists on multiple columns. The difference is that various rows do not need to have the same columns, and columns can be added to any row at any time without having to add it to other rows.

By design Cassandra is highly available, since there is no master in the cluster and every node is a peer in the cluster. A write operation in Cassandra is considered successful once it’s written to the commit log and an in-memory structure known as memtable. While a node is down, the data that was supposed to be stored by that node is handed off to other nodes. As the node comes back online, the changes made to the data are handed back to the node. This technique, known asÂ hinted handoff, for faster restore of failed nodes. In Cassandra, a write is atomic at the row level, which means inserting or updating columns for a given row key will be treated as a single write and will either succeed or fail. Cassandra has a query language that supports SQL-like commands, known as Cassandra Query Language (CQL). We can use the CQL commands to create a column family. Scaling in Cassandra is done by adding more nodes. As no single node is a master, when we add nodes to the cluster we are improving the capacity of the cluster to support more writes and reads.Â This allows for maximum uptime as the cluster keeps serving requests from the clients while new nodes are being added to the cluster.

2.3.4 Graph

Graph databases allow storing entities and relationships between these entities. Entities are also known as nodes, which have properties. Relations are known as edges that can have properties. Edges have directional significance; nodes are organized by relationships which allow finding interesting patterns between the nodes. The organization of the graph lets the data to be stored once and then interpreted in different ways based on relationships.

Relationships are first-class citizens in graph databases; most of the value of graph databases is derived from the relationships. Relationships don’t only have a type, a start node, and an end node, but can have properties of their own. Using these properties on the relationships, we can add intelligence to the relationship – for example, since when did they become friends, what is the distance between the nodes, or what aspects are shared between the nodes. These properties on the relationships can be used to query the graph.

2.3.4.1 Merits

Very compact modeling of networked data.

High performance efficiency.

Can be deployed and used very effectively in social networking.

Excellent choice for routing, dispatch and location-based services.

As nodes and relationships are created in the system, they can be used to make recommendation engines.

They can be used to search for patterns in relationships to detect fraud in transactions.

2.3.4.2 Demerits

Not appropriate when an update is required on all or a subset of entities.

Some databases may be unable to handle lots of data, especially in global graph operations (those involving the whole graph).

Sharding is difficult as graph databases are not aggregate-oriented.

2.3.4.3 Case Study – Neo4j

Neo4j is an open-source graph database, implemented in Java. It is described as an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in table. Neo4j is ACID compliant and easily embedded in individual applications.

In Neo4J, a graph is created by making two nodes and then establishing a relationship. Graph databases ensure consistency through transactions. They do not allow dangling relationships: The start node and end node always have to exist, and nodes can only be deleted if they don’t have any relationships attached to them. Neo4J achieves high availability by providing for replicated slaves.Â Neo4j is supported by query languages such as Gremlin (Groovy based traversing language) and Cypher (declarative graph query language). There are three ways to scale graph databases:

Adding enough RAM to the server so that the working set of nodes and relationships is held entirely in memory.

Improve the read scaling of the database by adding more slaves with read-only access to the data, with all the writes going to the master.Â

Sharding the data from the application side using domain-specific knowledge.

3. Solution Approach

3.1 NoSQL Methods to MySQL

3.1.1 Problem Addressed

The ever increasing performance demands of web-based services has generated significant interest in providing NoSQL access methods to MySQL – enabling users to maintain all of the advantages of their existing relational database infrastructure, while providing fast performance for simple queries, using an API to complement regular SQL access to their data.

There are many features of MySQL Cluster that make it ideal for lots of applications that are considering NoSQL data stores. Scaling out ,performance on commodity hardware, in-memory real-time performance, flexible schemas are some of them. In addition, MySQL Cluster adds transactional consistency and durability. We can also simultaneously combine various NoSQL APIs with full-featured SQL – all working on the same data set.Â

MySQL java APIs have the following features:

– Persistent classes

– Relationships

– Joins in queries

– Lazy loading

– Table and index creation from object model

By eliminating data transformations via SQL, users get lower data access latency and higher throughput. In addition, Java developers have a more natural programming method to directly manage their data, with a complete, feature-rich solution for Object/Relational Mapping. As a result, the development of Java applications is simplified with faster development cycles resulting in accelerated time to market for new services.

MySQL Cluster offers multipleÂ NoSQL APIsÂ alongside Java:

Memcached for a persistent, high performance, write-scalable Key/Value store,

HTTP/REST via an Apache module

C++ via the NDB API for the lowest absolute latency.

Developers can use SQL as well as NoSQL APIs for access to the same data set via multiple query patterns – from simple Primary Key lookups or inserts to complex cross-shard JOINs usingÂ Adaptive Query Localization

MySQL Cluster’sÂ distributed, shared-nothing architectureÂ withÂ auto-shardingÂ and real time performance makes it a great fit for workloads requiring high volume OLTP. Users also get the added flexibility of being able to run real-time analytics across the same OLTP data set for real-time business insight.

3.1.2 Challenges

NoSQL solutions are usually more cluster oriented, which is an advantage in speed and availability, but a disadvantage in security. The problem here is more that the clustering aspect of NoSQL databases isn’t as robust or grown-up as it should be.

NoSQL databases are in general less complex than their traditional RDBMS counterparts. This lack of complexity is a benefit when it comes to security. Most RDBMS come with a huge number of features and extensions that an attacker could use to elevate privilege or further compromise the host. Two examples of this relate to stored procedures:

1) Extended stored procedures – these provide functionality that allows interaction with the host file system or network. Buffer overflows are some of the security problems encountered.

2) Stored procedures that run as definer – RDBMS such as Oracle and SQL Server allow standard SQL stored procedures to run under a different (typically higher) user privilege.There have been many privilege escalation vulnerabilities in stored procedures due to SQL injection vulnerabilities.

One disadvantage of NoSQL solutions is their maturity compared with established RDBMS such Oracle, SQL Server, MySQL and DB2. With the RDBMS, the various types of attack vector are well understood and have been for several years. NoSQL databases are still emerging and it is possible that whole new classes of security issue will be discovered.

3.2 MongoDB&Hadoop

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB.Â

3.2.1 Problem Addressed:

Wecan perform analytics and ETL on large datasets by using tools like MapReduce, Pig and Streaming with the ability to load and save data against MongoDB. With HadoopMapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find anew way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

Mongodb map reduce perform parallel processing.

Aggregation is a primary use of Mongodb-Map Reduce combination.

Aggregation framework used optimized for aggregate queries.

Realtime aggregation similar to SQLgroup by.

3.2.2 Challenges

Javascript not the best language for processing Map Reduce.

Itslimited in external data processing libraries.

MongoDB adds load to data stores.

Auto Sharding not reliable

3.3 Cassandra &Hadoop

Cassandra has been traditionally used by Web 2.0 companies that requireÂ a fast and scalable way to store simple data sets, while Hadoop has been used forÂ analyzing vast amounts of data across many servers.

3.3.1 Problem Addressed

Running heavy analytics against production databases not been successful, because it can cause slow responsiveness of the database. For this distribution, DataStax is taking advantage of Cassandra’s ability to be distributed across multiple nodes.

In the setup by Datastax, the data is replicated, where one copy would be kept with the transactional servers and another copy of the data could be placed on servers that would be perform analytic processing.

We can implement Hadoop and Cassandra on the same cluster. This means that we can have real-time applications running under Cassandra while batch-based analytics and queries that do not require a timestamp can run on Hadoop.

Here, Cassandra replaces HDFS under the covers but this is invisible to the developer.

We can reassign nodes between the Cassandra and Hadoop environments as per the requirement.

The other positive factor is that using Cassandra removes the single points of failure that are associated with HDFS, namely the NameNode and JobTracker.

Performant OLT +Powerful OLAP

Less need to shuffle data between storage systems.

Data locality for processing.

Scales with cluster.

Can separate analytics load into virtual DC.

3.3.2 Challenges

Cassandra replication settings are done on a node level with configuration files

In particular, the combination of more RAM and more effective caching strategies could yield to improved performance. For interactive applications, we expect that Cassandra’s support for multi-threaded queries could also help deliver speed and scalability.

Cassandra tends to be more sensitive to network performance than Hadoop, even with physically local storage ,since Cassandra replicas do not have the ability to execute computing tasks locally as with Hadoop, meaning that tasks requiring a large amount of data may need to transfer this data over the network in order to operate on it.We believe that a commercially successful cloud computing service must be robust and flexible enough to deliver high performance under a variety of provisioning scenariosand application loads.

3.4 Azure Table Storage &Hadoop

3.4.1 Problem Addressed

Broader access to HadoopÂ through simplified deployment and programmability. Microsoft has simplified setup and deployment of Hadoop, making it possible to setup and configure Hadoop on Windows Azure in a few hours instead of days. Since the service is hosted on Windows Azure, customers only download a package that includes the Hive Add-in and Hive ODBC Driver. In addition, Microsoft has introduced new JavaScript libraries to make JavaScript a first class programming language in Hadoop. Through this library JavaScript programmers can easily write MapReduce programs in JavaScript, and run these jobs from simple web browsers. These improvements reduce the barrier to entry, by enabling customers to easily deploy and explore Hadoop on Windows.

Breakthrough insightsÂ through integration Microsoft Excel and BI tools.

This preview ships with a new Hive Add-in for Excel that enables users to interact with data in Hadoop from Excel. With the Hive Add-in customers can issue Hive queries to pull and analyze unstructured data from Hadoop in the familiar Excel. Second, the preview includes a Hive ODBC Driver that integrates Hadoop with Microsoft BI tools. This driver enables customers to integrate and analyze unstructured data from Hadoopusing award winning Microsoft BI tools such as PowerPivot and PowerView. As a result customers can gain insight on all their data, including unstructured data stored in Hadoop.

Elasticity, thanks to Windows Azure. This preview of the Hadoop based service runs on Windows Azure, offering an elastic and scalable platform for distributed storage and compute.

The Hadoop on Windows Azure beta has several positive factors, including:

Setup is easy using the intuitive Metro-style Web portal.

Flexible language choices for runningMapReduce jobs and queries can be executed using Hive (HiveQL).

There are various connectivity options, like an ODBC driver (SQL Server/Excel), RDP and other clients, as well as connectivity to other cloud data stores from Microsoft (Windows Azure Blobs, the Windows Azure Data Market) and others (Amazon Web Services S3 buckets).

3.4.2 Challenges

HDFS is well-suited for cases when data is appended at the end of a file, but not suited for cases when data needs to be located and/or updated in the middle of a file. With indexing technologies, like HBase or Impala, data access becomes somewhat easier because keys can be indexed, but not being able to index into values (secondary indexes) only allow for primitive query execution. There are, however, many unknowns in the version of Hadoop on Windows Azure that will be publicly released:

The recent release is a private beta only; where there is a small information on the roadmap and planned release features.

Pricing hasn’t been announced.

During the beta, there’s a limit to the size of files that can be uploaded, and Microsoft included a disclaimer that “the beta is for testing features, not for testing production-level data loads.” So it’s unclear what the release-version performance will be like.

3.5 Neo4J&Hadoop

Neo4J is a graph database and it is used with hadoop to improve the visualization, processing of networked data which is stored in a Neo4J data store.

3.5.1 Problem Addressed

The basic point regarding graph databases, with reference to analytics, is that the more nodes you have in your graph then the richer the environment becomes and the more information you can get out of it. Hadoop is good for data crunching, but the end-results in flat files don’t present well to the customer, also it’s hard to visualize your network data in excel.

Neo4J is perfect for working with our networked data. We use it a lot when visualizing our different sets of data. So we prepare our dataset with Hadoop and import it into Neo4J, the graph database, to be able to query and visualize the data. We have a lot of different ways we want to look at our dataset so we tend to create a new extract of the data with some new properties to look at every few days.

The use of a graph database allows for ad hoc querying and visualization, which has proven very valuable when working with domain experts to identify interesting patterns and paths. Using Hadoop again for the heavy lifting, we can do traversals against the graph without having to limit the number of features (attributes) of each node or edge used for traversal. The combination of both can be a very productive workflow for network analysis.

Neo4j, for example, supports ACID-compliant transactions and XA-compliant two-phase commit. So Neo4j might be better equated with a NoSQL database, except that it can also handle significant query processing.

3.5.2 Challenges

Hadoop, hash partitions data across nodes. The data for each vertex in the graph is randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency.

Hadoop, also, has a very simple replication algorithm, where all data is generally replicated a fixed number of times across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines.

Hadoop, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency.

Order Now