The Hadoop Distributed File System Information Technology Essay

Hadoop Distributed File System is a distributed file system available as an open source software under the Apache Software Foundation. Its salient features can be enlisted as reliability, availability and scalability that helps it effectively manage peta bytes of data. It makes use of the Map/Reduce algorithms that provide enhanced indexing features, which in turn accelerate the parallel processing of the data blocks within the distributed file system. HDFS is being used extensively in renowned organizations such as Google, Yahoo and Facebook as a means to accommodate tons of information securely and more importantly at lower production costs.

The Hadoop distributed file system is an integral component of the Hadoop architecture along with the Map/Reduce process. Its basic principles were derived from the Google Big Table file system. A major contribution to the evolution of HDFS was that of an open source developer, Douglas Cutting who was then developing a search engine called ‘Nutch’.[1] HDFS uses the Java platform as its programming language and is suitable to heterogeneous operating systems or hardware. It is composed of clusters comprising of several data nodes and name nodes which in turn make enormous amounts of data in terms of peta bytes available rapidly as a matter of enhanced parallel processing. The Map/Reduce process is responsible for the data storage format and retrieval from the distributed file system. HDFS offers a high level of scalability and reliability which are essential while dealing with large data chunks. Hence the use of this file system has been incorporated by the software giants such as Google, Facebook and Amazon to make information available on the fly to millions of users. More importantly, HDFS is cost effective and additional nodes can be comfortably plugged into the existing architecture.

However issues such as portability, performance and latency can hinder the overall functionality of the file system. Hence techniques such as combining a relational database together with HDFS are being deployed to have faster throughput and support for enhanced querying practices.

Chapter 2. Hadoop Architecture

The Hadoop architecture comprises of the backend as the Hadoop distributed file system. HDFS confines to the file format specified by a column oriented table service known as HBASE. This layer is preceded by a protocol such as Map/Reduce which is a distributed computation framework. To enable enhanced querying options, Hadoop makes use of Hive, a data warehouse infrastructure or Pig, which is a dataflow language and parallel execution framework. Lastly the entire configuration is supervised under a distributed coordination framework such as Zookeeper. [1]

2.1 Hadoop Cluster

HDFS in particular consists of several clusters made up of name nodes and data nodes. It follows a master-slave design pattern. The name node has a failure backup in the form of a secondary node. All transactions to and from the distributed file system are processed by the HDFS Client. The detailed description of each component is as follows:

C:UsersSHAJIDesktopcluster.gif[2]

2.1.1 NameNode:

The NameNode contains a namespace tree which maps the set of files to their directories in the form of hash-maps. This storage is handled by inodes which also manipulate the access rights and permissions associated with each block in the DataNode. The namespace forms a part of the system RAM as the metadata to be processed in terms of peta bytes. There is a single NameNode per cluster. Other constituents of the NameNode are image, checkpoint and journal.[3] The image comprises of the location of blocks and their content. The checkpoint creates a record of the timestamp at which modifications are performed on the image and lastly, the journal stores the transactions associated with the DataNodes and it plays a vital role during NameNode recovery. Thus the NameNode acts as a master who keeps track of each block and its replicas within multiple DataNodes.

DataNode:

Data in the HDFS is stored in the form of blocks with sizes ranging from 0-128 megabytes. Whenever a block of data is inserted into the system, it is simultaneously stored on two different DataNodes. Two set of files are used to display each block of code. One file contains the data itself whereas the other contains metadata and its corresponding checksum. The file size is proportional to the block length. Each DataNode has a namespace ID and version number associated with it and the same is verified when it performs a handshake with the NameNode on start up. After the handshake, the DataNode registers a storage id which becomes permanently linked with it and can be used to locate it later. The transactions such as creation, retrieval, update and deletion of blocks between the NameNode and DataNode takes place in the form of heartbeats. The heartbeats convey information to the NameNode about the availability of the blocks in the DataNode and vice versa send instructions to replicate or erase

existing block data. Thus the DataNode acts as a slave who executes the commands of the NameNode as well as performs the vital task of data storage.

2.1.3 Secondary Node:

The SecondaryNode or the BackupNode acts as a hot standby for the NameNode. It uses the concept of check pointing images of the file directory structure of the NameNode at regular intervals. Thus if the NameNode were to crash, then on its recovery it can easily synchronize with the current namespace by incorporating the last updated image. The SecondaryNode is unable to access the block locations within the DataNodes and hence has a limited functionality. However it can operate as a temporary NameNode and perform read operations on the blocks listed in the namespace in cases of the former’s sudden failure.

HDFS Client:

The HDFS Client is an interface to the internal distributed file system. The file input and output operations are performed via interactions between the both of them. The most common functions called are to read, write, update and delete block entries from the namespace. The client does not necessarily require information about the location of the other block replicas, the data from each block is streamed into a pipeline based on the Hadoop protocols. Let us consider the processing of the file I/O operations as follows:

Read operation:

In order to read data from a block, the client acquires information about the location of that block and its replicas from the namespace. This data is sorted according to the proximity of the client to each block. Once the location is retrieved, the client directly contacts the DataNode. Before reading the data, the client verifies the block with its corresponding checksum from the metadata file to ensure the data is not corrupt. The

client then reads data from the blocks in a parallel order. When reading from a file that is being written, the client requests the replicas to provide it with the latest block length so that it does not avoid reading any data reflected later.

Write operation:

Only a single client can write to a file at a time, the data to be written is deployed in the form of heartbeats to the NameNode as discussed before. A new block is generated by assigning it a unique id and simultaneously linking it to its replicas on other DataNodes. The set of nodes are arranged in a pipeline beginning with the one which is closest to the client. The data is written in the form of packets of bytes which are streamed through the pipeline. Each block of data has five packets associated with it. While writing each block, a checksum is calculated which is transferred along with it. Data can be forwarded from one block to the next without any acknowledgement from the prior DataNodes. Lastly, when the client issues a close function call, the data that has been written gets permanently stored and becomes visible to the client.

Map/Reduce Framework:

C:UsersSHAJIDesktopmapreduce.gif[4]

The Hadoop framework offers a parallel processing application i.e. the Map/Reduce framework. It follows a poll model in which the Job Tracker is polled by multiple Task Trackers to have either map or reduce jobs assigned to them. The framework offers a platform for clients to define their map/reduce functionalities. Whenever a job is submitted by the client, the Job Tracker consults the NameNode to provide it with the location of the blocks required to perform the task. It then checks the number of available slots in each Task Tracker and allocates them jobs based on factors such as proximity, performance and amount of pending tasks. The entire block is split into smaller chunks and which are submitted over to the user defined map function. This process maps the input to a common key such that a record containing a key value pair is formed. Eventually the values in each chunk get mapped onto a specific key within the distributed file system. On finishing the mapping tasks, the Task Trackers now poll for reducing jobs. The reduce function aims at combining the values pertaining to a unique key into a single value. It is preceded by read and sort functions which retrieve mapped data from the database and sort them according to the user defined methods. The output of the reduce process is a file which is stored permanently into the file system. Thus the smooth functioning of the Map/Reduce framework depends on the coordination of the Job and the Task Tracker and the internal functioning of the map/reduce functions.

Chapter 3. Advantages and Disadvantages of Hadoop

Advantages:

3.1.1 Fault tolerance

Every write operation to a block results in its replication at two independent locations within the cluster. Thus even if a single replica gets corrupt, data can be obtained from the other two. Besides, writes are reflected only upon the issue of a commit command. The SecondaryNode maintains check point images of the namespace tree to act as a BackupNode in case of NameNode failure. Thus Hadoop offers a fault tolerant system to maintain the continuous flow of data even during cases of sudden crashes.

Open file format

Hadoop is an open source project under Apache and hence is available to the public without any imposition of licensing or management costs. The benefit of being an open file format is that multiple developers can express their views on the map/reduce implementations designed by themselves as well as resolve bugs associated with the system files.

Flexible schema

Hadoop follows basically a NoSql database structure which is different from relational databases in the fact that structured tables are not maintained in the former. Each block acts as a row in a file. The data fields of a single record can be updated or altered without affecting the schema of other records. This flexible schema allows faster querying as there is no rigid protocol to follow while dealing with the data entries.

Pluggable and cost effective architecture

Additional servers can be plugged into the cluster to enhance the data processing bandwidth. The server gets assigned a unique id and eventually begins its set of actions. Since data is replicated on the different nodes, the need of a superior data storage sever gets eliminated and moreover Hadoop being open source, additional costs for support or licensing become insignificant.

Disadvantages:

3.2.1 Cluster unavailability during NameNode failure

The NameNode is the single point of failure of the Hadoop distributed file system. It contains the namespace tree which links the blocks of data to their respective DataNodes. Hence if it crashes, the location of the blocks cannot be determined and hence no file I/O operations can be scheduled. In such a situation, if the block replicas are within the same cluster, then all corresponding transactions come to a halt. The Secondary node cannot replace the NameNode but can only provide check pointed information to it on recovery.

Scalability of NameNode

The NameNode maintains a list of the DataNodes in the cluster mapped with their corresponding locations. However the size of the namespace tree is limited to that of the NameNode capacity. Although we can add servers to the cluster comfortably, the probability of its location being mapped becomes lesser as the numbers of nodes keep increasing. Hence the scalability of the NameNode is a matter to be addressed while including more servers into the Hadoop cluster.

Built for Batch jobs and not interactive work

Hadoop deals with the processing of peta bytes of data in terms of storing profiles of millions of users, videos, images or events. Interactive work such as chat do not require transfer of large data sets and hence the use of Hadoop becomes unnecessary to the system. The amount of data exchanged being hardly few kilobytes in size, querying to the file system can result in longer waiting period. Hence batch jobs containing sets of static inputs can be efficiently processed using this system.

Latency and low throughput

Latency in Hadoop occurs as a matter of the multiple stages of the file system structure. Writing into a block requires block replication, followed by mapping values into a key value pair format. This pair is then subjected to a reduce mechanism in which map entries are sorted and then merged into a single key and value pair. During read, the client has to obtain block locations from the NameNode and then proceed to the DataNode. These intermediary processes increase the output latency and hence lower the overall system throughput.

Chapter 4. Alternatives to the Hadoop file system

The primary disadvantage of using Hadoop is its single point of failure which is the NameNode. Also as the NameNode cannot be scaled extensively there is a limitation to the number of DataNodes it can manage. In such a situation, Ceph, an object based parallel file system having a C++ programming platform comes into picture. The Ceph metadata server is spread across hundreds of nodes thus providing a huge namespace capacity to locate system files. Unlike the explicit map/reduce framework in Hadoop, Ceph allows each file to indicate a striping pattern and object size preferred by it. Automatic failure recovery and data balancing during modification of the cluster size makes Ceph more feasible to be used for large scale integration. It achieves these key features through three structural elements which are a separate metadata sever (MDS) cluster to maintain a dynamic sub-tree partitioning for the management of distributed metadata, calculated random placement of data i.e. controlled replication under scalable hashing or CRUSH and lastly, a container for the monitor service and the object storage devices (OSDs) known as the reliable autonomic distributed object store or RADOS. [5]

Databases such as Cassandra (wide column store system), Voldemort (consistent key value store system) , MongoDB (document store system), U2 (multivalued database), Versant (object database) and MemcacheDB (tuple store system) are similar NoSQL databases that relate to Hadoop. When connected across a common hosting platform, each of their unique properties can be merged to support huge data volumes maintaining overall integrity, atomicity, consistency and durability of the stored data. [6]

Chapter 5. Case Study of Hadoop at Facebook

The basic infrastructure for Hadoop was set by Yahoo Inc. which was then used as a prototype for various corporations. Today, Facebook hosts one of the largest Hadoop clusters. The typical configuration of such a cluster consists of 2000 nodes which are categorized as 1200 nodes with 8 cores and 800 nodes with 16 cores. Each node has a RAM of 32 GB and storage of 12 TB. On the whole, the entire cluster offers a massive storage capacity of 21 PB. In terms of user date generated through the web interface we have 200 and more active users out of which million upload videos, images, update statuses, post comments and involve in varied applications each day. Hence the amount of information available for data warehousing grows exponentially. To handle such situations, Facebook has a special Data warehouse cluster which has 12 TB of compressed data added along with 800 TB of compressed data scanned daily. It deploys about 25,000 map/reduce jobs on its collection of 60 million files. The overall synchronization strategy is well maintained to service 30,000 client requests to the NameNode.

The entry of client information into the Hadoop is facilitated by Scribe which is an open source log aggregator directly linked to the HDFS. It is used as a filter to stream the web services invoked by the clients via Facebook. The inflow of data is first processed using Hive which is a warehouse itself that enables interactive querying and effective analysis of bulk data in the file system. It provides a pluggable implementation for the map/reduce framework as well as hosts a querying language called HiveQL. The Hive warehouse has a storage capacity of 5.5 PB being processed constantly by 4800 cores. It follows a two level network topology in which the data transfer rate is 1 Gbit/sec from node to rack switch and 4 Gbit/sec from rack to top level switch. The Hive warehouse plays a vital role in the processing of data since it accounts for 95% of the total tasks performed at Facebook.[7]

C:UsersSHAJIDesktophh.png[8]

Today, Facebook has moved onto the Cloud computing platform to integrate clusters globally offering enhanced support for data storage. It sole purpose was to address issues such as

demand for resources, poor job isolation techniques and its related service level agreements. The last component of the Facebook Hadoop architecture is the Fair share scheduler such as Apache Jira, which follows resource aware scheduling. It distributes the memory usage among jobs involving the real time collection of the CPU capacity through slot-less scheduling of jobs.

Thus the use of Hadoop at Facebook becomes the primary source of data storage and analysis available to the multitude of users worldwide. Several architects from Linux, Sun and GPU are attempting to integrate the Hadoop framework with their respective technologies to promote its use in a heterogeneous environment and facilitate processing of large PB data sets.

Chapter 6. State of the art

The software giants of the industry have been switching onto the Hadoop infrastructure to enhance their data storage and retrieval policies. These firms use Hadoop in varied sectors for advancement. Yahoo and Amazon use it for the purpose of search engine mechanisms to accelerate fetching of client requested information. Facebook uses Hadoop for processing log data as well as recommendation systems to publicize about upcoming events or gadgets. AOL along with Facebook uses it for storage and online analytical processing. Hadoop supports a NoSql feature that enables column oriented data storage and hence large data sets can be saved easily. Besides, organizations like the New York Times and Joost make use of character large objects and binary large objects to provide analysis of videos and images. As mentioned earlier, the integration of Ceph with Hadoop is also under progress so that their dual abilities can be coupled together to process bulk distributed data. Through use of algorithms which route smaller tasks to relational databases and heavier tasks to HDFS, we can reduce the latency factor and achieve higher throughput. The latest trend of companies such as HP, IBM and Intel is to host the Hadoop framework onto the cloud computing platform so that web scaling becomes available to tons of users worldwide. [7]

Chapter 7. Conclusion

Hadoop is a powerful system developed as a platform to support an enormous quantity of varied data applications. It provides an interface to process both structured and complex data thus facilitating heterogeneous data consolidation. The open source feature of Hadoop makes its commodity servers cheaper to use accompanied with improved performance ratio. The Map/Reduce framework offers users the flexibility to manipulate the methods of analysing their data. Although the NameNode failure is critical to the file system, use of the BackupNode or a coordination service such as Zookeeper helps maintain the availability of information throughout the service cycles. Issues such as scalability of the NameNode can be addressed by generation of multiple Namenodes so as to distribute the namespace across several other nodes.

In the world today where millions of users interact with each other through social networking media or bulky statistical records, the heavy data inflow requires systems such as Hadoop to analyse and persist them. By hosting Hadoop onto the cloud framework, we can successfully take the current level of data management into a wider seamless and global dimension.

Order Now