Fragment Allocation In Distributed Database Design

A database that consists of two or more data files located at different sites on a computer network. Because the database is distributed, different users can access it without interfering with one another. However, the DBMS must periodically synchronize the scattered databases to make sure that they all have consistent data, or in other words we can say that a distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers.

Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. Replication and distribution of databases improve database performance at end-user worksites.

To ensure that the distributive databases are up to date and current, there are two processes:

Replication.

Duplication.

Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be very complex and time consuming depending on the size and number of the distributive databases. This process can also require a lot of time and computer resources.

Duplication on the other hand is not as complicated. It basically identifies one database as a master and then duplicates that database. The duplication process is normally done at a set time after hours. This is to ensure that each distributed location has the same data. In the duplication process, changes to the master database only are allowed. This is to ensure that local data will not be overwritten. Both of the processes can keep the data current in all distributive locations.

Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. These technologies’ implementation can and does depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database, and hence the price the business is willing to spend on ensuring data security, consistency and integrity.

Basic architecture

A database User accesses the distributed database through:

Local applications

Applications which do not require data from other sites.

Global applications

Applications which do require data from other sites.

A distributed database does not share main memory or disks.

Main Features and Benefits of a Distributed System

A common misconception among people when discussing distributed systems is that it is just another name for a network of computers. However, this overlooks an important distinction. A distributed system is built on top of a network and tries to hide the existence of multiple autonomous computers. It appears as a single entity providing the user with whatever services are required. A network is a medium for interconnecting entities (such as computers and devices) enabling the exchange of messages based on well-known protocols between these entities, which are explicitly addressable (using an IP address, for example).

There are various types of distributed systems, such as Clusters [3], Grids [4], P2P (Peer-to-Peer) networks, distributed storage systems and so on. A cluster is a dedicated group of interconnected computers that appears as a single super-computer, generally used in high performance scientific engineering and business applications. A grid is a type of distributed system that enables coordinated sharing and aggregation of distributed, autonomous, heterogeneous resources based on usersââ‚¬â„¢ QoS (Quality of Service) requirements. Grids are commonly used to support applications emerging in the areas of e-Science and e-Business, which commonly involve geographically distributed communities of people who engage in collaborative activities to solve large scale problems and require sharing of various resources such as computers, data, applications and scientific instruments. P2P networks are decentralized distributed systems, which enable applications such as file-sharing, instant messaging, online multiuser gaming and content distribution over public networks. Distributed storage systems such as NFS (Network File System) provide users with a unified view of data stored on different file systems and computers which may be on the same or different networks.

The main features of a distributed system include:

Functional Separation: Based on the functionality/services provided, capability and purpose of each entity in the system.

Inherent distribution: Entities such as information, people, and systems are inherently distributed. For example, different information is created and maintained by different people. This information could be generated, stored, analyzed and used by different systems or applications which may or may not be aware of the existence of the other entities in the system.

Reliability: Long term data preservation and backup (replication) at different locations.

Scalability: Addition of more resources to increase performance or availability.

Economy: Sharing of resources by many entities to help reduce the cost of ownership. As a consequence of these features, the various entities in a distributed system can operate concurrently and possibly autonomously. Tasks are carried out independently and actions are co-ordinate at well-defined stages by exchanging messages. Also, entities are heterogeneous, and failures are independent. Generally, there is no single process, or entity, that has the knowledge of the entire state of the system.

Various kinds of distributed systems operate today, each aimed at solving different kinds of problems. The challenges faced in building a distributed system vary depending on the requirements of the system. In general, however, most systems will need to handle the following issues:

Heterogeneity: Various entities in the system must be able to interoperate with one another, despite differences in hardware architectures, operating systems, communication protocols, programming languages, software interfaces, security models, and data formats.

Transparency: The entire system should appear as a single unit and the complexity and interactions between the components should be typically hidden from the end user.

Fault tolerance and failure management: Failure of one or more components should not bring down the entire system, and should be isolated.

Scalability: The system should work efficiently with increasing number of users and addition of a resource should enhance the performance of the system.

Concurrency: Shared access to resources should be made possible.

Openness and Extensibility: Interfaces should be cleanly separated and publicly available to enable easy extensions to existing components and add new components.

Migration and load balancing: Allow the movement of tasks within a system without affecting the operation of users or applications, and distribute load among available resources for improving performance.

Security: Access to resources should be secured to ensure only known users are able to perform allowed operations. Several software companies and research institutions have developed distributed computing technologies that support some or all of the features described above.

Fragment Allocation in Distributed Database Design

On a Wide Area Network (WAN), fragment allocation is a major issue in distributed database design since it concerns the overall performance of distributed database systems. Here we propose a simple and comprehensive model that reflects transaction behavior in distributed databases. Based on the model and transaction information, two

Heuristic algorithms are developed to find a near-optimal allocation such that the total communication cost is minimized as much as possible. The results show that the fragment allocation found by the algorithms is close to being an optimal one. Some experiments were also conducted to verify that the cost formulas can truly reflect the communication cost in the real world.

INTRODUCTION:

Distributed database design involves the following interrelated issues:

(1) How a global relation should be fragmented,

(2) How many copies of a fragment should be replicated?

(3) How fragments should be allocated to the sites of the communication network,

(4) What the necessary information for fragmentation and allocation is. These issues complicate distributed database design. Even if each issue is considered individually, it is still an intractable problem. To simplify the overall problem, we address the fragment allocation issue only, assuming that all global relations have already been fragmented. Thus, the problem investigated here is determining the replicated number of each fragment and then finding a near-optimal allocation of all fragments, including

The replicated ones, in a Wild Area Network (WAN) such that the total communication cost is minimized. For a read request issued by a transaction, it may be simple just to load the target fragment at the issuing site, or it may be a little complicated to load the target fragment from a remote site. A write request could be most complicated since a write propagation should be executed to maintain consistency among all the fragment copies if multiple fragment copies are spread throughout the network. The frequency of each request issued at the sites must also be considered in the allocation model. Since the behaviors of different transactions maybe result in different optimal fragment allocations, cost formulas should be derived to minimize the transaction cost according to the transaction information.

Alchemi: An example distributed system

In a typical corporate or academic environment there are many resources which are generally under-utilized for long periods of time. A ââ‚¬Å“resourceââ‚¬Â in this context means any entity that could be used to fulfill any user requirement; this includes compute power (CPU), data storage, applications, and services. An enterprise grid is a distributed system that dynamically aggregates and co-ordinates various resources within an organization and improves their utilization such that there is an overall increase in productivity for the users and processes. These benefits ultimately result in huge cost savings for the business, since they will not need to purchase expensive equipment for the purpose of running their high performance applications.

The desirable features of an enterprise grid system are:

Enabling efficient and optimal resource usage.

Sharing of inter-organizational resources.

Secure authentication and authorization of users.

Security of stored data and programs.

Secure communication.

Centralized / semi-centralized control.

Auditing.

Enforcement of Quality of Service (QoS) and Service Level Agreements (SLA).

Interoperability of different grids (and hence: the basis on open-standards).

Support for transactional processes.

Alchemi is an Enterprise Grid computing framework developed by researchers at the

GRIDS Lab, in the Computer Science and Software Engineering Department at the University of Melbourne, Australia. It allows the user to aggregate the computing power of networked machines into a virtual supercomputer and develop applications to run on the Grid with no additional investment and no discernible impact on users. The main features offered by the Alchemi framework are:

Virtualization of compute resources across the LAN / Internet.

Ease of deployment and management.

Object-oriented “Grid thread” programming model for grid application development.

File-based “Grid job” model for grid-enabling legacy applications.

Web services interface for interoperability with other grid middleware.

Open-source .Net based, simple installation using Windows installers.

Alchemi Grids follow the master-slave architecture, with the additional capability of

Connecting multiple masters in a hierarchical or peer-to-peer fashion to provide

Scalability of the system. An Alchemi grid has three types of components namely the

Manager, the Executor, and the User Application itself. The Manager node is the master / controller whose main function is to service the user

Requests for workload distribution. It receives a user request, authenticates the user, and distributes the workload across the various Executors that are connected to it. The

Executor node is the one which actually performs the computation. Alchemi uses role based Security to authenticate users and authorize execution. A simple grid is created by Installing Executors on each machine that is to be part of the grid and linking them to a Central Manager Component.

Advantages of distributed databases

Management of distributed data with different levels of transparency.

Increase reliability and availability.

Easier expansion.

Reflects organizational structure database fragments are located in the departments they relate to.

Local autonomy a department can control the data about them (as they are the ones familiar with it.)

Protection of valuable data if there were ever a catastrophic event such as a fire, all of the data would not be in one place, but distributed in multiple locations.

Improved performance data is located near the site of greatest demand, and the database systems themselves are parallelized, allowing load on the databases to be balanced among servers. (A high load on one module of the database won’t affect other modules of the database in a distributed database.)

Economics it costs less to create a network of smaller computers with the power of a single large computer.

Modularity systems can be modified, added and removed from the distributed database without affecting other modules (systems).

Reliable transactions – Due to replication of database.

Hardware, Operating System, Network, Fragmentation, DBMS, Replication and Location Independence.

Continuous operation.

Distributed Query processing.

Distributed Transaction management.

Disadvantages of distributed databases

Complexity extra work must be done by the DBAs to ensure that the distributed nature of the system is transparent. Extra work must also be done to maintain multiple disparate systems, instead of one big one. Extra database design work must also be done to account for the disconnected nature of the database for example, joins become prohibitively expensive when performed across multiple systems.

Economics increased complexity and a more extensive infrastructure means extra labour costs.

Security remote database fragments must be secured, and they are not centralized so the remote sites must be secured as well. The infrastructure must also be secured (e.g., by encrypting the network links between remote sites).

Difficult to maintain integrity ââ‚¬” in a distributed database, enforcing integrity over a network may require too much of the network’s resources to be feasible.

Inexperience distributed databases are difficult to work with, and as a young field there is not much readily available experience on proper practice.

Lack of standards there are no tools or methodologies yet to help users convert a centralized DBMS into a distributed DBMS.

Database design more complex besides of the normal difficulties, the design of a distributed database has to consider fragmentation of data, allocation of fragments to specific sites and data replication.

Additional software is required.

Operating System should support distributed environment.

Concurrency control: it is a major issue. It is solved by locking and time stamping.

Order Now