SMP And MPP Databases Analysis
It has by now become a compulsion to implement Data Warehouses and Decision Support systems in almost all the major organizations. Almost every form of organization is investing heavily in building Warehouses across the multiple functions they implement. Data Warehouses, with their large volumes of integrated, consistent and conformed data, provide the competitive edge by enabling business establishments to analyze past and current trends, monitor current patterns and shortcomings and make informed future decisions.
The size of the average Data Warehouse is growing exponentially with each year with organizations looking increasingly to gather every bit of information possible into the warehouse. Modern day ETL tools provide excellent support to integrate from varying and disparate sources like Mainframes, Relational databases, XML files, unstructured documents like PDFs, emails and web pages.
It is not just the size of the Data Warehouse that is increasing, but also the utility and the functionality that is expected out of it, that is seeing a multi fold increase. A large number of advanced and high performance Business Intelligence applications – Reporting, Dashboards, Scorecards, Data Mining and Predictive modeling are now executed over the Data Warehouse and these applications execute highly complex queries accessing large volumes of data. These requirements – the ever growing size of the Data Warehouse and the increasing complexity of the queries executed against it has necessitated the need to look for alternate architectures and implementations of relational databases that can scale up effectively to support efficient querying across large volumes of data with shorter response time and consequently raised the debate of going MPP (Massively Parallel Processing) enabled databases over SMP (Symmetrical multi processors) structured data bases.
II. SMP (Symmetrical multiprocessor)
Symmetrical multiprocessor systems are single systems containing multiple processors (2 – 64, or even higher) in which a common pool of memory and disk I/O resources are shared equally. These systems are controlled by a centralized operating system. Sharing of system resources by the processors enables them to be managed more effectively. Very high speed interconnections are deployed across the SMP systems to allow effective interconnection and equal sharing of memory and resources.
Apart from high bandwidth, low communication latency is another important property that SMP systems should possess to demonstrate high levels of scalability. This is necessitated by often employed operations in data warehouse such as index lookups and joins that involve communication of small data packets. If the amount of data present in each message is less, then the importance of low latencies is paramount.
In SMP , multiple cpu’s share the same memory, board, I/O and operating system. Each and every CPU acts independently. When one CPU handles a database lookup, other CPU’s can perform database updation and perform other tasks. As a result, the device will be able to handle the highly complex networking tasks of today’s world in a very easy way. Thus SMP systems too involve a degree of parallelism in that multiple processors can be used to perform mutually exclusive operations in parallel.
SMP are relatively cheaper when compared to MPP databases. The cost of upgrading is also lesser because as we scale the number of processors, only an additional processor board needs to be added. Processing power can thus easily and seamlessly be increased by adding extra processors.
However SMP have the limitation that they can only scale so far. As all cpu’s on the same board share a single memory bus, there is a chance of bottlenecks to occur. This bottleneck impacts performance and slows down processing. Instead of placing too many number of CPUs on the same SMP board, designers of high-end network elements can distribute applications across a networked cluster of SMP boards. Each board has its own memory array, I/O and operating system. However this approach begins to complicate the up gradation. Network -specific codes has to be added by network managers to applications. Also as drivers are tightly bound to kernel, moving them involve creation of a new kernel image for each board.
III. MPP (Massively parallel processor)
Massively parallel systems are composed of many nodes. Each node is a separate computer having a minimum of one cpu and also has its own memory which is local to it. There is a connection also for connecting all the nodes. These type of systems have separate ALU’s that runs in parallel fashion. Various standards like MPI are used by nodes for communication. Message passing mechanism is used by nodes for communication.
Each node in a massively parallel processor system is accessed with the help of an interconnect technique. The technique supports transfer of data which is at the rate of 13 to 38 MB/sec. Every node in the system contains CPU, disk subsystems and memory. These nodes are self sufficient nodes. The system can be considered as a “shared nothing system”. Shared nothing indicates that the nodes have their own memory, OS and I/O subsystems, nothing is shared. These systems are designed to have good scalability. Also these systems allow the addition of any number of processors to the system.
In cases where partitioning of problems are possible, MPP systems exhibit good performance. In that case there will be no communication among nodes and all the nodes work in parallel fashion. But this partitioning occurs only in rare situations and therefore the performance that MPP systems promises to exhibit is reduced. Such partitioning occurs in the case of ad-hoc queries that are typical to datawarehouses. Also the high scalability that MPP systems offer is limited by data skew or when communication between nodes in the system is highly needed.
Single node failure reduces not only the power required for processing but also makes the data located at that node inaccessible. In industries, single-processor nodes which are termed as “thin” are augmented with multiprocessor nodes which are termed as “fat” with the help of many processors in SMP configuration. In such cases, the MPP nodes will have many number of processors and less number of nodes. The architecture of MPP includes a group of independent nodes which are of shared-nothing type. Each node has cpu, local disks and memory. Message based interconnect connects all these together.
IV. DEPLOYING DATA WAREHOUSE
Now that we have discussed in brief the inherent differences between an SMP and an MPP, the below section details the considerations that have to be taken into account while deploying a Data Warehouse.
The main consideration when deploying data warehouses are that they should be able to extract meaningful and un-obvious, information from large amounts of data . They can use techniques such as relational intra-query parallelization, on-line analytical processing (OLAP), data mining, and multidimensional databases for the extraction.
To perform these analyses, systems that are powerful require access to many times the amount of data that is stored in any one of a company’s operational systems. Organizations deploy data warehouses by transferring data periodically from on-line transaction processing (OLTP) databases into data warehouses. These are implemented at fixed schedules via ETL routines that execute at pre-defined intervals in a day. The ETL routines could also execute weekly/monthly and quarterly for sources that provide information at that frequency. Since the databases used in data warehouses are different from the operational OLTP source systems, the ETL from the source systems to the Data warehouse can be a resource-intensive operation involving data extraction, data cleansing and conforming of the data. The amount of storage needed is staggering as well – with the entire operations of the company integrated within the Data warehouse – sales, orders, operations, finance etc .
As the usefulness of this data is not predictable in the beginning, all of the company’s data is usually stored in a data warehouse . Data warehouses pose a constant challenge of rapid deployment of application. In the case of OLTP systems the workload is predictable and can be managed with careful tuning. While in the case of data warehouses, they constantly changes whenever new applications are created. Because of their constantly-changing nature, all data warehouses require custom configuration.
Factors to consider when deploying data warehouse
1) Complexity of Query: Query complexity ranges from canned queries that are simple to data mining using techniques in artificial intelligence. Canned queries make use of optimized, pre-compiled SQL which may be used in answering questions which are simple and are repeated frequently. Complex data analysis is done using ad-hoc queries which are written in SQL. Also those queries that support operations in data mining are very much complicated . Such queries are not written in SQL and they are difficult to optimize also. Intensive methods like neural nets, genetic programs etc are used by these queries.
2) Workload in Database: Workloads of decision support systems varies from interactive operation to batch operation. Data visualization packages uses access to data warehouse that are interactive. Such packages extract data trends with the help of executing pre-compiled queries.
3) System Architecture: DSS makes use of the technology, parallel processing. Parallel computing architectures range varies in the extent to which memory is hierarchical.
Memory is accessed uniformly by symmetric multiprocessors with the help of high-speed buses or crossbar switching technologies. These technologies support point-to point interconnection between processors. Groups of SMP systems are used by clustered approaches. These are linked with interconnection mechanisms which are of slower speed. MPP systems use nodes containing local memory that are accessed through a local high-speed bus. Communication among nodes are carried out through message-based interconnects which are of lower speed.
VI. NEED FOR SCALABLE DATA WAREHOUSES
The size of a Data warehouse grows rapidly in size and the growth cannot easily be accurately anticipated. Data warehouse implementations often start small and grow as the volume of data and the demands increase. Data warehouses are often deployed with a few processors in the beginning, and can support many times the initial processing capability.
Properties
When more number of processors are added to an SMP, or nodes are added to an MPP, it is important that system should scale. Ideally, a Data Warehouse system should exhibit two properties to show good levels of scalability – speed-up and scale-up.
1) Speed-up: It is the property demonstrated, in which if a job needs one time unit to complete with one processor then it will need 1/N of the time to complete with N processors. For example, consider a job that needs five hours to complete with one processor , it needs only one hour to complete with five processors. Then we say that the system scales well.
2) Scale-up: It is another important property. Consider a system with excellent scale-up. It provides the same level of performance even if the data warehouse size increases through the addition of processors or nodes. For example, when the database size is one terabyte , a batch job that takes five hours to run will take the same time of five hours when the size is two terabytes.
In order to maintain scalability, the data should be re re-partitioned across the nodes. This is a time consuming and risky process as databases are terabyte-sized . This step is not required on an SMP.
Database administrators valuate scalability by checking whether the system’s behavior is predictable when workload intensity increases. If the system’s behavior is predictable even when the workload grows, then the system scales well.
VIII. CONCLUSIONS
Both SMP and MPP server databases can be used for Data warehouse implementations. There are different situations where each can be utilized. The general trade-off point on choosing between the two depends on several factors:
1.) Volume of data expected to be stored in the database.
2.) Expected number of concurrent users.
3.) Complexity of queries to be executed – number of joins, aggregations etc to be used.
4.) Average volume of data accessed by each query.
5.) Anticipated growth volumes.
When the number of concurrent users is less, and when the volumes are low, SMP are preferred. In fact SMP are preferred for more OLTP like environments. In contrast when the volumes are large, and the number of queries executed is large and involves complex query processing – MPP server databases are preferred. These databases on account of their parallel processing capabilities can be utilized to execute complex queries more efficiently and hence offer a natural choice for typical Data warehouse implementations.
Order Now