Numa And Uma And Shared Memory Multiprocessors Computer Science Essay

The main difference between the NUMA and UMA memory architecture is the location of the Memory. The UMA architecture nodes have first and second cache memory levels joint with the processor, next levels of the memory hierarchy are “in the other side” of the interconnection network.

The NUMA architecture defines the node as the Processing Element, with cache lines and a part of the main memory. Then, each node is connected to each other by the network. So, in the NUMA architecture we could say that the memory and the cache are distributed in the nodes, while in the UMA architecture is only the cache that is distributed. Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can on often be divided into smaller ones, which are then solved concurrently. There are several different forms of parallel computing: bit-level instruction level, data, and task parallelism. Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of  multicore processors.

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism-with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, MPPs, and grids use multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks.

Parallel computer programs are more difficult to write than sequential ones, because concurrency introduces several new classes of potential software bugs, of which race conditions are the most common. Communication and synchronization between the different subtasks are typically one of the greatest obstacles to getting good parallel program performance.

Non-Uniform Memory Access (NUMA) Machines

Architectural Background

NUMA machines provide a linear address space, allowing all processors to directly address all memory. This feature exploits the 64-bit addressing available in modern scientific computers. The advantages over distributed memory machines include faster movement of data, less replication of data and easier programming. The disadvantages include the cost of hardware routers and the lack of programming standards for large configurations.

Node of a NUMA machine

The fundamental building block of a NUMA machine is a Uniform Memory Access (UMA) region that we will call a “node”. Within this region, the processors share a common memory. This local memory provides the fastest memory access for each of the processors on the node. The number of processors on a node is limited by the speed of the switch that couples the processors with their local memory. Typical of current systems are two to eight processors per node.

For larger configurations, multiple nodes are combined to form a NUMA machine. When a processor on one node references data that is stored on another node, hardware routers automatically send the data from the node where it is stored to the node where it is being requested. This extra step in memory access results in delays, which can degrade performance.

Small to medium NUMA machines have only one level of memory hierarchy; data is either local or remote. Larger NUMA machines use a routing topology, where delays are greater for nodes further away.

One design goal of a NUMA machine is to make the routers as fast as possible to minimize the difference between local and remote memory references.

The performance of an individual application depends on the number of nodes used. If only two nodes are used and the memory is placed randomly, there will be a 50% chance that memory references will be local. As the number of nodes increases, this probability decreases. The FMS (FLASH MEDIA SERVER) Programming Tools described in the next section overcome the scaling issues associated with large NUMA architectures.

Programming

The goal for optimal programming of NUMA machines is to maximize references to local memory on the node while minimizing references to remote memory. FMS contains the following unique “hooks” into the operating system that provide the control necessary to achieve this goal:

Thread Binding.

The compute and I/O threads managed by FMS may be physically bound to specific processors or nodes. This is the first step necessary in establishing the affinity between executing threads and the physical memory they reference.

Memory Placement.

When FMS allocates memory, it may be explicitly placed on the processor’s local node.

FMS automatically distributes each matrix and vector record uniformly among the nodes. Each processor is then assigned the portion of the work that corresponds to the data on its local node. The computational sequences are ordered to minimize the references to remote data.

The Parallel Programming Tools available with FMS provide these same hooks for you to achieve optimal NUMA performance on the non-FMS part of your application. These tools provide a common portable interface across all NUMA machines.

Read also  A Multi User Chat System In Java

Linux Support for NUMA Hardware

Large count multiprocessors are being built with non-uniform memory access (NUMA) times –

access times that are dependent upon where within the machine a piece of memory physically resides.   For optimal performance, the kernel needs to be aware of where memory is located, and keep memory used as close as possible to the user of the memory.  Examples of NUMA machines include the NEC Azusa, the IBM x440 and the IBM NUMAQ.

The 2.5 Linux kernel includes many enhancements in support of NUMA machines.  Data structures and macros are provided within the kernel for determining the layout of the memory and processors on the system.  These enable the VM subsystem to make decisions on the optimal placement of memory for processes.  This topology information is also exported to user-space.

In addition to items that have been incorporated into the 2.5 Linux kernel, there are NUMA features that have been developed that continue to be supported as patch sets.  These include NUMA enhancements to the scheduler, multipath I/O and a user-level API that provides user control over the allocation of resources in respect to NUMA nodes.

On NUMA systems, optimal performance is obtained by locating processes as close to the memory they access as possible.  For most processes, optimal performance is obtained by allocating all memory for the process from the same node, and dispatching the process on processors on that node.

Uniform Memory Access

UMA is a shared memory architecture used in parallel computers.

All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.

Uniform Memory Access computer architectures are often contrasted with Non-Uniform Memory Access (NUMA) architectures.

In the UMA architecture, each processor may use a private cache. Peripherals are also shared in some fashion, The UMA model is suitable for general purpose and time sharing applications by multiple users. It can be used to speed up the execution of a single large program in time critical applications. Unified Memory Architecture (UMA) is a computer architecture in which graphics chips are built into the motherboard and part of the computer’s main memory is used for video memory. 

Types of UMA architectures

UMA using bus-based Symmetric Multi-Processing (SMP) architectures

UMA using crossbar switches

UMA using multistage switching networks

Symmetric Multi-Processing:

In computing, symmetric multiprocessing or SMP involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. Processors may be interconnected using buses, crossbar switches or on-chip mesh networks. The bottleneck in the scalability of SMP using buses or crossbar switches is the bandwidth and power consumption of the inter connect among the various processors, the memory, and the disk arrays. Mesh architectures avoid these bottlenecks, and provide nearly linear scalability to much higher processor counts.

SMP systems allow any processor to work on any task no matter where the data for that task are located in memory, provided that each task in the system is not in execution on two or more processors at the same time; with proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently.

In computing, symmetric multiprocessing or SMP involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. Processors may be interconnected using buses, crossbar switches or on-chip mesh networks. The bottleneck in the scalability of SMP using buses or crossbar switches is the bandwidth and power consumption of the interconnection among the various processors, the memory, and the disk arrays. Mesh architectures avoid these bottlenecks, and provide nearly linear scalability to much higher processor counts at the sacrifice of programmability.

Serious programming challenges remain with this kind of architecture because it requires two distinct modes of programming, one for the CPUs themselves and one for the interconnection between the CPUs. A single programming language would have to be able to not only partition the workload, but also comprehend the memory locality, which is severe in a mesh-based architecture.

SMP systems allow any processor to work on any task no matter where the data for that task are located in memory, provided that each task in the system is not in execution on two or more processors at the same time; with proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently.

Read also  Big Data Applications and Overview

Alternatives

SMP represents one of the earliest styles of multiprocessor machine architectures, typically used for building smaller computers with up to 8 processors. Larger computer systems might use newer architectures such as NUMA (Non-Uniform Memory Access), which dedicates different memory banks to different processors. In a NUMA architecture, processors may access local memory quickly and remote memory more slowly. This can dramatically improve memory throughput as long as the data is localized to specific processes. On the downside, NUMA makes the cost of moving data from one processor to another, as in workload balancing, more expensive. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

Other systems include asymmetric multiprocessing (ASMP), which uses separate specialized processors for specific tasks (which increases complexity), and computer clustered multiprocessing (such as Beowulf), in which not all memory is available to all processors.

Examples of ASMP include many media processor chips that are a relatively slow base processor assisted by a number of hardware accelerator cores. High-powered 3D chipsets in modern video cards could be considered a form of asymmetric multiprocessing. Clustering techniques are used fairly extensively to build very large supercomputers. In this discussion, a single processor is denoted as a uni processor.

Advantages and disadvantages:

Advantages:

SMP has many uses in science, industry, and business which often use custom-programmed software for multitasked processing.

Disadvantages:

Most consumer products such as word processors and computer games are written in such a manner that they cannot gain large benefits from concurrent systems. For games this is usually because writing a program to increase performance on SMP systems can produce a performance loss on uni processor systems.

Multi-core chips are becoming more common in new computers, and the balance between installed uni – and multi-core computers may change in the coming years.

Uni processor and SMP systems require different programming methods to achieve maximum performance. Therefore two separate versions of the same program may have to be maintained, one for each. Programs running on SMP systems may experience a performance increase even when they have been written for uni processor systems. This is because hardware interrupts that usually suspend program execution while the kernel handles them can execute on an idle processor instead. The effect in most applications is not so much a performance increase as the appearance that the program is running much more smoothly. In some applications, particularly compilers and some distributed computing projects, one will see an improvement by a factor of the number of additional processors.

In situations where more than one program executes at the same time, an SMP system will have considerably better performance than a uni-processor because different programs can run on different CPUs simultaneously.

Systems programmers must build support for SMP into the operating system: otherwise, the additional processors remain idle and the system functions as a uniprocessor system.

In cases where an SMP environment processes many jobs, administrators often experience a loss of hardware efficiency. Software programs have been developed to schedule jobs so that the processor utilization reaches its maximum potential. Good software packages can achieve this maximum potential by scheduling each CPU separately, as well as being able to integrate multiple SMP machines and clusters.

Access to RAM is serialized; this and cache coherency issues causes performance to lag slightly behind the number of additional processors in the system.

Entry-level systems:

Before about 2006, entry-level servers and workstations with two processors dominated the SMP market. With the introduction of dual-core devices, SMP is found in most new desktop machines and in many laptop machines. The most popular entry-level SMP systems use the x86 instruction set architecture and are based on Intel’s Xeon, Pentium D, Core Duo, and Core 2 Duo based processors or AMD’s Athlon64 X2, 2000 series processors. Servers use those processors and other readily available non-x86 processor choices, including the Sun Microsystems Ultra SPARC, Fujitsu and later, SGI MIPS, Intel Itanium, Hewlett Packard PA-RISC, Hewlett-Packard  POWER and Apple Computer PowerPC processors. In all cases, these systems are available in uni processor versions as well.

Earlier SMP systems used motherboards that have two or more CPU sockets. More recently, microprocessor manufacturers introduced CPU devices with two or more processors in one device, for example, the POWER, UltraSPARC, Opteron, Athlon, Core 2, and Xeon all have multi-core variants. Athlon and Core 2 Duo multiprocessors are socket-compatible with uniprocessor variants, so an expensive dual socket motherboard is no longer needed to implement an entry-level SMP machine. It should also be noted that dual socket Opteron designs are technically NUMA designs, though they can be programmed as SMP for a slight loss in performance.

Mid-level systems

The Burroughs B5500 first implemented SMP in 1961. It was implemented later on other mainframes. Mid-level servers, using between four and eight processors, can be found using the Intel Xeon MP, AMD Opteron 800 and 8000 series and the above-mentioned Ultra SPARC, SPARC64, MIPS, Itanium, PA-RISC, Alpha and POWER processors. High-end systems, with sixteen or more processors, are also available with all of the above processors.

Read also  Research Paper: CalREN

Sequent Computer Systems built large SMP machines using Intel 80386 processors. Some smaller 80486 systems existed, but the major x86 SMP market began with the Intel Pentium technology supporting up to two processors. The Intel Pentium Pro expanded SMP support with up to four processors natively. Later, the Intel Pentium II, and Intel Pentium III processors allowed dual CPU systems, except for the respective Celerons. This was followed by the Intel Pentium II Xeon and Intel Pentium III Xeon processors which could be used with up to four processors in a system natively. In 2001 AMD released their Athlon MP, or Multi Processor CPU, together with the 760MP motherboard chipset as their first offering in the dual processor marketplace. Although several much larger systems were built, they were all limited by the physical memory addressing limitation of 64 GB. With the introduction of 64-bit memory addressing on the AMD64 Opteron in 2003 and Intel 64  Xeon in 2005, systems are able to address much larger amounts of memory; their addressable limitation of 16 EB is not expected to be reached in the foreseeable future.

Crossbar switches:

This letter describes a free-space optical fiber cross-connect that uses a pair of micro mirror arrays to redirect optical beams from an input-fiber array to an output array. This confocal switch architecture is well suited for simultaneous switching of multiple wavelength channels. We show that confocal switches with low insertion loss, low crosstalk, and large port counts can be implemented with surface-micro machined mirror arrays, and we demonstrate a 2Ã-2 single-mode (1550 nm) switch configuration with insertion loss of -4.2 dB and crosstalk of -50.5 dB. Our micro mirror design has sufficient size and angular deflection for scaling to 32Ã-32 ports.

Multistage interconnection networks

They are a class of high-speed computer networks usually composed of processing elements on one end of the network and memory elements on the other end, connected together by switching elements. The switching elements themselves are usually connected to each other in stages, hence the name.

Such networks include omega networks, delta networks and many other types. MINs are typically used in high-performance or parallel computing as a low-latency interconnection, though they could be implemented on top of a packet switching network. Though the network is typically used for routing purposes, it could also be used as a co-processor to the actual processors for such uses as sorting; cyclic shifting, as in a perfect shuffle network; and bitonic sorting. A multistage interconnection network capable of performing highly reliable communications with less hardware. In the multistage interconnection network for interconnecting a plurality of nodes, the first and final stages each have switches two times as large as the number of switches at an intermediate stage. Two output ports of each node are connected to the input ports of different first stage switches, and two input ports are connected to the output ports of final stage different switches. The input ports of switches of the intermediate stage are connected to the output ports of first stage different switches, and the output ports are connected to the input ports of final stage different switches. At least one output port of each switch at the first stage is directly connected to at least one input port of an optional switch at the final stage.

An 8×8 Omega network is a multistage interconnection network, meaning that processing elements are connected using multiple stages of switches. Inputs and outputs are given addresses as shown in the figure. The outputs from each stage are connected to the inputs of the next stage using a perfect shuffle connection system. This means that the connections at each stage represent the movement of a deck of cards divided into 2 equal decks and then shuffled together, with each card from one deck alternating with the corresponding card from the other deck. In terms of binary representation of the PEs, each stage of the perfect shuffle can be thought of as a cyclic logical left shift; each bit in the address is shifted once to the left, with the most significant bit becoming the least significant bit.

At each stage, adjacent pairs of inputs are connected to a simple exchange element, which can be set either straight or crossed. For N processing element, an Omega network contains N/2 switches at each stage, and log2N stages. The manner in which these switches are set determines the connection paths available in the network at any given time. Two such methods are destination-tag routing and XOR-tag routing, discussed in detail below.

The Omega network is highly blocking, though one path can always be made from any input to any output in a free network

Order Now

Order Now

Type of Paper
Subject
Deadline
Number of Pages
(275 words)