1999
Doctoral thesis  Unknown

Integrated models for SIMD, MIMD and hybrid parallel architectures

Gennaro C.

Single program multiple data (SPMD)  Multiple instruction multiple (MIMD)  Performance model  Queuing network model  Mean value analysis (MVA) 

The purpose of massively parallel machines is to solve problems in various scientific/engineering domains that would have unacceptable processing times if solved with a traditional single processor machine. The performance analysis has emerged as a key area of research in the field of parallel architectures, ever since the first high performance machines appeared on the market. This appeared ever more true as the trend of the high performance computer moved from using vector and array processors to using multiprocessor architectures. Many factors influence the performance of parallel applications. Hardware or software components, or both can be involved. Examples are: processor speed, processor communication network, I/O architecture, sequential code, communication network contention and synchronization protocols, data partitioning, etc. The first effort to model the performance of parallel programs is due to Amdahl and the well known homonymous law that relates the number of processors to the bound of the speedup which may be expected by parallel processing~cite{AM91}. This bound has proved useful in shaping our understanding of parallel system because it strikes a useful balance between simplicity and precision. Several amendments and extensions to Amdahl's law have been proposed, each appropriate for different purposes. Gustafson et al.~cite{GU88ACM,GU88,GMB88} introduce the concept of scaled speedup as a measure of how well an algorithm scales when the number of processors is multiplied by $k$. Flat et al.~cite{FK89} investigate the impact of synchronization and communication overhead on the performance of parallel processors. Eager et al.~cite{ELZ89} studied speedup versus efficiency. Wu et al.~cite{WL98} propose a formal definition of scalability and discuss scalability of cluster systems. Finally, Rosti et al.~cite{RSSS98} extends Amdahl's law to three--dimensional space to include processors and disks. However, these studies are limited by the fact that the speedup formulations do not include any explicit parameters that reference the communication and I/O overhead or the type of hardware used for executing the parallel application. Other approaches use task graph models in order to represent the intertask precedence relations within a program, the task execution times, and the times necessary to transfer data or other information from one task to other during execution. In the case of program control structures that can be represented by ``series--parallel'' task graphs, such as the fork--join structure, methodology to predict the best speedup which can be obtained with such program has been developed. In particular, when the task execution times are deterministic, Coffman et al.~cite{CD73} used critical path analysis to find the program completion time. If the task execution times are stochastic and the number of processors is infinite, the probability distribution time can be determined by a straightforward but in general very costly computation~cite{HH79,BA74}. General acyclic random graph models are presented in~cite{FKM83,MN84,GNPT86}. Gelenbe~cite{Gelenbe89} generalizes the task graph models so that it is possible to take into account the effect of communication times between the tasks. For more realistic cases where the number of processors is smaller than the number of tasks in a program, a queueing network approach can be used. Approximate models are presented in~cite{CL87,GL88,GMSW86} and an exact model is proposed by Baccelli et al.~cite{BZ90}. Task graphs, as general acyclic graphs, allow the description of sequential or parallel execution, synchronization, and spawning of task. Nevertheless, the multiprocessor systems considered by this kind of models have a generic architecture with a finite number of processors sharing a central memory. This work proposes a new model for parallel systems with distributed computational and I/O resources when executing parallel applications characterized by cyclic computation bursts and intensive I/O bursts. By means of queuing network techniques, the analysis of the model leads to the definition of a generic model that allows simulation of the behaviour of several parallel architectures. Traditionally, speedup is defined as the ratio of the elapsed time when executing a program on a single processor to the execution time when $p$ processors are available. We extend this definition to include the number $d$ of {it I/O nodes}. Because parallel program execution can be partitioned into two distinct phases, denoted as computation burst and I/O burst, we consider the $p$ processors devoted to the execution of the computation burst and the $d$ I/O nodes, to the corresponding I/O burst. An I/O node can represent a disk or a SIMD processor according to the parallel machine modeled. For instance consider a hybrid architecture in which a moderate number of processors (10-20) are arranged in a MIMD way and all of them are ``boosted'' by massively parallel SIMD arrays, used for the the execution of tasks that are particularly suitable for SIMD architectures. The clusters of SIMD processors is connected to the MIMD node through a high speed I/O channel. In this scenario, several questions may arise, such as: how does the program speedup scale with the number of SIMD processors? How much does the I/O channel impact to performance of the program? Is there any advantage in using the SIMD processors or it is better to use only the MIMD processors to perform the SIMD tasks? The approach we propose in this thesis is, to the best author's knowledge, the first attempt to address these kinds of issues. The purpose of this thesis is to develop analytical performance models that captures the behavior of numerous applications running on a different parallel architectures. The idea of our approach is to use queueing network models. This type of model has the advantage of providing a ``white--box'' view of the system to study. Because the parameters of our model correspond to measurable program characteristics, we can use the model in order to estimate the execution times for different number of processors and/or I/O nodes. nition of a generic model that allows simulation of the behaviour of several parallel architectures.



Back to previous page
BibTeX entry
@phdthesis{oai:it.cnr:prodotti:447219,
	title = {Integrated models for SIMD, MIMD and hybrid parallel architectures},
	author = {Gennaro C.},
	year = {1999}
}