Database Partitioning in a Cluster of Processors

Sacca', Domenico; Wiederhold, Gio

In a distributed database system the partitioning and allocation of the database over the processor nodes of the network is a critical aspect of database design effort. A poor distribution can lead to higher loads and hence higher costs in the nodes or in the communication network, so that the system cannot handle the required set of transactions. We consider here a system where multiple processors are clustered at one location in order to increase the system`s processing capability. The local network used in such a system has typically a high communication bandwith but any single processor will have inadequate processing and input-output capacity to deal with the transaction load. We develop and evaluate algorithms which perform in a computationally feasible manner the design steps of partitioning and allocation of a database to the processors. More precisely, we investigate the optimal non-redundant partitioning and allocation of a database to a number of processors nodes. Given is a set of source relations of the database and their attributes and a set of transactions which are to be executed during some period of interest. A transaction performs operations on some subset of the database. The initial network node, associated with every transaction, is not prespecified, but to be signed as part of the design. This problem differs from the problem in destributed systems where the processors are remote from each other and, presumably, close to their users. In those systems a transaction enters a known, local processor node, and the final response is issued from that node as well. The unclustered model with local transactions has been treated previously in the literature, for instance in [Ap82] and [CNW81]. We model the content of the database as a collection of relations. The given conceptual relations may be too large to be effectively assigned to single processors. We will consider initially how the relations can be fragmented, and will then allicate those fragments to the processor nodes. The possibility that files may be fragmented is not considered in treatments of the file allocation problem, as surveyed for instance in [DoFo82]. Once the database is put into operation each transaction will access some subset of the tuples and the attributes of each original relation. In order to complete transactions which do not find their data on the same processor, a scheduling algorithm is invoked which optimizes the processing over the netword. We do not investigate this scheduling algorithm itself, but simply assume that it exists, and that we can use it in order to obtain the costs for the execution of a given transaction over a proposed database allocation for some specified but fictitious processor network. In this presentation the following section will formulate our problem precisely. In Sec. 3 we show the intrinsic complexity of the problem, present the heuristic greedy with first-fit algorithms we propose, and prove statements about their behavior. The conclusions we draw from this work are presented in Sec. 4. Further background material, proofs, conjectures on the feasibility of the solutions, design algorithms, and evaluations can be found in [SW83].