As data intensive scientific computing systems become more widespread, there is a necessity of simplifying the development, deployment, and execution of complex data analysis applications for scientific discovery. The scientific workflow model is the leading approach for designing and executing data-intensive applications in high-performance computing infrastructures. Commonly, scientific workflows are built by a set of connected tasks arranged in a directed acyclic graph style, which communicate through storage abstractions. The Data Mining Cloud Framework (DMCF) is a system allowing users to design and execute data analysis workflows on cloud platforms, relying on cloud storage services for every I/O operation. Hercules is an in-memory I/O solution that can be used in DMCF as an alternative to cloud storage services, providing additional performance and flexibility features. This work improves the integration between DMCF and Hercules by using a data-aware scheduling strategy for exploiting data locality in data-intensive workflows. This paper presents experimental results demonstrating the performance improvements achieved using the proposed data-aware scheduling strategy in the Microsoft Azure cloud platform. In particular, with our scheduling strategy, the I/O overhead has been reduced by 55% with respect to the Azure storage, leading to a 20% reduction of the total execution time.

A Data-aware Scheduling Strategy for Workflow Execution in Clouds

MAROZZO, Fabrizio;Domenico Talia;Paolo Trunfio
2017

Abstract

As data intensive scientific computing systems become more widespread, there is a necessity of simplifying the development, deployment, and execution of complex data analysis applications for scientific discovery. The scientific workflow model is the leading approach for designing and executing data-intensive applications in high-performance computing infrastructures. Commonly, scientific workflows are built by a set of connected tasks arranged in a directed acyclic graph style, which communicate through storage abstractions. The Data Mining Cloud Framework (DMCF) is a system allowing users to design and execute data analysis workflows on cloud platforms, relying on cloud storage services for every I/O operation. Hercules is an in-memory I/O solution that can be used in DMCF as an alternative to cloud storage services, providing additional performance and flexibility features. This work improves the integration between DMCF and Hercules by using a data-aware scheduling strategy for exploiting data locality in data-intensive workflows. This paper presents experimental results demonstrating the performance improvements achieved using the proposed data-aware scheduling strategy in the Microsoft Azure cloud platform. In particular, with our scheduling strategy, the I/O overhead has been reduced by 55% with respect to the Azure storage, leading to a 20% reduction of the total execution time.
data-aware scheduling; DMCF; Hercules; in-memory storage; Microsoft Azure; workflows; Software; Theoretical Computer Science; Computer Science Applications1707 Computer Vision and Pattern Recognition; Computer Networks and Communications; Computational Theory and Mathematics
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/20.500.11770/132535
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 9
social impact