Normalized to: Mehta, G.
[1]
oai:arXiv.org:1010.4822 [pdf] - 955511
Data Sharing Options for Scientific Workflows on Amazon EC2
Submitted: 2010-10-22
Efficient data management is a key component in achieving good performance
for scientific workflows in distributed environments. Workflow applications
typically communicate data between tasks using files. When tasks are
distributed, these files are either transferred from one computational node to
another, or accessed through a shared storage system. In grids and clusters,
workflow data is often stored on network and parallel file systems. In this
paper we investigate some of the ways in which data can be managed for
workflows in the cloud. We ran experiments using three typical workflow
applications on Amazon's EC2. We discuss the various storage and file systems
we used, describe the issues and problems we encountered deploying them on EC2,
and analyze the resulting performance and cost of the workflows.
[2]
oai:arXiv.org:1005.4457 [pdf] - 170900
Pipeline-Centric Provenance Model
Submitted: 2010-05-24
In this paper we propose a new provenance model which is tailored to a class
of workflow-based applications. We motivate the approach with use cases from
the astronomy community. We generalize the class of applications the approach
is relevant to and propose a pipeline-centric provenance model. Finally, we
evaluate the benefits in terms of storage needed by the approach when applied
to an astronomy application.
[3]
oai:arXiv.org:1005.2718 [pdf] - 1513445
Scientific Workflow Applications on Amazon EC2
Submitted: 2010-05-15
The proliferation of commercial cloud computing providers has generated
significant interest in the scientific computing community. Much recent
research has attempted to determine the benefits and drawbacks of cloud
computing for scientific applications. Although clouds have many attractive
features, such as virtualization, on-demand provisioning, and "pay as you go"
usage-based pricing, it is not clear whether they are able to deliver the
performance required for scientific applications at a reasonable price. In this
paper we examine the performance and cost of clouds from the perspective of
scientific workflow applications. We use three characteristic workflows to
compare the performance of a commercial cloud with that of a typical HPC
system, and we analyze the various costs associated with running those
workflows in the cloud. We find that the performance of clouds is not
unreasonable given the hardware resources provided, and that performance
comparable to HPC systems can be achieved given similar resources. We also find
that the cost of running workflows on a commercial cloud can be reduced by
storing data in the cloud rather than transferring it from outside.