EPSRC logo

Details of Grant 

EPSRC Reference: EP/P031617/1
Title: Pin the Tail: Understanding Straggler Manifestation in Internet-based Distributed Systems
Principal Investigator: Garraghan, Dr P
Other Investigators:
Researcher Co-Investigators:
Project Partners:
CIATEQ Microsoft STFC Laboratories (Grouped)
Department: Computing & Communications
Organisation: Lancaster University
Scheme: First Grant - Revised 2009
Starts: 01 September 2017 Ends: 31 August 2019 Value (£): 96,599
EPSRC Research Topic Classifications:
Fundamentals of Computing Networks & Distributed Systems
EPSRC Industrial Sector Classifications:
Information Technologies
Related Grants:
Panel History:
Panel DatePanel NameOutcome
19 Apr 2017 EPSRC ICT Prioritisation Panel April 2017 Announced
Summary on Grant Application Form
Distributed systems are the essential elements that form the foundation for Internet infrastructure, and are critical for fulfilling the technological and societal needs of the digital age. Comprising Cloud datacenters, compute clusters, and the Internet of Things, these systems are responsible for the effective provisioning and execution of a multitude of parallelizable applications. The increased complexity and scale of these systems has resulted in the manifestation of emergent phenomena that substantially degrades overall system performance, and cannot be solved by simply increasing the number of compute nodes. This phenomena is known as The Long Tail Problem, whereby a small proportion of task stragglers - a small subset of tasks that execute abnormally slow - impede overall job completion time, and is systemic to all distributed systems that operate at sufficient scale. While work within this area attempts to address this problem through straggler detection or mitigation, their effectiveness is underpinned by understanding the precise underlying causes for straggler manifestation, and importantly determining what system conditions influence their occurrence. However achieving this understanding is incredibly challenging given the multitude of possible straggler root-causes - all of which can stem from diverse sub-system operational characteristics and their interactions with other sub-systems. As current understanding of straggler manifestation is restricted to a qualitative and high-level detail, it is presently impossible to determine what system operational conditions (e.g. cluster resource contention, temperature, failures) are highly likely to create a "perfect storm" for straggler occurrence. Determining the system conditions which influence the probability of straggler occurrence in different operational scenarios is vital towards achieving predictable and rapid parallel application execution, given the continued increase of system size and complexity.

The vision of this proposed research is to address our limited understanding of straggler manifestation and conduct in-depth analysis and modelling of Internet-based distributed systems to quantify the precise relationship between straggler occurrence and system behaviour. This study will involve analysis and modelling stragglers within real systems, performed through comprehensive experimentation to identify and extract key system parameters from virtual and physical sub-system operation across the entire distributed system architecture. A framework will be constructed capable of automated analysis to determine straggler root-cause within production systems, which will interface with an event-based simulation engine for determining the optimal system conditions for avoiding stragglers.

By working with leading international industrialists in massive-scale distributed systems, this work represents a significant step change towards solving The Long Tail Problem by providing much sought-out knowledge to truly understand straggler manifestation. As this problem is systemic across every type of large-scale distributed system, the impact of this work will have far reaching implications for both academia and industry, and will provide direct benefit to the competitiveness of the UKs digital economy within the short and long-term. This grant represents the first step towards realizing the research ambitious to scientifically understanding the operation of massive-scale Internet infrastructure, enabling the design of fault-tolerant techniques for future systems at unprecedented scale - a crucial objective towards realizing key emergent technologies for the future.
Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.lancs.ac.uk