held in conjunction with ISC’23: The International Conference on High Performance Computing
- 14:00-14:02, Welcome, WOCC’23 Organizers
- 14:02-14:47, Keynote, Daniel Milroy (Lawrence Livermore National Laboratory)
Title: Minimizing the difference between HPC and cloud: convergence of communities and technologies
Computing is undergoing fundamental changes. The end of Dennard scaling and tapering of Moore’s law has led to economic conditions that favor large-scale demand aggregators like public cloud providers. Gartner projects cloud to be the largest sector of computing by 2025, reaching nearly $1T USD in revenue with 20% annual growth. The tremendous growth is translating into substantial investment in research and development to manage the complexity of systems with post-Dennard and post-Moore architectures.
Current HPC tools and techniques are unsuitable for managing the vastly increased complexity of the resource heterogeneity and dynamism of distributed systems used by emerging scientific workflows. Adoption of cloud technologies will enable support for new workflows and avoid HPC becoming a technology island. However, direct implementation of current cloud technologies will not satisfy scientific software demands for performance. Converged computing, or an environment that seamlessly combines the performance of HPC with the portability, reproducibility, and automation of the cloud, is an active area of research and development that seeks to benefit both worlds. Successful realization of converged computing requires a clear-eyed appraisal of each world’s strengths and weaknesses.
In this talk I will provide a brief background on the economic and technical trends that led to the dominance of cloud computing, and the related increase of scientific software complexity. I will identify aspects of cloud computing and HPC that are immediately compatible such as coupling of simulation and services and those which require research for deeper integration such as MPI and elasticity. There are many academic research groups, laboratories, and companies working to bring cloud and HPC together, which translates to a large space for collaboration. While there is ample opportunity for widely beneficial research and development, I will also consider community differences that may hamper collaboration.
I will provide an overview of R&D themes and present highlights of my team’s work on converged computing at the Lawrence Livermore National Laboratory. I will conclude by suggesting potentially fruitful R&D directions and identifying collaborative work to enable cloud and HPC to integrate more completely and benefit both communities.
- 14:47-15:17, Invited Talk, Achilleas Tzenetopoulos, Dimosthenis Masouros, Sotirios Xydis and Dimitrios Soudris (National Technical University of Athens)
Title: Guaranteeing performance and efficiency on modern Cloud infrastructures
Over the past few years, cloud computing and High-performance computing (HPC) have emerged as two of the most important computing paradigms of our time. The convergence of HPC into the Cloud has emerged as a new trend, that enables HPC workloads to have access to more computational resources in a scalable and flexible manner. However, Cloud infrastructures pose a plethora of performance pitfalls, including among others the existence of resource interference due to the colocation of applications on the same physical servers. On top of that, hardware disaggregation has been introduced as a novel computing paradigm, which further complicates this problem.
In this talk, we provide insights regarding the challenges of converging the two worlds from the Cloud perspective. We present the challenges of resource interference and hardware heterogeneity provoked by workload co-location and our custom solution to mitigate them over the Kubernetes orchestrator. Additionally, we will present our exploration regarding the impact of resource interference on disaggregated memory systems, from our recent work called Adrias.
- 15:17-15:35, Technical Paper, Antony Chazapis, Fotis Nikolaidis, Manolis Marazakis and Angelos Bilas
Title: Running Kubernetes Workloads on HPC
Cloud and HPC increasingly converge in hardware platform capabilities and specifications, nevertheless still largely differ in the software stack and how it manages available resources. The HPC world typically favors Slurm for job scheduling, whereas Cloud deployments rely on Kubernetes to orchestrate container instances across nodes. Running hybrid workloads is possible by using bridging mechanisms that submit jobs from one environment to the other. However, such solutions require costly data movements, while operating within the constraints set by each setup’s network and access policies. In this work, we explore a design that enables running unmodified Kubernetes workloads directly on HPC. With High-Performance Kubernetes (HPK), users deploy their own private Kubernetes “mini Clouds”, which internally convert container lifecycle management commands to use the system-level Slurm installation for scheduling and Singularity/Apptainer as the container runtime. We consider this approach to be practical for deployment in HPC centers, as it requires minimal pre-configuration and retains existing resource management and accounting policies. HPK provides users with an effective way to utilize resources by a combination of well-known tools, APIs, and more interactive and user-friendly interfaces as is common practice in the Cloud domain, as well as seamlessly combine Cloud-native tools with HPC jobs in converged, containerized workflows.
- 15:35-15:53, Technical Paper, Saeedeh Baneshi, Ana Lucia Varbanescu, Anuj Pathania, Benny Akesson and Andy D Pimentel
Title: Estimating the Energy Consumption of Applications in the Computing Continuum with iFogSim
Digital services have become an essential part of our daily lives, but they come with a significant energy cost, raising sustainability concerns. Digital chains, composed of multiple layers (edge, fog, and cloud), have specific computing infrastructures, and scheduling decisions at each layer impact the overall quality of service and energy consumption of digital services. Measuring the energy consumption of applications spanning the entire computing continuum from edge to cloud is challenging due to the distributed nature of the system and the application. Simulation techniques are promising solutions to estimate energy consumption, and various simulators are available for modeling the cloud and fog computing environment.
This paper presents a process for defining and modeling the application and network topology in iFogSim and investigates its effectiveness in analyzing the end-to-end energy consumption of applications in digital chains. We design four scenarios to map application modules to devices along the chain, including Edge-Cloud collaboration architecture, and compare them with two placement policies of iFogSim: Cloud-only and Edge-ward policies. We analyzed energy consumption from an application perspective using iFogSim to enhance the understanding of how this tool can be used to evaluate the end-to-end energy consumption of applications in digital chains.
- 15:53-16:11, Technical Paper, Daniel Araújo de Medeiros, Gabin Schieffer, Jacob Wahlgren and Ivy Peng
Title: Exploring A Molecular Docking Workflow on Cloud with Kubernetes and Apache Airflow
Complex workflows are a strong driver for leveraging the convergence of HPC and cloud computing. In many scientific domains, efficient workflow management can lead to faster scientific output and reach wider user groups. This work investigated transitioning and deploying HPC workflows onto cloud-native workflow infrastructure, represented by Apache Airflow. We perform a case study based on GPU-accelerated molecular docking software for drug discovery. We described the design of a four-stage workflow in Apache Airflow and the technical changes required for the deployment. Our preliminary results show that transitioning an existing HPC workflow to Apache Airflow in a cloud setting is feasible and only requires minimal deployment changes.
- 16:11-16:29, Technical Paper, Tingkai Liu, Marquita Ellis, Carlos Costa, Claudia Misale, Sara Kokkila-Schumacher, Jinwook Jung, Gi-Joon Nam and Volodymyr Kindratenko
Title: Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray
We have extended the Ray framework to enable automatic scaling of workloads on high-performance computing (HPC) clusters managed by SLURM and bursting to Cloud managed by Kubernetes. Comparing to existing HPC-Cloud convergence solutions, our framework demonstrates advantages in several aspects: users can provide their own Cloud resource, framework provides the Python-level abstraction that does not require users to interact with job submission systems, and allows a single Python-based parallel workload to be run concurrently across an HPC cluster and a Cloud. Applications in Electronic Design Automation are used to demonstrate the functionality of this solution in scaling the workload on an on-premises HPC system and automatically bursting to a public Cloud when running out of allocated HPC resources. The paper focuses on describing the initial implementation and demonstrating novel functionality of the proposed framework as well as identifying practical considerations and limitations for using Cloud bursting mode. The code of our framework has been open-sourced.
- 16:29-16:47, Technical Paper, Luanzheng Guo, Jay Lofstead, Jie Ren, Ignacio Laguna, Gokcen Kestor, Line Pouchard, Dossay Oryspayev and Hyeran Jeon
Title: Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC
The emergence of multiple resource management systems, such as SLURM and Kubernetes, for different computational purposes has led to a desire to support a single workflow that spans multiple resource management domains, which can include multiple HPCs, edges, and cloud, over different network domains. Best-of-class tools developed in one domain often do not run well or at all in a different resource management regime demanding these hybrid environments. Understanding the resilience properties and concerns for cross-resource management system workflows is an unexplored area. Further, we lack tools and techniques to test this resilience and to understand how well systems and systems of systems work in the face of faults and failures. We are proposing a Fault Tolerance 500 (FT500) and a related set of benchmarks that test from the hardware layer through the software layers to create resilience scenarios. By making this a scored benchmark set, we offer a public ranking of systems and software and motivation for facilities to allow benchmarking. We also discuss potential approaches to enable fault-tolerant converged computing.
16:50-17:45 Panel Discussion
Converged computing continuum on the horizon across HPC, Cloud, and Edge: Opportunities and Challenges
Moderator: Stefano Markidis (KTH Royal Institute of Technology, Sweden)
- Simon Smart (ECMWF, UK)
- Claudia Misale (IBM, US)
- Utz-Uwe Haus (HPE, CH)
- Daniel Milory (LLNL, US)
- Craig Prunty (Sipearl, FR)
- Benjamin Czaja (SURF, NL)
The landscape of scientific computing is changing rapidly as complex, multi-stage pipelined workflows that combine traditional HPC computations with large-scale data analytics and AI are becoming increasingly common. These next-generation workflows not only seek to improve the efficiency and scale of traditional HPC simulations, but additionally aim to apply large-scale and distributed computing to domains with high societal impact such as autonomous vehicles, precision agriculture, or smart cities. Such complex workflows are expected to require the coordinated use of supercomputers and cloud data centers as well as edge-processing devices, leading to an era of Converged Computing that combines the best of these worlds.
Cloud computing technologies are gaining prevalence in HPC due to their benefits of resource dynamism, automation, reproducibility, and resilience. Similarly, HPC technologies for application performance optimization and sophisticated scheduling of complex resources are being integrated into modern cloud infrastructures. However, the convergence of HPC and cloud also raises a series of new challenges in areas of resource management, data transfers, storage and throughput. Modern cloud and HPC frameworks provide heterogeneous resources, including processors and accelerators, diverse types of memories and storage, and network links, to match the diversity in workloads. Similarly, cloud technologies for elasticity, resilience, and multi-tenancy need to be adopted in HPC while ensuring high performance and throughput. Converged software stacks will need to provide middleware and resource management to facilitate the use of heterogeneous hardware components, improve the system utilization, and provide seamless interfaces for users and application developers.
The First Workshop on Converged Computing (WOCC’23) will provide the edge, HPC and cloud communities a dedicated venue for discussing challenges and research opportunities, deployment efforts, and best practices in supporting complex workflows on coordinated use of supercomputers and cloud data centers as well as edge-processing devices. The workshop encourages interaction between participants who are developing applications, algorithms, middleware and infrastructure for converged environments. The workshop will be an ideal place for the community to define the current state-of-the-art, identify fundamental challenges and feasible future technologies and techniques. The workshop aims to start discussion on questions, including:
- what changes to architecture, hardware, and middleware designs (including hardware monitoring, the operating systems, system software, resource management) are needed?
- How to monitor and collect system level metrics for utilization to identify bottlenecks to meet the different targets in performance, cost, power budget?
- How to support different coupling patterns (e.g., loose or tight) between traditional scientific and big-data/AI components?
- What complex workflows and workloads leverage heterogeneity, elasticity, dynamic resources provisioning?
- Paper Submission – Mar 25, 2023
- Author Notifications –
Apr 10, 2023Apr 14, 2023
- Camera Ready Papers – June 10, 2023, including Copyright Form
- Workshop day – May 25, 2023
- Ivy Peng (KTH Royal Institute of Technology, Sweden)
- Tapasya Patki (Lawrence Livermore National Laboratory, USA)
- Daniel Milroy (Lawrence Livermore National Laboratory, USA)
- Jae-Seung Yeom, Lawrence Livermore National Laboratory, USA
- Andrew Younge, Sandia National Laboratory, USA
- Jeff Vetter, Oak Ridge National Laboratory, USA
- Nathan Tallent, Pacific Northwest National Laboratory, USA
- Bill Magro, Google, USA
- Martin Schulz, Technische Universität München, Germany
- Jesus Carretero, Universidad Carlos III de Madrid, Spain
- Stefano Marikidis, KTH Royal Institute of Technology, Sweden
- Domenico Talia, University of Calabria, Italy
- Estela Suarez, Julich Supercomputing Centre, Germany
- Erwin Laure, Max Planck Computing and Data Facility, Germany
- Jakob Luettgau, University of Tennessee Knoxville, USA
- Kento Sato, Riken, Japan
- Claudia Misale, IBM, USA
- Yoonho Park, IBM, USA
We invite both short paper (8 pages single column) or full paper (12 pages single column) of original works. Accepted papers will be included in the post-conference proceedings in Springer’s Lecture Notes in Computer Science (LNCS). We encourage works that are related to the following topics and accept other related topics.
- Experience and best practice of converged computing of cloud, edge, and HPC
- Resource and job management software
- Architectures, networks, and storage
- Software stack, middleware, and infrastructure
- Advanced models for expressing system resources
- Scheduling, resource allocation, and adaptation
- Complex workflows including HPC, Machine-Learning, and Data Analytics components
- Early results and evaluation of porting HPC applications to clouds
- Elasticity and scalability amid resource heterogeneity
- Virtualization, containers, container runtimes, and container orchestration frameworks
- Scalable and elastic storage and I/O data management services and architectures
- Resilience, fault tolerance, and reliability in converged computing environments
- Early results of leveraging cloud techniques (e.g., Kubernetes, cloud databases) for HPC applications
- System- and application-level resource monitoring and optimization techniques and tools
Use single-column format Text can be short paper (8 pages single column) or full paper (12 pages single column) of original works (including figures and references) with 2 possible extra pages after the review to address the reviewer’s comments. Use Springer’s LaTeX document class or Word template (see Springer’s Proceedings Guidelines) Papers must be suitable for double-blind review (see ISC High Performance Double-Blind Review Guidelines) Papers submitted to be included in the proceedings should not have been previously published or under review for a different venue Note that a membership on a program committee does not inherently create a conflict of interest to submit a paper.
Each paper is expected to receive a minimum of 3 reviews Double-blind peer-review will be used Papers will be evaluated based on novelty, technical soundness, clarity of presentation, and impact The Research Paper Committee reserves the right to reject incorrectly formatted papers.