Resource and Job Management in HPC clusters with Slurm: Administration, Usage and Performance Evaluation

Tutorial presented at IEEE Cluster 2016

Slides can be found here

Bull - Atos Technologies

Abstract

High Performance Computing is characterized by the continuous evolution of computing architectures, the proliferation of computing resources and the increasing complexity of applications, users wish to execute. One of the most important software of the HPC software stack that deals with both hardware resources evolutions and applications' needs is the Resource and Job Management System (RJMS). This systems software which stands between the user workloads and the infrastructure, provides functions to configure and manage the pool of resources along with features for building, submitting, scheduling and monitoring user jobs in a dynamic computing environment.

This tutorial is upon Slurm Resource and Job Management System. Slurm is an open source RJMS, specifically designed for the scalability requirements of state-of-the-art HPC clusters. As of the November 2015 Top500 supercomputer list, Slurm is being used on five of the ten most powerful computers in the world including the no1 system, Tianhe-2 with 3,120,000 computing cores. Throughout the years, it has evolved from a simple resource management software to a complex but very powerful resource and job manager, workload scheduler and an interesting research tool.

The tutorial will give an overview of the concepts and underlying architecture of Slurm and it will focus on both administrator configuration and user executions related aspects.

Outline of the Tutorial

The tutorial will be decomposed into three parts: Administration, Usage and Performance Evaluation.

On the administration part there will be a detailed description and hands-on for features such as job prioritization, resources selection, GPGPUs and generic resources, advanced reservations, accounting (associations, QOS, etc), scheduling(backfill, preemption), high availability, power management, topology aware placement, licenses management, burst buffers, scalability tuning with a particular focus on the configuration of the newly developed power adaptive scheduling technique.

The usage training part will provide in-depth analysis and hands-on for CPU usage parameters, options for multi-core and multi-threaded architectures, prolog and epilog scripts, job arrays, MPI tight integration, CPU frequency scaling usage, accounting / reporting and profiling of user jobs. Finally there will be a particular focus on some new functionalities within Slurm such as the usage of power adaptive scheduling (appeared in version 15.08) and the support of heterogeneous resources job specification language along with the multiple program multiple data (MPMD) MPI support (to appear in version 17.02).

Finally the performance evaluation part will consist of techniques with hands-on to experiment with Slurm in large scales using simulation and emulation which will be valuable for researchers and developers. For the hands-on exercises particular VM and/or container environments will be made available along with a pre-installed testbed cluster to enable the experimentation of the different functionalities.

Goals, demonstrations and exercises

The goals of the tutorial is to provide a detailed view of Slurm resource and job management system for administrators, users, developers and researchers.

-The hands-on user level exercises will take place upon a pre-installed cluster where each participant will obtain a guest account and will be able to explore the different possibilities of SLURM through detailed step-by-step exercises.

-The hands-on administrator level exercises will take place upon virtualized or containerized environments. Specifically constructed VM and/or Docker environments will become available to each interested participant and they will have the possibility to configure virtual clusters and experiment with the different available functionalities and configuration shortcuts.

-The performance evaluation exercises will make use of open-source tools and techniques that have been developed by the instructors to experiment with Slurm in large scales. VM and Docker environments will be used to facilitate the usage of these tools.

Target audience

The tutorial is destined for scientists from different domains of computer science that use HPC clusters in their every day research. It is destined for site administrators that wish to adopt Slurm as their RJMS or that need to get deeper on their knowledge and skills in advanced configuration and tuning. Finally it is destined for researchers and developers that use or wish to use Slurm as a research tool for experimentation, performance evaluation and comparisons in the areas of resource management and scheduling.

Prerequisite background and content level: Basic knowledge of cluster administrator tools, mysql databases, HPC principles and MPI will be needed for simple configuration and basic usage. Knowledge of scripting languages and understanding of C programming will be needed for the advanced configuration, usage and performance evaluation parts.

Instructors brief biography

The instructors are Slurm code contributors, have configured and tuned various production HPC clusters and are active in both the research and developments of Slurm.

Yiannis Georgiou (PhD) is a systems software architect at BULL/ATOS R&D on the HPC and Big Data group with expertise in resource management and scheduling. He is the lead architect of the developments done upon the open-source workload manager project Slurm within BULL and he participates in various research projects in the area. He is an active Slurm developer, a Slurm User Group conference committee member and he participates actively in defining Slurm's roadmap. He has published various articles in the field and has given numerous talks and tutorials upon resource management and scheduling. His research interests are centered around job scheduling, energy-efficiency, workload modeling, scalability, power management and high throughput computing. He holds a M.Sc Degree in Computer Science (2006) and a PhD degree (2010) upon resource management and scheduling in HPC, both diplomas obtained from Joseph Fourier University, Grenoble, France.

David Glesser is a PhD student working on fast and multi-objective scheduling algorithms for High Performance Computing. His PhD is done in collaboration between the french leader in super-computers, Bull/Atos, and the DataMove team of the University Grenoble-Alpes, France. He has published papers on energy efficient scheduling and machine learning scheduling in prestigious international conference like CCGrid and SuperComputing. He first worked for Bull as an engineer on the open-source software Slurm before starting his Phd. His past and future developments focus on experimenting with Slurm, adapting Slurm to a wide range of different usages, improving energy management and developing new scheduling algorithm within Slurm. He has provided talks in many Slurm User Group meetings as well as SC Bird of Feathers.

Instructors related publications list

[1] Karim Djemame, Django Armstrong, Richard E. Kavanagh, Jean-Christophe Deprez, Ana Juan Ferrer, David Garcia Perez, Rosa M. Badia, Raúl Sirvent, Jorge Ejarque, Yiannis Georgiou: TANGO: Transparent heterogeneous hardware Architecture deployment for eNergy Gain in Operation. CoRR abs/1603.01407 (2016)

[2] Yiannis Georgiou, David Glesser, Krzysztof Rzadca, Denis Trystram: A Scheduler-Level Incentive Mechanism for Energy Efficiency in HPC. CCGRID 2015: 617-626

[3] Yiannis Georgiou, David Glesser, Denis Trystram: Adaptive Resource and Job Management for Limited Power Consumption. IPDPS Workshops 2015: 863-870

[4] Éric Gaussier, David Glesser, Valentin Reis, Denis Trystram: Improving backfilling by using machine learning to predict running times. SC 2015: 64:1-64:10 2014

[5] Daniel Hackenberg, Thomas Ilsche, Joseph Schuchart, Robert Schöne, Wolfgang E. Nagel, Marc Simon, Yiannis Georgiou: HDEEM: high definition energy efficiency monitoring. E2SC@SC 2014: 1-10

[6] Yiannis Georgiou, Thomas Cadeau, David Glesser, Danny Auble, Morris Jette, Matthieu Hautreux: Energy Accounting and Control with SLURM Resource and Job Management System. ICDCN 2014: 96-118

[7] Yiannis Georgiou, Matthieu Hautreux: Evaluating Scalability and Efficiency of the Resource and Job Management System on Large HPC Clusters. JSSPP 2012: 134-156