Carme Combines Machine Learning With HPC Back-End

Open Source Multi-User Software Stack Carme for Interactive Machine Learning

Machine learning (ML) have an increasingly higher priority in both scientific and industrial enterprises. This is evident from the investment in new, above all, GPU-based hardware. This ranges from simple desktop computers to high performance computing clusters. Computing clusters process and analyze large amounts of data and simulate highly complex systems – for example, the human brain – using machine learning methods.

ML on HPC clusters presents a challenge. The mere procurement of the individual hardware components is only the first step.

The answers to these questions begin with our open-source software stack Carme. The name Carme not only refers to a Jupiter moon or a cluster of Jupiter moons, but also stands for a software structure with the help of which several users manage the available resources of a computing cluster. The basic concept here is to combine the world of machine learning and data analysis with the world of HPC systems. For this purpose, we use established tools of machine learning, data analysis, and HPC backends.

Carme Connects the World of Machine Learning and of HPC Clusters

Machine Learning is a constantly and rapidly growing field. This agility presents data centers with the challenge of providing very different applications to individual users. In doing so, the demands of constantly new software and libraries have never before been met to such an extent on HPC systems. Related to this is the fact that these libraries usually have to be installed via Python environments rather than via the Linux system. It is also not uncommon for different Deep Learning algorithms to have mutually exclusive dependencies. In Carme, we rely on software containers to solve these problems. This facilitates maintenance for administrators and usability for users. In addition, the use of interactive development interfaces is common in Deep Learning and data analytics. With the integration of interactive cluster usage, users get the chance to use tools they are already familiar with on a complex HPC cluster. This facilitates the transition and use of a cluster.

Thus, it is not enough to provide user interfaces and libraries. There must also be smooth integration of these into existing and emerging clusters. An intuitive software environment on the clusters increases usability for all users.

Carme Dashbooard
© Fraunhofer ITWM
The Carme dashboard with: Status bar (including history, help, messages, the user menu and special tools for admins), system messages, cluster utilization with prediction (all GPUs or single GPU type), information about the selectable GPU types, the job configuration with »job start button«, list of running jobs (with dif-ferent entry points, job information, visual runtime bar and the »job start button«) and links to the docu-mentation and the local wiki.
Illustration of the concept of Carme
© Fraunhofer ITWM
Illustration of the concept of Carme: By combining various AI & Data Science tools with proven HPC tools, Carme enables existing HPC systems to be easily and quickly used for both AI application development and training and teaching in the AI environment.

Our Used Tools

Container Images

By using container images, we quickly and easily provide the software required for a wide variety of applications and meet the needs of the user without overloading the operating system of the computing nodes. Since the software is located in the image, it can be managed and updated via it.

Interactive Tools

For most users, working with a graphical user interface is much more familiar than on the command line of a Linux system. With web-based front-ends like Jupyter Notebooks or Theia, users are not forced to install extra software on their operating system to access the cluster.

Batchsystem (SLURM)

By means of the batch system SLURM (Simple Linux Utility for Resource Management), resources are effectively and easily shared and allocated among the users. We have simplified the process to such an extent that the user only needs to specify the number of GPUs and compute nodes required, Carme takes care of the rest.

Distributed File System (BeeGFS)

Thanks to the parallel file system BeeGFS developed in-house, data can be made available quickly and effectively during the running simulation.

Maintenance and Monitoring Tools

With the help of monitoring tools such as Zabbix, the cluster administrator can see GPU, CPU, memory, and network utilization, as well as share this information with the user through diagrams.