Carme – The AI-Suite for HPC Cluster

Turn Your HPC Cluster into a Full-fledged AI Cluster.

Artifical Intelligence (AI) algorithms have an increasingly higher priority in both scientific and industrial enterprises. This is evident from the investment in new, above all, GPU-based hardware – from simple desktop computers to high performance computing clusters. Computing clusters are used in Data Analysis and highly complex Machine Learning (ML) systems to process and simulate very large amounts of data – to include even the human brain.

AI on HPC clusters presents certain challenges we are solving with Carme. The procurement of the individual hardware components is the least of these challenges.

Most important system components of Carme and their interconnections.
© Fraunhofer ITWM
Most important system components of Carme and their interconnections.
In order to make HPC systems attractive and usable, our open source framework  Carme combines the best of two worlds: On the one hand solid and proven HPC tools (such as batch system and parallel file system) and new tools on the other hand (such as container solutions and Web IDEs).

To achieve such a combination several questions arise:

  • How to manage existing resources?
  • How to make an application scalable to several GPUs?
  • How to solve the challenge of data storage and continuous upload to the program?
  • How to train users to effectively use the hardware?
The answers to these questions begin with our open-source software stack Carme. The name Carme not only refers to a Jupiter moon or a cluster of Jupiter moons, but also stands for a software structure with the help of which several users manage the available resources of a computing cluster. The basic concept is to combine the world of machine learning and data analysis with the world of HPC systems. We achieve this using established ML and DA tools with HPC back ends
The Carme dashboard with detailed cluster information and the job management interface.
© Fraunhofer ITWM
The Carme dashboard with detailed cluster information and the job management interface.

Carme Connects the Two Worlds of AI and HPC

Artificial Intelligence  is a constantly and rapidly growing field. This agility presents data centers with the challenge of providing very different applications for individual users. In doing so, the demands of constantly new software and libraries have never before been met to such an extent on HPC systems. Linked to this is the fact that these libraries usually have to be installed not via the Linux system but via Python environments. It is also not uncommon that different Deep Learning algorithms have mutually exclusive dependencies. In Carme we rely on software containers to solve these problems. This facilitates maintenance for administrators and usability for users. In addition, the use of interactive development interfaces is common in Deep Learning and Data Analysis. With the integration of an interactive cluster system, the users get the chance to use already known tools on a complex HPC cluster. This facilitates the changeover and use of a cluster.

Thus, it is not enough to provide user interfaces and libraries. There must also be a smooth integration of these into existing and emerging clusters. An intuitive software environment on the clusters increases the usability for all users.

Our Used Tools

Container Images

By using container images, we quickly and easily provide the software required for a wide variety of applications and meet the needs of the user without overloading the operating system of the computing nodes. Since the software is located in the image, it can be managed and updated via it.

Interactive Tools

For most users, working with a graphical user interface is much more familiar than on the command line of a Linux system. With web-based front-ends like Jupyter Notebooks or Theia, users are not forced to install extra software on their operating system to access the cluster.

Batchsystem (SLURM)

By means of the batch system SLURM (Simple Linux Utility for Resource Management), resources are effectively and easily shared and allocated among the users. We have simplified the process to such an extent that the user only needs to specify the number of GPUs and compute nodes required, Carme takes care of the rest.

Distributed File System (BeeGFS)

Thanks to the parallel file system BeeGFS developed in-house, data can be made available quickly and effectively during the running simulation.

Maintenance and Monitoring Tools

With the help of monitoring tools such as Zabbix, the cluster administrator can see GPU, CPU, memory, and network utilization, as well as share this information with the user through diagrams.