Computational marathon matches the efficiency of the AiiDA platform with the power of Switzerland Alps supercomputer

This was published on September 16, 2024

A group of MARVEL researchers from the Paul Scherrer Institute has conducted a "hero run" on the new Swiss supercomputer, occupying it entirely for about 20 hours with calculations managed remotely by the AiiDA software tools. The run demonstrated the efficiency and stability of AiiDA, that could seamlessly fill the entire capacity of an exascale machine, as well as the performance of the Alps supercomputer, that has been just inaugurated. All the results will soon be published on the Materials Cloud.

By Nicola Nosengo - NCCR MARVEL

It took about 20 hours and a lot of coffee for a team of scientists from the Swiss National Centre of Competence in Research NCCR MARVEL to complete a computational marathon that showcased both the power of Switzerland’s main supercomputing facility, and the level of maturity achieved by Swiss-made software tools for computational materials science.

The Alps supercomputer, which just became operational with its official inauguration on September 14th 2024, is one of the world’s most powerful supercomputers. It is managed by the Swiss National Supercomputing Center (CSCS) and it consists of a geo-distributed infrastructure mainly located in the Lugano data centre.

During the acceptance phase, CSCS has allowed access to Alps to selected research groups, and among the first with this opportunity were members of the NCCR MARVEL, specifically Giovanni Pizzi’s group, part of the Laboratory for Materials Simulation (LMS) at the Paul Scherrer Institute (PSI) headed by Nicola Marzari, that uses computational methods to look for new materials for many applications.

Over the course of one day and one night on July 17th and 18th, a team composed by Marnik Bercx, Michail Minotakis and Timo Reents, all from Pizzi’s group, embarked on what computational specialists call a “hero run” – a time slot when a supercomputing machine is entirely reserved for a single user, to use the full power of the entire machine to advance their own research, and demonstrate their capability of efficiently exploiting the immense computational power of the full system.

Machine utilization of the CSCS Alps system during our AiiDA “hero run”. After a ramp-up phase of 2.5 hours (with the machine still shared with other users), AiiDA workflows filled all available nodes, with a sustained 99.96% machine utilization for over 18 hours. In total 14,945,009 SCF iterations were executed performing 944,428 ionic steps, executed as part of 99,225 DFT code runs (using SIRIUS-enabled Quantum ESPRESSO), resulting in the crystal-structure relaxation of 19,829 compounds.

The PSI group wanted to match the power of the Alps supercomputer with AiiDA, an open-source tool that helps materials scientists automate the long and complex calculations required to simulate the properties of materials – either existing ones or those still waiting to be discovered. In particular, they interfaced AiiDA and Alps to run high-throughput calculations, where thousands of different materials structures stored in a database are calculated in parallel. It is the kind of computational experiment that allows, for example, to select potential new battery materials out of thousands of known chemical compounds, helping experimentalists to focus their efforts on the most promising ones. 

“We wanted to show that AiiDA can fill up all the nodes of a supercomputer with near-exascale performance for many hours and fully exploit the power of the machine while handling, running and maintaining many separate workflows simultaneously, which is necessary for high-throughput computations” explains Bercx.

The run was managed remotely, with the AiiDA software installed on a PSI server, and used to prepare all input files of the calculations to be performed. The actual computations were executed using an enhanced version of the widely used Quantum ESPRESSO computer code for materials simulations, powered by the Sirius library – developed within NCCR MARVEL at CSCS – that allows for the optimal exploitation of the great computing power provided by graphical processing units (GPUs) of Alps, and implements novel algorithms to significantly improve the simulation success rate.

When the scientists got the green light from the CSCS staff around noon on the chosen date, they started sending input files to the Alps machine, where they were submitted to a scheduling software that distributed the jobs among the 2033 NVIDIA Grace Hopper nodes (comprising 8132 GPUs and 585,504 CPU cores) that were granted for the hero run and queued them. On the other side of the connection, AiiDA was monitoring each job so that, once it was finished, the files could be retrieved, parsed, and stored in AiiDA, and new calculations could be then submitted.

Very quickly after starting the run, AiiDA could fill the whole Alps supercomputer with jobs, fully exploiting its outstanding computational capabilities. Around 3 AM the team understandably needed a short nap, and relied on AiiDA to continue preparing and submitting new jobs in their absence. The run successfully ended around 9 AM on the second day. “All went smoothly, and the number of available nodes was remarkably stable during the entire hero run, which speaks to the quality of the infrastructure” says Bercx. The 99.96% utilization of a near-exascale machine is utterly remarkable and quite unprecedented – very much achieving the goals of the MARVEL NCCR dedicated to computational materials discovery enabled by such capabilities and infrastructure.

In the end, the team managed to complete almost 100,000 calculations, corresponding to single runs of Quantum ESPRESSO, in just about 16 hours. More specifically, the calculations were about the properties of around 20,000 crystal structures taken from the AiiDA database. “We chose medium-sized structures, because Alps is so powerful that small structures would not use the computational power efficiently”, explains Minotakis. “We started with structures made out of 40 atoms, and then in subsequent submissions added slightly smaller and slightly larger structures”. The computations were meant to calculate the electronic properties of the materials in their ground state, find if they were magnetic or not, and calculate their ground-state geometric configuration. “We also had new pseudopotentials that we wanted to test, so we updated the calculations for a large fraction of the structures in the database and checked the differences with previous calculations” says Reents.

All the results will soon be published as FAIR and open data, and uploaded to the Materials Cloud, the online data sharing platform of NCCR MARVEL, to expand the MC3D database of inorganic 3D crystal structures.

In addition to the great scientific value of these simulations, the run demonstrated the efficiency and stability of AiiDA, that could seamlessly fill the entire capacity of an exascale machine.  “The performance of the new Alps machine is outstanding, even more so when combined with the high-throughput capabilities of AiiDA. It is impressive that we could compress in less than a day the equivalent computing power granted for one full year to large supercomputing projects at CSCS, equivalent to approximately 800,000 GPU hours of computation on the previous-generation CSCS supercomputer Daint", says Pizzi.

