It took about 20 hours and a lot of coffee for a team of scientists from the Swiss National Center of Competence in Research NCCR MARVEL to complete a computational marathon that showcased both the power of Switzerland’s main supercomputing facility, and the level of maturity achieved by Swiss-made software tools for computational materials science.
The Alps supercomputer, which just became operational with its official inauguration on September 14, 2024, is one of the world’s most powerful supercomputers. It is managed by the Swiss National Supercomputing Center (CSCS) and it consists of a geo-distributed infrastructure mainly located in the Lugano data center.
During the acceptance phase, CSCS allowed access to Alps to selected research groups, and among the first with this opportunity were members of the NCCR MARVEL, specifically Giovanni Pizzi’s group, part of the Laboratory for Materials Simulation (LMS) at PSI headed by Nicola Marzari, that uses computational methods to look for new materials for many applications.
Over the course of one day and one night on July 17 and 18, a team including Marnik Bercx, Michail Minotakis and Timo Reents, all from Pizzi’s group, embarked on what computational specialists call a “hero run”βa time slot when a supercomputing machine is entirely reserved for a single user, to use the full power of the entire machine to advance their own research, and demonstrate their capability of efficiently exploiting the immense computational power of the full system.
The PSI group wanted to match the power of the Alps supercomputer with AiiDA, an open-source tool that helps materials scientists automate the long and complex calculations required to simulate the properties of materialsβeither existing ones or those still waiting to be discovered.
In particular, they interfaced AiiDA and Alps to run high-throughput calculations, where thousands of different materials structures stored in a database are calculated in parallel. It is the kind of computational experiment that allows, for example, the selection of potential new battery materials out of thousands of known chemical compounds, helping experimentalists to focus their efforts on the most promising ones.
“We wanted to show that AiiDA can fill up all the nodes of a supercomputer with near-exascale performance for many hours and fully exploit the power of the machine while handling, running and maintaining many separate workflows simultaneously, which is necessary for high-throughput computations,” explains Bercx.
The run was managed remotely, with the AiiDA software installed on a PSI server, and used to prepare all input files of the calculations to be performed. The actual computations were executed using an enhanced version of the widely used Quantum ESPRESSO computed code for materials simulations, powered by the Sirius libraryβdeveloped within NCCR MARVEL at CSCSβthat allows for the optimal exploitation of the great computing power provided by graphical processing units (GPUs) of Alps, and implements novel algorithms to significantly improve the simulation success rate.
When the scientists got the green light from the CSCS staff around noon on the chosen date, they started sending input files to the Alps machine, where they were submitted to a scheduling software that distributed the jobs among the 2033 NVIDIA Grace Hopper nodes (including 8,132 GPUs and 585,504 CPU cores) that were granted for the hero run and queued them. On the other side of the connection, AiiDA was monitoring each job so that once it was finished, the files could be retrieved, parsed, and stored in AiiDA, and new calculations could be then submitted.
Very quickly after starting the run, AiiDA could fill the whole Alps supercomputer with jobs, fully exploiting its outstanding computational capabilities. Around 3 AM, the team understandably needed a short nap, and relied on AiiDA to continue preparing and submitting new jobs in their absence. The run successfully ended around 9 AM on the second day.
“All went smoothly, and the number of available nodes was remarkably stable during the entire hero run, which speaks to the quality of the infrastructure” says Bercx. The 99.96% utilization of a near-exascale machine is utterly remarkable and quite unprecedentedβvery much achieving the goals of the MARVEL NCCR dedicated to computational materials discovery enabled by such capabilities and infrastructure.
In the end, the team managed to complete almost 100,000 calculations, corresponding to single runs of Quantum ESPRESSO, in just about 16 hours. More specifically, the calculations were about the properties of around 20,000 crystal structures taken from the AiiDA database.
“We chose medium-sized structures, because Alps is so powerful that small structures would not use the computational power efficiently,” explains Minotakis. “We started with structures made out of 40 atoms, and then in subsequent submissions added slightly smaller and slightly larger structures.”
The computations were meant to calculate the electronic properties of the materials in their ground state, find whether they were magnetic or not, and calculate their ground-state geometric configuration.
“We also had new pseudopotentials that we wanted to test, so we updated the calculations for a large fraction of the structures in the database and checked the differences with previous calculations” says Reents. All the results will soon be published as FAIR and open data, and uploaded to the Materials Cloud, the online data sharing platform of NCCR MARVEL, to expand the MC3D database of inorganic 3D crystal structures.
In addition to the great scientific value of these simulations, the run demonstrated the efficiency and stability of AiiDA, which could seamlessly fill the entire capacity of an exascale machine.
“The performance of the new Alps machine is outstanding, even more so when combined with the high-throughput capabilities of AiiDA. It is impressive that we could compress in less than a day the equivalent computing power granted for one full year to large supercomputing projects at CSCS, equivalent to approximately 800,000 GPU hours of computation on the previous-generation CSCS supercomputer Daint,” says Pizzi.
Provided by
National Centre of Competence in Research (NCCR) MARVEL