Meeting of the Minds: Ada Lovelace Award winner discusses the interface between HPC systems and users
Computer scientist Sarah Neuwirth, recently appointed professor at Johannes Gutenberg University Mainz and head of the “High Performance Computing and its Applications” Group, received the Ada Lovelace Award for HPC at the PASC23 conference. Neuwirth presented her work in a keynote address, including her goals and aspirations. Coming from classical computer science, Neuwirth’s specialisation today is performance engineering. She aspires to close the gap between expectations and actual peak performance of modern, highly complex supercomputers. Through reproducible benchmarks, optimisations, and analyses of large HPC workloads, Neuwirth uses a holistic approach and seeks better collaboration with users. In this interview, she explains her vision in more detail.
Interview: Simone Ulmer
What drives you as a performance engineer?
Sarah Neuwirth: My expertise is in basic computer science, where you deal with how hardware or an operating system have to function in order for computers to work. During my PhD, I was fortunate enough to complete two research stays at the Oak Ridge National Laboratory in the USA, where I worked closely with users for the first time. The fascinating thing I discovered was that most scientists who are very deep in their subject matter completely lack perspective on the systems they utilise.
And you want to change that.
SN: To really run superlative problems on the big HPC systems, you need a certain foundation and should know how to parallelise your code. At the same time, you don’t want to overload the researchers with this information. Filling this gap is exactly what excites me — how do I get the systems to operate in an energy-efficient and optimal way without burdening the users? That’s why I ended up in system optimisation and performance engineering, exactly at the interface between these two worlds. The topic of reproducible benchmarking is also important to me, because benchmarking is often used as “benchmark marketing” instead of really providing information about systems. As I said, I originally came from a purely optimisation background, but at some point, I realised that it doesn’t make much sense to optimise only the access to the files when the access for large systems runs over the network, which in turn is optimised by another research community with completely different parameters. That’s why I ended up with holistic performance engineering. I am convinced that there are synergy effects.
For you, performance engineering is a tool for science that makes life easier. What do you get out of it?
SN: I am at the bridge to performance engineering, but also to the design of future HPC systems. This means that I am currently involved in some EU projects in the European consortium, where we are working on new architecture paradigms for which there are no performance engineering options yet. For me, the motivation at this point is, on the one hand, user support and, on the other, how to build, analyse, and evaluate new systems in the future, and in turn operate them in an energy-efficient and sustainable way. This is an exciting question that also opens up a new dimension for performance engineering.
In a way, performance engineering is also a tool for me to advance my own research, be it in the field of HPC architecture or system modelling and simulation. In the future, the term that has existed for a long time in the other sciences, the so-called digital twinning or digital twins, will also play a role for HPC systems. This means that we will try to map exascale HPC systems and then zettascale, etc., through modelling and miniature digital twins in order to be able to make appropriate decisions about resource allocation, job distribution, and even execution of the actual workloads. In my opinion, this is also an important key point in performance engineering that is often neglected.
Your keynote was an invitation to all the researchers and users in the room for closer collaboration.
SN: Yes, exactly, because I sometimes just lack the user perspective and don’t know exactly what would help them. We often do something in the belief that it is what the researchers need, only to realise later that it is not.
I would like to encourage users to seek contact with performance engineers and system researchers, because the interaction and their view of things helps us a lot. The exciting question here, which I discussed with colleagues a few weeks ago, is how much detail you really want to give users. Where is that sweet spot, that balance between too much detail on the one hand and oversimplifying on the other?
In your keynote, you mentioned this discussion and that a colleague once insisted that “users have to work for their performance!” What does that mean in concrete terms?
SN: Originally, I had thought that users would be grateful if they didn’t have to put in a lot of effort. But in the discussion, I thought, “Wow, that’s a really good point,” because we see much earlier, not only with HPC users, but already at university, that students just expect to be able to use ChatGPT to write their thesis. We should consider how much simplification makes sense, so that we can ensure that users do not simply press buttons and execute scripts, and that we instead promote a society of healthy scepticism and critical reflection. The question of how we host systems in the future almost takes on an ethical perspective.
How could performance engineering be made more understandable and user-friendly without cutting back on system requirements?
SN: My group and cooperating partners are currently working on a tool infrastructure that is not meant only for a specific community. We are working on a wider European infrastructure that helps users to both identify where and understand how their application could possibly be improved. One example is visualisation in a so-called web dashboard, where the user would receive specific information about optimisation possibilities. However, this is not to the degree of detail that we give you a mathematical model where you would first have to conduct a 100-page proof or collect data over three months, but something that is tangible for you. That’s why we also work with data centres that integrate these dashboards on site to provide access to system information. On the user side, it would provide job-specific information in order to abstract the help the user needs. At the same time, we want to design this framework in such a way that administrators, engineers, and network system architects can also work with it. They should use it to gain knowledge about the actual system, be it for optimisation or to identify configuration problems, software problems, or hardware problems. That way they can identify, for example, that a problem is not with the application, but caused by an unnoticed hardware failure or a faulty cable. That is my intention when I say that I want to make processes simpler.
You also appealed at PASC23 for more efficiency in the long term, for example by categorising and storing information from different scientific fields to avoid scientists from different disciplines working on the same problem over and over again. Is this project aiming in that direction?
SN: Indeed. For this framework, we would like to have directly tangible results for the HPC system that the users are currently using, so that they can make adjustments for this system. At the same time, an output file is to be generated, which can then in turn be fed into a community database. I am in contact with the computer scientist Florina Ciorba from the University of Basel, as well as with the supercomputing centres in Barcelona and Jülich, with the aim of creating a web archive with a central website where you can simply upload this file without revealing confidential information from the application code. The objective is to document how the system was configured, what software and hardware was used, which communication patterns were used by the application, and which data access patterns. By establishing this database in the long term, optimisation possibilities could be suggested according to scientific domains, system configurations, and access patterns, so that we have a starting point in the future. Additionally, this community database could eventually perhaps be used to develop artificial intelligence tools such as ChatGPT for automated system optimisation.
As a computer scientist and performance engineer, where do you see the opportunities in the field of artificial intelligence (AI) and machine learning (ML)?
SN: We have been using ML for five to seven years in performance engineering in the area of system optimisation, but not for the whole package — only for partial aspects, like to better map the communication patterns to the given hardware. We use ML to better tune the configuration of a job run. Here, we have an unbelievable number of variables and can use ML to approximate small, partial problems. With ML, we achieve much better results than if we tried everything by hand. I see ML and AI as a great opportunity to optimise the entire system, i.e. all aspects of the interaction.
What is still missing to be able to seize this opportunity?
SN: The problem is that we do not yet fully understand how to generate the required data sets with which to train the corresponding models. Optimising a computer system, whether petascale or exascale, depends on an extremely large number of parameters, from the actual hardware architecture to the type of software and how it is configured. Understanding all this is highly complex and requires experts who have been working on it for decades. Currently, we are not able to map this knowledge directly onto AI. That’s why AI is not yet applicable for overall optimisation, but my colleagues and I are convinced that if we manage to generate the data needed to train these models, it will potentially be possible. That’s where the idea of knowledge databases come in, where optimisations for certain systems and science pools are recorded. We can use this as a starting point to train such models.
What are the challenges for you personally in the realm of HPC for the coming years?
Catchword emerging workloads, changes in HPC architectures…
SN: The EU is working on a new approach known as Modular Supercomputing Architecture and wants to create systems that consist of smaller sub-computer clusters that cover different needs through their different characteristics. For example, a high-performance CPU cluster with a high single-thread performance is very well suited for applications that cannot be well parallelised. Alternatively, a GPU cluster is better suited for highly parallelised applications. Work is also being done on various other modules, for example for visualisation, data analysis, and, of course, quantum computing modules. Whether the first exascale computer based on such different architectures will actually work in Europe remains an exciting question.
And what about challenges concerning hardware development?
SN: Of course, you can also see changes there, and there could also be yet another shift in the technology that is used. The manufacturers are also breaking new ground, and we can expect a lot of changes in terms of architectures, so we will have to rethink the software in particular. We are currently using standards that are already very old. The classic example is the so-called POSIX standard, whose origins date back to the 1980s. The big computing systems still use this programming interface between application software and operating system today, because no user wants to touch these codes and the libraries behind them, which have been used and proven for years. So, the big discussion in the community is, POSIX or no POSIX. Can we get rid of POSIX and if so, how? This will remain one of the fundamental questions and is the biggest hurdle. There will also be improvements in cooling technology, but we expect the biggest shift to be in the way the systems are operated.
And then, of course, emerging workloads, which is a very big buzzword. They are indeed a highly complex problem because we have a much more heterogeneous mix of reading and writing data than before, when it comes to access patterns to the data, communication patterns, or the demands on computing power.
How important is sustainability in terms of carbon footprint in your work?
SN: In my work, I have discovered this as a new aspect for me. But I think there is still a long way to go, because at the moment, the biggest keyword that everyone is hanging on to is still energy efficiency. The question that arises is, is energy efficiency really synonymous with sustainability? What does energy efficiency actually mean? In Germany, there are metrics to evaluate the efficiency of data centres — the ratio before to after is calculated. Overall, it looks good, but maybe the system is not energy efficient, just more efficient than before. A lot of work will be needed here, and fortunately there are also many colleagues who are committed to efficiency in a broader conceptual space. I would like to make a contribution here, so that we grow together with performance engineering in such a way that, when we look at systems as a whole, we also think about how we can make these systems more sustainable. For example, interval operation means the systems run in full operation for twelve hours overnight and shut down during the day. This would reduce the need for cooling, especially in summer. I recently took a tour of a data centre that cools with hot water exchange. But unfortunately, it is far too warm in summer and they therefore have to actively cool. This hot water exchange that you actually use only works in winter, and yet this is considered Green IT.
What impressions and inspirations did you take away from PASC23? Will you attend again?
SN: In fact, this was the second time I had the opportunity to attend the PASC conference. For me, PASC is so interesting because it is an international and interdisciplinary platform that discusses the latest techniques and trends in HPC, especially from a user perspective. I had the opportunity to talk to HPC users and gather new insights for system optimization. These impressions and comments are very valuable for holistic performance engineering. I sincerely hope that I will have many more opportunities to participate in the PASC conference series in the future.