BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260522T162631Z
LOCATION:Plenary Room (Bldg. 6 - 001)
DTSTART;TZID=Europe/Stockholm:20260630T170000
DTEND;TZID=Europe/Stockholm:20260630T173000
UID:submissions.pasc-conference.org_PASC26_sess133_pap132@linklings.com
SUMMARY:Combining Domain and Tensor Parallelism to Train Multi-Billion-Par
 ameter AI Weather Models
DESCRIPTION:Deifilia Kieckhefen (Karlsruhe Institute of Technology (KIT));
  Markus Götz (Karlsruhe Institute of Technology, Helmholtz AI); and Lars H
 elge Heyen, Achim Streit, and Charlotte Debus (Karlsruhe Institute of Tech
 nology (KIT))\n\nAI-based methods have recently revolutionized atmospheric
  modeling. Successes in medium-range forecasting have led to rapid develop
 ments towards AI-based models. However, accurate modeling of complex atmos
 pheric dynamics at high spatial resolutions requires billions of neural ne
 twork parameters and gigabyte-sized data samples, making accelerator memor
 y and I/O-bandwidth the bottlenecks for model training. To overcome these 
 limitations, we introduce Jigsaw, a distributed training and inference sch
 eme that leverages domain and tensor parallelism to eliminate memory redun
 dancy across model-parallel processes and reduce I/O demands. We apply the
  Jigsaw parallelization scheme into an MLP-Mixer architecture, WeatherMixe
 r, a multi-layer-perceptron-based model with global vision that is well-su
 ited for  learning weather phenomena. Using Jigsaw, we train WeatherMixer 
 with up to 3.2B-parameters,  achieving predictive performance competitive 
 with numerical weather prediction and state-of-the-art AI models. To highl
 ight the computational performance, we perform scaling experiments on glob
 al 0.25° (≈ 30 km resolution) ERA5 data across two HPC systems. Anticipati
 ng that future  reanalysis datasets will include even higher resolutions, 
 we demonstrate, for the first time, training on 0.125° data. Scaling exper
 iments demonstrate that the dataloading bottlenecks arising from high-reso
 lution input data samples are reduced through domain parallelism, subseque
 ntly improving per-GPU computational throughput.\n\nIn compute–communicati
 on–limited regimes, Jigsaw achieves state-of-the-art performance in distri
 buted model training, with 97% of theoretical peak performance on 4 GPUs; 
 and a strong scaling speedup of 6.4 when training across 8 GPUs. By combin
 ing domain, tensor, and data parallelism at larger scales,  training on 25
 6 GPUs reaches 11 PFLOPs with a scaling efficiency of 72% compared to 51% 
 without Jigsaw.\n\n
END:VEVENT
END:VCALENDAR
