BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260625T133339Z
LOCATION:Bldg. 6 - 001 - Plenary Room
DTSTART;TZID=Europe/Stockholm:20260630T121800
DTEND;TZID=Europe/Stockholm:20260630T121900
UID:submissions.pasc-conference.org_PASC26_sess129_pos119@linklings.com
SUMMARY:P32 - Profile-Guided-Optimisation of Lattice QCD Contractions on C
 PU and GPU
DESCRIPTION:JingJing Li, Urs Wenger, and Roman Gruber (University of Bern)
 \n\nWe present a performance optimisation study of the 2+2 disconnected co
 mponent bottleneck of Lattice QCD computation of the hadronic light-by-lig
 ht contribution to the muon's anomalous magnetic moment. Optimization on C
 PU and GPU architectures were guided by popular profiling tools perf, valg
 rind, nsys, and ncu.\n\nThe 2+2 contraction is first decomposed into small
 er modules for easier profiling and optimisation. The CPU optimisations in
 cluded loop re-arrangement to improve data locality, the replacement of mu
 lti-level nested memory allocations with contiguous 1D pointers, and the r
 esolution of thread-synchronization bottlenecks in parallel regions. L1 ca
 che miss rates see multi-factor reductions and parallelised regions show n
 ear perfect strong scaling across the modules. On GPUs, we rearranged data
  access pattern, improved parallel work distribution, and identified laten
 cy issues. Finally, kernels are tied together with asynchronous streaming 
 to overlap workload at small problem size and hide communication and data 
 copy latencies. The optimized code achieved an overall 30% runtime reducti
 on in production environments (GPU) on CSCS Daint. This project highlights
  how systematic profiling and targeted optimizations can yield significant
  resource savings in computationally intensive legacy code.\n\nSession Cha
 ir: Tobias Hodel (University of Bern, Switzerland)\n\n
END:VEVENT
END:VCALENDAR
