BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260724T151410Z
LOCATION:Bldg. 8 - B 101
DTSTART;TZID=Europe/Stockholm:20260701T100000
DTEND;TZID=Europe/Stockholm:20260701T103000
UID:submissions.pasc-conference.org_PASC26_sess155_msa249@linklings.com
SUMMARY:Hierarchical Precision and Recursion for Accelerating Symmetric Li
 near Solves on MXUs
DESCRIPTION:Vicky Carrica (Massachusetts Institute of Technology)\n\nSymme
 tric positive-definite system solvers based on Cholesky factorization are 
 a critical performance bottleneck in large scientific workflows, such as c
 limate modeling. Addressing this demands co-design across algorithms, soft
 ware abstractions, and emerging hardware, particularly as modern AI accele
 rators increasingly rely on lower-precision Matrix Processing Units (MXUs)
 . We present a portable, nested recursive mixed-precision solver designed 
 for MXUs, including NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300
 X).\nImplemented in Julia, our solver provides a high-level, hardware-agno
 stic interface that enables a single codebase to target multiple architect
 ures. By replacing standard updates with recursive counterparts, we extend
  recursion across the POTRF, TRSM, and SYRK phases. We integrate this into
  a tree-structured precision hierarchy that assigns low-precision FP16 to 
 large off-diagonal blocks and higher precision to diagonal blocks, preserv
 ing numerical stability. To mitigate FP16’s limited dynamic range, we appl
 y lightweight per-block quantization.\nOn NVIDIA H200 GPUs, our solver ach
 ieves a 5.32x speedup over double-precision cuSOLVER, with 100x better acc
 uracy than pure half-precision. Comparable results are observed on the AMD
  MI300X, highlighting performance portability across architectures and dem
 onstrating how Julia facilitates the co-design of algorithms, software, an
 d hardware features to deliver robust, cross-platform acceleration for sci
 entific workflows.\n\nDomain: Climate, Weather, and Earth Sciences, Physic
 s, Computational Methods and Applied Mathematics\n\nSession Chairs: Ludovi
 c Raess (University of Lausanne, ETH Zurich) and Samuel Omlin (ETH Zurich 
 / CSCS)\n\n
END:VEVENT
END:VCALENDAR