BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260421T090513Z
LOCATION:Plenary Room (Bldg. 6 - 001)
DTSTART;TZID=Europe/Stockholm:20260629T193600
DTEND;TZID=Europe/Stockholm:20260629T193700
UID:submissions.pasc-conference.org_PASC26_sess124_pos128@linklings.com
SUMMARY:FPGA-Specific Optimizations for Multi-Device Shallow Water Simulat
 ions with SYCL
DESCRIPTION:Christoph Alt (Paderborn University, Friedrich-Alexander-Unive
 rsität Erlangen-Nürnberg); Markus Büttner (University of Bayreuth); Tobias
  Kenter (Paderborn University); Harald Köstler (Friedrich-Alexander-Univer
 sität Erlangen-Nürnberg); Christian Plessl (Paderborn University); and Vad
 ym Aizinger (University of Bayreuth)\n\nThe shallow water equations are an
  essential tool for modeling tides, tsunamis, and storm surges. At PASC 24
 , we presented an implementation of the shallow water equations running on
  CPUs, GPUs and FPGAs. While the numerical code is shared across the diffe
 rent architectures, the implementation uses SYCL as a portability layer to
  support architecture-specific memory layouts and communication routines. 
 This poster provides a detailed overview of the FPGA-specific optimisation
 s of this portable codebase. Unlike CPUs and GPUs, FPGAs do not provide a 
 cache-based memory hierarchy that mitigates the cost of accessing slow off
 -chip memory. To reduce the bandwidth bottleneck, the shallow water solver
  makes use of RAM blocks on the FPGA device, which act as static, array-sp
 ecific caches storing necessary data in fast on-chip memory. When the enti
 re mesh fits into the on-chip caches, the FPGA designs  nearly achieve the
  ideal throughput of one element per clock cycle. Along with an MPI-based 
 communication scheme for CPUs and GPUs, the implementation also supports d
 irect streaming communication for FPGAs. In combination with the on-chip c
 aches, this achieves super-linear scaling in a strong scaling scenario, as
  the optimal performance per FPGA is reached when the complete partition f
 its into the on-chip caches.\n\n
END:VEVENT
END:VCALENDAR
