Time filter

Source Type

Malas T.,King Abdullah University of Science and Technology | Hager G.,Erlangen Regional Computing Center | Ltaief H.,King Abdullah University of Science and Technology | Keyes D.,King Abdullah University of Science and Technology
2nd EAGE Workshop on High Performance Computing for Upstream | Year: 2015

Today's high-end multicore systems are characterized by a deep memory hierarchy, i.e., several levels of local and shared caches, with limited size and bandwidth per core. The ever-increasing gap between the processor and memory speed will further exacerbate the problem and has lead the scientific community to revisit numerical software implementations to better suit the underlying memory subsystem for performance (data reuse) as well as energy efficiency (data locality). The authors propose a novel multithreaded wavefront diamond blocking (MWD) implementation in the context of stencil computations, which represents the core operation for seismic imaging in oil industry. The stencil diamond formulation introduces temporal blocking for high data reuse in the upper cache levels. The wavefront optimization technique ensures data locality by allowing multiple threads to share common adjacent point stencil. Therefore, MWD is able to take up the aforementioned challenges by alleviating the cache size limitation and releasing pressure from the memory bandwidth. Performance comparisons are shown against the optimized 25-point stencil standard seismic imaging scheme using spatial and temporal blocking and demonstrate the effectiveness of MWD. Copyright © 2015 by the European Association of Geoscientists & Engineers (EAGE) All rights reserved. Source

Kronawitter S.,University of Passau | Stengel H.,Erlangen Regional Computing Center | Hager G.,Erlangen Regional Computing Center | Lengauer C.,University of Passau
Parallel Processing Letters | Year: 2014

Our aim is to apply program transformations to stencil codes in order to yield the highest possible performance. We recognize memory bandwidth as a major limitation in stencil code performance. We conducted a study in which we applied optimizing transformations to two Jacobi smoother kernels: one 3D 1st-order 7-point stencil and one 3D 3rd-order 19-point stencil. To obtain high performance, the optimizations have to be customized for the execution platform at hand. We illustrate this by experiments on two consumer and two server architectures. We also verified the need for complex optimizations with the help of the Execution-Cache-Memory performance model. A code generator with knowledge about stencil codes and execution platforms should be able to apply our transformations automatically. We are working towards such a generator in project ExaStencils. © 2014 World Scientific Publishing Company. Source

Kreutzer M.,Erlangen Regional Computing Center | Hager G.,Erlangen Regional Computing Center | Wellein G.,Erlangen Regional Computing Center | Fehske H.,University of Greifswald | And 2 more authors.
Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012 | Year: 2012

Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded jagged diagonals storage" (pJDS) format is proposed which may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme while making no assumptions about the matrix structure. In our test scenarios the pJDS format cuts the overall spMVM memory footprint on the GPGPU by up to 70%, and achieves 91% to 130% of the ELLPACK-R performance. Using a suitable performance model we identify performance bottlenecks on the node level that invalidate some types of matrix structures for efficient multi-GPGPU parallelization. For appropriate sparsity patterns we extend previous work on distributed-memory parallel spMVM to demonstrate a scalable hybrid MPI-GPGPU code, achieving efficient overlap of communication and computation. © 2012 IEEE. Source

Treibig J.,Erlangen Regional Computing Center | Hager G.,Erlangen Regional Computing Center | Hofmann H.G.,Friedrich - Alexander - University, Erlangen - Nuremberg | Hornegger J.,Friedrich - Alexander - University, Erlangen - Nuremberg | Wellein G.,Erlangen Regional Computing Center
International Journal of High Performance Computing Applications | Year: 2013

Volume reconstruction by backprojection is the computational bottleneck in many interventional clinical computed tomography (CT) applications. Today vendors in this field replace special purpose hardware accelerators with standard hardware such as multicore chips and GPGPUs. Medical imaging algorithms are on the verge of employing high-performance computing (HPC) technology, and are therefore an interesting new candidate for optimization. This paper presents low-level optimizations for the backprojection algorithm, guided by a thorough performance analysis on four generations of Intel multicore processors (Harpertown, Westmere, Westmere EX, and Sandy Bridge). We choose the RabbitCT benchmark, a standardized testcase well supported in industry, to ensure transparent and comparable results. Our aim is to provide not only the fastest possible implementation but also compare with performance models and hardware counter data in order to fully understand the results. We separate the influence of algorithmic optimizations, parallelization, SIMD vectorization, and microarchitectural issues and pinpoint problems with current SIMD instruction set extensions on standard CPUs (SSE, AVX). The use of assembly language is mandatory for best performance. Finally, we compare our results to the best GPGPU implementations available for this open competition benchmark. © The Author(s) 2012. Source

Hager G.,Erlangen Regional Computing Center | Treibig J.,Erlangen Regional Computing Center | Habich J.,Erlangen Regional Computing Center | Wellein G.,Erlangen Regional Computing Center
Concurrency Computation | Year: 2016

Summary Modern multi-core chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single-core and multi-core performance of streaming kernels. The model refines the well-known roofline model, because it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multi-core chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multi-core processors and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice Boltzmann flow solver code. Copyright © 2013 John Wiley & Sons, Ltd. Source

Discover hidden collaborations