2025, issue 4, p. 55-64

Received 20.10.2025; Revised 11.11.2025; Accepted 18.11.2025

Published 08.12.2025; First Online 15.12.2025

https://doi.org/10.34229/2707-451X.25.4.6

Previous | FULL TEXT (in Ukrainian) | Next

UDC 519.6

Multi-Gpu Two-Dimensional Block-Cyclic Algorithm for Factorization of Dense Matrix

Oleksandr Khimich , Volodymyr Sydoruk ^* , Anton Pavliuk

V.M. Glushkov Institute of Cybernetics of the NAS of Ukraine, Kyiv

^* Correspondence: This email address is being protected from spambots. You need JavaScript enabled to view it.

This article presents an efficient parallel algorithm for LU factorization of large dense matrices based on a two-dimensional block-cyclic data distribution designed for multi-GPU computing environments. The proposed methodology addresses key challenges of large-scale linear algebra computations, including load balancing, minimizing communication overhead, and maximizing computational locality. By distributing matrix blocks cyclically across a GPU process grid, the algorithm ensures uniform workload distribution even for very large problem sizes, thereby avoiding idle time and reducing synchronization delays.

The implementation leverages state-of-the-art GPU computing technologies, including CUDA streams, cuBLAS and cuSolver libraries for local numerical kernels, and NCCL for high-performance collective communications using NVLink interconnects. A look-ahead scheduling strategy and overlapping of communication with computation further increase the degree of parallelism, enabling sustained high utilization of GPU resources throughout the factorization pipeline.

A detailed theoretical performance model is developed to analyze speedup, scalability, communication costs, and the impact of block size on total execution time. Numerical experiments conducted on an 8-GPU node with NVIDIA RTX 2080 Ti GPUs demonstrate excellent strong scaling, achieving up to 95 % parallel efficiency for matrices of size up to N = 20,000. The results confirm that the proposed multi-GPU LU factorization approach closely approaches the theoretical performance limits and significantly outperforms traditional CPU-based and hybrid CPU–GPU schemes.

The method is highly suitable for large-scale scientific and engineering applications requiring fast and robust solution of linear systems, including computational fluid dynamics, structural mechanics, numerical simulation of physical processes, and machine learning workloads. Future research directions include extensions to sparse matrices with adaptive load balancing, mixed-precision acceleration with error correction, and generalization to other matrix factorizations such as Cholesky, QR, and LDLᵀ.

Keywords: LU factorization, multi-GPU systems, block-cyclic data distribution, parallel computing, CUDA, cuBLAS, cuSolver, NCCL, high-performance computing, scalability.

Cite as: Khimich O., Sydoruk V., Pavliuk A. Multi-Gpu Two-Dimensional Block-Cyclic Algorithm for Factorization of Dense Matrix. Cybernetics and Computer Technologies. 2025. 4. P. 55–64. (in Ukrainian) https://doi.org/10.34229/2707-451X.25.4.6

References

1. Dongarra J., Whaley R.C., Petitet A. A ScaLAPACK User’s Guide: Solving Linear Algebra Problems on Distributed Memory Computers. 1998. SIAM, Philadelphia. https://www.netlib.org/scalapack/slug/scalapack_slug.html

2. Khimich A.N., Molchanov I.N., Mova V.I. etc. Numerical software of the MIMD-computer by Inpark. Kyiv: Nauk. dumka, 2007. 222 p. (in Russian)

3. Khymych A.N., Molchanov I.N., Popov A.V., Chistyakova T.V., Yakovlev M.F. Parallel algorithms for solving problems of computational mathematics. Kyiv: Nauk. dumka, 2008. 247 p. (in Russian)

4. Volkov V., Demmel J. LU, QR and Cholesky Factorizations Using Vector Capabilities of GPUs. University of California, Berkeley, 2008. https://www.netlib.org/lapack/lawnspdf/lawn202.pdf

5. Dongarra J., Tomov S., Haidar A. MAGMA: Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. 2012.

6. Khimich A., Polyanko V., Chistyakova T. Parallel Algorithms for Solving Linear Systems on Hybrid Computers. Cybernetics and Computer Technologies. 2020. 2. P. 53–66. (in Ukrainian) https://doi.org/10.34229/2707-451X.20.2.6

7. Khimich O.M., Popov O.V., Chistyakov, O.V. et al. A Parallel Algorithm for Solving a Partial Eigenvalue Problem for Block-Diagonal Bordered Matrices. Cybern Syst Anal. 2020. Vol. 56. P. 913–923. https://doi.org/10.1007/s10559-020-00311-z

8. Baranov A.Y., Popov A.V., Slobodyan Y.E., Khimich A.N. Mathematical modeling of building con-structions using hybrid computing systems. Journal of Automation and Information Sciences. 2017. Vol. 49, Iss. 7. P. 18–32. https://doi.org/10.1615/JAutomatInfScien.v49.i7.20

9. Haidar A., Tomov S., Dongarra J. Towards Batched Dense Linear Algebra Routines on GPUs: Reduction and Factorization. Concurrency and Computation: Practice and Experience. 2019. 31 (6), e4965.

10. NVIDIA Corporation. CuSolver and cuBLAS Libraries Documentation. NVIDIA Developer Portal. 2023.

11. Mikhalevich V.S., Byk, N.A. Brusnykin B.N.,..., Khymych A.N. etc. Numerical methods for multiprocessor computer complex EC. Edited by I.N. Molchanova. M.: Izdanie VVIA, Zhukovsky, 1986. 401 p. (in Russian)

12. NVIDIA Corporation. NCCL 2.15: Multi-GPU Collective Communication Library. NVIDIA Developer Documentation. 2022.

13. Computing complex SKIT IC NAS of Ukraine. (in Ukrainian) http://icybcluster.org.ua/index.php?lang_id=2&menu_id=5 (accessed: 20.10.2025)

ISSN 2707-451X (Online)

ISSN 2707-4501 (Print)

Previous | FULL TEXT (in Ukrainian) | Next

2025, issue 4, p. 55-64

Archive