The FAI computing cluster has received another upgrade as part of grant and program-targeted funding. This is the second major update to the cluster in 2024. Under project AP23487846 and two programs, BR24992807 and BR24992759, 10 NVIDIA GeForce RTX 4090 graphics accelerators in dual-slot server configurations with turbine cooling systems were procured and installed in the existing GPU server, SuperMicro A+ Server AS-4125GS-TNRT2.
The purchased cards are OEM modifications of NVIDIA RTX 4090 gaming cards, designed for use in data centers. The dual-slot form factor allows up to 10 cards to be installed in a single GPU server. Installing the cards in a single server (on one motherboard) enables non-blocking communication between the cards at speeds of up to 64 GB/s (PCIe 4.0 interface speed) without the need for 400/800 Gbps RDMA cards.
The computing cluster SuperMicro A+ Server AS-4125GS-TNRT2 with installed 10 NVIDIA RTX 4090 GPU cards |
The communication speed between the cards is crucial for many computational tasks, particularly in N-body simulations, where data synchronization between distributed computing devices is required after each integration step. If such communication takes longer than the computation, the distributed computing devices become idle. As a result, their performance does not add up when used together. Thanks to the high speed (the maximum possible for the PCIe 4.0 interface) and low latency, GPU communication via the motherboard (through PLX switches) resolves this issue, enabling all GPU cards to function as a unified device with aggregated performance. For instance, the FP32 performance of a single RTX 4090 card is 82.6 teraflops, which means that combining 10 devices yields a theoretical total performance of 826 teraflops.
In practice, this number is smaller, as achieving peak performance in simulation tasks is very challenging. For example, the nbody utility included with the CUDA Samples package achieves an actual computation speed of 40–45 teraflops per second (performance varies due to Turbo Boost), which corresponds to 48–54% of the theoretically possible value.
The performance of individual cards (shown by different color lines) as a function of the number of simulated particles. |
When two or more cards are used simultaneously, and as the number of simulated particles increases, performance improves. For instance, when simulating 8 million particles with 10 cards, an actual performance exceeding 260 teraflops was achieved — a record for the *nbody* utility. This result indicates that, when working together, the combined GPU performance reaches 31.5% of the theoretical maximum. At the same time, individual cards operating in a collaborative mode achieve 65% of their actual performance in standalone mode, meaning that 35% is spent on communication (data synchronization between the cards).
The combined performance of the cards as a function of the number of simulated particles (top figure) and the number of cards (bottom figure). |
It is worth noting that with such a dense arrangement of cards, where the distance between them does not exceed 6 mm and the power consumption of each card reaches 450 watts, ensuring adequate cooling (below 80°C) becomes a concern. However out testing revealed that even under full load, the temperature of the cards does not exceed 80°C. Somewhat unfortunate, that the GPU server A+ Server AS-4125GS-TNRT2 increases the internal fan speed to 9,000 RPM, which generates significant noise, but this is not a problem for us since the server operates in dedicated server room with good noise isolation.
Thus, preliminary testing confirmed the feasibility and practicality of using 10 GPU cards in a single server. The addition of 10 GPU cards increased the GPU performance of the FAI cluster by 56%, bringing it to 2,287 teraflops for FP32 operations and up to 35,609 teraflops per second for FP08 operations.
The primary tasks for the 10-card cluster include numerical simulations of direct N-body dynamics (4–8M particles), accelerating the creation of equilibrium initial conditions for galaxy simulations, solving problems related to recognizing spiral structures in galaxies, and identifying spectral lines in spectra using neural networks and artificial intelligence technologies.