Optimizing Performance using SIMD

Many processors are equipped with vector units, e.g. SSE, AVX and the upcoming AVX-512 extensions on x86 processors or the NEON extensions for ARM processors. Employing these units efficiently is not trivial. While in some cases the compiler is able to perform vectorization automatically when appropriate flags are set, it might be necessary to support the compiler to improve the performance using the vector units.

Compiler Flags

Choosing the right compiler options is very important for a fast program in general. The manual of the compiler should document available optimization options. The following flags apply to the GNU C Compiler (GCC), other compilers should have similar options. Intel has a list of recommended flags to compile code for their x86 processors:

  • -O3 to enable most available optimizations (Intel recommends -Ofast which enables -ffast-math, see below)
  • -march=native to tune the code for the current (!) processor. For recent x86 processors, this enables SSE and AVX, if available, and optimizes the assembly instructions for the current processor architecture. In general, this can significantly improve the performance. However, in some (rare) cases, it turned out that using this option leads to slightly slower code.
  • -flto when using multiple source files to enable the optimization at link time across multiple files.
  • -funroll-loops to unroll loops when possible and appropriate.
  • -ffast-math enables unsafe math optimization for floating-point arithmetics, which break the IEEE 754 complience. Although recommended by Intel, in some cases it might slow down the program and/or lead to slightly different results.

Exploiting SIMD capabilities manually

Many processors are equipped with SIMD (Single-Instruction, Multiple-Data) units. These units allow to perform the same instruction on multiple elements at the same time, leading to a higher performance. For example, recent x86 processors support the AVX instruction set with 256 bit registers, effectively allowing four double-precision or eight single-precision floating-point operations at the same time.

In some cases, the compiler will make use of the SIMD capabilities automatically. However, for more complex codes, it is possible to improve the performance by using the SIMD intrinsics of the compiler, which are in many cases directly lowered to a corresponding assembly instruction and therefore allow to write code on a lower level within a programming language like C. For the x86 architecture, Intel offers an Intrinsics Guide which provides a reference for supported intrinsics of the Intel C Compiler (icc). Many of these are also supported by GCC.

Clang also supports these intrinsics, but lowers most of them only to the corresponding LLVM IR instructions. Consequently, as optimizations in Clang are performed on the level of the LLVM IR, the produced assembly code diverges from the original code. This implies that the intrinsics are not as useful as for compilers like GCC, and the programs produced by Clang are in many cases slower than GCC-compiled programs if intrinsics are employed.

If employing intrinsics still leads to inefficient assembly code, it is possible to use assembly directly for performance critical functions. This can be done either using inline assembly or using external assembly files. In the second case, it should be noted that a the overhead of a function call is introduced.

One important aspect of efficient SIMD code on the x86 architecture is the alignment of memory locations. If the locations where data has to be read or written is aligned to the size of the vector registers, the faster move aligned instructions can be used which write the data at once. In case the alignement is not known or the memory location is known to be not aligned, however, the unaligned move instructions have to be used, with the drawback that up to twice as much cycles are spent for the memory access due to accesses to two cache lines.

The Intel 64 and IA-32 Architectures Optimization Reference Manual suggests that some penalties can be avoided by replacing one unaligned loads with two aligned loads and a shuffle instruction (palignr) to avoid cache line splits. In some cases this does not lead to a performance improvement, though. The manual also notes that for mis-aligned memory access, code written using the AVX extensions may be slower than code that only employs the older SSE extensions. The manual also recommends to prefer aligned stores over aligned loads as stores across multiple cache-lines have a higher penalty.

Example: Partial Differential Equation Solver

The effect of the SIMD capabilities of x86 processors was evaluated using the partdiff-par program, which is a solver of partial differential equations. It was originally developed by Thomas Ludwig and Andreas Schmidt. The program
operates on a 9 × 9 matrix with interlines and implements the Jacobi and the Gauss-
Seidel method. As the Gauss-Seidel method is not suitable for straight-forward SIMD optimizations due to data dependencies, we focused on the optimization of the Jacobi method. We compared the performance of a standard GCC compile code with -O3, GCC compiled code with the additional flags described above and a manual assembly rewrite of the inner kernel employing SSE and AVX. It should be noted that the assembly kernel is a separate function which is called once per iteration. Consequently, there are still possibilities for optimizations.

The evaluation was performed with a different matrix sizes and 1000 iterations on two different nodes:

  • Node 1: 2 Intel Westmere Xeon X5650 processors (do not support AVX) clocked at 2.67 GHz, 12GB main memory
  • Node 2: 1 Intel Sandy Bridge Xeon E31275 processor (support AVX) clocked at 3.40 GHz, 16GB main memory

For the performance of the three variants (GCC, GCC tuned for target architecture, assembly with SSE) on the node with the Intel Westmere processors, we can observe that the assembly kernel achieves a significantly higher performance for matrices which fit in the L2 cache. Also for larger matrices, the assembly implementation is still faster than code produced by GCC even though the overhead of a function call exists.

Evaluation of partdiff-par on an Intel Westmere processor

On the node with an Intel Sandy Bridge processor, the SSE assembly kernel has also a better performance for smaller matrices than the tuned GCC code. However, the assembly kernel which employs AVX is significantly slower, likely caused by two unaligned load operations per matrix element. The performance degrades if the matrix does not fit completely in the L2 cache, the boundary for the matrix size is at 90 interlines. For the L3 cache, we cannot observe a similar effect, the matrix has to have less than 900 interlines to fit in that cache.

Evaluation of partdiff-par on an Intel Sandy Bridge processor

Summary

Depending on the underlying problem, the employing SIMD capabilities of modern processors can significantly improve the performance. While tuning the compiler with appropriate options can already improve the performance, there are still cases where the compiler needs supported in producing vectorized code. This can be done either within existing code using compiler intrinsics or by rewriting the kernel of the application directly in assembly. When employing larger vectors of the AVX or the upcoming AVX-512 capabilities on the x86 architecture, memory alignment is an important aspect in the context performance optimization.

Further References


This evaluation was conducted during an internship at the German Climate Compute Center (DKRZ) in May 2015. The author would like to thank Dr. Julian Kunkel and Prof. Dr. Thomas Ludwig for enabling the internship and their helpful feedback.