Rendering pipelines, spatial audio, physics solvers. In these areas the CPU is chewing through millions of matrix mults and quaternion rotations. Every single frame. Hardware is monstrously fast today. But somehow, math routines still manage to bottleneck the whole application.

Actually the bottleneck is almost never the math itself. It’s the memory layout. Wrap geometry primitives in heavy object-oriented abstractions, and you basically throw sand in the gears. You stop the CPU from doing the one thing it is actually built for. Blasting instructions over flat, contiguous memory.

The OOP Penalty

The standard textbook way to write a 3D math lib is all about encapsulation. You hide data to keep state safe. So you get classes with private members, custom constructors, getters, setters. Maybe even a virtual destructor if someone wanted to build a polymorphic hierarchy. Looks correct on a UML diagram. But the hardware penalty is brutal.

Give an object a non-trivial constructor, a v-table or just some padding for alignment - and you instantly break the CPU’s data locality assumptions. CPUs operate in cache lines. Usually that’s 64 bytes fetched from RAM straight into L1 cache. Let’s say a 16-byte quaternion gets padded to 24 bytes just to hold a virtual table pointer. Cache line utilization drops. You burn memory bandwidth loading structural garbage. Stuff that has absolutely zero to do with the actual math. And worse. This OOP boilerplate actively blocks the compiler from touching SIMD registers.

Compiler Paranoia

Clang, GCC and MSVC are highly aggressive at auto-vectorizing loops nowadays. But they are deeply paranoid. They operate strictly inside the ABI bounds and static analysis limits. For the auto-vectorizer to safely replace scalar float ops with vectorized instructions (like AVX vfmadd231ps), the compiler needs hard proof of two things. First, contiguous memory. Flat layout with zero hidden padding. Second, type transparency. Meaning it can verify memory ranges don't overlap. Strict aliasing.

If a C++ class is not evaluated as trivially copyable (std::is_trivially_copyable_v == true), the compiler gets scared. It emits defensive machine code. It might pass the object by a hidden pointer instead of shoving it directly into CPU registers like XMM/YMM. Iterate over a big array of matrices, and these memory indirection chains basically stall the hardware prefetcher. The CPU just sits there. Waiting for RAM fetches. Total pipeline starvation.

To get maximum throughput, math primitives must map directly to raw memory blocks. No exceptions.

DOD in C++23

Let’s look at how hardware-sympathetic geometry works in practice. If you inspect the core headers of modern C++23 math libs like Dichotomia (quat.hpp, mat4.hpp), you see strict Data-Oriented Design. No heavy classes. Primitives are just flat standard-layout structs constrained by C++ concepts. Roughly looks like this:

#include <concepts>
#include <type_traits>
#include <cstddef>

namespace dich {

// Constraining the primitive to floats
template <typename T>
concept floating_point = std::is_floating_point_v<T>;

// Flat, data-oriented Quaternion
template <floating_point T>
struct quat {
    T w, x, y, z;

    constexpr quat() noexcept = default;
    constexpr quat(T _w, T _x, T _y, T _z) noexcept 
        : w(_w), x(_x), y(_y), z(_z) {}
};

// Flat Matrix 4x4
template <floating_point T>
struct alignas(alignof(T) * 4) mat4 {
    T data[16];

    constexpr mat4() noexcept = default;
};

// Forcing compiler layout guarantees
static_assert(std::is_standard_layout_v<quat<float>>);
static_assert(std::is_trivially_copyable_v<quat<float>>);
static_assert(sizeof(quat<float>) == 16); // fits cleanly in a 128-bit register

static_assert(std::is_standard_layout_v<mat4<float>>);
static_assert(std::is_trivially_copyable_v<mat4<float>>);
static_assert(sizeof(mat4<float>) == 64); // exactly one 64-byte L1 cache line

} // namespace dich

Notice what is missing here. No private members, zero virtual functions, no user-defined destructors. By enforcing std::is_trivially_copyable_v and standard layout rules, the code guarantees a mat4 takes up exactly one 64-byte cache line. And because it is trivially copyable, the ABI passes instances directly in registers. No stack pushing.

Write a matrix multiplication over these structs, and the compiler easily sees the independent arithmetic ops and the strict alignment.

template<floating_point T>
[[nodiscard]] constexpr mat4<T> multiply(const mat4<T>& a, const mat4<T>& b) noexcept {
    mat4<T> result{};
    // Because both 'a' and 'b' are contiguous float arrays,
    // Clang/GCC unroll this loop and map it directly to SIMD instructions.
    for (std::size_t i = 0; i < 4; ++i) {
        for (std::size_t j = 0; j < 4; ++j) {
            T sum = 0;
            for (std::size_t k = 0; k < 4; ++k) {
                sum += a.data[i * 4 + k] * b.data[k * 4 + j];
            }
            result.data[i * 4 + j] = sum;
        }
    }
    return result;
}

Compile this with -O3 -march=native and Clang naturally spits out vectorized FMA instructions. The C++23 abstractions cost literally zero cycles at runtime. Those static_assert statements? They act as a hard compile-time regression test. If a future developer accidentally adds a virtual method, the build just fails instantly. Performance baseline protected.

The Hardware Reality

Dropping OOP for a flat DOD layout gives very predictable hardware-level returns. Run bulk operations - say, applying transforms to a massive array of entities. The lack of hidden pointers basically kills cache line thrashing completely. The hardware prefetcher predicts the linear memory access pattern like it’s supposed to.

In benchmarks against standard OOP wrappers, instruction cache misses drop massively because branch validation and stack teardown logic are just gone. Throughput scales up hard. And if you check the generated assembly, it confirms a clean 1:1 translation to vfmadd231ps instructions. Basically intrinsic-level performance out of pure standard C++.

To give you an idea of the raw throughput difference on a modern CPU (e.g., AMD Ryzen 7 5800X, compiled with gcc 14 -O3):

Standard OOP Matrix (Scalar): 18.5 ms per 1,000,000 multiplications.
Dichotomia DOD Matrix (Auto-Vectorized): ~4.2 ms per 1,000,000 multiplications.

Compilers are smart, but they are deeply conservative. Give them opaque or fragmented memory layouts and the optimizer will always fall back to the safe, slow scalar path. Performance here is just about structuring data so the hardware reads it without friction. High-level developer ergonomics don’t actually need runtime overhead. Using standard layouts and C++23 constraints, you can build robust math tools. But under the hood, they just act as transparent data pipes for the CPU.

If you are interested in examining the complete data-oriented implementation of these primitives, including the Python bindings for zero-copy FFI, you can inspect the architecture in the Dichotomia repository on GitHub.

zem-invictus / dichotomia

A high-performance, header-only C++23 math library for 3D game engines. Built from scratch focusing on modern C++ features (Value Semantics, Deducing this), strict angle typing, and performance.

Dichotomia

A minimalistic, modern C++23 math library for basic 3D graphics applications. It provides core linear algebra components with an emphasis on constexpr and modern C++ features, alongside seamless, high-performance Python bindings via nanobind (with full NumPy buffer protocol support).

Features

Vectors (Vec2, Vec3, Vec4): Fully templated, constexpr arithmetic, strict ISO C++ operator[] using std::unreachable().
Matrices (Mat4): 4x4 matrix operations, fast Inverse and Determinant, Perspective, Orthographic, LookAt (RH Zero-to-One standard).
Quaternions (Quat): Fast Euler-to-Quaternion conversion, Spherical Linear Interpolation (Slerp), rotation matrices.
Angles (Radians, Degrees): Type-safe angle structs with user-defined literals (180.0_deg, 3.14_rad).
Standardized: Zero-warning compilation, 100% Google C++ Style Guide compliant, complete Google Test coverage.

Performance

Dichotomia leverages C++23 [[assume]] contracts and explicit object parameters (Deducing This) to achieve zero-overhead abstractions. Thanks to aggressive compiler auto-vectorization (tested…

View on GitHub