SIMD AVX2 vs SSE4.2 GEMM Matrix Multiplication Performance: Vector Operations and MLP Inference on Intel Alder Lake

Tags: benchmarking, simd, research, opensource

Summary

This dataset characterizes the performance of SIMD-accelerated compute kernels in the Kazkade zero-copy columnar analytics engine. We compare AVX2, SSE4.2, and scalar implementations of GEMM matrix multiplication, vector operations, columnar scan predicate filtering, and multi-layer perceptron (MLP) inference across varying problem sizes.

Methodology

Hardware: Intel Core i7-1260P (AVX2, 8 cores, 12 MB L3 cache). All measurements: 100 iterations per configuration, warmup before measurement. Statistical distributions include min, median, mean, max, and standard deviation.

Key Results

AVX2 GEMM at 1024x1024 achieves 33.8 GFLOPS ??? 9x faster than scalar (3.7 GFLOPS)
SSE4.2 achieves 16.8 GFLOPS at the same size ??? 2x slower than AVX2
Vector dot product: 10.9 GB/s with AVX2 vs 2.5 GB/s scalar
MLP inference (3-layer, 128 neurons): 55,652 ops/sec at batch size 64
Columnar scan on 10M rows: 78M rows/sec vs 21M rows/sec row-wise