SIMD AVX2 vs SSE4.2 GEMM Matrix Multiplication Performance: Vector Operations and MLP Inference on Intel Alder Lake
Tags: benchmarking, simd, research, opensource
Summary
This dataset characterizes the performance of SIMD-accelerated compute kernels in the Kazkade zero-copy columnar analytics engine. We compare AVX2, SSE4.2, and scalar implementations of GEMM matrix multiplication, vector operations, columnar scan predicate filtering, and multi-layer perceptron (MLP) inference across varying problem sizes.
Methodology
Hardware: Intel Core i7-1260P (AVX2, 8 cores, 12 MB L3 cache). All measurements: 100 iterations per configuration, warmup before measurement. Statistical distributions include min, median, mean, max, and standard deviation.
Key Results
- AVX2 GEMM at 1024x1024 achieves 33.8 GFLOPS ??? 9x faster than scalar (3.7 GFLOPS)
- SSE4.2 achieves 16.8 GFLOPS at the same size ??? 2x slower than AVX2
- Vector dot product: 10.9 GB/s with AVX2 vs 2.5 GB/s scalar
- MLP inference (3-layer, 128 neurons): 55,652 ops/sec at batch size 64
- Columnar scan on 10M rows: 78M rows/sec vs 21M rows/sec row-wise
Data Files
-
stat_gemm.csv: GEMM performance across 3 implementations and 5 matrix sizes -
stat_vector_ops.csv: Vector operation throughput -
stat_mlp_inference.csv: MLP inference latency -
kazkade_cli_bench.txt: Raw CLI benchmark output
Dataset: https://doi.org/10.7910/DVN/YMJKOG












