Designing a practical sorting benchmark across Python, Rust, and C

I recently built a sorting playground, and one question kept coming up:

How do you compare and evaluate sorting algorithms?

Not just theoretically, but in practice.

The problem

A simple benchmark sounds easy:

run the same algorithm
measure time
compare

But it quickly becomes complicated:

Python vs Rust vs C behave very differently
large inputs can break CI pipelines
some algorithms are not even benchmarkable

The approach

Instead of chasing perfect accuracy, I focused on something else:

consistency and reproducibility

Key decisions

run benchmarks in CI (GitHub Actions)
use fixed datasets
run multiple iterations
average results

Input sizes

small
medium
large

But not every language runs all sizes (due to runtime issues).

The reality of cross-language benchmarking

One important thing I learned:

Not all languages should run the same workload.

For example:

Rust / C / C++ can handle large datasets easily
Python can become extremely slow on large inputs
running everything “fairly” is not practical

Practical constraints

So I introduced constraints:

Python skips large datasets
heavy algorithms are limited
some algorithms opt out entirely

This makes the system:

fast enough to run in CI
stable
still useful for comparison

Incremental benchmarking

Another key idea:

don’t re-run everything

new algorithms → benchmarked
existing ones → reused

This keeps CI time under control.

What the system produces

Each algorithm gets:

per-language results
per-size measurements
aggregated data

The output is stored as static JSON and rendered in the UI.

Why build this?

Because combining:

visualization
comparison
and reproducible benchmarks

makes algorithms much easier to understand.

Future ideas

more languages
better scoring models
workload-specific comparisons

But always keeping it simple enough to run.

If you’re interested, you can explore it here (Open sourced under the MIT license):

https://sorting.1234567890.dev/benchmark
https://github.com/T-1234567890/sort-playground

Designing a practical sorting benchmark across Python, Rust, and C

The problem

The approach

Key decisions

Input sizes

The reality of cross-language benchmarking

Practical constraints

Incremental benchmarking

What the system produces

Why build this?

Future ideas

Tags

Author

Stats

Published

You Might Also Like

Micro-benchmarking TypeScript Without Lying to Yourself

Writing an HTTP Load Tester That Doesn't Lie About p99

Defluffer promete -45% en tokens. Yo medí el costo semántico del ahorro y es incómodo

How Mano-P Achieves #1 on OSWorld: Architecture, Benchmarks, and Edge Deployment

I Tested OpenAI, Anthropic, and Cohere for Bulk Content Generation. Here's What the Data Actually Shows.

I Prompted 5 Frontier LLMs to “Report Uncertainty” Here’s What Happened to Their Statistical Validity Scores