SIMD in Zig: The Fine Art of Punching Substring Searches in the Face
Hello everyone. Gather ‘round, because it’s time we talk about something that’s been lurking in the dank corners of software performance: substring searching. Yes, that seemingly harmless function that can turn into a CPU-chewing monster given a large enough haystack. And here comes our brave Zig programmer, armed with SIMD and just enough low-level masochism to make Boyer-Moore-Horspool look like Pong compared to Elden Ring.
The Baseline: Slow and Steady… and Still Slow
The baseline here is Zig’s standard std.mem.indexOf
, which uses the Boyer-Moore-Horspool algorithm. On paper, it’s fine – O(n) average case, skip tables, predictable behaviour. But, like an outdated MMO engine, it’s stuck in scalar-land, calling std.mem.eql
over and over. The compiler shrugs at auto-vectorization and moves on to more exciting optimizations, like staring at its own backend. Bottom line: it works, but it’s about as thrilling as diagnosing hypochondria in a patient with a sniffle.
Enter SIMD: Vectorized Wrath
Our intrepid coder steals – err, is inspired by – Wojciech Muła’s approach: grab the first and last characters of the needle, expand them into AVX2 256-bit registers, and sweep through the haystack in glorious 32-byte gulps. Compare blocks, AND the results, and boom – you’ve spotted candidates for your substring faster than Steam spots an excuse for another sale.
There’s an undeniable elegance in slicing through data this way. Fewer memory accesses, fewer comparisons, just raw bitmask glory. Of course, like every overclock, the devil is in the pipeline: branch misses skyrocket. From 48 to nearly 3,000. Congratulations, you turned your CPU into an indecisive RPG NPC who keeps forgetting its quest dialogue.
Optimizing the Optimizations
Solution? Don’t always compare first and last letters – pick rarer characters in the needle, the ones less likely to spam false positives. It’s like in gaming: don’t camp the respawn point, go for that sneaky flanking route. In practice, the tweak chopped branch misses down by 60% and even squeezed another ~9% speed increase. That’s free DLC right there.
The Benchmark Bloodbath
- Baseline: ~1.22 milliseconds for huge input.
- First SIMD version: ~500 microseconds – a 59% improvement with 80% fewer CPU cycles.
- Rarest-character SIMD: Down to ~429 microseconds, and fewer branch prediction disasters.
These aren’t subtle differences – this is significant, measurable, and deliciously satisfying. Imagine reading an entire 200,000-word novel in half a millisecond. Your brain would fry before page two, but the CPU? Barely warmed up.
AVX-512: The Theoretical Overlord
And then there’s AVX-512. Twice the width, twice the bite. Our hero writes a version ready for 64-byte gulps, deferring block size detection to std.simd.suggestVectorLength()
. Sadly, they don’t own a CPU with AVX-512, so performance remains speculative – like those ‘RTX ON’ YouTube videos people watch while gaming on integrated graphics. But given the results so far, it’s safe to say AVX-512 would stomp the previous scores into the silicon dust.
Small Inputs: Surprise Victory
The obvious assumption: SIMD shines with big data, so small inputs should shrink gains. Wrong. Even for sub-100-character haystacks, SIMD is a smidge faster. 4 microseconds down to 1.5 microseconds. Completely irrelevant in user-facing scenarios, but to performance purists, that’s as good as finding a mint-condition collector’s cartridge at a yard sale.
Final Diagnosis
Here’s the prognosis: If you’re writing substring searches in Zig and still using the standard library for large data, you need an intervention – ideally involving SIMD. The documented approach not only speeds things up massively but also demonstrates how a little algorithmic poking can yield extra gains. Yes, branch misprediction gremlins will haunt you unless you choose good comparison characters, but the payoff is significant.
From a gaming perspective, this is the equivalent of swapping your rusty sword for a legendary, enchanted, processor-friendly greatsword. From a doctor’s perspective: the patient – your CPU – is now healthier, happier, and far less inflamed by pointless memory accesses.
Overall verdict: extremely good work, with room for further exotic optimizations if you’re willing to bleed at the altar of low-level wizardry. If you’re not using SIMD for text scanning in Zig at scale, you’re doing it wrong. And if you enjoy reading benchmark tables as much as I do? Well, you just got dessert.
And that, ladies and gentlemen, is entirely my opinion.
Article source: Faster substring search with SIMD in Zig, https://aarol.dev/posts/zig-simd-substr/