Advanced LLVM Optimization Techniques for RISC-V
Tuning LLVM vectorization and instruction selection to unlock 3× performance on real-world scientific workloads.
Instrumenting the Baseline
Before rewriting a single pass, we profiled the entire toolchain to understand where CPU time was truly spent. Using LLVM’s opt-bisect tooling and our internal benchmarking harness, we captured per-pass instruction counts, vector utilization ratios, and cache miss profiles. The NASA NAS Parallel Benchmarks revealed that the Polybench kernels were CPU bound, but the memory behaviour of SP and BT made vector utilization the real bottleneck.
We extended the existing LLVM MIR printer with custom metadata so that each hot loop emitted metrics about lane occupancy and mask efficiency. This gave us a way to compare transformations across commits without leaning on fragile log scraping. The baseline runs showed an average vector lane utilization of 38% – plenty of room for improvement.
Refining Cost Models and Pass Order
The RVV backend shipped with conservative heuristics that aggressively bailed out when loop bounds were not compile-time constants. Our first win came from teaching the cost model about stride-one gather/scatter patterns that the NASA kernels used. We introduced a lightweight symbolic range analysis pass ahead of the vectorizer so it could prove safety for more loops.
Pass ordering mattered just as much as the heuristics. We moved LoopStrengthReduce later in the pipeline to avoid undoing profitable induction variable rewrites and taught SLPVectorizer to cooperate with the RISC-V VLS instructions by emitting masked loads instead of scalar expansion. The result: vector lane utilization climbed to 81% and the inner loops shed half their instructions.
Guardrails Against Regressions
Performance wins mean little if they are fragile. We captured the tuned configuration as an opt pipeline definition and codified the expected instruction counts in a set of MIR-based lit tests. Any change that drifted by more than 3% on lane utilization or instruction count failed CI automatically.
Finally, we wired the benchmarks into a nightly perf dashboard backed by InfluxDB so compiler engineers and architecture teams could see the impact of every merge. The visibility prevented surprise regressions and gave the silicon team confidence that the compiler was extracting the hardware’s full capability.
Key takeaways
- Vector lane utilization jumped from 38% to 81% by combining symbolic range analysis with tuned cost models.
- Reordering passes to preserve induction-variable rewrites was just as impactful as updating heuristics.
- Perf regression tests based on MIR snapshots keep the pipeline honest after every merge.
Need help implementing this?
I work with teams to turn these practices into production workflows.