How to avoid bounds checks in Rust (without unsafe!)

Sergey "Shnatsel" Davidoff
22 min readJan 17, 2023

--

You can often hear online that indexing into a slice, such as my_slice[i] is slow in Rust and you should do something else instead for performance.

The details, however, are murky. There’s little in the way of benchmarks, and hardly any documentation on removing this overhead without resorting to unsafe code.

So after optimizing a bunch of high-profile crates by removing bounds checks (and also removing unsafe code from some of them), I figured I should write down the results, and the techniques I discovered.

In this article I’m going to cover:

  1. What is the typical runtime cost of bounds checks on indexing
  2. How to avoid bounds checks without unsafe code
  3. How to verify that bounds checks have been eliminated
  4. How to benchmark and profile Rust code
  5. When a bounds check is absolutely necessary, how to construct the cheapest possible bounds check

What are bounds checks?

Consider the following code:

Accessing the 7th element of the array is incorrect, because the array is only 4 elements long, so there is no 7th element.

In C this would lead to accessing some random memory outside the intended range, which is called a buffer overflow.

The notorious Heartbleed is a buffer overflow, but most vulnerabilities of this type don’t get flashy names because there’s just so many of them. Yet writing outside the intended range is a very common way for an attacker to execute their code on your machine and do literally anything they want — steal credit card info, mine cryptocurrency, spy on you, etc. This is why buffer overflows are considered the most dangerous software vulnerability.

To avoid that, Rust inserts so called bounds checks on every access that ensure that a buffer overflow never happens — something like this:

If you try accessing an invalid index, a Rust program will panic instead of creating a security vulnerability.

While this is great for security, this has a small cost in runtime performance, because now there is more code for the CPU to execute.

Do bounds checks actually slow you down?

The real-world performance impact of bounds checks is surprisingly low.

The greatest impact I’ve ever seen on real-world code from removing bounds checks alone was 15%, but the typical gains are in 1% to 3% range, and even that only happens in code that does a lot of number crunching.

You can occasionally see greater impact (as we’ll see soon!) if removing bounds checks allows the compiler to perform other optimizations.

Still, performance of code that’s not doing large amounts of number crunching will probably not be impacted by bounds checks at all.

Try it yourself!

While I will be posting the results I got, there’s nothing quite like trying things for yourself. So I’ve prepared a repository with all the code and will be providing all the necessary commands so you can follow along.

If you have already installed Rust, run this to get the code and all the tools:

cargo install cargo-show-asm hyperfine
git clone https://github.com/Shnatsel/bounds-check-cookbook
cd bounds-check-cookbook
cargo build --release

Let’s see some bounds checks

To have a simple example to experiment with, let’s write a function that calculates the Fibonacci numbers and writes them to a Vec:

The compiler is really good at removing any code that’s not being called, and at precomputing everything it can in advance. I had to add a main with a lot of tricks in it to make sure it doesn’t happen, and so that we get to see the bounds checks at runtime.

Let’s look at the assembly and see what the bounds checks look like:

cargo asm --rust --bin fibvec_naive_indexing fibonacci_vec

This will print the optimized assembly of the fibonacci_vec function, i.e. the instructions the CPU will actually execute, along with the Rust code that produced them.

You can do this even if you know nothing about assembly! Just eyeballing the amount of assembly produced and looking for function names is sufficient.

Let’s look at the hot inner loop first, the fib[i] = fib[i-1] + fib[i-2]; part simply by searching for it in the output of cargo asm:

That’s it? That’s just two instructions!

That can’t be it! Where’s the rest of it?

And indeed, if we scroll down a bit, we’ll see more code attributed to this line - it’s not all in one place:

What happened here is the compiler outlining the code that’s taken when the bounds check fails. That code path leads to a panic, and panics are rare. So the compiler shuffled the code in such a way that we don’t even load the code that is only executed when leading up to a panic until we we actually need it. Clever!

Anyway, back to the assembly! core::panicking::panic_bounds_check appears to be the panic on bounds check failure, happening in assembly attributed to our line of code. So this is what they look like!

Let’s see if the if length > 1 { fib[1] = 1; } bit outside the hot loop also has a bounds check on it…

No bounds checks here! The compiler was smart enough to realize that when length is strictly greater than 1, it’s impossible for the bounds check to fail. Our Vec called fib also has length strictly greater than 1, and so fib[1] is always in bounds.

However, it didn’t seem to realize that the same holds for the loop, specifically the fib[i] = fib[i-1] + fib[i-2]; line.

Perhaps we can help it?

Help the optimizer

We’re going to make two changes to the code to make it easier for the optimizer to prove that the bounds checks never fail:

  1. Instead of indexing up to length, which is just some integer, we’ll index up to fib.len(), to make it obvious that the index is always in bounds.
  2. Instead of using a Vec, we’ll make a slice of it once and index into the slice. This makes it more clear that the length doesn’t change.

This gets us the following code:

main() is unchanged, see here for the full code

And let’s verify it with cargo asm — the command is in the code above:

It’s again split in two parts for some reason, but the bounds check is gone!

But is it any faster? Let’s find out!

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000'

Benchmark 1: target/release/fibvec_naive_indexing 1000000000
Time (mean ± σ): 3.612 s ± 0.040 s [User: 1.435 s, System: 2.132 s]
Range (min … max): 3.546 s … 3.693 s 10 runs

Benchmark 2: target/release/fibvec_clever_indexing 1000000000
Time (mean ± σ): 3.133 s ± 0.019 s [User: 0.995 s, System: 2.103 s]
Range (min … max): 3.106 s … 3.163 s 10 runs

Summary
'target/release/fibvec_clever_indexing 1000000000' ran
1.15 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

If you’re on Windows, you may have to add .exe to those paths.

It is faster, by a whopping 15%! That much is often no cause for celebration, but that’s the greatest boost from eliminating bounds checks that I’ve ever seen, so that’s just about the best we could have hoped for!

And while this example was somewhat contrived, I used these techniques to speed up the fastblur crate by 15%. (Although I’ve shaved off 6x as much execution time through other means first).

Update: the fastblur example was also helped by indexing into a slice instead of a &mut Vec, more info here. So the gain from removing bounds checks is actually less than 15%.

Aside: Compiler Optimizations

Now let’s also try this on a 64-bit ARM CPU, just to confirm…

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000' 
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
Time (mean ± σ): 3.320 s ± 0.024 s [User: 1.131 s, System: 2.179 s]
Range (min … max): 3.263 s … 3.346 s 10 runs

Benchmark 2: target/release/fibvec_clever_indexing 1000000000
Time (mean ± σ): 3.226 s ± 0.019 s [User: 1.092 s, System: 2.127 s]
Range (min … max): 3.209 s … 3.263 s 10 runs

Summary
'target/release/fibvec_clever_indexing 1000000000' ran
1.03 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

Aaand it’s back in the expected 3% range. No huge 15% uplift here.

But the assembly on ARM is really short:

No bounds checks in sight! And that’s just 3 instructions, which means very little work to do, so it should be very fast!

What’s going on?

Recall that removal of bounds checks by themselves doesn’t matter much. You can only see a big uplift if removing bounds checks allowed the compiler to perform other optimizations.

If you go back and squint at the x86 assembly of fibonacci_vec without bounds checks, it’s almost the same lines repeated over and over, which looks suspiciously like loop unrolling.

Why is it performed on x86 and not on ARM? I have no idea! It should be — this is a basic optimization that should not be related to the CPU in any way.

For comparison I tried this on a POWER9 CPU, and the compiler seems to unroll the loop even more and hyperfine reports a massive speedup of 1.78 ± 0.04 times, so I’m just going to file a bug for rustc and let people who know how compilers work deal with it.

The important takeaway for us is this: optimizing compilers are solving an NP-hard problem in a really short time, and there are always some cases they don’t handle exactly right.

Worse, the exact details change between versions. Even if the compiler improves on average, it may regress your specific code! Automatic vectorization for example is notoriously fickle, which is frustrating because it can get you much better performance when it works.

I’ve found the optimizations that remove bounds checks to be very reliable — once you get them working, they tend to keep working. So you can use these techniques and generally expect them not to break in the future. But that’s only the part responsible for the 3% uplift!

Since the loop unrolling responsible for the 15% uplift works on x86 but not on ARM, I wouldn’t bet on it working reliably in the future. Such is the sad reality of having something solve an NP-hard problem in a very short time.

Fortunately, in real programs that don’t spend all of the execution time in a single hot loop the differences are nowhere near this pronounced — regressions in one place are counterbalanced by improvements in another.

Aside: Benchmarking

So you may be wondering, why am I using hyperfine and going through all this trouble of writing a non-trivial main()?

Why don’t I just use cargo bench or criterion or something else specifically designed for benchmarking?

That once again has to do with the compiler’s tendency to precompute everything it can, and eliminate all code that doesn’t result in any changes to the output.

If the return value of a function is never used, and the function doesn’t panic, the compiler will simply remove it!

This is great for production code and terrible for benchmarks.

You can try to combat this by wrapping inputs and outputs in std::hint::black_box(), but it’s difficult to wrap all the things correctly, and it’s not clear which optimizations it inhibits, exactly.

I am sidestepping all this by making a real binary that reads the inputs, and the inputs are supplied only when the program is actually run, so there’s no way for the compiler to precompute anything. It also prints the result, so the compiler cannot remove the fibonacci_vec function as dead code.

And having standalone binaries also makes inspecting the assembly and profiling easier, as you will see shortly!

On the flip side, I have to crank the Vec lengths way up to get easily measurable execution times, and this may not be representative of how these functions perform in on small Vecs due to CPU cache effects.

Now, back to experimenting with bounds checks…

Just Use Iterators

Experienced Rust users have probably been screaming at me about that the entire time they’ve been reading this article - up to this point, anyway!

Rust provides convenient iterators that let you work with collections without worrying about bounds checks, off-by-one errors and the like.

They also let you express what you want to accomplish more clearly — as long as you can remember the names of all the iterators or always use an IDE that shows documentation, which is a pretty big catch if you ask me!

So let’s go ahead and rewrite our code using iterators. We need sliding windows over our Vec, and there’s a handy iterator for that called windows, so let’s give it a shot:

Uh oh! That doesn’t compile! The issue is that windows() only gives us read-only slices, we cannot write through them. For most other iterators there is a corresponding _mut() version that gives mutable slices, but there is no windows_mut()! What gives?

Turns out windows_mut() cannot be implemented as a regular iterator in Rust, because if you were to .collect() it into say a Vec, you would end up with many overlapping and mutable slices — but Rust requires every part of memory to be mutable from only one place at a time!

What we’re looking for is called a so-called streaming iterator that cannot be .collect()ed, but there doesn’t seem to be such a thing in the standard library yet. So we’ll have to change our code a bit:

main() is unchanged, see here for the full code

This works, and you can use cargo asm to confirm that there are no bounds checks going on here. On Rust 1.65 it benchmarks slightly faster than our earlier attempt on my machine, by about 2%. On 1.66 it’s another big boost:

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000' 'target/release/fibvec_iterator 1000000000'
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
Time (mean ± σ): 3.530 s ± 0.056 s [User: 1.452 s, System: 2.027 s]
Range (min … max): 3.450 s … 3.616 s 10 runs

Benchmark 2: target/release/fibvec_clever_indexing 1000000000
Time (mean ± σ): 3.111 s ± 0.058 s [User: 1.037 s, System: 2.038 s]
Range (min … max): 3.039 s … 3.207 s 10 runs

Benchmark 3: target/release/fibvec_iterator 1000000000
Time (mean ± σ): 2.763 s ± 0.057 s [User: 0.666 s, System: 2.078 s]
Range (min … max): 2.686 s … 2.847 s 10 runs

Summary
'target/release/fibvec_iterator 1000000000' ran
1.13 ± 0.03 times faster than 'target/release/fibvec_clever_indexing 1000000000'
1.28 ± 0.03 times faster than 'target/release/fibvec_naive_indexing 1000000000'

And it does provide a nice uplift on ARM as well:

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000' 'target/release/fibvec_iterator 1000000000'
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
Time (mean ± σ): 3.324 s ± 0.024 s [User: 1.160 s, System: 2.154 s]
Range (min … max): 3.285 s … 3.354 s 10 runs

Benchmark 2: target/release/fibvec_clever_indexing 1000000000
Time (mean ± σ): 3.257 s ± 0.022 s [User: 1.112 s, System: 2.136 s]
Range (min … max): 3.232 s … 3.297 s 10 runs

Benchmark 3: target/release/fibvec_iterator 1000000000
Time (mean ± σ): 2.968 s ± 0.025 s [User: 0.782 s, System: 2.175 s]
Range (min … max): 2.929 s … 3.011 s 10 runs

Summary
'target/release/fibvec_iterator 1000000000' ran
1.10 ± 0.01 times faster than 'target/release/fibvec_clever_indexing 1000000000'
1.12 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

Fortunately this isn’t some sort of iterator secret sauce — a program with the same structure but using indexing with optimizer hints is just as fast.

But notice that we had to significantly change how we implement the computation! Iterators are very handy if you are writing code from scratch, and you totally should use them — but they can be a pain to retrofit into an existing code. And some patterns cannot be expressed with iterators at all!

Update: after the release of this article the standard library documentation has been updated with the instructions for emulating windows_mut(). An example calculating Fibonacci numbers can be found here.

And finally, I’ve used the for loop with an iterator here, but in this case the compiler can miss some “obvious” optimizations. If you instead use .foreach on the iterator, the compiler should optimize the code better. This is especially relevant if you have a chain of iterator adapters, something like .skip().filter().take() or even longer.

So if you find yourself writing long iterator chains, it might be worth benchmarking it against an index-based implementation with optimizer hints, like the one I’ve described earlier. Or like the following…

Put an assert in front of the hot loop

Let’s make use of this function we’ve written.

Now that we have a Vec full of Fibonacci numbers, we can write a function that checks if another Vec also has Fibonacci numbers simply by comparing the two. If our Vec is small enough to fit into the CPU cache, this could be faster than doing the math over and over!

A naive implementation could look like this:

see here for the full code

Let’s check the assembly:

Oh no — the bounds checks are back! The is_fibonacci() function has them in the hot loop!

We have to check bounds here, because we don’t know the lengths of either of these slices in advance. It’s required for correctness! But what we can do is perform the bounds check only once outside the loop, instead of for every element, which will make the cost negligible.

Let’s make sure the sizes are the same before we enter the loop, and use the trick of iterating only up to .len() from earlier:

see here for the full code

Et voila, no more bounds checks inside the hot loop:

That’s it, that’s all the assembly attributed to the indexing line now!

This can also be achieved with an iterator, but realistically you’ll just use the == operator to compare the slices. The code is contrived - showing off the optimization technique is what’s important here.

I’ve sped up the jpeg-decoder crate using this approach, so check that out if you want to see applications to real-world code. There I used assert!s instead of slicing, but the principle is the same.

The great thing about this approach is that you cannot have too many assert!s — the redundant ones will be removed by the compiler, just like the bounds checks!

Inlining propagates constraints across functions

So you’ve used your shiny new is_fibonacci function for a while, and decided to split comparing elements into its own function for reuse:

see here for the full code

And now the bounds checks are back! The elements_are_equal function is a separate entity and cannot make any assumptions about the way it is called (or at least when it has #[inline(never)] on it).

Inlining is when the compiler copies the contents of the function to the place where it’s being called, instead of just putting a function call there. (Inlining is its own rabbit hole that goes pretty deep, just like CPU cache.)

We use #[inline(never)] on functions the assembly of which we want to view so that the function does not get inlined and become part of another function in the generated code. (While it is in theoretically possible to attribute the code to inlined functions, cargo asm doesn’t do that yet).

Instead we’re going to use #[inline(always)] for elements_are_equal() to make sure its contents are copied into is_fibonacci() and they are optimized together — getting us the benefits if them being separate in the source code, but a single entity for the optimizer, so that the knowledge of index constraints would be propagated across functions.

We’ve swapped #[inline(never)] for #[inline(always)] and the bounds checks should be gone! Let’s verify:

$ cargo asm --rust --bin comparison_split_inline elements_are_equal
Error: No matching functions, try relaxing your search request

Right, we can’t view the assembly of elements_are_equal() because it no longer exists in the assembly as a separate function.

But we can still check the assembly of is_fibonacci and verify that it worked! The bounds checks are gone again!

Out in the real world I’ve sped up the rand crate by 7% with a few assert!s and an #[inline(always)] — the same techniques as shown here.

Let’s see how much of a difference this optimization actually made here:

$ hyperfine --warmup 3 --min-runs 20 'target/release/comparison_realistic 100000000 100000000' 'target/release/comparison_naive 100000000 100000000' 'target/release/comparison_clever 100000000 100000000' 'target/release/comparison_iterator 100000000 100000000'
Benchmark 1: target/release/comparison_realistic 100000000 100000000
Time (mean ± σ): 729.8 ms ± 13.5 ms [User: 193.8 ms, System: 532.5 ms]
Range (min … max): 711.9 ms … 748.4 ms 20 runs

Benchmark 2: target/release/comparison_naive 100000000 100000000
Time (mean ± σ): 739.8 ms ± 12.8 ms [User: 206.5 ms, System: 529.1 ms]
Range (min … max): 725.9 ms … 761.7 ms 20 runs

Benchmark 3: target/release/comparison_clever 100000000 100000000
Time (mean ± σ): 736.2 ms ± 13.0 ms [User: 210.0 ms, System: 521.7 ms]
Range (min … max): 719.1 ms … 761.6 ms 20 runs

Benchmark 4: target/release/comparison_iterator 100000000 100000000
Time (mean ± σ): 734.6 ms ± 10.9 ms [User: 201.8 ms, System: 528.3 ms]
Range (min … max): 724.3 ms … 760.7 ms 20 runs

Summary
'target/release/comparison_realistic 100000000 100000000' ran
1.01 ± 0.02 times faster than 'target/release/comparison_iterator 100000000 100000000'
1.01 ± 0.03 times faster than 'target/release/comparison_clever 100000000 100000000'
1.01 ± 0.03 times faster than 'target/release/comparison_naive 100000000 100000000'

Hm. Hardly any difference, this is below 1% and might as well be noise. I had to add warmup and crank up the number of runs to get past the noise.

Okay, we probably should have answered an important question about this function before we started optimizing it:

Does this function even account for a large enough portion of the execution time to be worth optimizing?

Aside: Profiling

If we make a function twice faster, but it only accounted for 2% of the execution time of the program, we’ve only sped up the program by 1%!

We can use a profiler to find where time is spent in the program, and so which function would be a good target for optimization.

Profiling languages that compile to native code, such as Rust, is remarkably poorly documented. There’s a multitude of tools in various states of disrepair, most of which only work on a single OS, so the landscape can be difficult to navigate and is filled with gotchas. So here’s a guide for doing this with modern tools!

As a teaser, here’s the result we’re going to get:

Profiler UI

Yes, I know it’s not very readable. Click it!

See? It’s a whole interactive UI for viewing profiles, right in the browser! What you’re looking at is Firefox Profiler — which actually works in any browser, and it’s one the best profiler UIs I’ve ever used.

The killer feature is sharing the results in two clicks.

I cannot overstate how awesome the sharing feature is, and how much easier it makes communicating performance results. If you include a profile in a bug report about performance, it saves so much time for both you and whoever is going to end up working on fixing it!

Interpreting profiles

What we’re looking at is a flame graph. Each horizontal bar represents a function, and it’s as wide as the function’s share of execution time.

The bars are stacked on top of each other; the bar on top is called by the function represented by a bar directly below it.

The yellow bars are in userspace, and the orange bars are in the kernel.

For example, I can tell that main calls fibonacci_vec which in turn calls into the kernel, which does something with “pages” — that must be referring to memory pages. We’re creating a Vec there, so this must be memory allocation.

So apparently 51% of the time of this program is spent allocating memory, and another 9% is spent deallocating it! Actually hyperfine was telling us about it earler that we spend a lot of time in the kernel — it reports user and sys times, with user being our program and sys being the kernel.

Note that the order of the bars in a flame graph is meaningless, it only shows the aggregate time spent in a given function. If you want to see how the program execution actually went instead of viewing aggregates, switch to the “Stack Chart” tab:

This shows that the calls into the kernel from fibonacci_vec are spread evenly across the execution time. Apparently the kernel gradually provisions memory instead of serving us a big allocation up front when we request it. The deallocation at the end, however, happens in a single chunk.

Modern operating systems provision memory only when it’s actually being written to, which is what we’re seeing here. That’s also why you can try allocate 100 TB of RAM and the initial allocation call will succeed — but your process will get killed if you try to actually write to all of that.

Capturing execution profiles

Here’s how to create one of those beautiful graphs for your own code.

First off, the profiler needs debug symbols to attribute chunks of machine code to functions. To enable debug symbols, add this to your Cargo.toml:

[profile.release]
debug = true

If you don’t do this, the compiler will generate very limited debug symbols, which will in turn provide very little visibility — although you’ll still get some info even if you haven’t done this.

The other steps are unfortunately platform-specific:

Profiling on macOS

There’s a convenient profiler that shows results in Firefox Profiler:

cargo install samply
samply record target/release/comparison_naive 100000000 100000000

This will record the profile and open the web browser with the results.

Profiling on Linux

Linux has a very mature and flexible profiler. The downside is that it’s a bit too flexible and it can be difficult to figure out the right flags.

Here’s how to record a profile:

# --call-graph=dwarf tells the profiler use the debug symbols we added
sudo perf record --call-graph=dwarf target/release/comparison_naive 100000000 100000000

Now that you’ve recorded a profile, we can use samply to view it:

cargo install samply
sudo chown "$USER" perf.data # because we recorded the profile with `sudo`
samply load perf.data

This will open Firefox Profiler with the results in your web browser.

Seeing into the OS kernel

There is one thing Samply cannot do yet, and that’s reporting on where the time is spent inside the kernel. This is only needed if Samply is showing wide orange bars on the flame graph, and you want to understand what they stand for.

In this case use perf script to visualize the profile. It is slower and less accurate, but lets you see into the kernel as well:

sudo perf script -F +pid > profile.perf

Now go to profiler.firefox.com and upload the resulting profile.perf file, and you should see the same profiling UI showing the results.

The profiles I showed earlier were captured using this method.

Profiling on Windows

Here’s the list of good free profiling tools available on Windows:

(This section was intentionally left blank by Microsoft).

Intel VTune and AMD uProf are free of charge, but are not particularly great — the UIs are clunky, and it may be difficult to get them to work in the first place (e.g. you may have to change some settings in the BIOS).

Fortunately, you can just use the Linux instructions in WSL2 or any other VM with Linux.

The results

So the question we wanted to answer was:

Does is_fibonacci() even account for a large enough portion of the execution time to be worth optimizing?

If you open my profile, you can see in the “Flame Graph” or “Call Tree” views that the program spent 13% of the time in is_fibonacci, and if you subtract all the kernel time from fibonacci_vec it accounts for 23% of the time.

Since the execution times of these two functions are roughly comparable, it seems that eliminating bounds checks in is_fibonacci has indeed sped up this function very little.

To reiterate, seeing only a slight boost from eliminating bounds checks is the typical outcome. It’s the 15% improvement that’s the anomaly!

With that out of the way, let’s look at one final technique for dealing with bounds checks.

What if I know literally nothing about the index?

So far we’ve relied on some knowledge we had about the constraints of our index, or had a loop where we could check bounds once before it instead of doing it on every iteration. But what if…

What if you know absolutely nothing about the index?

Let’s say the index comes from untrusted user input, and you can assume absolutely nothing about it. You have to perform a bounds check. There is no other way. But the function is rather hot… what if you could speed it up? It might not gain you much, but then again it might, and it’s just one bounds check and how inescapable can it really be?!

You end up thinking about your conundrum at night. This bounds check haunts your dreams. Taunting you. You wake up in cold sweat, and resolve to do something about it. You try panic = abort in Cargo.toml, and maybe it helps or maybe not, it could be just noise and benchmarks are a lie and oh god why did you decide to go into programming?!

Fear not, for I shall deliver you from this predicament.

We’re going to create. The. Cheapest. Possible. Bounds. Check.

Observe.

The. Cheapest. Possible. Bounds. Check.

So you have your lookup table for Fibonacci numbers, and a function nth_fibonacci() that simply performs a lookup by index.

Easy peasy, except the function must perform a bounds check because its inputs are completely unpredictable:

Look at the assembly! Just, ugh. A single line, creating so much of it:

Since we know nothing about the index, we have to perform a bounds check. But we can do it on our own terms.

The branching instruction created by ifs or can be quite expensive, as far as CPU instructions go; we’ve seen its impact on benchmarks already. What if there is a cheaper alternative?

Let’s recall why we need bounds checks in the first place: we want to make sure all lookups are confined to the Vec or slice we’re working with - because if they aren’t, we get Hearbleed or worse. But a panic on an invalid access that usual bounds checks create is not strictly mandatory; as long as all accesses are confined to the slice we’re working with, we’re good.

So technically we could use the modulo operator, written as %, to confine all accesses to the slice, like this:

fibonacci[i % fibonacci.len()]

This won’t return an error on an invalid access like a regular bounds check would, it will just silently return an incorrect result… but is it cheaper?

Unfortunately the % operator is also an expensive instruction, but there is one very special case we can take advantage of. If the divisor is a constant and is known to be a power of two, any compiler worth its bytes will optimize it into a bitwise AND, and bitwise operations are very cheap.

Our lookup table only holds 100 numbers, which is not a power of two, but we can extend it to the nearest power of two with dummy values to make it all work out — we’ve already accepted wrong results on invalid accesses, so we might as well go all the way!

And so our code becomes…

see here for the full code

Let’s check the assembly…

Yes!

We’ve done it!

This is it! The cheapest possible bounds check!

We have inflated the memory usage slightly and no longer report errors on invalid accesses, but we’ve achieved the goal of speeding up our function, even when we could assume nothing about the index.

Whether it’s worth the trade-offs… that’s for you and your benchmarks to decide.

Anti-patterns

Finally I want to take a look at some patterns I keep seeing people write over and over, even though they hurt them instead of helping them.

debug_assert! instead of assert!

debug_assert! is removed in release mode. Ironically, this makes the code slower: the compiler now cannot assume anything about the length of the input, and has to insert all three bounds checks in release mode!

Do this instead:

As a rule of thumb, assert!s checking lengths help performance instead of hindering it. They are very difficult to overdo either — any redundant ones will be removed by the optimizer, just like bounds checks!

If in doubt, use assert! instead of debug_assert!.

Or, if you want to be super-duper extra sure you’re not doing extra work:

This also only performs one bounds check instead of three.

Unsafe indexing by a constant value

This unsafe doesn’t make your code any faster. In fact, in debug mode this is slower than the safe version because get_unchecked creates function call overhead, and optimizations that would remove it are disabled.

Do this instead:

The compiler always optimizes indexing by a constant into a slice of known length in release mode. And in debug mode this is even faster than unsafe.

Parting thoughts

Hopefully this menagerie of techniques for dealing with bounds checks will serve you well, and you will never have to resort to get_unchecked and risk creating code execution vulnerabilities in your code.

If you would like to practice these techniques, you can search Github for unsafe indexing and see if you can convert it into safe code without regressing performance.

Happy hacking!

--

--

Responses (5)