We’ve taken a look at some basic theory for doing division in binary. In this post we implement the algorithm in Rust, benchmark how it performs and consequently apply a small optimization.

- Part 1 - Division using Long Division
- Part 2 - Division using Two’s Complement
- Part 3 - A simple implementation of division in Rust

Let’s jump straight into it! For our first implementation we’ll use the two’s complement method we reviewed in the last post:

```
pub fn div(dividend: u32, divisor: u32) -> (u32, u32) {
// Get the two's complement of the number
let ones_complement = divisor ^ 0xFFFF_FFFF;
let twos_complement = ones_complement + 1;
// Keep subtracting from the dividend until we can't anymore
let mut working = dividend;
let mut quotient = 0;
while divisor <= working {
working = working.wrapping_add(twos_complement);
quotient += 1;
}
let remainder = working; // Leftover
(quotient, remainder)
}
```

As expected, we get the two’s complement by reversing the bits (using an exclusive or) and adding one. We then continue looping and incrementing the quotient until we cannot subtract from the dividend without overflow.

This is a pretty naive implementation, however it works:

```
#[cfg(test)]
mod tests {
use super::div;
#[test]
fn naive_div() {
assert_eq!(div(100, 4), (25, 0));
assert_eq!(div(100, 3), (33, 1));
}
}
```

Understandably it performs worse as the final quotient gets bigger:

```
#![feature(test)]
extern crate division;
extern crate test;
use division::div;
#[bench]
fn div_hundred(b: &mut ::test::Bencher) {
b.iter(|| {
let result = div(100, 3);
::test::black_box(result);
});
}
#[bench]
fn div_million(b: &mut ::test::Bencher) {
b.iter(|| {
let result = div(1_000_000, 3);
::test::black_box(result);
});
}
#[bench]
fn div_billion(b: &mut ::test::Bencher) {
b.iter(|| {
let result = div(1_000_000_000, 3);
::test::black_box(result);
});
}
```

On my machine, this is incredibly slow. Once it finally finishes it gives the following stats:

```
running 3 tests
test div_hundred ... bench: 22 ns/iter (+/- 3)
test div_million ... bench: 96,739 ns/iter (+/- 8,381)
test div_billion ... bench: 96,974,170 ns/iter (+/- 2,048,390)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out
```

That doesn’t look too great! Let’s see if we can squeeze some better performance from this…

We can get some much better performance based upon how binary numbers behave within a fixed sized type. With some simple optimizations I get the following benchmarked results:

```
running 3 tests
test div_billion ... bench: 55 ns/iter (+/- 3)
test div_hundred ... bench: 55 ns/iter (+/- 4)
test div_million ... bench: 55 ns/iter (+/- 4)
test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out
```

That is some impressive improvement. It’s slightly slower for `div_hundered`

however the rest are significantly faster. Let’s take a look at the code and then evaluate the results we’re seeing:

```
pub fn div2(dividend: u32, divisor: u32) -> (u32, u32) {
// Get the two's complement of the number
let ones_complement = divisor ^ 0xFFFF_FFFF;
let twos_complement = ones_complement + 1;
// Start off the quotient with the dividend
let mut quotient = dividend;
let mut remainder = 0u32;
// Loop through the number of bits in the number (32)
for _ in 0..32 {
let carry = quotient >> 31;
quotient = quotient << 1;
remainder = remainder << 1;
remainder = remainder | carry;
let test = remainder.wrapping_add(twos_complement);
if (test & 0x8000_0000) == 0 {
remainder = test;
quotient |= 1;
}
}
(quotient, remainder)
}
```

As you can see, one of the major wins we get with this implementation is that we only loop 32 times as opposed to looping $q$ times. Of course, there are further optimizations we can achieve by early exiting on 1, when divisor = dividend etc but rather than jump into those… how is this even working?

This optimization is exploiting how binary numbers are represented in a fixed type system. I think that the easiest way to understand this is to go through a side by side example. To simplify, we’ll use a 4 bit number example: $\dfrac{9}{2} = 4, r = 1$ or $\dfrac{1001}{0010} = 0100, r = 0001$

As you can see from the code, we start the algorithm off with the quotient being set to the dividend, and the remainder set to zero. We then start an $n$ sized loop whereby $n$ is the number of bits of the number we’re dividing - in this example: 4.

Let’s go through each iteration one by one and see what is happening…

$q = 9 (1001), r = 0 (0000)$

The first thing we’re doing is bit shifting the quotient and the remainder to the left. If you recall, bit shifting to the left is equivalent to *doubling* the number. Of course, in an $n$ bit number representation we may end up overflowing. With this algorithm, when we overflow the quotient we add one to the remainder, and when we overflow the remainder - we throw it away.

So if we double both the quotient and the remainder we get $q = 18, r = 0$. $18$ can’t be represented in a 4 bit system so we subtract $16$ so that $q = 2$ and add one to the remainder so that $r = 1$.

In binary, this is a bit easier to comprehend: we shift all the bits of the quotient and the remainder to the left and if a bit “overflows” from the quotient, we transfer it to the remainder: i.e. $q = 0010, r = 0001$

The next part of this algorithm is a test… we check if we can subtract the divisor from the remainder. In this case you cannot subtract $2$ from $1$ without going into negative therefore this test fails.

The algorithm of course is doing this check via adding the two’s complement and checking if the left most digit is $0$, thus indicating a positive number.

$q = 2 (0010), r = 1 (0001)$

We start off with the bit shift left / doubling of the number. For the quotient, doubling $2$ gives us $4$ which is perfectly fine. For the remainder $1$ doubled is $2$ which is also acceptable. In binary we of course have $q = 0100, r = 0010$.

We then perform the test again. In this case $2$ subtracts from $2$ perfectly! Because of this, we add $1$ to the quotient and set the remainder to the result of the test (i.e. the leftover). In this case we set $r$ to $0$.

Again, if we think about this from a binary perspective: once we’ve added the two’s complement to the remainder we detect that the result is positive. From this, we can assume that the “subtraction” was successful and consequently add one to the quotient (via bitwise OR) and set the remainder to the leftover.

$q = 5 (0101), r = 0 (0000)$

If we double/bit shift $5 (0101)$ we get $10 (1010)$. The remainder is $0$ so we don’t do anything.

Performing the divisor test: we can’t subtract $2$ from $0$ so we continue on.

$q = 10 (1010), r = 0 (0000)$

Doubling/bit shifting $10 (1010)$ results in $20 (10100)$ which of course has overflowed. So we subtract $16$ and move the bit (or add one) to the remainder making $q = 4 (0100)$ and $r = 1$.

If we perform the final test: we can see that the divisor does not subtract from the remainder cleanly therefore we stop our algorithm.

Consequently, we’re left with the result of our division which is: $q = 4, r = 1$.

When utilizing binary numbers using a fixed size type system we’re able to achieve some optimizations for division that are difficult to achieve in other bases. Of course, this comes at a price - namely the numeric range that can be represented within that sized type.

We achieved a major performance boost from the naive implementation by simply rejigging the algorithm to exploit our type system. In the current form this isn’t all that useful, therefore in the next session we’ll take a look at how to logically represent decimal numbers with fractions and how we can consequently perform division on these.

For reference, all code examples above can be found on Github.