Optimize Fixed-point Division #1

Open
opened 2024-10-15 18:18:46 +00:00 by Ghost · 3 comments

Fixed-point division currently uses the GBA's Div bios call. Unfortunately, this implementation is rather slow. An alternative implementation could use a lookup table to approximate the value. This article by Segger details an implementation that uses a 256 byte lookup table to compute the quotient of two unsigned 16-bit values. The same method could be applied to 24:8 fixed point numbers.

Currently, a division operation takes around ~281 cycles.

Fixed-point division currently uses the GBA's Div bios call. Unfortunately, this implementation is rather slow. An alternative implementation could use a lookup table to approximate the value. [This article by Segger](https://blog.segger.com/algorithms-for-division-part-3-using-multiplication/?mtm_kwd=Algorithms-3) details an implementation that uses a 256 byte lookup table to compute the quotient of two unsigned 16-bit values. The same method could be applied to 24:8 fixed point numbers. Currently, a division operation takes around ~281 cycles.
Ghost added the
Status
Abandoned
Kind/Performance
labels 2024-10-15 18:18:46 +00:00
Ghost removed the
Status
Abandoned
label 2024-10-15 19:06:47 +00:00
Author

Notation

Xa:b represents the integer X in fixed-point format with a integer bits, and b fraction bits, ie. X · 2-b

124:8 = 2-8
4213:19 = 42 · 2-19

Algorithm

Let U24:8 be the numerator and V24:8 > 0 be the denominator, both signed. Let z be the number of leading zero bits in V. Then, because V > 0, at least one bit is set and z < 32. Additionally, because V is positive, the highest bit of V must not be set. It follows that z > 0.

Let 1:7 be the bit normalized value of V to 8 bits (ie. the most significant 8 bits of V). Then:

1:7 = V · 10:24 / 10:z = (V << z) >> 24

We interpret as a 1:7 fixed point number. Because it was normalized, the highest bit (ones place) will always be set. As a result, 1:7 will be in the range [1,2). It follows that the reciprocal, 1 / ṽ1:7 , will be in the range [0.5, 1)

A modified version of this reciprocal will be stored in the lookup table. Let L0:16 be the lookup table entry at index 1:7 - 128. We subtract 128 to remove the highest bit from 1:7 because it is always set, so we can avoid storing 128 elements.

L0:16 = (2160:16 · 271:7) / ṽ1:7 = 0x10000 · 0x80 / ṽ1:7

Note that the element at index 0 will have a value 0x10000 which does not fit in 16 bits. To ensure every table entry fits into 16 bits, this value will be replaced with 0xFFFF, a close enough approximation. Using this table we can compute the normalized denominator to look up its reciprocal. The quotient then becomes a simple multiplication and shift (to remove coefficient stored in the table and undo normalization).

Q24:24 = ( U24:8 · L24:8 ) · ( 10:16 · 11:7 ) / 10:z = ( U24:8 · L24:8 ) >> (23 - z)

Calculating the quotient as Q24:8 requires another shift, that can be combined into the previous equation.

Q24:8 = Q24:24 · 10:16 = Q24:24 >> 16

Error Correction

This algorithm does not account for error correction. Further testing is required to determine if error correction is necessary. It may be worth the loss of precision for the increase in speed, or possibly having a separate fast divide function without error correction.

## Notation _X<sub>a:b</sub>_ represents the integer _X_ in fixed-point format with _a_ integer bits, and _b_ fraction bits, ie. _X &middot; 2<sup>-b</sup>_ _1<sub>24:8</sub> = 2<sup>-8</sup>_ _42<sub>13:19</sub> = 42 &middot; 2<sup>-19</sup>_ ## Algorithm Let _U<sub>24:8</sub>_ be the numerator and _V<sub>24:8</sub> \> 0_ be the denominator, both signed. Let _z_ be the number of leading zero bits in _V_. Then, because _V \> 0_, at least one bit is set and _z \< 32_. Additionally, because _V_ is positive, the highest bit of _V_ must not be set. It follows that _z \> 0_. Let _v&#771;<sub>1:7</sub>_ be the bit normalized value of _V_ to 8 bits (ie. the most significant 8 bits of _V_). Then: _v&#771;<sub>1:7</sub>_ = _V &middot; 1<sub>0:24</sub> / 1<sub>0:z</sub>_ = _(V << z) >> 24_ We interpret _v&#771;_ as a _1:7_ fixed point number. Because it was normalized, the highest bit (ones place) will always be set. As a result, _v&#771;<sub>1:7</sub>_ will be in the range [1,2). It follows that the reciprocal, _1 / v&#771;<sub>1:7</sub>_ , will be in the range [0.5, 1) A modified version of this reciprocal will be stored in the lookup table. Let _L<sub>0:16</sub>_ be the lookup table entry at index _v&#771;<sub>1:7</sub> - 128_. We subtract 128 to remove the highest bit from _v&#771;<sub>1:7</sub>_ because it is always set, so we can avoid storing 128 elements. _L<sub>0:16</sub>_ = _(2<sup>16</sup><sub>0:16</sub> &middot; 2<sup>7</sup><sub>1:7</sub>) / v&#771;<sub>1:7</sub> = 0x10000 &middot; 0x80 / v&#771;<sub>1:7</sub>_ Note that the element at index 0 will have a value 0x10000 which does not fit in 16 bits. To ensure every table entry fits into 16 bits, this value will be replaced with 0xFFFF, a close enough approximation. Using this table we can compute the normalized denominator to look up its reciprocal. The quotient then becomes a simple multiplication and shift (to remove coefficient stored in the table and undo normalization). _Q<sub>24:24</sub> = ( U<sub>24:8</sub> &middot; L<sub>24:8</sub> ) &middot; ( 1<sub>0:16</sub> &middot; 1<sub>1:7</sub> ) / 1<sub>0:z</sub> = ( U<sub>24:8</sub> &middot; L<sub>24:8</sub> ) >> (23 - z)_ Calculating the quotient as _Q<sub>24:8</sub>_ requires another shift, that can be combined into the previous equation. _Q<sub>24:8</sub> = Q<sub>24:24</sub> &middot; 1<sub>0:16</sub> = Q<sub>24:24</sub> >> 16_ ## Error Correction This algorithm does _not_ account for error correction. Further testing is required to determine if error correction is necessary. It may be worth the loss of precision for the increase in speed, or possibly having a separate fast divide function without error correction.
Author

The armv4t architecture does not have an instruction to compute the number of leading zeroes in an integer, so the method to calculate it is fairly slow. Maybe this could be avoided with multiple divisions using the lookup table?

The armv4t architecture does not have an instruction to compute the number of leading zeroes in an integer, so the method to calculate it is fairly slow. Maybe this could be avoided with multiple divisions using the lookup table?
Author

The armv4t architecture does not have an instruction to compute the number of leading zeroes in an integer, so the method to calculate it is fairly slow. Maybe this could be avoided with multiple divisions using the lookup table?

No, most likely not. The leading zeroes are needed when computing the normalized denominator. Replacing this with the addition of the individual bytes multiplied by a coefficient would result in an expression of the form:

U / (B_0 + B_1 ...)

Which would be difficult to compute. Maybe there are more efficient solutions, but this is acceptable for now.

> The armv4t architecture does not have an instruction to compute the number of leading zeroes in an integer, so the method to calculate it is fairly slow. Maybe this could be avoided with multiple divisions using the lookup table? No, most likely not. The leading zeroes are needed when computing the normalized denominator. Replacing this with the addition of the individual bytes multiplied by a coefficient would result in an expression of the form: U / (B_0 + B_1 ...) Which would be difficult to compute. Maybe there are more efficient solutions, but this is acceptable for now.
Sign in to join this conversation.
No description provided.