Optimize Fixed-point Division #1
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixed-point division currently uses the GBA's Div bios call. Unfortunately, this implementation is rather slow. An alternative implementation could use a lookup table to approximate the value. This article by Segger details an implementation that uses a 256 byte lookup table to compute the quotient of two unsigned 16-bit values. The same method could be applied to 24:8 fixed point numbers.
Currently, a division operation takes around ~281 cycles.
Notation
Xa:b represents the integer X in fixed-point format with a integer bits, and b fraction bits, ie. X · 2-b
124:8 = 2-8
4213:19 = 42 · 2-19
Algorithm
Let U24:8 be the numerator and V24:8 > 0 be the denominator, both signed. Let z be the number of leading zero bits in V. Then, because V > 0, at least one bit is set and z < 32. Additionally, because V is positive, the highest bit of V must not be set. It follows that z > 0.
Let ṽ1:7 be the bit normalized value of V to 8 bits (ie. the most significant 8 bits of V). Then:
ṽ1:7 = V · 10:24 / 10:z = (V << z) >> 24
We interpret ṽ as a 1:7 fixed point number. Because it was normalized, the highest bit (ones place) will always be set. As a result, ṽ1:7 will be in the range [1,2). It follows that the reciprocal, 1 / ṽ1:7 , will be in the range [0.5, 1)
A modified version of this reciprocal will be stored in the lookup table. Let L0:16 be the lookup table entry at index ṽ1:7 - 128. We subtract 128 to remove the highest bit from ṽ1:7 because it is always set, so we can avoid storing 128 elements.
L0:16 = (2160:16 · 271:7) / ṽ1:7 = 0x10000 · 0x80 / ṽ1:7
Note that the element at index 0 will have a value 0x10000 which does not fit in 16 bits. To ensure every table entry fits into 16 bits, this value will be replaced with 0xFFFF, a close enough approximation. Using this table we can compute the normalized denominator to look up its reciprocal. The quotient then becomes a simple multiplication and shift (to remove coefficient stored in the table and undo normalization).
Q24:24 = ( U24:8 · L24:8 ) · ( 10:16 · 11:7 ) / 10:z = ( U24:8 · L24:8 ) >> (23 - z)
Calculating the quotient as Q24:8 requires another shift, that can be combined into the previous equation.
Q24:8 = Q24:24 · 10:16 = Q24:24 >> 16
Error Correction
This algorithm does not account for error correction. Further testing is required to determine if error correction is necessary. It may be worth the loss of precision for the increase in speed, or possibly having a separate fast divide function without error correction.
The armv4t architecture does not have an instruction to compute the number of leading zeroes in an integer, so the method to calculate it is fairly slow. Maybe this could be avoided with multiple divisions using the lookup table?
No, most likely not. The leading zeroes are needed when computing the normalized denominator. Replacing this with the addition of the individual bytes multiplied by a coefficient would result in an expression of the form:
U / (B_0 + B_1 ...)
Which would be difficult to compute. Maybe there are more efficient solutions, but this is acceptable for now.