Fixed point multiplication used an ARM inline assembly routine. This was
fast, but unfortunately, caused some odd attempted inlining problems
when used from Thumb-mode code. This commit replaces this assembly
routine with a C++ implementation that performs equal or better than the
assembly routine in most cases. The C++ implementation is slightly
slower when called from Thumb-mode code because GCC inlines the
operation instead of calling a standalone ARM-mode routine placed in
IWRAM. The performance tradeoff is acceptable though because of the
fixes, portability, and ARM-mode performance improvements it provides.