Prelude

Recently, due to past learning on ray-tracers, vector maths, and hearing about SIMD everywhere, I became interested in implementing a small Linear algebra library as a holiday work. Currently, the library only contains a basic 2-dimensional vector that is implemented both in scalar and simd formats (likely not very optimal at all).

Here’s the repository if you want to check it out.

Scalar Implementation

Before implementing the SIMD version, a scalar version was written to act as a baseline to measure improvements present in the SIMD version. The i and j components is chosen to be 2 f64, due to its higher precision, and it is laid out as such:

#[derive(Debug, Copy, Clone)]
struct Vec2 {
    i: f64,
    j: f64,
}

Even implementing this scalar version allowed me to learn new things, such as possibility to do a fake cross-product with 2-dimensional matrices, undeterministic nature of trigonometric functions, about how fused multiplication-addition (FMA, or num_traits::ops::mul_add() in Rust) is faster than unfused for CPUs containing afmainstruction.

SIMD Implementation

Confidence

At the start, I was quite naïve about implementing SIMD, as I walked the door with basically no prior knowledge. What I had in mind, is SIMD is just faster, that I simply had to use the SIMD types such as f64x2, then I could get automatic performance boost. After I moved everything from the scalar implementation over to fit SIMD, I realised how wrong I was.

Realisation

The struct for the SIMD implementation looked like this:

1
2
3

#[repr(transparent)]
#[derive(Debug, Copy, Clone)]
pub struct Vec2(f64x2);

#[repr(transparent)] is a new tag I learnt, which basically tells the compiler to freely optimise this struct just like how it would optimise f64x2. The Rust compiler just keeps amazing me…

So I thought it would be easy. I basically changed all the indexing of .i and .j in the scalar version to .0[0] and .0[1] in this version. After all this, I went to benchmark the two withcriterion, and results were surprising for me.

For quite a few functions, there was indeed a decrease in execution time, but intriguing for me is that most functions requiring field accesses are way slower, upto 50% slower than the scalar implementation in functions such as this:

#[repr(transparent)]
#[derive(Debug, Copy, Clone)]
pub struct Transform2((f64x2, f64x2));

impl Mul<Vec2> for Transform2 {
    type Output = Vec2;

    fn mul(self, rhs: Vec2) -> Self::Output {
        Vec2::new(
            self.0[0] * rhs.0[0] + self.0[1] * rhs.0[1],
            self.0[2] * rhs.0[0] + self.0[3] * rhs.0[1]
        )
    }
}

I used this formula because, well, this is what I was taught at school:

\begin{bmatrix}x' \\ y'\end{bmatrix} = \begin{bmatrix}a & b \\ c & d\end{bmatrix} \begin{bmatrix}x \\ y\end{bmatrix}

\quad \quad = \begin{bmatrix}ax + by \\ cx + dy\end{bmatrix}

At first, I thought it had to do with my scalar implementation having tons of impl const and const fn, as I was messing around with a lot of the nightly features of Rust. It was afterward that I realised accessing the field of a vector is quite costly. It requires some sorts of masking and shuffling to obtain the desired results, which is especially slow on ARM64 machines (I’m running a MacBook ProM4), due to the gap between the integer and vector registers, likely causing stalls in execution.

Therefore, after some ~~consultation from ChatGPT~~, I found a more optimised solution to this kind of problem, avoiding all extractions.

impl Mul<Vec2> for Transform2 {
    type Output = Vec2;

    fn mul(self, rhs: Vec2) -> Self::Output {
        Vec2(Simd::from_array([
            (self.0.0 * rhs.0).reduce_sum(),
            (self.0.1 * rhs.0).reduce_sum(),
        ]))
    }
}

This brings a 10% decrease in execution time.

For a more extreme case of Transform2 * Transform2, because it had 8 field extractions in the naïve approach, it experienced a 48% decrease in execution time, which is quite a big leap.

Conclusion

From these performance gains, it is quite amazing how a simple tweak in which data-type to use, and how the operations are done, such grave performance gains are able to be accomplished. I’m only just starting to touch on with what SIMD is, and I hope I can learn more about this amazing field in the future.

Initial Thought on Vectors and SIMD