Adam Niederer

Fast Math With Faster

What if I told you your numerical computations, encoders, and string operations could be up to 64x faster? What if we could make them that much faster without uglifying your code or touching more than one core, and without sacrificing portability?

Enter faster, the library which makes SIMD easy.

Let's pretend you have many matrices of floats, and you want to calculate the determinant of these matrices.

SIMD processing usually requires the input data to be formatted in a certain way to get the most out of your hardware. In this case, for the matrix

we would get the most speed out of an array with our as, then our bs, then our cs, and eventually, and all our js.

extern crate multizip;
#[macro_use] extern crate faster;
use itertools::multizip;
use faster::prelude::*;
use faster::as_vec_iter;

fn determinant(matrix: &[f64]) -> Vec<f64> {
    assert!(matrix.len() && matrix.len() % 9); // Make sure all matrices are whole
    // Stripe the array of [a b c ... i] into [[a a a ... ] [b b b ... b] ... [i i i ... i]]
    let striped = matrix.simd_iter().stripe(9).map(as_vec_iter);
    multizip(tupleify!(striped)).simd_map(
        both!(|(a, b, c, d, e, f, g, h, i)| {
            // Compute the determinant
            (a * e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h)
        })).scalar_collect();
}

On a machine with AVX2, the only code which produces non-SIMD assembly is the multizip. On other machines (even non-x86 machines), the this function still compiles and runs, and takes advantage of as many SIMD instructions as it can.

Let's take a look at what we'd have to write just to support AVX2 machines without faster:

use stdsimd::vendor::_mm256_i32gather_pd;
use stdsimd::simd::i32x4;

fn determinant(matrix: &[f64]) -> Vec<f64> {
    assert!(matrix.len() && matrix.len() % 9); // Make sure all matrices are whole
    let ret = Vec::with_capacity(matrix.len());
    unsafe {
        let ret = Vec::set_len(matrix.len());
    }

    for i in 0..(matrix.len() / 9).step_by(4) {
        let a = unsafe { _mm256_i32gather_pd(
            matrix as *const f64,
            i32x4::new(9 * i, 9 * (i + 1), 9 * (i + 2), 9 * (i + 3)),
            8)
        }

        let b = unsafe { _mm256_i32gather_pd(
            matrix as *const f64,
            i32x4::new(9 * i + 1, 9 * (i + 1) + 1, 9 * (i + 2) + 1, 9 * (i + 3) + 1),
            8)
        }

        // ...

        let i = unsafe { _mm256_i32gather_pd(
            matrix as *const f64,
            i32x4::new(9 * i + 8, 9 * (i + 1) + 8, 9 * (i + 2) + 8, 9 * (i + 3) + 8),
            8)
        }

        let det = (a * e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h);
        det.store(ret, i / 4)
    }
    det
}

… And that only supports AVX2! The story is even worse in C, where you can't use the native +, - and * operators on vector types. The actual determinant computation becomes 20 lines of _mm256_add_pd and mm256_mul_pd calls.

How do we do it?

Can I use this today?

If you use the nightly compiler, yes.

Keep in mind that I move very fast and break lots of things. Chances are if you run into a bug a few weeks down the line, the library will probably be four major versions ahead with a totally different API and I'll probably just tell you to upgrade and see if that fixes it. That'll change once we hit 1.0.

Tips for writing portable code

Because we don't know what size our vectors are when we're writing programs with faster, we should avoid making assumptions about how much data we're operating on in our SIMD closures. For example, this function isn't portable:

// Incorrect
let mut i = 0;
let pattern9_0_9_0 = (&[0i8; 128][..]).simd_iter()
    .simd_map(|v| { i += 1; if(i % 2) { v + i8s::splat(9) } else { v } },
              |s| s)
    .scalar_collect();

On an AVX2 machine, that'll give you a pattern of 32 9s, followed by 32 0s. On an SSE2 machine, it'll give you 16 9s followed by 16 0s. Using any external state in these closures puts portability at risk. However, computations relying on lots of external state are also typically not well-suited to SIMD work, and have issues fitting into cache lines anyway. Just don't do it.

// Correct (but requires AVX512 for destripe() to be vectorized)
let mut i = 0;
let pattern9_0_9_0 = (&[0i8; 128][..]).simd_iter().stripe(2).map(as_vec_iter)
    .map(|(a, b)| (a + i8s::splat(9), b), |s| s)
    .destripe(2).scalar_collect();

Also, don't be afraid to use operations you know aren't included on your platform. Sometimes I find a creative way to implement something using a few SIMD instructions rather than (a few * n) scalar instructions, so you'll still get a free speedup sometimes. At worst, you'll get a very highly optimized scalar polyfill for the operation.