Faster Progress Report 1

It's raining code! 0.4.0 is faster's biggest, most useful release yet.

faster began as a yak shave, created to aid base💯 in its quest to become the fastest meme on Github. Writing an explicit AVX2-accelerated version of base💯's encoder and decoder, then realizing I'd have to do the same thing again to see the speedups on my Ivy Bridge desktop, pushed me to make this library. Months later, it has blossomed into its own project, and has eclipsed base💯 in both popularity and promise.

Faster 0.4.0 contains numerous improvements, including

#![no_std] support
Collection striping (Gathers)
Vector merging
Vectorized endianness operations
Vector swizzling
Lockstep Packed Iterators (Zipping)
Vector patterns
Vectorized iteration on uneven collections
Vectorized min/max on SSE4.1
Vectorized Upcast on SSE4.1
FnMut closures in simd_map and simd_reduce
Compound assignment operators for architectures without hardware SIMD
Large vectors for architectures without hardware SIMD
Downcast polyfills on more vector types
Polyfills of [saturating_]{hadd,hsub} on more vector types.
A changelog
Tons of documentation
Bug fixes

In the following few headings, we'll be exploring some of these improvements in more detail.

`no_std` Support

We can compile on core now! All functions which returned a Vec are disabled when compiling for core, but you get the same great API and the same speedup. Faster performs no allocations, except for functions which return owned types, so switching from std to core doesn't affect any library functionality.

Vector Patterns

Thanks to the leaking of a top-secret implementation of procedural macros, faster has formalized its vector pattern constructors. This means halfs, interleave, and the new partition are an official part of the API! These patterns are fully vectorized and polyfilled, and are being used right now in the iterator and vector merging code.

The new function, partition, fills a vector with the first argument until the specified index, and then fills the remainder with the second argument. Thanks to the kind soul on the #rust IRC who helped me divine a vectorized algorithm for this!

(Yes, I know it's "halves". We're still unstable, don't worry.)

Striping

A non-vectorized implementation of gathers have landed. This lets you take a collection which normally wouldn't load into a vector and perform vector operations on it. A good example would be a collection where we wish to group every nth element into a vector, or a collection with some garbage elements in it.

There was a vectorized implementation, but it was too limited and not fast enough. I'd rather approach the solution with swizzling in the future, rather than using Intel's gather intrinsics.

While striping a collection does mean your loads will be scalar, it is still possible to outperform non-SIMD implementations, as seen in base💯's new SIMD-accelerated decoder. Although this feature is still a work in progress, it has already yielded serious improvements in expressive power.

buf.simd_iter().stripe_four().zip().simd_map(tuplify!(4, u8s(0)), |(_, _, c, d)| {
    ((c - u8s(143)) * u8s(64)) + d - u8s(128) - u8s(55)
}).scalar_fill(out)

Our determinant function sketch from the last progress report actually compiles and runs now!

fn determinant(matrices: &[f64]) -> Vec<f64> {
    // Input: Many 3x3 matrices of the form [a b c ... i a b c ... i ...]
    matrices.simd_iter().stripe_nine().zip().simd_map(tuplify!(9, f64s(0.0)), |(a, b, c, d, e, f, g, h, i)| {
        (a * e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h)
    }).scalar_collect()
}

Zipping

Faster can now add two vectors together! Two PackedIterators can now be operated on in lockstep, using a similar API to the one exposed by Rust's standard library. Just call zip on a tuple of PackedIterator to get a structure which returns a tuple containing each iterator's next vector. We can see its use in our determinant function:

fn determinant(matrices: &[f64]) -> Vec<f64> {
    // Input: Many 3x3 matrices of the form [a b c ... i a b c ... i ...]
    matrices.simd_iter().stripe_nine().zip().simd_map(tuplify!(9, f64s(0.0)), |(a, b, c, d, e, f, g, h, i)| {
        (a * e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h)
    }).scalar_collect()
}

stripe_nine returns a tuple of nine PackedIterators, which are then zipped. We can destructure the packed zipped iterator just like a scalar zipped iterator.

Swizzling

Faster now has two built-in ways to shuffle a vector: PackedSwizzle::flip() and PackedReendianize::swap_bytes(). flip will flip even and odd elements of your vector, starting at element 0:

assert_eq!(u8s::interleave(2, 1).flip(), u8s::interleave(1, 2));
assert_eq!(f64s::interleave(2.0, 1.0).flip(), f64s::interleave(1.0, 2.0));

Whereas swap_bytes will perform a vectorized endianness swap.

assert_eq!(u16s(2).swap_bytes(), u16s(2.swap_bytes()));
assert_eq!(i32s::interleave(30, 90).swap_bytes(),
           i32s::interleave(30.swap_bytes(), 90.swap_bytes()));

Helper functions present in Rust's standard library such as to_be are also supported:

let checksum = socket_read_buf.simd_iter()
    .simd_reduce(u8s(0), u8s(0), |acc, v| {
        acc + v.be_u32s().from_be();
    }).sum();

Vector Merging

Combining two vectors is now possible with faster. On x86, this uses the single-cycle blend intrinsics, which are much faster than masking and blitting vectors onto each other.

assert_eq!(u8s(20).merge_interleaved(u8s(5)), u8s::interleave(20, 5));
assert_eq!(i16s(-3).merge_partitioned(i16s(1), 2), i16s::partitioned(-3, 1, 2));

QoL Improvements

Mutable Closures

When designing faster, I initially thought it would be possible to completely abstract vector sizes from the user by relying on pure and referentially transparent computation. While this model is sufficiently powerful and usable most of the time, there are real use cases for mutable state in SIMD computations. Because of this, mutable closures are now usable in simd_map and simd_reduce.

Shorthand Splat Syntax

Who knew you could export a function with the same name as a type? Now, instead of typing u8s::splat(0), you can simply call u8s(0) for the same effect. This works for all vector types!

Scalar Reduction

You can now reduce a vector into a scalar in the same way you can reduce an iterator of vectors into a vector. It's very possible to write nonportable code with this function (such as the included example), but it offers a ton of expressive power and should reduce cognitive load when the need to write such a thing arises.

u8s(0x01).scalar_reduce(0, |acc, v| acc + v)

Documentation

Tons of documentation and examples were added between 0.3.0 and 0.4.0. Check it out here!

Performance Improvements

Vectorized Upcast & Downcast & Min & Max

Thanks to continued improvements in stdsimd, upcasting, downcasting, elementwise minimums, and elementwise maximums are now vectorized on x86. Yay!

Partial Vectors

Faster's iterators can now vectorize loads/stores of an entire collection, regardless of its size. Previously, each scalar which didn't fit into a vector was broadcast into a vector, sent through the SIMD closure, and coalesced back into a scalar. While this worked well with faster 0.3.0's more limited API, it was slow, inefficient, and didn't work with swizzling and striping.

Now, simd_map and simd_reduce take an additional parameter, default. On uneven collections, faster will load every vector in the collection normally until it hits the end of the collection. Then, it loads the last Vector::width elements from the collection into the vector, and writes default over any elements which were already loaded. It also returns the number of elements which haven't already been loaded, they can be ignored by mapping operations.

The only quirk of this system is that the last vector is "shifted right", so the last element of the collection is always at the highest index, and the vector may contain the default at index 0.

// Assuming 128-bit vectors:
let x = (&[1, 2, 3, 4, 5u32][..]).simd_iter()
    .simd_map(u32s(9), |v| { println!("{:?}", v); v })
    .scalar_collect();

// Output:
// u32x4(1, 2, 3, 4)
// u32x4(9, 9, 9, 5)

This means faster is now just as fast on 1023-element collections as it is on 1025-element collections. I'm also looking into ways to make this less quirky for the end user, although I expect its possible impact to be minimal.

World Domination

Faster has reached a point where it is ready to be deployed in real code. To test faster in the real world, I have integrated it into foundational crates such as byteorder and bytecount, with very impressive results compared to the existing scalar or explicit simd implementations. I hope to see these improvements merged in shortly. Furthermore, I intend to continue sending improvements to libraries which would benefit from SIMD.

Here are some benchmarks from from the bytecount pull request:

# `faster`
test bench_count_big_1000000_hyper ... bench:      30,959 ns/iter (+/- 27)

# `simd` crate (-C target-cpu=native; upstream)
test bench_count_big_1000000_hyper     ... bench:  61,536 ns/iter (+/- 2,047)

# scalar implementation (-C target-cpu=native; upstream)
test bench_count_big_1000000_hyper ... bench:      53,060 ns/iter (+/- 49)

And some benchmarks from byteorder:

# faster
test slice_u16::write_big_endian    ... bench:      23,344 ns/iter (+/- 122) = 8567 MB/s
test slice_u32::write_big_endian    ... bench:      46,681 ns/iter (+/- 160) = 8568 MB/s
test slice_u64::write_big_endian    ... bench:     105,206 ns/iter (+/- 369) = 7604 MB/s

# scalar implementation (-C target-cpu=native; upstream)
test slice_u16::write_big_endian    ... bench:     147,829 ns/iter (+/- 269) = 1352 MB/s
test slice_u32::write_big_endian    ... bench:     112,241 ns/iter (+/- 652) = 3563 MB/s
test slice_u64::write_big_endian    ... bench:     108,404 ns/iter (+/- 571) = 7379 MB/s

Not bad, eh?

What's next?

Faster is scarily fast, and is often much faster than competing solutions which do not offer automatic SIMD iteration. However, its iteration algorithm is lazy, and every call to next must check whether the iterator has any more items in it. This is typically incongruous with how SIMD code is written, where all operations are performed in an unrolled loop.

Adding an eager implementation of simd_map should totally alleviate this, and should encourage LLVM to unroll the iteration, allowing faster to truly brush up against the runtimes of hand-tuned assembly.

Although simd_reduce is already eager, it still relies on the aforementioned iteration protocol. Unfortunately, Rust's trait system makes it quite difficult to add an optimized implementation for non-lazy iterators, so I may have to hold off on this until specialization lands.

Changing up the iteration algorithm also presents an excellent opportunity to add a "hard mode" API to faster, which allows a user to manually pack and unpack streams of types as they please. Sneak peek:

pub fn char_to_emoji<'a, 'b>(buf: &'a [u8], out: &'b mut [u8]) -> &'b [u8] {
    for (i, vec) in buf.simd_iter().pack().enumerate() {
        let (a, b) = v.upcast();
        let third = (a + u16s(55)) / u16s(64) + u16s(143).downcast((b + u16s(55)) / u16s(64) + u16s(143));
        let fourth = ((v + u8s(55)) & u8s(0x3f)) + u8s(128);

        // Make some room for interleaving
        let (ta, tb) = third.upcast();
        let (fa, fb) = fourth.upcast();

        // Interleave third and fourth bytes
        let third_foruth_a = ta.swap_bytes().merge_interleaved(fa);
        let third_fourth_b = tb.swap_bytes().merge_interleaved(fb);

        // Make some more room for another interleaving
        let (tfa, tfb) = third_fourth_a.upcast();
        let (tfc, tfd) = third_fourth_b.upcast();

        // Interleave a constant 0xf09f with the third and fourth bytes,
        // and store into out buffer
        u32s(0xf09f0000).merge_interleaved(tfa).store(out, 128 * i);
        u32s(0xf09f0000).merge_interleaved(tfb).store(out, 128 * i + tfa.width());
        u32s(0xf09f0000).merge_interleaved(tfc).store(out, 128 * i + 2 * tfa.width());
        u32s(0xf09f0000).merge_interleaved(tfd).store(out, 128 * i + 3 * tfa.width());
    }
}

That may look complicated, but it's a quarter of the size of this 100-line monstrosity (which only supports AVX2!).

This should be exactly as fast as traditional SIMD assembly, because the user is responsible for all of the iterative work, while faster will only provide a basic load/store interface and its full selection of vector operations.

I'd like to thank everybody for following the development of faster, filing issues, lending algorithm ideas, and maintaining a high quality of discourse about the library. If you'd like to join the community, check out the Github page.