Faster Progress Report 1

It's been a busy few weeks with faster! Faster has a test suite, "just works" on (almost) all architectures and CPUs, and has many more intrinsics and functions available.

New License

Faster has been relicensed under the MPL 2.0. This lets you use faster in any of your projects, including proprietary ones, provided you keep faster free.

Breaking Changes

Faster, being a bleeding-edge library tackling a novel concept, is bound to have tons breaking changes before it stabilizes. Here are the recent ones:

PackedIterator is now an iterator over scalars instead of vectors. Use PackedIterator::simd_map to map over vectors and scalars, and PackedIterator::next_vector to manually extract the next vector from the iterator.
All collections are now assumed to be uneven, because assuming your data fits into your system's vector is a really bad idea
PackedTransmute::be_f32s and PackedTransmute::be_f64s are now unsafe, and have been renamed to be_f32s_unchecked and be_f64s_unchecked, respectively, because of their potential to generate signaling NaNs. Huge thanks to BurntSushi for pointing out this issue!
[Saturating]Packed{Hadd, Hsub} now return interleaved results, rather than in the Intel-defined order. This makes them compatible with all architectures and vector sizes.

New Intrinsics

Many intrinsics have gained implementations for more vector sizes and types. However, a few more intrinsics were added outright:

As{u8s, u16s, u32s, u64s, i8s, i16s, i32s, i64s, f32s, f64s}

Casting has been implemented. You can now quickly and efficiently convert between floats and integers of the same width, and between signed/unsigned integers of the same width.

let floaty_threes = u32s::splat(3).as_f32s();

Upcast

Upcasting has been implemented for almost all vector types! It takes a vector and returns a 2-tuple of vectors containing the upcast numbers. For example:

let (threes, more_threes) = u32s::splat(3).upcast();

Here, threes contains the first half of the upcast vector, and more_threes contains the second half. Both are of type u64s.

Downcast

Downcasting is implemented for many of the integer vectors, but not for float vectors. In the same way that upcasting doubles the size of your and therefore returns two vectors, downcasting takes two vectors and returns one. From the previous example:

let (threes, fours) = u32s::halfs(3, 4).upcast();
let downcast = threes.saturating_downcast(more_threes);
assert_eq!(downcast, u32s::halfs(3, 4));

Downcasting for floats will be implemented soon. They require a few additional instructions because of Intel, sadly.

I'm unsure about whether I want to call it downcast, or downcast_with in the final version. Hit up the issue tracker with your best bikeshedding.

I also plan to implement a checked downcast which returns a Result<T>. PRs are welcome, because I have tons of things ahead of this on my list.

SaturatingAdd

This works in exactly the same way as i32::saturating_add, but can operate on an entire vector at once. Neat!

SaturatingSub

This works in exactly the same way as i32::saturating_sub, but can operate on an entire vector at once. Also neat!

SaturatingHadd

Exactly like hadd, except the addition is saturating.

SaturatingHsub

Exactly like hsub, except the subtraction is saturating.

All of these intrinsics have tests and polyfills, so go play around with them!

Uneven Collections

Faster is supposed to work on any platform, regardless of the size of its vectors. This means we can't make any assumptions about whether your data can be operated on using only SIMD operations. When we last spoke, we actually didn't have any support for collections which didn't exactly fit into your system's vectors. Now, we do!

let thirteen_fives = [3.0f32; 13].simd_iter().simd_map(
    |vector| vector + f32s::splat(2.0),
    |scalar| scalar + 2.0).scalar_collect()

Faster still has a few things it could do better on this front, which I plan to implement in the future:

For operations which are 1:1, do the operation on a partially-full vector instead of on all the remaining scalars
Allow users to elide the scalar function when it does exactly the same thing as the vector function (like in the above example)

Tests

I have been tirelessly adding tests to the intrinsic modules and ensuring both the intrinsics and their polyfills behave correctly and identically. More than half of the intrinsics have comprehensive tests now!

Polyfills

You can now use almost any intrinsic on any platform - even ones which don't support SIMD at all! I've added highly optimized scalar alternatives to almost every intrinsic, and I will have a polyfill for every intrinsic I add going forward.

Currently, AddSub and Rsqrt are missing polyfills. For those interested in getting their hands dirty with the project, check out the polyfills for Hadd and Recip and maybe try beating me to implementing it!

Vector Initialization

All vector types have two new functions, halfs and interleave. They allow one to initialize a vector to a pattern, like {1, 1, 2, 2} or {1, 2, 1, 2}, respectively.

Both are not vectorized and very non-portable and probably will be made private soon, but they are very useful for testing. The tests for Upcast and Downcast are written with halfs and interleave.

`core` Support

Faster is committed to supporting all architectures and platforms - even those without a memory allocator. Now, you can use IntoScalar::scalar_fill(&mut [T]) to fill a stack array with the results of your computations.

Upcoming Features

Gathers will be released as soon as stdsimd 0.0.4 comes out; I added the intrinsics to stdsimd a few days ago and have a portable interface and polyfills ready to go. This should make many computations like matrix determinants extremely easy to implement. Sneak peek:

fn determinant(matrices: &[f64]) -> Vec<f64> {
    // Input: Many 3x3 matrices of the form [a b c ... i a b c ... i ...]
    matrices.simd_iter().stripe(9).multizip().simd_map(|(a, b, c, d, e, f, g, h, i)| {
        (a * e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h)
    })).scalar_collect();
}

Scatters, the opposite of gathers, aren't available on consumer chips yet, so I will be polyfilling those after I add gathers. Once I have some AVX512-capable silicon in my hands, I will vectorize it.

With the new iterator overhaul, we can add more ways to iterate like simd_reduce. This should let us implement functions like strcmp with ease.

Faster currently assumes that you have some kind of vector available on your system. However, that seriously cuts into our ability to support the Intel 8086, Apple A4, and PDP-11. Therefore, I'll be adding support for non-SIMD architectures very soon.

I'll write docs and tutorials as soon as I finish all of the above features and have 100% polyfill and test coverage. After that, runtime feature detection?

With my current rate of work, I should have all of the above done in a few months. If you'd like to see that sped up, consider contributing to faster itself or stdsimd. Happy hacking!