Faster Progress Report 1
It's been a busy few weeks with faster! Faster has a test suite, "just works" on (almost) all architectures and CPUs, and has many more intrinsics and functions available.
New License
Faster has been relicensed under the MPL 2.0. This lets you use faster in any of your projects, including proprietary ones, provided you keep faster free.
Breaking Changes
Faster, being a bleeding-edge library tackling a novel concept, is bound to have tons breaking changes before it stabilizes. Here are the recent ones:
PackedIterator
is now an iterator over scalars instead of vectors. UsePackedIterator::simd_map
to map over vectors and scalars, andPackedIterator::next_vector
to manually extract the next vector from the iterator.- All collections are now assumed to be uneven, because assuming your data fits into your system's vector is a really bad idea
PackedTransmute::be_f32s
andPackedTransmute::be_f64s
are now unsafe, and have been renamed tobe_f32s_unchecked
andbe_f64s_unchecked
, respectively, because of their potential to generate signaling NaNs. Huge thanks to BurntSushi for pointing out this issue![Saturating]Packed{Hadd, Hsub}
now return interleaved results, rather than in the Intel-defined order. This makes them compatible with all architectures and vector sizes.
New Intrinsics
Many intrinsics have gained implementations for more vector sizes and types. However, a few more intrinsics were added outright:
As{u8s, u16s, u32s, u64s, i8s, i16s, i32s, i64s, f32s, f64s}
Casting has been implemented. You can now quickly and efficiently convert between floats and integers of the same width, and between signed/unsigned integers of the same width.
let floaty_threes = u32s::splat(3).as_f32s();
Upcast
Upcasting has been implemented for almost all vector types! It takes a vector and returns a 2-tuple of vectors containing the upcast numbers. For example:
let (threes, more_threes) = u32s::splat(3).upcast();
Here, threes
contains the first half of the
upcast vector, and more_threes
contains the second
half. Both are of type u64s
.
Downcast
Downcasting is implemented for many of the integer vectors, but not for float vectors. In the same way that upcasting doubles the size of your and therefore returns two vectors, downcasting takes two vectors and returns one. From the previous example:
let (threes, fours) = u32s::halfs(3, 4).upcast();
let downcast = threes.saturating_downcast(more_threes);
assert_eq!(downcast, u32s::halfs(3, 4));
Downcasting for floats will be implemented soon. They require a few additional instructions because of Intel, sadly.
I'm unsure about whether I want to call it
downcast
, or downcast_with
in the
final version. Hit up the issue tracker with your best
bikeshedding.
I also plan to implement a checked downcast which returns a Result<T>. PRs are welcome, because I have tons of things ahead of this on my list.
SaturatingAdd
This works in exactly the same way as
i32::saturating_add
, but can operate on an entire
vector at once. Neat!
SaturatingSub
This works in exactly the same way as
i32::saturating_sub
, but can operate on an entire
vector at once. Also neat!
SaturatingHadd
Exactly like hadd
, except the addition is
saturating.
SaturatingHsub
Exactly like hsub
, except the subtraction is
saturating.
All of these intrinsics have tests and polyfills, so go play around with them!
Uneven Collections
Faster is supposed to work on any platform, regardless of the size of its vectors. This means we can't make any assumptions about whether your data can be operated on using only SIMD operations. When we last spoke, we actually didn't have any support for collections which didn't exactly fit into your system's vectors. Now, we do!
let thirteen_fives = [3.0f32; 13].simd_iter().simd_map(
|vector| vector + f32s::splat(2.0),
|scalar| scalar + 2.0).scalar_collect()
Faster still has a few things it could do better on this front, which I plan to implement in the future:
- For operations which are 1:1, do the operation on a partially-full vector instead of on all the remaining scalars
- Allow users to elide the scalar function when it does exactly the same thing as the vector function (like in the above example)
Tests
I have been tirelessly adding tests to the intrinsic modules and ensuring both the intrinsics and their polyfills behave correctly and identically. More than half of the intrinsics have comprehensive tests now!
Polyfills
You can now use almost any intrinsic on any platform - even ones which don't support SIMD at all! I've added highly optimized scalar alternatives to almost every intrinsic, and I will have a polyfill for every intrinsic I add going forward.
Currently, AddSub and Rsqrt are missing polyfills. For those interested in getting their hands dirty with the project, check out the polyfills for Hadd and Recip and maybe try beating me to implementing it!
Vector Initialization
All vector types have two new functions, halfs
and interleave
. They allow one to initialize a
vector to a pattern, like {1, 1, 2, 2} or {1, 2, 1, 2},
respectively.
Both are not vectorized and very
non-portable and probably will be made private soon,
but they are very useful for testing. The tests for Upcast and
Downcast are written with halfs
and
interleave
.
core
Support
Faster is committed to supporting all architectures and
platforms - even those without a memory allocator. Now, you can
use IntoScalar::scalar_fill(&mut [T])
to fill a
stack array with the results of your computations.
Upcoming Features
Gathers will be released as soon as stdsimd 0.0.4 comes out; I added the intrinsics to stdsimd a few days ago and have a portable interface and polyfills ready to go. This should make many computations like matrix determinants extremely easy to implement. Sneak peek:
fn determinant(matrices: &[f64]) -> Vec<f64> {
// Input: Many 3x3 matrices of the form [a b c ... i a b c ... i ...]
.simd_iter().stripe(9).multizip().simd_map(|(a, b, c, d, e, f, g, h, i)| {
matrices* e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h)
(a })).scalar_collect();
}
Scatters, the opposite of gathers, aren't available on consumer chips yet, so I will be polyfilling those after I add gathers. Once I have some AVX512-capable silicon in my hands, I will vectorize it.
With the new iterator overhaul, we can add more ways to
iterate like simd_reduce
. This should let us
implement functions like strcmp
with ease.
Faster currently assumes that you have some kind of vector available on your system. However, that seriously cuts into our ability to support the Intel 8086, Apple A4, and PDP-11. Therefore, I'll be adding support for non-SIMD architectures very soon.
I'll write docs and tutorials as soon as I finish all of the above features and have 100% polyfill and test coverage. After that, runtime feature detection?
With my current rate of work, I should have all of the above done in a few months. If you'd like to see that sped up, consider contributing to faster itself or stdsimd. Happy hacking!