Faster Progress Report 1
It's raining code! 0.4.0 is faster's biggest, most useful release yet.
faster began as a yak shave, created to aid base💯 in its quest to become the fastest meme on Github. Writing an explicit AVX2-accelerated version of base💯's encoder and decoder, then realizing I'd have to do the same thing again to see the speedups on my Ivy Bridge desktop, pushed me to make this library. Months later, it has blossomed into its own project, and has eclipsed base💯 in both popularity and promise.
Faster 0.4.0 contains numerous improvements, including
#![no_std]
support- Collection striping (Gathers)
- Vector merging
- Vectorized endianness operations
- Vector swizzling
- Lockstep Packed Iterators (Zipping)
- Vector patterns
- Vectorized iteration on uneven collections
- Vectorized min/max on SSE4.1
- Vectorized
Upcast
on SSE4.1 FnMut
closures insimd_map
andsimd_reduce
- Compound assignment operators for architectures without hardware SIMD
- Large vectors for architectures without hardware SIMD
Downcast
polyfills on more vector types- Polyfills of
[saturating_]{hadd,hsub}
on more vector types. - A changelog
- Tons of documentation
- Bug fixes
In the following few headings, we'll be exploring some of these improvements in more detail.
no_std
Support
We can compile on core
now! All functions which
returned a Vec
are disabled when compiling for
core
, but you get the same great API and the same
speedup. Faster performs no allocations, except for functions
which return owned types, so switching from std
to
core
doesn't affect any library functionality.
Vector Patterns
Thanks to the
leaking of a top-secret implementation of procedural macros,
faster has formalized its vector pattern constructors. This
means halfs
, interleave
, and the new
partition
are an official part of the API! These
patterns are fully vectorized and polyfilled, and are being used
right now in the iterator and vector merging code.
The new function, partition
, fills a vector with
the first argument until the specified index, and then fills the
remainder with the second argument. Thanks to the kind soul on
the #rust
IRC who helped me divine a vectorized
algorithm for this!
(Yes, I know it's "halves". We're still unstable, don't worry.)
Striping
A non-vectorized implementation of gathers have landed. This lets you take a collection which normally wouldn't load into a vector and perform vector operations on it. A good example would be a collection where we wish to group every nth element into a vector, or a collection with some garbage elements in it.
There was a vectorized implementation, but it was too limited and not fast enough. I'd rather approach the solution with swizzling in the future, rather than using Intel's gather intrinsics.
While striping a collection does mean your loads will be scalar, it is still possible to outperform non-SIMD implementations, as seen in base💯's new SIMD-accelerated decoder. Although this feature is still a work in progress, it has already yielded serious improvements in expressive power.
.simd_iter().stripe_four().zip().simd_map(tuplify!(4, u8s(0)), |(_, _, c, d)| {
buf- u8s(143)) * u8s(64)) + d - u8s(128) - u8s(55)
((c }).scalar_fill(out)
Our determinant function sketch from the last progress report actually compiles and runs now!
fn determinant(matrices: &[f64]) -> Vec<f64> {
// Input: Many 3x3 matrices of the form [a b c ... i a b c ... i ...]
.simd_iter().stripe_nine().zip().simd_map(tuplify!(9, f64s(0.0)), |(a, b, c, d, e, f, g, h, i)| {
matrices* e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h)
(a }).scalar_collect()
}
Zipping
Faster can now add two vectors together! Two PackedIterators
can now be operated on in lockstep, using a similar API to the
one exposed by Rust's standard library. Just call
zip
on a tuple of PackedIterator
to
get a structure which returns a tuple containing each iterator's
next vector. We can see its use in our determinant function:
fn determinant(matrices: &[f64]) -> Vec<f64> {
// Input: Many 3x3 matrices of the form [a b c ... i a b c ... i ...]
.simd_iter().stripe_nine().zip().simd_map(tuplify!(9, f64s(0.0)), |(a, b, c, d, e, f, g, h, i)| {
matrices* e * i) + (b * f * g) + (c * d * h) - (c * e * g) - (b * d * i) - (a * f * h)
(a }).scalar_collect()
}
stripe_nine
returns a tuple of nine
PackedIterators, which are then zipped. We can destructure the
packed zipped iterator just like a scalar zipped iterator.
Swizzling
Faster now has two built-in ways to shuffle a vector:
PackedSwizzle::flip()
and
PackedReendianize::swap_bytes()
. flip
will flip even and odd elements of your vector, starting at
element 0:
assert_eq!(u8s::interleave(2, 1).flip(), u8s::interleave(1, 2));
assert_eq!(f64s::interleave(2.0, 1.0).flip(), f64s::interleave(1.0, 2.0));
Whereas swap_bytes
will perform a vectorized
endianness swap.
assert_eq!(u16s(2).swap_bytes(), u16s(2.swap_bytes()));
assert_eq!(i32s::interleave(30, 90).swap_bytes(),
i32s::interleave(30.swap_bytes(), 90.swap_bytes()));
Helper functions present in Rust's standard library such as
to_be
are also supported:
let checksum = socket_read_buf.simd_iter()
.simd_reduce(u8s(0), u8s(0), |acc, v| {
+ v.be_u32s().from_be();
acc }).sum();
Vector Merging
Combining two vectors is now possible with faster. On x86, this uses the single-cycle blend intrinsics, which are much faster than masking and blitting vectors onto each other.
assert_eq!(u8s(20).merge_interleaved(u8s(5)), u8s::interleave(20, 5));
assert_eq!(i16s(-3).merge_partitioned(i16s(1), 2), i16s::partitioned(-3, 1, 2));
QoL Improvements
Mutable Closures
When designing faster, I initially thought it would be
possible to completely abstract vector sizes from the user by
relying on pure and referentially transparent computation. While
this model is sufficiently powerful and usable most of the time,
there are real use cases for mutable state in SIMD computations.
Because of this, mutable closures are now usable in
simd_map
and simd_reduce
.
Shorthand Splat Syntax
Who knew you could export a function with the same name as a
type? Now, instead of typing u8s::splat(0)
, you can
simply call u8s(0)
for the same effect. This works
for all vector types!
Scalar Reduction
You can now reduce a vector into a scalar in the same way you can reduce an iterator of vectors into a vector. It's very possible to write nonportable code with this function (such as the included example), but it offers a ton of expressive power and should reduce cognitive load when the need to write such a thing arises.
0x01).scalar_reduce(0, |acc, v| acc + v) u8s(
Documentation
Tons of documentation and examples were added between 0.3.0 and 0.4.0. Check it out here!
Performance Improvements
Vectorized Upcast & Downcast & Min & Max
Thanks to continued improvements in stdsimd, upcasting, downcasting, elementwise minimums, and elementwise maximums are now vectorized on x86. Yay!
Partial Vectors
Faster's iterators can now vectorize loads/stores of an entire collection, regardless of its size. Previously, each scalar which didn't fit into a vector was broadcast into a vector, sent through the SIMD closure, and coalesced back into a scalar. While this worked well with faster 0.3.0's more limited API, it was slow, inefficient, and didn't work with swizzling and striping.
Now, simd_map
and simd_reduce
take
an additional parameter, default
. On uneven
collections, faster will load every vector in the collection
normally until it hits the end of the collection. Then, it loads
the last Vector::width
elements from the collection
into the vector, and writes default
over any
elements which were already loaded. It also returns the number
of elements which haven't already been loaded, they can be
ignored by mapping operations.
The only quirk of this system is that the last vector is "shifted right", so the last element of the collection is always at the highest index, and the vector may contain the default at index 0.
// Assuming 128-bit vectors:
let x = (&[1, 2, 3, 4, 5u32][..]).simd_iter()
.simd_map(u32s(9), |v| { println!("{:?}", v); v })
.scalar_collect();
// Output:
// u32x4(1, 2, 3, 4)
// u32x4(9, 9, 9, 5)
This means faster is now just as fast on 1023-element collections as it is on 1025-element collections. I'm also looking into ways to make this less quirky for the end user, although I expect its possible impact to be minimal.
World Domination
Faster has reached a point where it is ready to be deployed
in real code. To test faster in the real world, I have
integrated it into foundational crates such as
byteorder
and bytecount
, with very
impressive results compared to the existing scalar or explicit
simd implementations. I hope to see these improvements merged in
shortly. Furthermore, I intend to continue sending improvements
to libraries which would benefit from SIMD.
Here are some benchmarks from from the bytecount
pull request:
# `faster`
test bench_count_big_1000000_hyper ... bench: 30,959 ns/iter (+/- 27)
# `simd` crate (-C target-cpu=native; upstream)
test bench_count_big_1000000_hyper ... bench: 61,536 ns/iter (+/- 2,047)
# scalar implementation (-C target-cpu=native; upstream)
test bench_count_big_1000000_hyper ... bench: 53,060 ns/iter (+/- 49)
And some benchmarks from byteorder
:
# faster
test slice_u16::write_big_endian ... bench: 23,344 ns/iter (+/- 122) = 8567 MB/s
test slice_u32::write_big_endian ... bench: 46,681 ns/iter (+/- 160) = 8568 MB/s
test slice_u64::write_big_endian ... bench: 105,206 ns/iter (+/- 369) = 7604 MB/s
# scalar implementation (-C target-cpu=native; upstream)
test slice_u16::write_big_endian ... bench: 147,829 ns/iter (+/- 269) = 1352 MB/s
test slice_u32::write_big_endian ... bench: 112,241 ns/iter (+/- 652) = 3563 MB/s
test slice_u64::write_big_endian ... bench: 108,404 ns/iter (+/- 571) = 7379 MB/s
Not bad, eh?
What's next?
Faster is scarily fast, and is often much faster than
competing solutions which do not offer automatic SIMD iteration.
However, its iteration algorithm is lazy, and every call to
next
must check whether the iterator has any more
items in it. This is typically incongruous with how SIMD code is
written, where all operations are performed in an unrolled
loop.
Adding an eager implementation of simd_map
should totally alleviate this, and should encourage
LLVM to unroll the iteration, allowing faster to truly brush up
against the runtimes of hand-tuned assembly.
Although simd_reduce
is already eager, it still
relies on the aforementioned iteration protocol. Unfortunately,
Rust's trait system makes it quite difficult to add an optimized
implementation for non-lazy iterators, so I may have to hold off
on this until specialization lands.
Changing up the iteration algorithm also presents an excellent opportunity to add a "hard mode" API to faster, which allows a user to manually pack and unpack streams of types as they please. Sneak peek:
pub fn char_to_emoji<'a, 'b>(buf: &'a [u8], out: &'b mut [u8]) -> &'b [u8] {
for (i, vec) in buf.simd_iter().pack().enumerate() {
let (a, b) = v.upcast();
let third = (a + u16s(55)) / u16s(64) + u16s(143).downcast((b + u16s(55)) / u16s(64) + u16s(143));
let fourth = ((v + u8s(55)) & u8s(0x3f)) + u8s(128);
// Make some room for interleaving
let (ta, tb) = third.upcast();
let (fa, fb) = fourth.upcast();
// Interleave third and fourth bytes
let third_foruth_a = ta.swap_bytes().merge_interleaved(fa);
let third_fourth_b = tb.swap_bytes().merge_interleaved(fb);
// Make some more room for another interleaving
let (tfa, tfb) = third_fourth_a.upcast();
let (tfc, tfd) = third_fourth_b.upcast();
// Interleave a constant 0xf09f with the third and fourth bytes,
// and store into out buffer
0xf09f0000).merge_interleaved(tfa).store(out, 128 * i);
u32s(0xf09f0000).merge_interleaved(tfb).store(out, 128 * i + tfa.width());
u32s(0xf09f0000).merge_interleaved(tfc).store(out, 128 * i + 2 * tfa.width());
u32s(0xf09f0000).merge_interleaved(tfd).store(out, 128 * i + 3 * tfa.width());
u32s(}
}
That may look complicated, but it's a quarter of the size of this 100-line monstrosity (which only supports AVX2!).
This should be exactly as fast as traditional SIMD assembly, because the user is responsible for all of the iterative work, while faster will only provide a basic load/store interface and its full selection of vector operations.
I'd like to thank everybody for following the development of
faster
, filing issues, lending algorithm ideas, and
maintaining a high quality of discourse about the library. If
you'd like to join the community, check out the Github
page.