Adam Niederer

Fast Rust in 2018

Rust is already a fast language, but there are still many opportunities to make it the fast language. While many people will rightfully focus on usability, tooling, and community goals for 2018, I will focus on some exciting changes which will make Rust the star of HPC.

Rust has been blessed with excellent runtime speed since its inception, and I hope that its world-class static analysis, code generation capabilities, and closeness to the metal can be applied to make Rust the fastest language ever.

Furthermore, Rust is a very expressive language. Traits, macros, and libraries "for humans" allow us to express what may take hundreds of lines of C in a digestable amount of Rust. Because of this, Rust is poised to make advanced optimization techniques available to everybody.

Those without advanced knowledge of computer architecture or algorithmic complexity should be able to leverage Rust's fantastic documentation, expressiveness, and library ecosystem to get their code within range of the a professionally optimized solution. Rust is nearly there, and many fast things already "just work" for everybody, such as:

However, a few more features, along with "for humans" implementations of each, should completely tip the scales:

In the next few thousand words, I'll survey in-progress features or bizarre ideas which will allow the above bullet points to join the Rust story.

Accepted Features

First, let's take a look at some features which are coming down the pipeline, but still need a little specification or implementation work to arrive on nightly. All of these features should help make Rust code faster, as well as increase usability of the language.

Specifically, many features should make zero-allocation code easier to write, as well as branch elision.

In this segment, I will offer less original ideas in favor of cheering for some already well-defined features which I believe can either make rust faster, or lower the bar to entry for manual optimization.

Specialization

Current Status: On nightly; awaiting features.

Certainly one of the fastest things coming to Rust is specialization, which lets one implement specifically optimized versions of traits for a given struct or type parameter.

use std::iter::{Iter, Repeat};

default fn vectorized_load<T : Iterator>(iter: T) -> Vector {
    // Compiles to many branches, calls, and insrps
    let mut ret = Vector::default();
    for i in Vector::length {
        if let Some(scl) = iter.next() {
            ret.replace(i, iter.next());
        }
    }
}

fn vectorized_load(iter: Repeat) -> Vector {
    // Compiles to a single call, branch, and movaps
    iter.next().map_or(Vector::default(), Vector::splat)
}

Specialization is especially exciting because the internal structure of a struct may reveal an asymptotically faster method of computing something. For example, knowing something is randomly accessible can allow one to chop an O(n) operation down to O(1) quite handily, or squeeze an O(n2) implementation into O(nlogn).

Smaller improvements are also possible - there are many ways to optimize something which takes a std::iter::Repeat instead of an std::iter::Iterator, for example.

I expect to see a lot of cycles freed up by this addition.

Const Generics

Current Status: RFC; awaiting impl.

Rust has this strange aversion to arrays and tuples. This is unfortunate, because arrays and tuples are the fastest ways to store homogeneous and heterogeneous data, respectively. Arrays over size 32 and tuples over size 12 don't have many of their typical traits implemented, and implementing your own trait on an array or tuple is a nightmare, and without fail requires usage of macros.

This can cause people to use vecs and boxed structs in places which don't require them, leading to unnecessary allocations. Const generics alleviate the issue with arrays entirely, which should reduce the pressure on developers to heap-allocate everything.

impl<T, const N: usize> Trait for [T; N] {
    // Tears of joy
}

Const generics also allow monomorphization of functions across values as well as types. Eliding branches can be very helpful in hot code, and const generics should make jump tables and vanity branches less common once they land. It also gives Rust another opportunity to eat C++'s lunch - they've had const generics since '98.

Macros 2.0

Current Status: RFC; awaiting more RFCs.

While many languages can get away with no macro support because of their dynamism or data model, macros in Rust are a force multiplier that gives an otherwise stuffy static language some real expressive power, allowing one to elide branches, dynamic dispatch, and all sorts of slowness while not giving up the expressive power such systems allow. It's an excellent complement to the trait system, and fills many gaps left by it.

Macros also allow code to be monomorphized in a similar way to const generics, and allow semi-generic trait implementations over tuples. Procedural macros and improved declarative macros have the potential to significantly improve on Rust's current macro system, which, despite its Turing-completeness, often feels like using a pushdown automaton.

Looking at some popular crates on crates.io, it looks like macros are very heavily used. In byteorder, half of the code either uses or defines macros. I'm quite excited about procedural macros, as putting more effort into both declarative and procedural macros should pay off in spades for both usability and speed.

One of the most exciting possibilities of procedural macros, however, is the possibility of making traditionally verbose code extremely simple. See my thoughts on a SPIR-V Target where procedural macros really shine.

Pre-RFC

These ideas are either being worked on without any existing RFC, like SIMD, or exist solely in the back of my mind. I think all of these ideas would have an enormous impact on Rust's position as a serious language for HPC, and assist both new and seasoned programmers.

Specifically, Explicit SIMD and Hardware acceleration will give Rust the upper hand against similarly performant languages, and will open the door to libraries which use Rust's expressiveness to make SIMD and SIMT writeable by anybody.

Furthermore, Standardized performance documentation can ensure going fast is not left to those with formal education or experience in computer science. It may also make Rust an attractive teaching language, or give new programmers some background knowledge and interest in the theory.

SIMD

Current Status: pre-RFC; on nightly via crate.

While heterogeneous compute is undoubtably the way forward for big data, many tasks can see an enormous benefit from vectorization, but do not provide enough data or compute time to feed a distant accelerator. For example, stream encoding will often only have 216 bytes to work with, and many byte-to-byte encodings do not have enough computational complexity to warrant offloading the computation.

A fantastic example of such an algorithm is base💯, which only performs six arithmetic operations per four bytes encoded, and five per four bytes decoded. Below is an implementation of a base💯 decoder:

pub fn emoji_to_char<'a, 'b>(buf: &'a [u8], out: &'b mut [u8]) -> &'b [u8] {
    for (i, chunk) in buf.chunks(4).enumerate() {
        out[i] = ((chunk[2].wrapping_sub(143)).wrapping_mul(64))
            .wrapping_add(chunk[3].wrapping_sub(128))
            .wrapping_sub(55)
    }
    out
}

Despite its linearity and simplicity, llvm has a very difficult time vectorizing this code. Compilers are finicky with vectorization, and it's not necessarily the fault of LLVM. Another equivalent implementation may vectorize nicely today, but it may not vectorize tomorrow, or at all on a similar CPU.

However, this decoder may be represented succinctly using explicit SIMD, which provides strong guarantees about runtime and vectorized algorithm used. Unfortunately, explicit SIMD isn't usable in stable Rust at the moment. Below is a 4-8x faster version of the above decoder, via explicit SIMD:

extern crate faster;
use faster::*;

pub fn emoji_to_char<'a, 'b>(buf: &'a [u8], out: &'b mut [u8]) -> &'b [u8] {
    let u = u8s::splat;
    buf.simd_iter().stripe_four().zip().simd_map(u8s::splat(0), |(_, _, c, d)| {
        ((c - u(143)) * u(64)) + d - u(128)) - u(55)
    }).scalar_fill(out)
}

This will hopefully change soon, thanks to the diligent work of stdsimd's contributors. While it's pre-RFC at the moment, it has the backing of many core Rust developers and will hopefully land as soon as it's ready. Those interested in seeing explicit SIMD in rust may be interested in the following issues:

SPIR-V Target

Current Status: pre-RFC; unimplemented.

Currently, there are two realistic options for writing cross-platform heterogeneous algorithms: OpenCL C, or GLSL. Both languages can do the job, with the first being sufficient for most integer HPC cases and the latter being more suited to floating point. However, both languages are often difficult to read, and debugging kernels and shaders is laborious.

A target for SPIR-V relying on libcore may let Rust corner the HPC market as it takes off. C++ via OpenCL or CUDA still suffers from many of the issues found in standard C++, and Rust's technically sophisticated static analysis would be refreshing to use on this often-finicky platform.

Not only is Rust well-suited for such low-level, fast computations - it also has a very, very strong productivity story for these targets. While dependencies in OpenCL and CUDA are often limited to special vectorized headers and cuBLAS, we could potentially get the entirety of crates.io's embedded ecosystem on accelerators.

Imagine the possibilities! It's possible to make kernels much more like real functions with procedural macros, as well. Reductive operations could use a macro to "return" a value, whereas mapping operations could use a classical C-style API.

extern crate rust_kernel;
#[macro_use] use rust_kernel::{work_id, work_size};

#[kernel]
#[shared(arr)]
fn threesum(arr: &[u64]) -> Option<(u64, u64, u64)> {
    for i in work_id()..work_size() {
        let mut j = i + 1;
        let mut k = arr.len() - 1;

        while k >= j {
            if arr[i] + arr[j] + arr[k] == 0 {
                kernel_return!(Some((arr[i], arr[j], arr[k])));
            } else if arr[i] + arr[j] + arr[k] > 0 {
                k = k - 1;
            } else {
                j = j + 1;
            }
        }
    }
}

fn threesum_driver() {
    let arr = [0u64; 65536];
    let context = boring_cl_or_vulkan_setup();
    println!("{:?}", context.enqueue(threesum).arg(arr).await());
}

That isn't very attractive, though. Can we push this further, perhaps with Macros 2.0?

fn holy_cow() {
    let context = boring_cl_or_vulkan_setup();
    let simt_can_be_this_easy = [0f32; 65536].as_slice()
        .simt_map(context, kernelify!(|v| v.sqrt() + 2))
        .host_collect::<Vec<f32>>();
}

Yes we can! Imagine a macro which takes a closure, wraps standard kernel boilerplate around it based on its arguments, compiles it to SPIR-V, and links it with the program automatically. This should be possible, and the biggest step to making it happen is finding a way to get Rust to compile to SPIR-V.

Those interested in this topic may also want to take a look at SPIRV-LLVM.

Performance Documentation

Current Status: pre-RFC; unimplemented.

Some packages may contain functions of an asymptotic complexity which a casual observer would not be able to predict. I've written some quadratic functions before, both out of necessity and laziness, and have often wished I could mark a function as either necessarily complex, or in need of performance improvements.

Furthermore, people without a formal education in computer science may be less consistently aware of topics such as asymptotic complexity or the cache hierarchy, which may modulate their ability to write performant code. A formal system of documenting these things should not only allow more people to avoid these pitfalls, but may also spark interest in the theory, as well as provide a significant amount of background knowledge for those who are exposed to the theory after Rust.

Documenting Allocations

It's often worth considering the performance implications of a data structure before using it. Many of Rust's types have heap allocations underpinning them, and although common types like Vec or HashMap are commonly understood to be heap-allocated, but what about glium::glutin::WindowAttributes?

The ability to mark a structure as heap-allocated for documentation should give users more power when attempting to cut down on allocations in hot code. Any structures containing an annotated structure can be automatically marked as requiring allocations by the documentation generator. In std, structures such as Box, Vec, String, and HashMap would be marked appropriately.

It may even be possible to document exactly how many allocations are performed when constructing most structures, and whether they may incur reallocations at runtime.

// Structures can contain heap-allocated components
/// !allocates(1)
struct Foo {
    x: *const u8
}

impl Foo { /* use alloc; do heapy things */ }

// This is automatically marked as requiring two heap allocations to
// construct. No annotation is needed.
struct FooAndString {
    x: String;
    y: Foo;
}

Documenting Complexity

Marking a function as slow in a doc-comment, perhaps with a complexity attached, may allow users to avoid some performance pitfalls. Furthermore, attaching a theoretical best-case to this annotation may encourage contributions by those wishing to brush up on their interview skills.

/// !slow(current = "n^2", target = "nlogn")
/// Sorts a slice
fn sort<T>(arr: &[T]) where T : PartialEq {
    // Bubble sort
}

/// !slow(current = "2^n")
/// Find all hamiltonian cycles in the graph
fn cycles(graph: DirectedGraph) {
    // Bellman-Held-Karp solver
}

// Asymptotic complexity markings cannot recursively applied to a function
// without analyzing the body. This function is actually O(1), so marking it
// would be incorrect.
/// Sort some of arr
fn sort_some<T>(arr: &[T]) where T : PartialEq {
    sort(arr[0..8]).
}

Marking functions which are slow for miscellaneous reasons may also be worthwhile. Heavy I/O, intentional cache flushing, and blocking may also be good targets for such an annotation system. Some of these may be easy to apply recursively, as well.

/// !slow(cache)
fn kernel_context_switch() {
    // Pull a PCB off the run queue and longjmp into it
}

/// !slow(io)
fn transfer_to_accelerator_sync(data: &'a [T]) {
    // Synchronously send a bunch of data to your hardware accelerator
}

Wrap-up

With these features, I believe Rust will democratize speed, as well as emerge as the king of the hill for HPC experts and enthusiasts. We've already made amazing strides in this direction with libraries like vulkano, ocl, rayon, faster, stdsimd, crossbeam, and std. With an entire year ahead of us to make it better, my hopes are high.