Specialized Importance Matrix Computation
Quantization is basically a necessity to get an LLM of any truly useful size working on consumer hardware. Frequently, very aggressive quantization is necessary. For example, to run a 32b parameter model on my 20GB graphics card with even a short context, it must be squished down to four bits per word or fewer, and long-context use cases frequently require somewhere between three and four.
A semi-recent innovation in this field is importance matrix quantization, in which the model is run through a corpus, which then determines how the resulting model will be quantized - assigning more bits to things deemed "important" when predicting that corpus and fewer to things deemed not. Models used to be quantized primarily on Wikitext, but it seems to have been subject to questions about data leakage and lack of generality.
I commonly use quantized models from Bartowski due to their
availability, wide array of quantization levels, and reliability
of updates. I use models almost exclusively for coding, and
rarely for English-Chinese work, so I was actually quite
surprised to see that the dataset for generating Bartowski's
imatrix quants, calibration_datav3.txt,
contains no Chinese and little code. Unsloth, another provider
of quantized models, uses both
calibration_datav3.txt and calibration_datav5.txt,
according
to their documentation. v5 contains more code than v3, but
it inherits a lot of the quirks from v3 that I'll get into
later.
To be clear, the point of this article is absolutely not to attack Bartowski, Unsloth, or the creators of these corpuses. Bartowski and the Unsloth team are clearly incredibly dedicated to the craft, and are the premier providers of quantized GGUFs. Their quants are available quickly and they both frequently reupload fixed versions of quants when it is discovered that the originals had issues. I also appreciate both parties for making their imatrix files and corpuses available, without which this analysis wouldn't be possible in the first place. I hope that through rigorous benchmarking, experimentation, and constructive feedback like in this article, we can all benefit from better quantized models.
Bartowski's
Importance Matrix Generation Corpus -
calibration_datav3.txt
Quickly, let's breeze through some of the text in
calibration_datav3.txt. The text is predominantly
English, but there appear to be some small snippets in Russian,
Arabic (?), German, Spanish, Swedish, and Italian (?) in the
dataset. I can't read what they say, but the inclusion of other
languages seems reasonable for the sake of balance. In my
reading of it, I didn't find any instances of some very
widely-spoken languages that may be useful to include, such as
Chinese, Hindi, and French, to name a few
As for coding languages, the most represented language is PHP, but there are also very small amounts of Python, Pandas, Java, C#, SQLAlchemy, Shell, LaTeX, CSS, C++, C, HTML, jQuery, Android UI XML, and JavaScript, as a probably nonexhaustive list. However, most of the representations of these languages are things like interface definitions, short REPL sessions, and errors or questions about errors. I'm sure some of this lights up some of the right weights, but I do wonder if it's also lighting up the bits about the bodies of functions.
For example, the C representation appears to be from this question on StackOverflow. There are also some things that I found to be interesting in this dataset:
- Riddles
- 10 sentences ending in "bread"
- Some very noisy math
- Why was this article picked?
- A list of drugs?
- What appears to be a copypaste of this study on PubMed
At this point I feel the desire to say that linking to any of the content of this dataset on this site does not constitute endorsement of it. I would also wonder if these news articles, snippets from StackOverflow, etc, are being used in a way that is allowed - I have no idea if it is or isn't, but one advantage of the corpus I will present is that it is entirely of my own works.
Unsloth's
Imatrix Generation Corpus -
calibration_datav5.txt
calibration_datav5.txt appears to include many
more languages, too many to list here. Inclusion of both
simplified and traditional Chinese was a nice touch. In
addition, many chunkier code samples were added, including total
newcomers such as TCL,
Emacs
Lisp, x86
ASM, Rust,
and SQL.
As much as I love Emacs Lisp, I'm not sure if a Quail package
was the right one to include given its dissimilarity to the
majority of elisp that somebody would actually ask an LLM about,
but I digress.
After the new data appears to be
calibration_textv3.txt pasted verbatim, so all of
the quirks there should apply here. In addition, though, I
noticed that the study noted above is actually pasted twice into
this corpus: here
and here.
Creating a Strawman Corpus
I'm going to posit that this corpus can be improved upon somewhat for English-speaking users who primarily use models to code. To do so I propose a very simple corpus containing only the concatenated source of my articles Versioning Code at Scale, Part 2, and Efficient Status Bars in Rust, which I will include below in full. At approximately 4,000 tokens, it's microscopic compared to the one just analyzed.
The point of benchmarking this corpus is not to show that it's the best corpus, but it is to give a conservative estimate of what an importance matrix more tailored to an individual use case can stand to gain.
In addition, I also created quants with the corpus from this discussion on Github, which contains 8,000 random tokens from Mistral-7b's tokenizer, to get a baseline for what a truly incoherent quantization corpus might yield.
Methodology
To begin, I created aggressive quants of the most recently released model series at the time of writing, specifically Qwen 3 4b IQ3XXS, for both corpora. I then computed the mean perplexity and Kullback-Liebler Divergence over a medium-sized corpus consisting of code, which produced approximately 100GB of logits, English text, approximately 30GB, and multilingual text, approximately 5GB. KLD should be "the gold standard for reporting quantization errors" per Unsloth, so we should be measuring roughly the same thing, although I'm sure they have more resources and better measurements than I do.
Results
First, the coding corpus, where I expect some degree of
improvement over the models quantized with
calibration_textv3.txt
| Model | Perplexity | % Increase | KL Divergence |
|---|---|---|---|
| Bartowski | 3.71 | 17.56% | 0.1927 |
| Unsloth | 3.69 | 16.81% | 0.2013 |
| Random | 3.60 | 14.09% | 0.2028 |
| Mine | 3.50 | 11.03% | 0.1780 |
My corpus turns in a 60% improvement in perplexity delta over Bartowski's quantization, and 52% over Unsloth's, so it looks like my hypothesis of specialized corpuses doing well for specialized tasks holds water. KL divergence is also improved, with my corpus yielding 8% better than Bartowski's quant, and 13% compared to Unsloth's. That's interesting to me, because the Unsloth corpus appears to contain more code.
Next, English, which I expect to be flat or slightly regressed:
| Model | Perplexity | % Increase | KL Divergence |
|---|---|---|---|
| Unsloth | 23.00 | 28.09% | 0.3127 |
| Random | 21.87 | 21.81% | 0.3285 |
| Bartowski | 20.55 | 14.48% | 0.2702 |
| Mine | 20.30 | 13.07% | 0.2826 |
My quant performs competitively, achieving 11% improvement in perplexity delta compared to Bartowski's quant, and 115% improvement compared to Unsloth's. KL divergence is also reasonable, achieving 11% better KL divergence than Unsloth's quant, with Bartowski's quant achieving 5% better divergence than mine.
Finally, results across the very small multilingual corpus, which I expect to show regressions:
| Model | Perplexity | % Increase | KL Divergence |
|---|---|---|---|
| Unsloth | 18.03 | 27.83% | 0.3824 |
| Mine | 17.80 | 26.18% | 0.3950 |
| Random | 17.67 | 25.26% | 0.4052 |
| Bartowski | 17.62 | 24.91% | 0.3694 |
As expected, KL-divergence is worse than Unsloth's. I would not expect Bartowski's quant to be better in this scenario due to the greater linguistic diversity of Unsloth's imatrix corpus, but that's what the numbers say.
Appendix: My Importance Matrix Generation Corpus
In the last section, we discussed growing and breaking of software. Although
these types of changes may encompass some plurality of the work done on any
piece of software, we have left uncovered a majority of software work:
/changing/.
* What is a Change?
Changing cannot be as strictly defined as our previous types of work; bug fixes,
performance improvements, UX improvements, accessibility improvements, and
refactoring all fall under changing. Often, growing can also involve
changing. For example, consider this component from the previous part:
#+begin_src typescript
/*
* Renders data to an unordered list
* Props:
* @required `title: string`: The title of the list
* @required `data: T[]`: Elements to render to their own items, mapped to strings by...
* Events:
* `userInitiatedEvent({detail: void})`: Fires when a user...
*/
class UnorderedListComponent<T> extends HTMLElement {
private _title: string;
private _data: T[];
set title(prop: string) { ...; render(); }
set data(prop: T[]) { ...; render(); }
connectedCallback() { ... }
disconnectedCallback() { ... }
render() { ... }
userInitiatedCallback() {
this.dispatchEvent(new CustomEvent("userInitiatedEvent", Math.random())
}
}
#+end_src
Although removing the ~title~ prop from this component is, from a typing
perspective, a compatible change, such a change may cause a change in behavior
which some may consider breaking. Let us consider a component which does not
require a title be provided, and instead is able to omnisciently guess what
the user wants.
#+begin_src typescript
/*
* Renders data to an unordered list
* Props:
* @required `data: T[]`: Elements to render to their own items, mapped to strings by...
* Events:
* `userInitiatedEvent({detail: void})`: Fires when a user...
*/
class UnorderedListComponent<T> extends HTMLElement {
private get _title(): string { return omniscientlyGuessTitle(); }
private _data: T[];
set data(prop: T[]) { ...; render(); }
connectedCallback() { ... }
disconnectedCallback() { ... }
render() { ... }
userInitiatedCallback() {
this.dispatchEvent(new CustomEvent("userInitiatedEvent", Math.random())
}
}
#+end_src
Although the component's title may still be exactly what the user wants, there
is a subtle difference to its behavior. Now, setting the ~title~ property will
no longer cause the component to rerender. Such a change is noticed by somebody
who decided to mutate ~data~ and then set ~title~ to have the changes reflected,
your omniscience technology goes unappreciated, and everybody has a bad day.
These subtle changes are the grey zone in which library maintainers must
oftentimes operate, and the lines of what can be considered breaking are often
not clear, because you cannot possibly know the use case of every user.
* Hyrum's Law Probably Doesn't Apply to You
Some say that [[https://www.hyrumslaw.com/][Hyrum's Law]], which states that every observable behavior of a
system /de facto/ becomes its interface, implies that any change can be
considered breaking. I also believe this law holds in many cases: I'm quite sure
that every single bug and quirk in Win32, IE11, and the Direct3D9 Runtime is
relied upon by somebody.
However, the "sufficient number of users" over which Hyrum's Law applies is
higher than most systems will ever attain, and can be made higher with good
support. There are an infinite number of strawman programs which can cause even
the most universally-compatible change to be considered breaking: if a library
function took less than 2ms wall time to run, crash. If the number of keys in a
map where key count is not actually germane to anything is odd, crash. If the
artifact has a "2" anywhere in its version or name, crash. If your API is not as
popular as Win32, you probably shouldn't worry about them.
What we actually care about are the non-hostile use cases which can cause
innocuous changes to break compatibility: race conditions made manifest by a
performance improvements, bugfixes incompatible with their workarounds, and
unintentional breakage caused by "weird" things like stacking contexts or
floating point precision loss, as well as breakage from new bugs.
* Acceptable Use
Most of the APIs to which Hyrum's Law applies are external, public APIs where
the creators have very limited influence over their users. Although this is
common in many fields, there is still some degree of influence over users which
should be leveraged to lessen users' reliance on noninterface characteristics.
Documentation is critically important to this. The documentation of an interface
is the contract between a library's maintainers and its users. As in any good
contract, a library maintainer should seek to limit their liability, and
establish some "ground rules" which globally apply to everything in the
contract. Some rules I've used for frontend projects include:
- The type of provided data in a contract may be a supertype of the actually-provided data.
- Runtime and memory characteristics, within reason, are not part of the contract
- Many aspects of a component's look and feel are not part of the contract (to be covered more in Part 3)
In addition, minimum coding standards are also important to improve
compatibility with changes, such as disallowing wall-time waits.
* Fixing Breakage
Even after taking all of these steps to prevent breakage, it is still bound to
happen
Rushing ahead to fix the break can lead to half-solutions or cause other things
to break in haste, so it's important to take a step back and follow a process
for each incident. I personally like the following:
- Assess the break
- Revert the change
- Gather cases where breakage occurred
- Root-cause the source of the break
- Fix the breakage
** Assessing the Break
First, it's important to not let the urgency get the better of oneself; taking a
minute to step back and assess where the break is happening, its severity, and
lightly hypothesize why the break happened. Sometimes, it's really obvious, like
a forgotten margin causing somebody's page to look weird, or sometimes you get a
totally broken-looking component with no hint as to why. If a rough idea of the
cause can be ascertained on the spot, it can be used to give rough estimates on
time to resolve. Additionally, if there's extremely high confidence that it's
truly a trivial fix, much of the following steps can be shortcut.
It's also important to do some customer service at the outset: let the key
stakeholders affected by the break know your appraisal of the situation, and
when they can expect the break to be fixed. Usually, if one follows these steps,
that is either very shortly in the future or very shortly in the past. Making
developers and business stakeholders feel cared for can go a long way to
reducing annoyance at breakage when it does happen. Finally, if the change was
for a specific stakeholder, it is important to inform them about how the
incident will be addressed as well, and when they can expect the fixed version
of their requested changes to be available.
** Reverting the Change
Now that stakeholders are taken care of, in most cases one will want to revert
the change before doing anything else. The impact of the break may not be
completely understood, and although it's only in a development environment, it
has a chance of impacting the productivity of other developers, be it through
actually impeding them or simply through annoyance at the break. In addition,
reverting takes off the majority of the time pressure to put up a fix, as people
are no longer actively using a degraded version of the library.
One of the advantages of hot-pushing dependencies is the enormous amount of
control the library has over which versions of itself are being used. Reverting
a change in a hot-deployed library unilaterally guarantees that apps will no
longer be running the version with the break, reducing "aftershocks" caused by
those upgrading to a version with a bug and sticking to said version until they
run into it. For packages which do not use this technique, most package
repositories allow certain versions of packages to be yanked or deprecated. This
is commonly used for security vulnerabilities, but can also apply to breakage.
** Gather Test Cases
After reverting, one should gather all known cases in which the change caused
breakage. Standard issue-reporting is the name of the game here: where to go,
steps to reproduce, expected behavior, actual behavior. This step is usually not
a quick one, as the whole reason we're here is that somebody was broken by the
change. Usually the one initial report is enough to root-cause the issue, but
additional cases can never hurt.
One can take advantage of hot-deployed code again in this step, if the case is
extreme enough to warrant it. At a non-critical time, for example the start of a
sprint, one can communicate that the change with a break will be temporarily
hot-deployed for a period of time, and ask consumers to report issues they
see. This has some of the negative impacts towards productivity listed above,
but it can sometimes be worth it if reproduction cases are scant in number and
root-causing has failed.
Needless to say, analogues of these cases should almost always go in the
library's test suite in some form to ensure that such a break does not happen in
the same way again.
** Root-Cause the Break
In order to fix the break, one must fully understand why it happened so as to
ensure that the break is fixed in all cases. The most efficient process for
root-causing is the scientific method, discussion of the application of which I
will leave to [[https://www.youtube.com/watch?v=FihU5JxmnBg][Stuart Halloway]]'s excellent talk at Clojure/conj 2015. As with
many conj talks, I highly recommend it to users of any programming language or
framework, including those without Clojure exposure.
** Fixing the Breakage
Finally, it's time to fix the break and then push out a new version of the
library. This is a great opportunity to beef up the test suite in that area, and
ensure that all behaviors are well-documented and tested to ensure that there
will be no further surprises. It is especially important to ensure that this is
well-tested, as repeated breakage can reduce developer confidence in the
library. Generally, you want people to feel that the number of bugs is generally
on a downwards trend, and fixes leading to further bugs can complicate that
narrative.
* Conclusion
Unfortunately I don't have silver bullets to offer on ensuring compatible
changes in all cases, in the same way that one can ensure growing software is
compatible. However, having a process for narrowing the possibility space of
breakage which can occur, gracefully fixing breakage when it occurs, and doing
the necessary people management to ensure continued confidence in the projects
can be applied to any project, and will go a long way to make people feel
comfortable using the library, especially a hot-deployed one, and minimize the
impact of breakage when it does occur.
Although to my knowledge there's no better mathematical way to divide up the
grey zone of changing effectful software in the general case, there are some
tips I have that specifically apply to the frontend, and component libraries,
which will be covered in the next section.
#+title: Efficient Status Bars with Rust
#+author: Adam Niederer
AKA "How I sped up my desktop by 100x using Rust 🔥 🚀"
AKA "How I learned to stop context switching and love the (fork) bomb"
* My Desktop
My desktop is pretty heavily customized. I don't run a desktop environment, so
much of it is a mishmash of à la carte programs. My status bar, in particular,
is a custom script. It gathers and displays information by interfacing with a
lot of different components of my system, namely:
- The system clock
- System usage information
- Hardware voltage and temperature sensors
- Any available Bluetooth interfaces
- Ethernet and WLAN interfaces
- PulseAudio sinks
- Internet weather APIs
- Battery sensors
- Music players
All of that information is dumped to stdout, and painted to the screen with
[[https://github.com/krypt-n/bar][lemonbar]], which also handles mouse events.
With it, I can play/pause/skip music, check the weather, date, and time, connect
to Bluetooth devices and wifi access points, check my memory and swap usage,
adjust and mute my volume, and make sure my hardware is happy.
* Making it Work with Bash
The plethora of pre-packaged programs for these interfaces persuaded me to
produce this program with bash. Despite its linguistic quirkiness, bash remains
a leading choice for throwing together scripts.
I'll spare you the details of the script, for it contains many hard-coded API
keys and MAC addresses. Here's a representative sample of the code, though:
#+BEGIN_SRC bash
# Display hardware sensors
TEMP=$(sensors 2>/dev/null | grep -E "Package|Physical" | cut -d ' ' -f 5 | cut -c 2-7)
VCORE=$(sensors 2>/dev/null | grep "Vcore" | cut -d ' ' -f 11 | cut -c 2-5)
if [ -n "$TEMP" ]; then
echo -ne "\uf2c8 $TEMP "
fi
if [ -n "$VCORE" ]; then
echo -ne "âš¡ ${VCORE}V "
fi
#+END_SRC
It's workable, if a little brittle. Because programs like ~sensors~, ~free~, and
~pamixer~ rarely receive significant updates, this script only broke once or
twice over its years of use, and it was always a quick fix.
It had a more serious problem though, which could be illustrated by my system's
idle CPU usage:
It's wildly inefficient. Retrieving my CPU temperature and voltage spawns and
subsequently terminates 8 processes. The entire script creates and kills at
least 200 processes per second. Yikes!
My desktop often runs on machines with finite energy, so I made efforts to
optimize it. By decimating the update frequency and ensuring power-hungry
programs like ~nmcli~ run less frequently, I was able to run it on my laptop
without noticeable effect on battery life.
* Getting it Right with Rust
Optimizing my status bar sat on my back burner for quite a while, as many of the
interfaces used were only accessible with C. C APIs are Satan's curse upon man,
so I held out for a better solution.
In the winter of 2017, I got really into Rust. Consequently, I noticed a
Cambrian explosion of wrapper crates for things like ~pulseaudio~, ~bluez~,
~lm-sensors~, and ~dbus~. A few months later, Rust had native bindings to
everything I needed. Work quickly began on a replacement.
350 lines of Rust code later, I had a working copy of my status bar. Rust's
amazing and highly standardized documentation system allowed me to integrate
tens of crates in an evening, while its explicit error handling ensured graceful
removal of irrelevant features on sparsely-equipped machines.
#+BEGIN_SRC rust
let thermal_sensors = Sensors::new().into_iter()
.filter(|c| c.prefix() == "coretemp")
.flat_map(|c| {
c.into_iter().flat_map(|feat| {
feat.into_iter().filter(|sub| {
sub.subfeature_type() == &SubfeatureType::SENSORS_SUBFEATURE_TEMP_INPUT
}).collect::<Vec<_>>()
}).collect::<Vec<_>>()
}).collect::<Vec<_>>();
let voltage_sensors = Sensors::new().into_iter()
.filter(|c| c.prefix() == "nct6776")
.flat_map(|c| {
c.into_iter().flat_map(|feat| {
feat.into_iter().filter(|sub| {
sub.subfeature_type() == &SubfeatureType::SENSORS_SUBFEATURE_IN_INPUT
}).collect::<Vec<_>>()
}).collect::<Vec<_>>()
}).collect::<Vec<_>>();
// Snip!
loop {
// Snip!
if let Ok(thermals) = thermal_sensors.first()?.get_value() {
print!("\u{f2c8} {}°C ", thermals);
}
if let Ok(voltage) = voltage_sensors.first()?.get_value() {
print!("âš¡ {:.2}V ", voltage);
}
}
#+END_SRC
#+BEGIN_HTML
<div style="text-align: center">Imagine how ugly that would be in C.</div>
#+END_HTML
The difference was night and day.
* Measuring Success
Let's take a quick look at both programs with ~perf stat~ to ensure our
qualitative results have some empirical backing.
We can see the Bash script executing 200x as many instructions, and requiring
around 100x as many cycles as the Rust script.
Those aren't the only metrics we can test, however. Both scripts are throttled
by a call to ~sleep~ with a constant argument. This means scripts which take
longer to run will produce fewer updates per second.
Both scripts print one line per update, so I was able to measure their relative
update frequency by running both scripts for the same amount of time and
measuring the length of their outputs.
The Rust script runs at a 27% higher frequency than the bash script, furthering
its efficiency lead.
* On Ecosystems
Although 10^2 speedups are more likely to draw clicks, my most poignant takeaway
from this endeavor is related to Rust's ecosystem. Many crates with underpinning
C libraries offer much more thoughtful APIs, and are more accessible to casual
users.
I'm repeatedly delighted by each crate's continued maintenance and development -
especially [[https://github.com/jnqnfe/pulse-binding-rust][pulse-binding-rust]], which has added a ton of features while
profoundly transforming its API for the better. This code from February:
#+BEGIN_SRC rust
extern "C" fn pulse_cb(_: *mut ContextInternal, info: *const ServerInfoInternal, ret: *mut c_void) {
if !info.is_null() && !ret.is_null() {
unsafe {
let name = CStr::from_ptr((*info).default_sink_name).to_owned().into_string().unwrap();
*(ret as *mut String) = name;
}
}
}
let mut pulse_sink_name: String = String::new();
pulsectx.introspect().get_server_info(
(pulse_cb, &mut pulse_sink_name as *mut _ as *mut c_void));
#+END_SRC
is equivalent to this code from August:
#+BEGIN_SRC rust
pulsectx.introspect().get_server_info(|info| {
pulse_channel.sender.send(info.default_sink_name.clone().unwrap().into());
});
let pulse_sink_name = pulse_channel.receiver.recv().unwrap();
#+END_SRC
#+BEGIN_HTML
<div style="text-align: center">Yielding to PulseAudio is omitted from both samples.</div>
#+END_HTML
I can only hope the velocity carried by Rust's ecosystem is kept strong.