Versioning Code at Scale, Part 2: Changing

Published 2024-06-15 by Adam Niederer

In the last section, we discussed growing and breaking of software. Although these types of changes may encompass some plurality of the work done on any piece of software, we have left uncovered a majority of software work: changing.

What is a Change?

Changing cannot be as strictly defined as our previous types of work; bug fixes, performance improvements, UX improvements, accessibility improvements, and refactoring all fall under changing. Often, growing can also involve changing. For example, consider this component from the previous part:

/*
 * Renders data to an unordered list
 * Props:
 * @required `title: string`: The title of the list
 * @required `data: T[]`: Elements to render to their own items, mapped to strings by...
 * Events:
 * `userInitiatedEvent({detail: void})`: Fires when a user...
 */
class UnorderedListComponent<T> extends HTMLElement {
  private _title: string;
  private _data: T[];

  set title(prop: string) { ...; render(); }
  set data(prop: T[]) { ...; render(); }

  connectedCallback() { ... }
  disconnectedCallback() { ... }
  render() { ... }

  userInitiatedCallback() {
    this.dispatchEvent(new CustomEvent("userInitiatedEvent", Math.random())
  }
}

Although removing the title prop from this component is, from a typing perspective, a compatible change, such a change may cause a change in behavior which some may consider breaking. Let us consider a component which does not require a title be provided, and instead is able to omnisciently guess what the user wants.

/*
 * Renders data to an unordered list
 * Props:
 * @required `data: T[]`: Elements to render to their own items, mapped to strings by...
 * Events:
 * `userInitiatedEvent({detail: void})`: Fires when a user...
 */
class UnorderedListComponent<T> extends HTMLElement {
  private get _title(): string { return omniscientlyGuessTitle(); }
  private _data: T[];

  set data(prop: T[]) { ...; render(); }

  connectedCallback() { ... }
  disconnectedCallback() { ... }
  render() { ... }

  userInitiatedCallback() {
    this.dispatchEvent(new CustomEvent("userInitiatedEvent", Math.random())
  }
}

Although the component's title may still be exactly what the user wants, there is a subtle difference to its behavior. Now, setting the title property will no longer cause the component to rerender. Such a change is noticed by somebody who decided to mutate data and then set title to have the changes reflected, your omniscience technology goes unappreciated, and everybody has a bad day.

These subtle changes are the grey zone in which library maintainers must oftentimes operate, and the lines of what can be considered breaking are often not clear, because you cannot possibly know the use case of every user.

Hyrum's Law Probably Doesn't Apply to You

Some say that Hyrum's Law, which states that every observable behavior of a system de facto becomes its interface, implies that any change can be considered breaking. I also believe this law holds in many cases: I'm quite sure that every single bug and quirk in Win32, IE11, and the Direct3D9 Runtime is relied upon by somebody.

However, the "sufficient number of users" over which Hyrum's Law applies is higher than most systems will ever attain, and can be made higher with good support. There are an infinite number of strawman programs which can cause even the most universally-compatible change to be considered breaking: if a library function took less than 2ms wall time to run, crash. If the number of keys in a map where key count is not actually germane to anything is odd, crash. If the artifact has a "2" anywhere in its version or name, crash. If your API is not as popular as Win32, you probably shouldn't worry about them.

What we actually care about are the non-hostile use cases which can cause innocuous changes to break compatibility: race conditions made manifest by a performance improvements, bugfixes incompatible with their workarounds, and unintentional breakage caused by "weird" things like stacking contexts or floating point precision loss, as well as breakage from new bugs.

Acceptable Use

Most of the APIs to which Hyrum's Law applies are external, public APIs where the creators have very limited influence over their users. Although this is common in many fields, there is still some degree of influence over users which should be leveraged to lessen users' reliance on noninterface characteristics.

Documentation is critically important to this. The documentation of an interface is the contract between a library's maintainers and its users. As in any good contract, a library maintainer should seek to limit their liability, and establish some "ground rules" which globally apply to everything in the contract. Some rules I've used for frontend projects include:

The type of provided data in a contract may be a supertype of the actually-provided data.
Runtime and memory characteristics, within reason, are not part of the contract
Many aspects of a component's look and feel are not part of the contract (to be covered more in Part 3)

In addition, minimum coding standards are also important to improve compatibility with changes, such as disallowing wall-time waits.

Fixing Breakage

Even after taking all of these steps to prevent breakage, it is still bound to happen

Rushing ahead to fix the break can lead to half-solutions or cause other things to break in haste, so it's important to take a step back and follow a process for each incident. I personally like the following:

Assess the break
Revert the change
Gather cases where breakage occurred
Root-cause the source of the break
Fix the breakage

Assessing the Break

First, it's important to not let the urgency get the better of oneself; taking a minute to step back and assess where the break is happening, its severity, and lightly hypothesize why the break happened. Sometimes, it's really obvious, like a forgotten margin causing somebody's page to look weird, or sometimes you get a totally broken-looking component with no hint as to why. If a rough idea of the cause can be ascertained on the spot, it can be used to give rough estimates on time to resolve. Additionally, if there's extremely high confidence that it's truly a trivial fix, much of the following steps can be shortcut.

It's also important to do some customer service at the outset: let the key stakeholders affected by the break know your appraisal of the situation, and when they can expect the break to be fixed. Usually, if one follows these steps, that is either very shortly in the future or very shortly in the past. Making developers and business stakeholders feel cared for can go a long way to reducing annoyance at breakage when it does happen. Finally, if the change was for a specific stakeholder, it is important to inform them about how the incident will be addressed as well, and when they can expect the fixed version of their requested changes to be available.

Reverting the Change

Now that stakeholders are taken care of, in most cases one will want to revert the change before doing anything else. The impact of the break may not be completely understood, and although it's only in a development environment, it has a chance of impacting the productivity of other developers, be it through actually impeding them or simply through annoyance at the break. In addition, reverting takes off the majority of the time pressure to put up a fix, as people are no longer actively using a degraded version of the library.

One of the advantages of hot-pushing dependencies is the enormous amount of control the library has over which versions of itself are being used. Reverting a change in a hot-deployed library unilaterally guarantees that apps will no longer be running the version with the break, reducing "aftershocks" caused by those upgrading to a version with a bug and sticking to said version until they run into it. For packages which do not use this technique, most package repositories allow certain versions of packages to be yanked or deprecated. This is commonly used for security vulnerabilities, but can also apply to breakage.

Gather Test Cases

After reverting, one should gather all known cases in which the change caused breakage. Standard issue-reporting is the name of the game here: where to go, steps to reproduce, expected behavior, actual behavior. This step is usually not a quick one, as the whole reason we're here is that somebody was broken by the change. Usually the one initial report is enough to root-cause the issue, but additional cases can never hurt.

One can take advantage of hot-deployed code again in this step, if the case is extreme enough to warrant it. At a non-critical time, for example the start of a sprint, one can communicate that the change with a break will be temporarily hot-deployed for a period of time, and ask consumers to report issues they see. This has some of the negative impacts towards productivity listed above, but it can sometimes be worth it if reproduction cases are scant in number and root-causing has failed.

Needless to say, analogues of these cases should almost always go in the library's test suite in some form to ensure that such a break does not happen in the same way again.

Root-Cause the Break

In order to fix the break, one must fully understand why it happened so as to ensure that the break is fixed in all cases. The most efficient process for root-causing is the scientific method, discussion of the application of which I will leave to Stuart Halloway's excellent talk at Clojure/conj 2015. As with many conj talks, I highly recommend it to users of any programming language or framework, including those without Clojure exposure.

Fixing the Breakage

Finally, it's time to fix the break and then push out a new version of the library. This is a great opportunity to beef up the test suite in that area, and ensure that all behaviors are well-documented and tested to ensure that there will be no further surprises. It is especially important to ensure that this is well-tested, as repeated breakage can reduce developer confidence in the library. Generally, you want people to feel that the number of bugs is generally on a downwards trend, and fixes leading to further bugs can complicate that narrative.

Conclusion

Unfortunately I don't have silver bullets to offer on ensuring compatible changes in all cases, in the same way that one can ensure growing software is compatible. However, having a process for narrowing the possibility space of breakage which can occur, gracefully fixing breakage when it occurs, and doing the necessary people management to ensure continued confidence in the projects can be applied to any project, and will go a long way to make people feel comfortable using the library, especially a hot-deployed one, and minimize the impact of breakage when it does occur.

Although to my knowledge there's no better mathematical way to divide up the grey zone of changing effectful software in the general case, there are some tips I have that specifically apply to the frontend, and component libraries, which will be covered in the next section.