The lower-post-volume people behind the software in Debian. (List of feeds.)

My friend Aras recently wrote the same ray tracer in various languages, including C++, C# and the upcoming Unity Burst compiler. While it is natural to expect C# to be slower than C++, what was interesting to me was that Mono was so much slower than .NET Core.

The numbers that he posted did not look good:

  • C# (.NET Core): Mac 17.5 Mray/s,
  • C# (Unity, Mono): Mac 4.6 Mray/s,
  • C# (Unity, IL2CPP): Mac 17.1 Mray/s,

I decided to look at what was going on, and document possible areas for improvement.

As a result of this benchmark, and looking into this problem, we identified three areas of improvement:

  • First, we need better defaults in Mono, as users will not tune their parameters
  • Second, we need to better inform the world about the LLVM code optimizing backend in Mono
  • Third, we tuned some of the parameters in Mono.

The baseline for this test, was to run Aras ray tracer on my machine, since we have different hardware, I could not use his numbers to compare. The results on my iMac at home were as follows for Mono and .NET Core:

Runtime Results MRay/sec
.NET Core 2.1.4, debug build dotnet run 3.6
.NET Core 2.1.4, release build, dotnet run -c Release 21.7
Vanilla Mono, mono Maths.exe 6.6
Vanilla Mono, with LLVM and float32 15.5

During the process of researching this problem, we found a couple of problems, which once we fixed, produced the following results:

Runtime Results MRay/sec
Mono with LLVM and float32 15.5
Improved Mono with LLVM, float32 and fixed inline 29.6


Chart visualizing the results of the table above

Just using LLVM and float32 your code can get almost a 2.3x performance improvement in your floating point code. And with the tuning that we added to Mono’s as a result of this exercise, you can get 4.4x over running the plain Mono - these will be the defaults in future versions of Mono.

This blog post explains our findings.

32 and 64 bit Floats

Aras is using 32-bit floats for most of his math (the float type in C#, or System.Single in .NET terms). In Mono, decades ago, we made the mistake of performing all 32-bit float computations as 64-bit floats while still storing the data in 32-bit locations.

My memory at this point is not as good as it used to be and do not quite recall why we made this decision.

My best guess is that it was a decision rooted in the trends and ideas of the time.

Around this time there was a positive aura around extended precision computations for floats. For example the Intel x87 processors use 80-bit precision for their floating point computations, even when the operands are doubles, giving users better results.

Another theme around that time was that the Gnumeric spreadsheet, one of my previous projects, had implemented better statistical functions than Excel had, and this was well received in many communities that could use the more correct and higher precision results.

In the early days of Mono, most mathematical operations available across all platforms only took doubles as inputs. C99, Posix and ISO had all introduced 32-bit versions, but they were not generally available across the industry in those early days (for example, sinf is the float version of sin, fabsf of fabs and so on).

In short, the early 2000’s were a time of optimism.

Applications did pay a heavier price for the extra computation time, but Mono was mostly used for Linux desktop application, serving HTTP pages and some server processes, so floating point performance was never an issue we faced day to day. It was only noticeable in some scientific benchmarks, and those were rarely the use case for .NET usage in the 2003 era.

Nowadays, Games, 3D applications image processing, VR, AR and machine learning have made floating point operations a more common data type in modern applications. When it rains, it pours, and this is no exception. Floats are no longer your friendly data type that you sprinkle in a few places in your code, here and there. They come in an avalanche and there is no place to hide. There are so many of them, and they won’t stop coming at you.

The “float32” runtime flag

So a couple of years ago we decided to add support for performing 32-bit float operations with 32-bit operations, just like everyone else. We call this runtime feature “float32”, and in Mono, you enable this by passing the --O=float32 option to the runtime, and for Xamarin applications, you change this setting on the project preferences.

This new flag has been well received by our mobile users, as the majority of mobile devices are still not very powerful and they rather process data faster than they need the precision. Our guidance for our mobile users has been both to turn on the LLVM optimizing compiler and float32 flag at the same time.

While we have had the flag for some years, we have not made this the default, to reduce surprises for our users. But we find ourselves facing scenarios where the current 64-bit behavior is already surprises to our users, for example, see this bug report filed by a Unity user.

We are now going to change the default in Mono to be float32, you can track the progress here:

In the meantime, I went back to my friend Aras project. He has been using some new APIs that were introduced in .NET Core. While .NET core always performed 32-bit float operations as 32-bit floats, the System.Math API still forced some conversions from float to double in the course of doing business. For example, if you wanted to compute the sine function of a float, your only choice was to call Math.Sin (double) and pay the price of the float to double conversion.

To address this, .NET Core has introduced a new System.MathF type, which contains single precision floating point math operations, and we have just brought this [System.MathF]( to Mono now.

While moving from 64 bit floats to 32 bit floats certainly improves the performance, as you can see in the table below:

Runtime and Options Mrays/second
Mono with System.Math 6.6
Mono with System.Math, using -O=float32 8.1
Mono with System.MathF 6.5
Mono with System.MathF, using -O=float32 8.2

So using float32 really improves things for this test, the MathF had a small effect.

Tuning LLVM

During the course of this research, we discovered that while Mono’s Fast JIT compiler had support for float32, we had not added this support to the LLVM backend. This meant that Mono with LLVM was still performing the expensive float to double conversions.

So Zoltan added support for float32 to our LLVM code generation engine.

Then he noticed that our inliner was using the same heuristics for the Fast JIT than it was using for LLVM. With the Fast JIT, you want to strike a balance between JIT speed and execution speed, so we limit just how much we inline to reduce the work of the JIT engine.

But when you are opt into using LLVM with Mono, you want to get the fastest code possible, so we adjusted the setting accordingly. Today you can change this setting via an environment variable MONO_INLINELIMIT, but this really should be baked into the defaults.

With the tuned LLVM setting, these are the results:

Runtime and Options Mrays/seconds
Mono with System.Math --llvm -O=float32 16.0
Mono with System.Math --llvm -O=float32 fixed heuristics 29.1
Mono with System.MathF --llvm -O=float32 fixed heuristics 29.6

Next Steps

The work to bring some of these improvements was relatively low. We had some on and off discussions on Slack which lead to these improvements. I even managed to spend a few hours one evening to bring System.MathF to Mono.

Aras RayTracer code was an ideal subject to study, as it was self-contained, it was a real application and not a synthetic benchmark. We want to find more software like this that we can use to review the kind of bitcode that we generate and make sure that we are giving LLVM the best data that we can so LLVM can do its job.

We also are considering upgrading the LLVM that we use, and leverage any new optimizations that have been added.


The extra precision has some nice side effects. For example, recently, while reading the pull requests for the Godot engine, I saw that they were busily discussing making floating point precision for the engine configurable at compile time (

I asked Juan why anyone would want to do this, I thought that games were just content with 32-bit floating point operations.

Juan explained to that while floats work great in general, once you “move away” from the center, say in a game, you navigate 100 kilometers out of the center of your game, the math errors start to accumulate and you end up with some interesting visual glitches. There are various mitigation strategies that can be used, and higher precision is just one possibility, one that comes with a performance cost.

Shortly after our conversation, this blog showed up on my Twitter timeline showing this problem:

A few images show the problem. First, we have a sports car model from the pbrt-v3-scenes **distribution. Both the camera and the scene are near the origin and everything looks good.

** (Cool sports car model courtesy Yasutoshi Mori.) Next, we’ve translated both the camera and the scene 200,000 units from the origin in xx, yy, and zz. We can see that the car model is getting fairly chunky; this is entirely due to insufficient floating-point precision.

** (Thanks again to Yasutoshi Mori.) If we move 5×5× farther away, to 1 million units from the origin, things really fall apart; the car has become an extremely coarse voxelized approximation of itself—both cool and horrifying at the same time. (Keanu wonders: is Minecraft chunky purely because everything’s rendered really far from the origin?)

** (Apologies to Yasutoshi Mori for what has been done to his nice model.)

Posted Wed Apr 11 21:12:37 2018 Tags:

Ed Felton tweeted a few days ago: “Often hear that the reason today’s Internet is not more secure is that the early designers failed to imagine that security could ever matter. That is a myth.”

This is indeed a myth.  Much of the current morass can be laid at the feet of the United States government, due to its export regulations around cryptography.

I will testify against the myth.  Bob Scheifler and I started the X Window System in 1984 at MIT, which is a network transparent window system: that is, applications can reside on computers anywhere in the network and use the X display server. As keyboard events may be transmitted over the network, it was clear to us from the get-go that it was a security issue. It is in use to this day on Linux systems all over the world (remote X11 access is no longer allowed: the ssh protocol is used to tunnel the X protocol securely for remote use). By sometime in 1985 or 1986 we were distributing X under the MIT License, which was developed originally for use of the MIT X Window System distribution (I’d have to go dig into my records to get the exact date).

I shared an office with Steve Miller at MIT Project Athena, who was (the first?) programmer working on Kerberos authentication service, which is used by Microsoft’s Active Directory service. Needless to say, we at MIT were concerned about security from the advent of TCP/IP.

We asked MIT whether we could incorporate Kerberos (and other encryption) into the X Window System. According to the advice at the time (and MIT’s lawyers were expert in export control, and later involved in PGP), if we had even incorporated strong crypto for authentication into our sources, this would have put the distribution under export control, and that that would have defeated X’s easy distribution. The best we could do was to leave enough hooks into the wire protocol that kerberos support could be added as a source level “patch” (even calls to functions to use strong authentication/encryption by providing an external library would have made it covered under export control). Such a patch for X existed, but could never be redistributed: by the time that export controls were relaxed, the patch had become mostly moot, as ssh had become available, which, along with the advent of the World Wide Web, was “good enough”, though far from an ideal solution.

Long before the term Open Source software was invented, open source and free software was essential to the Internet for essential services. The choice for all of us  working on that software was stark: we could either distribute the product of our work, or enter a legal morass, and getting it wrong could end up in court, as Phil Zimmerman did somewhat later with PGP.

Anyone claiming security was a “failure of imagination” does not know the people or the history and should not be taken seriously. Security mattered not just to us, but everyone working on the Internet. There are three software legacies from Project Athena: Kerberos, the X Window System, and instant messaging. We certainly paid much more than lip service to Internet security!

Government export controls crippled Internet security and the design of Internet protocols from the very beginning: we continue to pay the price to this day.  Getting security right is really, really hard, and current efforts towards “back doors”, or other access is misguided. We haven’t even recovered from the previous rounds of government regulations, which has caused excessive complexity in an already difficult problem and many serious security problems. Let us not repeat this mistake…



Posted Mon Apr 9 22:02:05 2018 Tags:

To reduce the number of bugs filed against libinput consider this a PSA: as of GNOME 3.28, the default click method on touchpads is the 'clickfinger' method (see the libinput documentation, it even has pictures). In short, rather than having a separate left/right button area on the bottom edge of the touchpad, right or middle clicks are now triggered by clicking with 2 or 3 fingers on the touchpad. This is the method macOS has been using for a decade or so.

Prior to 3.28, GNOME used the libinput defaults which vary depending on the hardware (e.g. mac touchpads default to clickfinger, most other touchpads usually button areas). So if you notice that the right button area disappeared after the 3.28 update, either start using clickfinger or reset using the gnome-tweak-tool. There are gsettings commands that achieve the same thing if gnome-tweak-tool is not an option:

$ gsettings range org.gnome.desktop.peripherals.touchpad click-method
$ gsettings get org.gnome.desktop.peripherals.touchpad click-method
$ gsettings set org.gnome.desktop.peripherals.touchpad click-method 'areas'

For reference, the upstream commit is in gsettings-desktop-schemas.

Note that this only affects so-called ClickPads, touchpads where the entire touchpad is a button. Touchpads with separate physical buttons in front of the touchpad are not affected by any of this.

Posted Sun Apr 8 22:14:00 2018 Tags:

This was driving me insane. For years, I have been using Command-Shift-4 to take screenshots on my Mac. When you hit that keypress, you get to select a region of the screen, and the result gets placed on your ~/Desktop directory.

Recently, the feature stopped working.

I first blamed Dropbox settings, but that was not it.

I read every article on the internet on how to change the default location, restart the SystemUIServer.

The screencapture command line tool worked, but not the hotkey.

Many reboots later, I disabled System Integrity Protection so I could use iosnoop and dtruss to figure out why screencapture was not logging. I was looking at the logs right there, and saw where things were different, but could not figure out what was wrong.

Then another one of my Macs got infected. So now I had two Mac laptops that could not take screenshots.

And then I realized what was going on.

When you trigger Command-Shift-4, the TouchBar lights up and lets you customize how you take the screenshot, like this:

And if you scroll it, you get these other options:

And I had recently used these settings.

If you change your default here, it will be preserved, so even if the shortcut is Command-Shift-4 for take-screenshot-and-save-in-file, if you use the TouchBar to make a change, this will override any future uses of the command.

Posted Wed Apr 4 17:40:08 2018 Tags:

As many people know, last week there was a court hearing in the Geniatech vs. McHardy case. This was a case brought claiming a license violation of the Linux kernel in Geniatech devices in the German court of OLG Cologne.

Harald Welte has written up a wonderful summary of the hearing, I strongly recommend that everyone go read that first.

In Harald’s summary, he refers to an affidavit that I provided to the court. Because the case was withdrawn by McHardy, my affidavit was not entered into the public record. I had always assumed that my affidavit would be made public, and since I have had a number of people ask me about what it contained, I figured it was good to just publish it for everyone to be able to see it.

There are some minor edits from what was exactly submitted to the court such as the side-by-side German translation of the English text, and some reformatting around some footnotes in the text, because I don’t know how to do that directly here, and they really were not all that relevant for anyone who reads this blog. Exhibit A is also not reproduced as it’s just a huge list of all of the kernel releases in which I felt that were no evidence of any contribution by Patrick McHardy.


I, the undersigned, Greg Kroah-Hartman,
declare in lieu of an oath and in the
knowledge that a wrong declaration in
lieu of an oath is punishable, to be
submitted before the Court:

I. With regard to me personally:

1. I have been an active contributor to
   the Linux Kernel since 1999.

2. Since February 1, 2012 I have been a
   Linux Foundation Fellow.  I am currently
   one of five Linux Foundation Fellows
   devoted to full time maintenance and
   advancement of Linux. In particular, I am
   the current Linux stable Kernel maintainer
   and manage the stable Kernel releases. I
   am also the maintainer for a variety of
   different subsystems that include USB,
   staging, driver core, tty, and sysfs,
   among others.

3. I have been a member of the Linux
   Technical Advisory Board since 2005.

4. I have authored two books on Linux Kernel
   development including Linux Kernel in a
   Nutshell (2006) and Linux Device Drivers
   (co-authored Third Edition in 2009.)

5. I have been a contributing editor to Linux
   Journal from 2003 - 2006.

6. I am a co-author of every Linux Kernel
   Development Report. The first report was
   based on my Ottawa Linux Symposium keynote
   in 2006, and the report has been published
   every few years since then. I have been
   one of the co-author on all of them. This
   report includes a periodic in-depth
   analysis of who is currently contributing
   to Linux. Because of this work, I have an
   in-depth knowledge of the various records
   of contributions that have been maintained
   over the course of the Linux Kernel

   For many years, Linus Torvalds compiled a
   list of contributors to the Linux kernel
   with each release. There are also usenet
   and email records of contributions made
   prior to 2005. In April of 2005, Linus
   Torvalds created a program now known as
   “Git” which is a version control system
   for tracking changes in computer files and
   coordinating work on those files among
   multiple people. Every Git directory on
   every computer contains an accurate
   repository with complete history and full
   version tracking abilities.  Every Git
   directory captures the identity of
   contributors.  Development of the Linux
   kernel has been tracked and managed using
   Git since April of 2005.

   One of the findings in the report is that
   since the 2.6.11 release in 2005, a total
   of 15,637 developers have contributed to
   the Linux Kernel.

7. I have been an advisor on the Cregit
   project and compared its results to other
   methods that have been used to identify
   contributors and contributions to the
   Linux Kernel, such as a tool known as “git
   blame” that is used by developers to
   identify contributions to a git repository
   such as the repositories used by the Linux
   Kernel project.

8. I have been shown documents related to
   court actions by Patrick McHardy to
   enforce copyright claims regarding the
   Linux Kernel. I have heard many people
   familiar with the court actions discuss
   the cases and the threats of injunction
   McHardy leverages to obtain financial
   settlements. I have not otherwise been
   involved in any of the previous court

II. With regard to the facts:

1. The Linux Kernel project started in 1991
   with a release of code authored entirely
   by Linus Torvalds (who is also currently a
   Linux Foundation Fellow).  Since that time
   there have been a variety of ways in which
   contributions and contributors to the
   Linux Kernel have been tracked and
   identified. I am familiar with these

2. The first record of any contribution
   explicitly attributed to Patrick McHardy
   to the Linux kernel is April 23, 2002.
   McHardy’s last contribution to the Linux
   Kernel was made on November 24, 2015.

3. The Linux Kernel 2.5.12 was released by
   Linus Torvalds on April 30, 2002.

4. After review of the relevant records, I
   conclude that there is no evidence in the
   records that the Kernel community relies
   upon to identify contributions and
   contributors that Patrick McHardy made any
   code contributions to versions of the
   Linux Kernel earlier than 2.4.18 and
   2.5.12. Attached as Exhibit A is a list of
   Kernel releases which have no evidence in
   the relevant records of any contribution
   by Patrick McHardy.
Posted Sun Mar 11 02:55:05 2018 Tags:

Mono has a pure C# implementation of the Windows.Forms stack which works on Mac, Linux and Windows. It emulates some of the core of the Win32 API to achieve this.

While Mono's Windows.Forms is not an actively developed UI stack, it is required by a number of third party libraries, some data types are consumed by other Mono libraries (part of the original design contract), so we have kept it around.

On Mac, Mono's Windows.Forms was built on top of Carbon, an old C-based API that was available on MacOS. This backend was written by Geoff Norton for Mono many years ago.

As Mono switched to 64 bits by default, this meant that Windows.Forms could not be used. We have a couple of options, try to upgrade the 32-bit Carbon code to 64-bit Carbon code or build a new backend on top of Cocoa (using Xamarin.Mac).

For years, I had assumed that Carbon on 64 was not really supported, but a recent trip to the console shows that Apple has added a 64-bit port. Either my memory is failing me (common at this age), or Apple at some point changed their mind. I spent all of 20 minutes trying to do an upgrade, but the header files and documentation for the APIs we rely on are just not available, so at best, I would be doing some guess work as to which APIs have been upgraded to 64 bits, and which APIs are available (rumors on Google searches indicate that while Carbon is 64 bits, not all APIs might be there).

I figured that I could try to build a Cocoa backend with Xamarin.Mac, so I sent this pull request to let me do this outside of the Mono tree on my copious spare time, so this weekend I did some work on the Cocoa Driver.

But this morning, on twitter, Filip Navarra noticed the above, and contacted me:

He has been kind enough to upload this Cocoa-based backend to GitHub.

Going Native

There are a couple of interesting things about this Windows.Forms backend for Cocoa.

The first one, is that it is using sysdrawing-coregraphics, a custom version of System.Drawing that we had originally developed for iOS users that implements the API in terms of CoreGraphics instead of using Cairo, FontConfig, FreeType and Pango.

The second one, is that some controls are backed by native AppKit controls, those that implement the IMacNativeControl interface. Among those you can find Button, ComboBox, ProgressBar, ScrollBar and the UpDownStepper.

I will now abandon my weekend hack, and instead get this code drop integrated as the 64-bit Cocoa backend.

Stay tuned!

Posted Tue Feb 20 20:09:04 2018 Tags:

While I had promised my friend Lucas that I would build a game in Unity for what seems like a decade, I still have not managed to build one.

Recently Aras shared his excitement for Unity in 2018. There is a ton on that blog post to unpack.

What I am personally excited about is that Unity now ships an up-to-date Mono in the core.

Jonathan Chambers and his team of amazing low-level VM hackers have been hard at work in upgrading Unity's VM and libraries to bring you the latest and greatest Mono runtime to Unity. We have had the privilege of assisting in this migration and providing them with technical support for this migration.

The work that the Unity team has done lays down the foundation for an ongoing update to their .NET capabilities, so future innovation on the platform can be quickly adopted, bringing new and more joyful capabilities to developers in the platform.

With this new runtime, Unity developers will be able to access and consume a large portion of third party .NET libraries, including all those shiny .NET Standard Libraries - the new universal way of sharing code in the .NET world.

C# 7

The Unity team has also provided very valuable guidance to the C# team which have directly influenced features in C# 7 like ref locals and returns

  • In our own tests using C# for an AR application, we doubled the speed of managed-code AR processing by using these new features.

When users use the new Mono support in Unity, they default to C# 6, as this is the version that Mono's C# compiler fully supports. One of the challenges is that Mono's C# compiler has not fully implemented support for C# 7, as Mono itself moved to Roslyn.

The team at Unity is now working with the Roslyn team to adopt the Roslyn C# compiler in Unity. Because Roslyn is a larger compiler, it is a slower compiler to startup, and Unity does many small incremental compilations. So the team is working towards adopting the server compilation mode of Roslyn. This runs the Roslyn C# compiler as a reusable service which can compile code very quickly, without having to pay the price for startup every time.

Visual Studio

If you install the Unity beta today, you will also see that on Mac, it now defaults to Visual Studio for Mac as its default editor.

JB evain leads our Unity support for Visual Studio and he has brought the magic of his Unity plugin to Visual Studio for Mac.

As Unity upgrades its Mono runtime, they also benefit from the extended debugger protocol support in Mono, which bring years of improvements to the debugging experience.

Posted Tue Feb 20 20:09:04 2018 Tags:


Bufferbloat is responsible for much of the poor performance seen in the Internet today and causes latency (called “lag” by gamers), triggered even by your own routine web browsing and video playing.

But bufferbloat’s causes and solutions remind me of the old parable of the Blind Men and the Elephant, updated for the modern Internet:

It was six men of Interstan,
 To learning much inclined,
Who sought to fix the Internet,
 (With market share in mind),
That evil of congestion,
 Might soon be left behind.

Definitely, the pipe-men say,
A wider link we laud,
For congestion occurs only
When bandwidth exceeds baud.
If only the reverse were true,
We would all live like lords.

But count the cost, if bandwidth had
No end, money-men say,
But count the cost, if bandwidth had
No end, money-men say,
Perpetual upgrade cycle;
True madness lies that way.

From DiffServ-land, their tried and true
Solution now departs,
Some packets are more equal than
Their lesser counterparts.
But choosing which is which presents
New problems in all arts.

“Every packet is sacred!” cries
The data-centre clan,
Demanding bigger buffers for
Their ever-faster LAN.
To them, a millisecond is
Eternity to plan.

The end-to-end principle guides
A certain band of men,
Detecting when congestion strikes;
Inferring bandwidth then.
Alas, old methods outcompete their
Algorithmic maven.

The Isolationists prefer
Explicit signalling,
Ensuring each and every flow
Is equally a king.
If only all the other men
Would listen to them sing!

And so these men of industry,
Disputed loud and long,
Each in his own opinion,
Exceeding stiff and strong,
Though each was partly in the right,
Yet mostly in the wrong!

So, oft in technologic wars,
The disputants, I ween,
Rail on in utter ignorance,
Of what each other mean,
And prate about an Internet,
Not one of them has seen!

Jonathan Morton, with apologies to John Godfrey Saxe

Most technologists are not truly wise: we are usually like the blind men of Interstan. The TCP experts, network operators, telecom operators, router makers, Internet service operators, router vendors and users have all had a grip *only* on their piece of the elephant.

The TCP experts look at TCP and think “if only TCP were changed” in their favorite way, all latency problems would be solved, forgetting that there are many other causes of saturating an Internet link, and that changing every TCP implementation on the planet will take time measured in decades. With the speed of today’s processors, almost everything can potentially transmit at gigabits per second. Saturated (“congested”) links are *normal* operation in the Internet, not abnormal. And indeed, improved congestion avoidance algorithms such as BBR and mark/drop algorithms such as CoDel and PIE are part of the solution to bufferbloat.

And since TCP is innately “unfair” to flows with different RTT’s, and we have both nearby (~10ms) or (inter)continental distant (~75-200ms) services, no TCP only solution can possibly provide good service. This is a particularly acute problem for people living in remote locations in the world who have great inherent latency differences between local and non-local services. But TCP only solutions cannot solve other problems causing unnecessary latency, and can never achieve really good latency as the queue they build depend on the round trip time (RTT) of the flows.

Network operators often think: “if only I can increase the network bandwidth,” they can banish bufferbloat: but at best, they can only move the bottleneck’s location and that bottleneck’s buffering may even be worse! When my ISP increased the bandwidth to my house several years ago (at no cost, without warning me), they made my service much worse, as the usual bottleneck moved from the broadband link (where I had bufferbloat controlled using SQM) to WiFi in my home router. Suddenly, my typical lag became worse by more than a factor of ten, without having touched anything I owned and having double the bandwidth!

The logical conclusion of solving bufferbloat this way would be to build an ultimately uneconomic and impossible-to-build Internet where each hop is at least as fast as the previous link under all circumstances, and which ultimately collides with the limits of wireless bandwidth available at the edge of the Internet. Here lies madness. Today, we often pay for much more bandwidth than needed for our broadband connections just to reduce bufferbloat’s effects; most applications are more sensitive to latency than bandwidth, and we often see serious bufferbloat issues at peering points as well in the last mile, at home and using cellular systems. Unnecessary latency just hurts everyone.

Internet service operators optimize the experience of their applications to users, but seldom stop to see if the their service damages that of other Internet services and applications. Having a great TV experience, at the cost of good phone or video conversations with others is not a good trade-off.

Telecom operators, have often tried to limit bufferbloat damage by hacking the congestion window of TCP, which does not provides low latency, nor does it prevents severe bufferbloat in their systems when they are loaded or the station remote.

Some packets are much more important to deliver quickly, so as to enable timely behavior of applications. These include ACKS, TCP opens, TLS handshakes, and many other packet types such as VOIP, DHCP and DNS lookups. Applications cannot make progress until those responses return. Web browsing and other interactive applications suffer greatly if you ignore this reality. Some network operation experts have developed complex packet classification algorithms to try to send these packets on their way quickly.

Router manufacturers often have extremely complex packet classification rules and user interfaces, that no sane person can possibly understand. How many of you have successfully configured your home router QOS page, without losing hair from your head?

Lastly, packet networks are inherently bursty, and these packet bursts cause “jitter.” With only First In First Out (FIFO) queuing, bursts of tens or hundreds of packets happen frequently. You don’t want your VOIP packet or other time sensitive to be arbitrarily delayed by “head of line” blocking of bursts of packets in other flows. Attacking the right problem is essential.

We must look to solve all of these problems at once, not just individual aspects of the bufferbloat elephant. Flow Queuing CoDel (FQ_CoDel) is the first, but will not be the last such algorithm. And we must attack bufferbloat however it appears.

Flow Queuing Codel Algorithm: FQ_CoDel

Kathleen Nichols and Van Jacobson invented the CoDel Algorithm (described now formally in RFC 8289), which represents a fundamental advance in Active Queue Management that preceded it, which required careful tuning and could hurt you if not tuned properly.  Their recent blog entry helps explain its importance further. CoDel is based on the notion of “sojourn time”, the time that a packet is in the queue, and drops packets at the head of the queue (rather than tail or random drop). Since the CoDel algorithm is independent of queue length, is self tuning, and is solely dependent on sojourn time, additional combined algorithms become possible not possible with predecessors such as RED.

Dave Täht had been experimenting with Paul McKenney’s Stochastic Fair Queuing algorithm (SFQ) as part of his work on bufferbloat, confirming that many of the issues are caused by head of line blocking (and unfairness of differing RTT’s of competing flows) were well handled by that algorithm.

CoDel and Dave’s and Paul’s work inspired Eric Dumazet to invent the FQ_Codel algorithm almost immediately after the CoDel algorithm became available. Note we refer to it as “Flow Queuing” rather than “Fair Queuing” as it is definitely “unfair” in certain respects rather than a true “Fair Queuing” algorithm.

Synopsis of the FQ_Codel Algorithm Itself

FQ_CoDel, documented in RFC 8290, uses a hash (typically but not limited of the usual 5-tuple identifying a flow), and establishes a queue for each, as does the Stochastic Fair Queuing (SFQ) algorithm. The hash bounds memory usage of the algorithm enabling use on even small embedded router or in hardware. The number of hash buckets (flows) and what constitutes a flow can be adjusted as needed; in practice the flows are independent TCP flows.

There are two sets of flow queues: those flows which have built a queue, and queues for new flows.

The packet scheduler preferentially schedules packets from flows that have not built a queue from those which have built a queue (with provisions to ensure that flows that have built queues cannot be starved, and continue to make progress).

If a flow’s queue empties, and packets again appear later, that flow is considered a new flow again.

Since CoDel only depends on the time in queue, it is easily applied in FQ_CoDel to ensure that TCP flows are properly policed to keep their behavior from building large queues.

CoDel does not require tuning and works over a very wide range of bandwidth and traffic. Combining flow queuing with a older mark/drop algorithm is impossible with older AQM algorithms such as Random Early Detection (RED) which depend on queue length rather than time in queue, and require tuning. Without some mark/drop algorithm such as CoDel, individual flows might fill their FQ_CoDel queues, and while they would not affect other flows, those flows would still suffer bufferbloat.

What Effects Does the FQ_CoDel Algorithm Have?

FQ_Codel has a number of wonderful properties for such a simple algorithm:

  • Simple “request/response” or time based protocols are preferentially scheduled relative to bulk data transport. This means that your VOIP packets, your TCP handshakes, cryptographic associations, your button press in your game, your DHCP or other basic network protocols all get preferential service without the complexity of extensive packet classification, even under very heavy load of other ongoing flows. Your phone call can work well despite large downloads or video use.
  • Dramatically improved service for time sensitive protocols, since ACKS, TCP opens, security associations, DNS lookups, etc., fly through the bottleneck link.
  • Idle applications get immediate service as they return from their idle to active state. This is often due to interactive user’s actions and enabling rapid restart of such flows is preferable to ongoing, time insensitive bulk data transfer.
  • Many/most applications put their most essential information in the first data packets after a connection opens: FQ_CoDel sees that these first packets are quickly forwarded. For example, web browsers need the size information included in an image before being able to complete page layout; this information is typically contained in the first packet of the image. Most later packets in a large flow have less critical time requirements than the first packet(s).
  • Your web browsing is significantly faster, particularly when you share your home connection with others, since head of line blocking is no longer such an issue.
  • Flows that are “well behaved” and do not use more than their share of the available bandwidth at the bottleneck are rewarded for their good citizenship. This provides a positive incentive for good behavior (which the packet pacing work at Google has been exploiting with BBR, for example, which depends on pacing packets). Unpaced packet bursts are poison to “jitter”, that is the variance of latency, which sets how close to real time other algorithms (such as VOIP) need to operate.
  • If there are bulk data flows of differing RTT’s, the algorithm ensures (in the limit) fairness between the flows, solving an issue that cannot itself be solved by TCP.
  • User simplicity: without complex packet classification schemes, FQ_CoDel schedules packets far better than previous rigid classification systems.
  • FQ_CoDel is extremely small, simple and fast and scalable both up and down in bandwidth and # of flows, and down onto very constrained systems. It is safe for FQ_CoDel to be “always on”.

Having an algorithm that “just works” for 99% of cases and attacks the real problem is a huge advantage and simplification. Packet classification need only be used only to provide guarantees when essential, rather than be used to work around bufferbloat. No (sane) human can possibly successfully configure the QOS page found on today’s home router.  Once bufferbloat is removed and flow queuing present, any classification rule set can be tremendously simplified and understood.

Limitations and Warnings

If flow queuing AQM is not present at your bottleneck in your path, nothing you do can save you under load. You can at most move the bottleneck by inserting an artificial bottleneck.

The FQ_CoDel algorithm [RFC 8290], particularly as a queue discipline, is not a panacea for bufferbloat as buffers hide all over today’s systems. There has been about a dozen significant improvements to reduce bufferbloat just in Linux starting with Tom Herbert’s Linux BQL, Eric Dumazet’s TCP Small queues, TSO Sizing and FQ scheduler, just naming a few. Our thanks to everyone in the Linux community for their efforts. FQ_Codel, implemented in its generic form as the Linux queue discipline fq_codel, only solves part of the bufferbloat problem, that found in the qdisc layer. The FQ_CoDel algorithm itself is applicable to a wide variety of circumstances, and not just in networking gear, but even sometimes in applications.

WiFi and Wireless

Good examples of the buffering problems occurs in wireless and other technologies such as DOCSIS and DSL. WiFi is particularly hard: rapid bandwidth variation and packet aggregation means that WiFi drivers must have significant internal buffering. Cellular systems have similar (and different) challenges.

Toke Høiland-Jørgensen, Michał Kazior, Dave Täht, Per Hertig and Anna Brunstrom reported amazing improvements for WiFi at the Usenix Technical Conference. Here, FQ_CoDel algorithm is but part of a more complex algorithm that handles aggregation and transmission air-time fairness (which Linux has heretofore lacked). The results are stunning, and we hope to see similar improvements in other technologies by applying FQ_Codel.

WiFiThis chart shows the latency under load in milliseconds of the new WiFi algorithm, beginning adoption in Linux. The log graph is used to enable plotting results on the same graph. The green curve is for the new driver latency under load (cumulative probability of the latency observed by packets) while the orange graph is for the previous Linux driver implementation. You seldom see 1-2 orders of magnitude improvements in anything. Note that latency can exceed one second under load with existing drivers. And implementation of air-time fairness in this algorithm also significantly increases the total bandwidth available in a busy WiFi network, by using the spectrum more efficiently, while delivering this result! See the above paper for more details and data.

This new driver structure and algorithm is in the process of adoption in Linux, though most drivers do not yet take advantage of what this offers. You should demand your device vendors using Linux update their drivers to the new framework.

While the details may vary for different technologies, we hope the WiFi work will help illuminate techniques appropriate for applying FQ_CoDel (or other flow queuing algorithms, as appropriate) to other technologies such as cellular networks.

Performance and Comparison to Other Work

In the implementations available in Linux, FQ_CoDel has seriously outperformed all comers, as shown in: “The Good, the Bad and the WiFi: Modern AQMs in a residential setting,” by Toke Høiland-Jørgensen, Per Hurtig and Anna Brunstrom. This paper is unique(?) in that it compares the algorithms consistently using the identical tests on all algorithms on the same up to date test hardware and software version, and is an apple-to-apple comparison of running code.

Our favorite benchmark for “latency under load” is the “rrul” test from Toke’s flent test tool. It demonstrates when flow(s) may be damaging the latency of other flow(s) as well as goodput and latency.  Low bandwidth performance is more difficult than high bandwidth. At 1Mbps, a single 1500 byte packet represents 12 milliseconds! This shows fq_codel outperforming all comers in latency, while preserving very high goodput for the available bandwidth. The combination of an effective self adjusting mark/drop algorithm with flow queuing is impossible to beat. The following graph measures goodput versus latency for a number of queue disciplines, at two different bandwidths, of 1 and 10Mbps (not atypical of the low end of WiFi or DSL performance). See the paper for more details and results.


FQ_CoDel radically outperforms CoDel, while solving problems that no conventional TCP mark/drop algorithm can do by itself. The combination of flow queuing with CoDel is a killer improvement, and improves the behavior of CoDel (or PIE), since ACKS are not delayed.

Pfifo_fast previously has been the default queue discipline in Linux (and similar queuing is used on other operating systems): it is for the purposes of this test, a simple FIFO queue. Note that the default length of the FIFO queue is usually ten times longer than that displayed here in most of today’s operating systems. The FIFO queue length here was chosen so that the differences between algorithms would remain easily visible on the same linear plot!

Other self tuning TCP congestion avoidance mark/drop algorithms such as PIE [RFC 8033] are becoming available, and some of those are beginning to be combined with flow queuing to good effect. As you see above, PIE without flow queuing is not competitive (nor is CoDel). An FQ-PIE algorithm, recently implemented for FreeBSD by Rasool Al-Saadi and Grenville Armitage, is more interesting and promising. Again, flow queuing combined with an auto-adjusting mark/drop algorithm is the key to FQ-PIE’s behavior.

Availability of Modern AQM Algorithms

Implementations of both CoDel and FQ_Codel were released in Linux 3.5 in July 2012 and in FreeBSD 11, in October, 2016 (and backported to FreeBSD 10.3); FQ-PIE is only available in FreeBSD (though preliminary patches for Linux exist).

FQ_CoDel has seen extensive testing, deployment and use in Linux since its initial release. There is never a reason to use CoDel, and is present in Linux solely to enable convenient performance comparisons, as in the above results. FQ_CoDel first saw wide use and deployment as part of OpenWrt when it became the default queue discipline there, and used heavily as part of the SQM (Smart Queue Management) system to mitigate “last mile” bufferbloat. FQ_CoDel is now the default queue discipline on most Linux distributions in many circumstances.  FQ_CoDel is being used in an increasing number of commercial products.

The FQ_CoDel based WiFi improvements shown here are available in the latest LEDE/OpenWrt release for a few chips. WiFi chip vendors and router vendors take note: your competitors may be about to pummel you, as this radical improvement is now becoming available. The honor of the commercial product to incorporate this new WiFi code is the Evenroute IQrouter.

PIE, but not FQ-PIE was mandated in DOCSIS 3.1, and is much less effective than FQ_CoDel (or FQ-PIE).

Continuing Work: CAKE

Research in other AQM’s continue apace. FQ_CoDel and FQ-PIE are far from the last words on the topic.

Rate limiting is necessary to mitigate bufferbloat in equipment that has no effective queue management (such as most existing DSL or most Cable modems). While we have used fq_codel in concert with Hierarchical Token Bucket (HTB) rate limiting for years as part of the Smart Queue Management (SQM) scripts (in OpenWrt and elsewhere), one can go further in an integrated queue discipline to address problems difficult to solve piecemeal, and make configuration radically simpler. The CAKE queue discipline, available today in OpenWrt, additionally adds integrated framing compensation, diffserv support and CoDel improvements. We hope to see CAKE upstream in Linux soon. Please come help in its testing and development.

As you see above, the Make WiFi Fast project is making headway and makes WiFi much to be preferred to cellular service when available, much, much more is left to be done. Investigating fixing bufferbloat in WiFi made us aware of just how much additional improvement in WiFi was also possible. Please come help!


Ye men of Interstan, to learning much inclined, see the elephant, by opening your eyes: 

ISP’s, insist your equipment manufacturers support modern flow queuing automatic queue management. Chip manufacturers: ensure your chips implement best practices in the area, or see your competitors win business.

Gamers, stock traders, musicians, and anyone wanting decent telephone or video calls take note. Insist your ISP fix bufferbloat in its network. Linux and RFC 8290 prove you can have your cake and eat it too. Stop suffering.

Insist on up to date flow queuing queue management in devices you buy and use, and insist that bufferbloat be fixed everywhere.

Posted Mon Feb 12 00:39:15 2018 Tags:


This post is based on a whitepaper I wrote at the beginning of 2016 to be used to help many different companies understand the Linux kernel release model and encourage them to start taking the LTS stable updates more often. I then used it as a basis of a presentation I gave at the Linux Recipes conference in September 2017 which can be seen here.

With the recent craziness of Meltdown and Spectre , I’ve seen lots of things written about how Linux is released and how we handle handles security patches that are totally incorrect, so I figured it is time to dust off the text, update it in a few places, and publish this here for everyone to benefit from.

I would like to thank the reviewers who helped shape the original whitepaper, which has helped many companies understand that they need to stop “cherry picking” random patches into their device kernels. Without their help, this post would be a total mess. All problems and mistakes in here are, of course, all mine. If you notice any, or have any questions about this, please let me know.


This post describes how the Linux kernel development model works, what a long term supported kernel is, how the kernel developers approach security bugs, and why all systems that use Linux should be using all of the stable releases and not attempting to pick and choose random patches.

Linux Kernel development model

The Linux kernel is the largest collaborative software project ever. In 2017, over 4,300 different developers from over 530 different companies contributed to the project. There were 5 different releases in 2017, with each release containing between 12,000 and 14,500 different changes. On average, 8.5 changes are accepted into the Linux kernel every hour, every hour of the day. A non-scientific study (i.e. Greg’s mailbox) shows that each change needs to be submitted 2-3 times before it is accepted into the kernel source tree due to the rigorous review and testing process that all kernel changes are put through, so the engineering effort happening is much larger than the 8 changes per hour.

At the end of 2017 the size of the Linux kernel was just over 61 thousand files consisting of 25 million lines of code, build scripts, and documentation (kernel release 4.14). The Linux kernel contains the code for all of the different chip architectures and hardware drivers that it supports. Because of this, an individual system only runs a fraction of the whole codebase. An average laptop uses around 2 million lines of kernel code from 5 thousand files to function properly, while the Pixel phone uses 3.2 million lines of kernel code from 6 thousand files due to the increased complexity of a SoC.

Kernel release model

With the release of the 2.6 kernel in December of 2003, the kernel developer community switched from the previous model of having a separate development and stable kernel branch, and moved to a “stable only” branch model. A new release happened every 2 to 3 months, and that release was declared “stable” and recommended for all users to run. This change in development model was due to the very long release cycle prior to the 2.6 kernel (almost 3 years), and the struggle to maintain two different branches of the codebase at the same time.

The numbering of the kernel releases started out being 2.6.x, where x was an incrementing number that changed on every release The value of the number has no meaning, other than it is newer than the previous kernel release. In July 2011, Linus Torvalds changed the version number to 3.x after the 2.6.39 kernel was released. This was done because the higher numbers were starting to cause confusion among users, and because Greg Kroah-Hartman, the stable kernel maintainer, was getting tired of the large numbers and bribed Linus with a fine bottle of Japanese whisky.

The change to the 3.x numbering series did not mean anything other than a change of the major release number, and this happened again in April 2015 with the movement from the 3.19 release to the 4.0 release number. It is not remembered if any whisky exchanged hands when this happened. At the current kernel release rate, the number will change to 5.x sometime in 2018.

Stable kernel releases

The Linux kernel stable release model started in 2005, when the existing development model of the kernel (a new release every 2-3 months) was determined to not be meeting the needs of most users. Users wanted bugfixes that were made during those 2-3 months, and the Linux distributions were getting tired of trying to keep their kernels up to date without any feedback from the kernel community. Trying to keep individual kernels secure and with the latest bugfixes was a large and confusing effort by lots of different individuals.

Because of this, the stable kernel releases were started. These releases are based directly on Linus’s releases, and are released every week or so, depending on various external factors (time of year, available patches, maintainer workload, etc.)

The numbering of the stable releases starts with the number of the kernel release, and an additional number is added to the end of it.

For example, the 4.9 kernel is released by Linus, and then the stable kernel releases based on this kernel are numbered 4.9.1, 4.9.2, 4.9.3, and so on. This sequence is usually shortened with the number “4.9.y” when referring to a stable kernel release tree. Each stable kernel release tree is maintained by a single kernel developer, who is responsible for picking the needed patches for the release, and doing the review/release process. Where these changes are found is described below.

Stable kernels are maintained for as long as the current development cycle is happening. After Linus releases a new kernel, the previous stable kernel release tree is stopped and users must move to the newer released kernel.

Long-Term Stable kernels

After a year of this new stable release process, it was determined that many different users of Linux wanted a kernel to be supported for longer than just a few months. Because of this, the Long Term Supported (LTS) kernel release came about. The first LTS kernel was 2.6.16, released in 2006. Since then, a new LTS kernel has been picked once a year. That kernel will be maintained by the kernel community for at least 2 years. See the next section for how a kernel is chosen to be a LTS release.

Currently the LTS kernels are the 4.4.y, 4.9.y, and 4.14.y releases, and a new kernel is released on average, once a week. Along with these three kernel releases, a few older kernels are still being maintained by some kernel developers at a slower release cycle due to the needs of some users and distributions.

Information about all long-term stable kernels, who is in charge of them, and how long they will be maintained, can be found on the release page.

LTS kernel releases average 9-10 patches accepted per day, while the normal stable kernel releases contain 10-15 patches per day. The number of patches fluctuates per release given the current time of the corresponding development kernel release, and other external variables. The older a LTS kernel is, the less patches are applicable to it, because many recent bugfixes are not relevant to older kernels. However, the older a kernel is, the harder it is to backport the changes that are needed to be applied, due to the changes in the codebase. So while there might be a lower number of overall patches being applied, the effort involved in maintaining a LTS kernel is greater than maintaining the normal stable kernel.

Choosing the LTS kernel

The method of picking which kernel the LTS release will be, and who will maintain it, has changed over the years from an semi-random method, to something that is hopefully more reliable.

Originally it was merely based on what kernel the stable maintainer’s employer was using for their product (2.6.16.y and 2.6.27.y) in order to make the effort of maintaining that kernel easier. Other distribution maintainers saw the benefit of this model and got together and colluded to get their companies to all release a product based on the same kernel version without realizing it (2.6.32.y). After that was very successful, and allowed developers to share work across companies, those companies decided to not do that anymore, so future LTS kernels were picked on an individual distribution’s needs and maintained by different developers (3.0.y, 3.2.y, 3.12.y, 3.16.y, and 3.18.y) creating more work and confusion for everyone involved.

This ad-hoc method of catering to only specific Linux distributions was not beneficial to the millions of devices that used Linux in an embedded system and were not based on a traditional Linux distribution. Because of this, Greg Kroah-Hartman decided that the choice of the LTS kernel needed to change to a method in which companies can plan on using the LTS kernel in their products. The rule became “one kernel will be picked each year, and will be maintained for two years.” With that rule, the 3.4.y, 3.10.y, and 3.14.y kernels were picked.

Due to a large number of different LTS kernels being released all in the same year, causing lots of confusion for vendors and users, the rule of no new LTS kernels being based on an individual distribution’s needs was created. This was agreed upon at the annual Linux kernel summit and started with the 4.1.y LTS choice.

During this process, the LTS kernel would only be announced after the release happened, making it hard for companies to plan ahead of time what to use in their new product, causing lots of guessing and misinformation to be spread around. This was done on purpose as previously, when companies and kernel developers knew ahead of time what the next LTS kernel was going to be, they relaxed their normal stringent review process and allowed lots of untested code to be merged (2.6.32.y). The fallout of that mess took many months to unwind and stabilize the kernel to a proper level.

The kernel community discussed this issue at its annual meeting and decided to mark the 4.4.y kernel as a LTS kernel release, much to the surprise of everyone involved, with the goal that the next LTS kernel would be planned ahead of time to be based on the last kernel release of 2016 in order to provide enough time for companies to release products based on it in the next holiday season (2017). This is how the 4.9.y and 4.14.y kernels were picked as the LTS kernel releases.

This process seems to have worked out well, without many problems being reported against the 4.9.y tree, despite it containing over 16,000 changes, making it the largest kernel to ever be released.

Future LTS kernels should be planned based on this release cycle (the last kernel of the year). This should allow SoC vendors to plan ahead on their development cycle to not release new chipsets based on older, and soon to be obsolete, LTS kernel versions.

Stable kernel patch rules

The rules for what can be added to a stable kernel release have remained almost identical for the past 12 years. The full list of the rules for patches to be accepted into a stable kernel release can be found in the Documentation/process/stable_kernel_rules.rst kernel file and are summarized here. A stable kernel change:

  • must be obviously correct and tested.
  • must not be bigger than 100 lines.
  • must fix only one thing.
  • must fix something that has been reported to be an issue.
  • can be a new device id or quirk for hardware, but not add major new functionality
  • must already be merged into Linus’s tree

The last rule, “a change must be in Linus’s tree”, prevents the kernel community from losing fixes. The community never wants a fix to go into a stable kernel release that is not already in Linus’s tree so that anyone who upgrades should never see a regression. This prevents many problems that other projects who maintain a stable and development branch can have.

Kernel Updates

The Linux kernel community has promised its userbase that no upgrade will ever break anything that is currently working in a previous release. That promise was made in 2007 at the annual Kernel developer summit in Cambridge, England, and still holds true today. Regressions do happen, but those are the highest priority bugs and are either quickly fixed, or the change that caused the regression is quickly reverted from the Linux kernel tree.

This promise holds true for both the incremental stable kernel updates, as well as the larger “major” updates that happen every three months.

The kernel community can only make this promise for the code that is merged into the Linux kernel tree. Any code that is merged into a device’s kernel that is not in the releases is unknown and interactions with it can never be planned for, or even considered. Devices based on Linux that have large patchsets can have major issues when updating to newer kernels, because of the huge number of changes between each release. SoC patchsets are especially known to have issues with updating to newer kernels due to their large size and heavy modification of architecture specific, and sometimes core, kernel code.

Most SoC vendors do want to get their code merged upstream before their chips are released, but the reality of project-planning cycles and ultimately the business priorities of these companies prevent them from dedicating sufficient resources to the task. This, combined with the historical difficulty of pushing updates to embedded devices, results in almost all of them being stuck on a specific kernel release for the entire lifespan of the device.

Because of the large out-of-tree patchsets, most SoC vendors are starting to standardize on using the LTS releases for their devices. This allows devices to receive bug and security updates directly from the Linux kernel community, without having to rely on the SoC vendor’s backporting efforts, which traditionally are very slow to respond to problems.

It is encouraging to see that the Android project has standardized on the LTS kernels as a “minimum kernel version requirement”. Hopefully that will allow the SoC vendors to continue to update their device kernels in order to provide more secure devices for their users.


When doing kernel releases, the Linux kernel community almost never declares specific changes as “security fixes”. This is due to the basic problem of the difficulty in determining if a bugfix is a security fix or not at the time of creation. Also, many bugfixes are only determined to be security related after much time has passed, so to keep users from getting a false sense of security by not taking patches, the kernel community strongly recommends always taking all bugfixes that are released.

Linus summarized the reasoning behind this behavior in an email to the Linux Kernel mailing list in 2008:

On Wed, 16 Jul 2008, wrote:
> you should check out the last few -stable releases then and see how
> the announcement doesn't ever mention the word 'security' while fixing
> security bugs

Umm. What part of "they are just normal bugs" did you have issues with?

I expressly told you that security bugs should not be marked as such,
because bugs are bugs.

> in other words, it's all the more reason to have the commit say it's
> fixing a security issue.


> > I'm just saying that why mark things, when the marking have no meaning?
> > People who believe in them are just _wrong_.
> what is wrong in particular?

You have two cases:

 - people think the marking is somehow trustworthy.

   People are WRONG, and are misled by the partial markings, thinking that
   unmarked bugfixes are "less important". They aren't.

 - People don't think it matters

   People are right, and the marking is pointless.

In either case it's just stupid to mark them. I don't want to do it,
because I don't want to perpetuate the myth of "security fixes" as a
separate thing from "plain regular bug fixes".

They're all fixes. They're all important. As are new features, for that

> when you know that you're about to commit a patch that fixes a security
> bug, why is it wrong to say so in the commit?

It's pointless and wrong because it makes people think that other bugs
aren't potential security fixes.

What was unclear about that?


This email can be found here, and the whole thread is recommended reading for anyone who is curious about this topic.

When security problems are reported to the kernel community, they are fixed as soon as possible and pushed out publicly to the development tree and the stable releases. As described above, the changes are almost never described as a “security fix”, but rather look like any other bugfix for the kernel. This is done to allow affected parties the ability to update their systems before the reporter of the problem announces it.

Linus describes this method of development in the same email thread:

On Wed, 16 Jul 2008, wrote:
> we went through this and you yourself said that security bugs are *not*
> treated as normal bugs because you do omit relevant information from such
> commits

Actually, we disagree on one fundamental thing. We disagree on
that single word: "relevant".

I do not think it's helpful _or_ relevant to explicitly point out how to
tigger a bug. It's very helpful and relevant when we're trying to chase
the bug down, but once it is fixed, it becomes irrelevant.

You think that explicitly pointing something out as a security issue is
really important, so you think it's always "relevant". And I take mostly
the opposite view. I think pointing it out is actually likely to be

For example, the way I prefer to work is to have people send me and the
kernel list a patch for a fix, and then in the very next email send (in
private) an example exploit of the problem to the security mailing list
(and that one goes to the private security list just because we don't want
all the people at universities rushing in to test it). THAT is how things
should work.

Should I document the exploit in the commit message? Hell no. It's
private for a reason, even if it's real information. It was real
information for the developers to explain why a patch is needed, but once
explained, it shouldn't be spread around unnecessarily.


Full details of how security bugs can be reported to the kernel community in order to get them resolved and fixed as soon as possible can be found in the kernel file Documentation/admin-guide/security-bugs.rst

Because security bugs are not announced to the public by the kernel team, CVE numbers for Linux kernel-related issues are usually released weeks, months, and sometimes years after the fix was merged into the stable and development branches, if at all.

Keeping a secure system

When deploying a device that uses Linux, it is strongly recommended that all LTS kernel updates be taken by the manufacturer and pushed out to their users after proper testing shows the update works well. As was described above, it is not wise to try to pick and choose various patches from the LTS releases because:

  • The releases have been reviewed by the kernel developers as a whole, not in individual parts
  • It is hard, if not impossible, to determine which patches fix “security” issues and which do not. Almost every LTS release contains at least one known security fix, and many yet “unknown”.
  • If testing shows a problem, the kernel developer community will react quickly to resolve the issue. If you wait months or years to do an update, the kernel developer community will not be able to even remember what the updates were given the long delay.
  • Changes to parts of the kernel that you do not build/run are fine and can cause no problems to your system. To try to filter out only the changes you run will cause a kernel tree that will be impossible to merge correctly with future upstream releases.

Note, this author has audited many SoC kernel trees that attempt to cherry-pick random patches from the upstream LTS releases. In every case, severe security fixes have been ignored and not applied.

As proof of this, I demoed at the Kernel Recipes talk referenced above how trivial it was to crash all of the latest flagship Android phones on the market with a tiny userspace program. The fix for this issue was released 6 months prior in the LTS kernel that the devices were based on, however none of the devices had upgraded or fixed their kernels for this problem. As of this writing (5 months later) only two devices have fixed their kernel and are now not vulnerable to that specific bug.

Posted Mon Feb 5 17:40:11 2018 Tags:

For the last few weeks, Benjamin Tissoires and I have been working on a new project: Tuhi [1], a daemon to connect to and download data from Wacom SmartPad devices like the Bamboo Spark, Bamboo Slate and, eventually, the Bamboo Folio and the Intuos Pro Paper devices. These devices are not traditional graphics tablets plugged into a computer but rather smart notepads where the user's offline drawing is saved as stroke data in vector format and later synchronised with the host computer over Bluetooth. There it can be converted to SVG, integrated into the applications, etc. Wacom's application for this is Inkspace.

There is no official Linux support for these devices. Benjamin and I started looking at the protocol dumps last year and, luckily, they're not completely indecipherable and reverse-engineering them was relatively straightforward. Now it is a few weeks later and we have something that is usable (if a bit rough) and provides the foundation for supporting these devices properly on the Linux desktop. The repository is available on github at

The main core is a DBus session daemon written in Python. That daemon connects to the devices and exposes them over a custom DBus API. That API is relatively simple, it supports the methods to search for devices, pair devices, listen for data from devices and finally to fetch the data. It has some basic extras built in like temporary storage of the drawing data so they survive daemon restarts. But otherwise it's a three-way mapper from the Bluez device, the serial controller we talk to on the device and the Tuhi DBus API presented to the clients. One such client is the little commandline tool that comes with tuhi: tuhi-kete [2]. Here's a short example:

$> ./tools/
Tuhi shell control
tuhi> search on
INFO: Pairable device: E2:43:03:67:0E:01 - Bamboo Spark
tuhi> pair E2:43:03:67:0E:01
INFO: E2:43:03:67:0E:01 - Bamboo Spark: Press button on device now
INFO: E2:43:03:67:0E:01 - Bamboo Spark: Pairing successful
tuhi> listen E2:43:03:67:0E:01
INFO: E2:43:03:67:0E:01 - Bamboo Spark: drawings available: 1516853586, 1516859506, [...]
tuhi> list
E2:43:03:67:0E:01 - Bamboo Spark
tuhi> info E2:43:03:67:0E:01
E2:43:03:67:0E:01 - Bamboo Spark
Available drawings:
* 1516853586: drawn on the 2018-01-25 at 14:13
* 1516859506: drawn on the 2018-01-25 at 15:51
* 1516860008: drawn on the 2018-01-25 at 16:00
* 1517189792: drawn on the 2018-01-29 at 11:36
tuhi> fetch E2:43:03:67:0E:01 1516853586
INFO: Bamboo Spark: saved file "Bamboo Spark-2018-01-25-14-13.svg"
I won't go into the details because most should be obvious and this is purely a debugging client, not a client we expect real users to use. Plus, everything is still changing quite quickly at this point.

The next step is to get a proper GUI application working. As usual with any GUI-related matter, we'd really appreciate some help :)

The project is young and relying on reverse-engineered protocols means there are still a few rough edges. Right now, the Bamboo Spark and Slate are supported because we have access to those. The Folio should work, it looks like it's a re-packaged Slate. Intuos Pro Paper support is still pending, we don't have access to a device at this point. If you're interested in testing or helping out, come on over to the github site and get started!

[1] tuhi: Maori for "writing, script"
[2] kete: Maori for "kit"

Posted Tue Jan 30 08:59:00 2018 Tags:

I keep getting a lot of private emails about my previous post about the latest status of the Linux kernel patches to resolve both the Meltdown and Spectre issues.

These questions all seem to break down into two different categories, “What is the state of the Spectre kernel patches?”, and “Is my machine vunlerable?”

State of the kernel patches

As always, covers the technical details about the latest state of the kernel patches to resolve the Spectre issues, so please go read that to find out that type of information.

And yes, it is behind a paywall for a few more weeks. You should be buying a subscription to get this type of thing!

Is my machine vunlerable?

For this question, it’s now a very simple answer, you can check it yourself.

Just run the following command at a terminal window to determine what the state of your machine is:

$ grep . /sys/devices/system/cpu/vulnerabilities/*

On my laptop, right now, this shows:

$ grep . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Vulnerable: Minimal generic ASM retpoline

This shows that my kernel is properly mitigating the Meltdown problem by implementing PTI (Page Table Isolation), and that my system is still vulnerable to the Spectre variant 1, but is trying really hard to resolve the variant 2, but is not quite there (because I did not build my kernel with a compiler to properly support the retpoline feature).

If your kernel does not have that sysfs directory or files, then obviously there is a problem and you need to upgrade your kernel!

Some “enterprise” distributions did not backport the changes for this reporting, so if you are running one of those types of kernels, go bug the vendor to fix that, you really want a unified way of knowing the state of your system.

Note that right now, these files are only valid for the x86-64 based kernels, all other processor types will show something like “Not affected”. As everyone knows, that is not true for the Spectre issues, except for very old CPUs, so it’s a huge hint that your kernel really is not up to date yet. Give it a few months for all other processor types to catch up and implement the correct kernel hooks to properly report this.

And yes, I need to go find a working microcode update to fix my laptop’s CPU so that it is not vulnerable against Spectre…

Posted Fri Jan 19 11:10:09 2018 Tags:

Even these days, I still spend too much time on the command line. My friends still make fun of my MacOS desktop when they see that I run a full screen terminal, and the main program that I am running there is the Midnight Commander:

Every once in a while I write an interactive application, and I want to have full bash-like command line editing, history and search. The Unix world used to have GNU readline as a C library, but I wanted something that worked on both Unix and Windows with minimal dependencies.

Almost 10 years ago I wrote myself a C# library to do this, it works on both Unix and Windows and it was the library that has been used by Mono's interactive C# shell for the last decade or so.

This library used to be called getline.cs, and it was part of a series of single source file libraries that we distributed with Mono.

The idea of distributing libraries that were made up of a single source file did not really catch on. So we have modernized our own ways and now we publish these single-file libraries as NuGet packages that you can use.

You can now add an interactive command line shell with NuGet by installing the Mono.Terminal NuGet package into your application.

We also moved the single library from being part of the gigantic Mono repository into its own repository.

The GitHub page has more information on the key bindings available, how to use the history and how to add code-completion (even including a cute popup).

The library is built entirely on top of System.Console, and is distributed as a .NET Standard library which can run on your choice of .NET runtime (.NET Core, .NET Desktop or Mono).

Check the GitHub project page for more information.

Posted Fri Jan 12 15:10:07 2018 Tags:
Edit 2018-02-26: renamed from libevdev-python to python-libevdev. That seems to be a more generic name and easier to package.

Last year, just before the holidays Benjamin Tissoires and I worked on a 'new' project - python-libevdev. This is, unsurprisingly, a Python wrapper to libevdev. It's not exactly new since we took the git tree from 2016 when I was working on it the first time round but this time we whipped it into a better shape. Now it's at the point where I think it has the API it should have, pythonic and very easy to use but still with libevdev as the actual workhorse in the background. It's available via pip3 and should be packaged for your favourite distributions soonish.

Who is this for? Basically anyone who needs to work with the evdev protocol. While C is still a thing, there are many use-cases where Python is a much more sensible choice. The python-libevdev documentation on ReadTheDocs provides a few examples which I'll copy here, just so you get a quick overview. The first example shows how to open a device and then continuously loop through all events, searching for button events:

import libevdev

fd = open('/dev/input/event0', 'rb')
d = libevdev.Device(fd)
if not d.has(libevdev.EV_KEY.BTN_LEFT):
print('This does not look like a mouse device')

# Loop indefinitely while pulling the currently available events off
# the file descriptor
while True:
for e in
if not e.matches(libevdev.EV_KEY):

if e.matches(libevdev.EV_KEY.BTN_LEFT):
print('Left button event')
elif e.matches(libevdev.EV_KEY.BTN_RIGHT):
print('Right button event')
The second example shows how to create a virtual uinput device and send events through that device:

import libevdev
d = libevdev.Device() = 'some test device'

uinput = d.create_uinput_device()
print('new uinput test device at {}'.format(uinput.devnode))
events = [InputEvent(libevdev.EV_REL.REL_X, 1),
InputEvent(libevdev.EV_REL.REL_Y, 1),
InputEvent(libevdev.EV_SYN.SYN_REPORT, 0)]
And finally, if you have a textual or binary representation of events, the evbit function helps to convert it to something useful:

>>> import libevdev
>>> print(libevdev.evbit(0))
>>> print(libevdev.evbit(2))
>>> print(libevdev.evbit(3, 4))
>>> print(libevdev.evbit('EV_ABS'))
>>> print(libevdev.evbit('EV_ABS', 'ABS_X'))
>>> print(libevdev.evbit('ABS_X'))
The latter is particularly helpful if you have a script that needs to analyse event sequences and look for protocol bugs (or hw/fw issues).

More explanations and details are available in the python-libevdev documentation. That doc also answers the question why python-libevdev exists when there's already a python-evdev package. The code is up on github.

Posted Sun Jan 7 23:45:00 2018 Tags:

By now, everyone knows that something “big” just got announced regarding computer security. Heck, when the Daily Mail does a report on it , you know something is bad…

Anyway, I’m not going to go into the details about the problems being reported, other than to point you at the wonderfully written Project Zero paper on the issues involved here. They should just give out the 2018 Pwnie award right now, it’s that amazingly good.

If you do want technical details for how we are resolving those issues in the kernel, see the always awesome writeup for the details.

Also, here’s a good summary of lots of other postings that includes announcements from various vendors.

As for how this was all handled by the companies involved, well this could be described as a textbook example of how NOT to interact with the Linux kernel community properly. The people and companies involved know what happened, and I’m sure it will all come out eventually, but right now we need to focus on fixing the issues involved, and not pointing blame, no matter how much we want to.

What you can do right now

If your Linux systems are running a normal Linux distribution, go update your kernel. They should all have the updates in them already. And then keep updating them over the next few weeks, we are still working out lots of corner case bugs given that the testing involved here is complex given the huge variety of systems and workloads this affects. If your distro does not have kernel updates, then I strongly suggest changing distros right now.

However there are lots of systems out there that are not running “normal” Linux distributions for various reasons (rumor has it that it is way more than the “traditional” corporate distros). They rely on the LTS kernel updates, or the normal stable kernel updates, or they are in-house franken-kernels. For those people here’s the status of what is going on regarding all of this mess in the upstream kernels you can use.

Meltdown – x86

Right now, Linus’s kernel tree contains all of the fixes we currently know about to handle the Meltdown vulnerability for the x86 architecture. Go enable the CONFIG_PAGE_TABLE_ISOLATION kernel build option, and rebuild and reboot and all should be fine.

However, Linus’s tree is currently at 4.15-rc6 + some outstanding patches. 4.15-rc7 should be out tomorrow, with those outstanding patches to resolve some issues, but most people do not run a -rc kernel in a “normal” environment.

Because of this, the x86 kernel developers have done a wonderful job in their development of the page table isolation code, so much so that the backport to the latest stable kernel, 4.14, has been almost trivial for me to do. This means that the latest 4.14 release (4.14.12 at this moment in time), is what you should be running. 4.14.13 will be out in a few more days, with some additional fixes in it that are needed for some systems that have boot-time problems with 4.14.12 (it’s an obvious problem, if it does not boot, just add the patches now queued up.)

I would personally like to thank Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, Peter Zijlstra, Josh Poimboeuf, Juergen Gross, and Linus Torvalds for all of the work they have done in getting these fixes developed and merged upstream in a form that was so easy for me to consume to allow the stable releases to work properly. Without that effort, I don’t even want to think about what would have happened.

For the older long term stable (LTS) kernels, I have leaned heavily on the wonderful work of Hugh Dickins, Dave Hansen, Jiri Kosina and Borislav Petkov to bring the same functionality to the 4.4 and 4.9 stable kernel trees. I had also had immense help from Guenter Roeck, Kees Cook, Jamie Iles, and many others in tracking down nasty bugs and missing patches. I want to also call out David Woodhouse, Eduardo Valentin, Laura Abbott, and Rik van Riel for their help with the backporting and integration as well, their help was essential in numerous tricky places.

These LTS kernels also have the CONFIG_PAGE_TABLE_ISOLATION build option that should be enabled to get complete protection.

As this backport is very different from the mainline version that is in 4.14 and 4.15, there are different bugs happening, right now we know of some VDSO issues that are getting worked on, and some odd virtual machine setups are reporting strange errors, but those are the minority at the moment, and should not stop you from upgrading at all right now. If you do run into problems with these releases, please let us know on the stable kernel mailing list.

If you rely on any other kernel tree other than 4.4, 4.9, or 4.14 right now, and you do not have a distribution supporting you, you are out of luck. The lack of patches to resolve the Meltdown problem is so minor compared to the hundreds of other known exploits and bugs that your kernel version currently contains. You need to worry about that more than anything else at this moment, and get your systems up to date first.

Also, go yell at the people who forced you to run an obsoleted and insecure kernel version, they are the ones that need to learn that doing so is a totally reckless act.

Meltdown – ARM64

Right now the ARM64 set of patches for the Meltdown issue are not merged into Linus’s tree. They are staged and ready to be merged into 4.16-rc1 once 4.15 is released in a few weeks. Because these patches are not in a released kernel from Linus yet, I can not backport them into the stable kernel releases (hey, we have rules for a reason…)

Due to them not being in a released kernel, if you rely on ARM64 for your systems (i.e. Android), I point you at the Android Common Kernel tree All of the ARM64 fixes have been merged into the 3.18, 4.4, and 4.9 branches as of this point in time.

I would strongly recommend just tracking those branches as more fixes get added over time due to testing and things catch up with what gets merged into the upstream kernel releases over time, especially as I do not know when these patches will land in the stable and LTS kernel releases at this point in time.

For the 4.4 and 4.9 LTS kernels, odds are these patches will never get merged into them, due to the large number of prerequisite patches required. All of those prerequisite patches have been long merged and tested in the android-common kernels, so I think it is a better idea to just rely on those kernel branches instead of the LTS release for ARM systems at this point in time.

Also note, I merge all of the LTS kernel updates into those branches usually within a day or so of being released, so you should be following those branches no matter what, to ensure your ARM systems are up to date and secure.


Now things get “interesting”…

Again, if you are running a distro kernel, you might be covered as some of the distros have merged various patches into them that they claim mitigate most of the problems here. I suggest updating and testing for yourself to see if you are worried about this attack vector

For upstream, well, the status is there is no fixes merged into any upstream tree for these types of issues yet. There are numerous patches floating around on the different mailing lists that are proposing solutions for how to resolve them, but they are under heavy development, some of the patch series do not even build or apply to any known trees, the series conflict with each other, and it’s a general mess.

This is due to the fact that the Spectre issues were the last to be addressed by the kernel developers. All of us were working on the Meltdown issue, and we had no real information on exactly what the Spectre problem was at all, and what patches were floating around were in even worse shape than what have been publicly posted.

Because of all of this, it is going to take us in the kernel community a few weeks to resolve these issues and get them merged upstream. The fixes are coming in to various subsystems all over the kernel, and will be collected and released in the stable kernel updates as they are merged, so again, you are best off just staying up to date with either your distribution’s kernel releases, or the LTS and stable kernel releases.

It’s not the best news, I know, but it’s reality. If it’s any consolation, it does not seem that any other operating system has full solutions for these issues either, the whole industry is in the same boat right now, and we just need to wait and let the developers solve the problem as quickly as they can.

The proposed solutions are not trivial, but some of them are amazingly good. The Retpoline post from Paul Turner is an example of some of the new concepts being created to help resolve these issues. This is going to be an area of lots of research over the next years to come up with ways to mitigate the potential problems involved in hardware that wants to try to predict the future before it happens.

Other arches

Right now, I have not seen patches for any other architectures than x86 and arm64. There are rumors of patches floating around in some of the enterprise distributions for some of the other processor types, and hopefully they will surface in the weeks to come to get merged properly upstream. I have no idea when that will happen, if you are dependant on a specific architecture, I suggest asking on the arch-specific mailing list about this to get a straight answer.


Again, update your kernels, don’t delay, and don’t stop. The updates to resolve these problems will be continuing to come for a long period of time. Also, there are still lots of other bugs and security issues being resolved in the stable and LTS kernel releases that are totally independent of these types of issues, so keeping up to date is always a good idea.

Right now, there are a lot of very overworked, grumpy, sleepless, and just generally pissed off kernel developers working as hard as they can to resolve these issues that they themselves did not cause at all. Please be considerate of their situation right now. They need all the love and support and free supply of their favorite beverage that we can provide them to ensure that we all end up with fixed systems as soon as possible.

Posted Sat Jan 6 15:20:26 2018 Tags:
Cover of the book.
No worries, this is just about my Groovy Cheminformatics book. Seven years ago I started a project that was very educational to me: self-publishing a book. With the help from I managed to get a book out that sold over 100 copies and that was regularly updated. But their lies the problem: supply creates demand. So, I had a system that supplied me with an automated set up that reran scripts, recreated text output and even figures for the book (2D chemical diagrams). I wanted to make an edition for every CDK release. All in all, I got quite far with that: eleven editions.

But the current research setting, or at least in academia, does not provide me with the means to keep this going. Sad thing is, the hardest part is actually updating the graphics for the cover, which needs to resize each time the book gets ticker. But John Mayfield introduced so many API changes, I just did not have the time to update the book. I tried, and I have a twelfth edition on my desk. But where my automated setup scales quite nicely, I don't.

It may we worth reiterating why I started the book. We have had several places where information was given, and questions were answered: the mailing list, wiki pages, JavaDoc, the Chemistry Toolkit Rosetta Wiki, and more. Nothing in the book was not already answered somewhere else. The book was just a boon for me to answer those questions and provide an easy way for people to get many answers.

Now, because I could not keep up with the recent API changes, I am no longer feeling comfortable with releasing the book. As such, I have "retired" the book.

I am now working out on how to move from here. An earlier edition is already online under a Creative Commons license, and it's tempting to release the latest version like this too. That said, I have also been talking with the other CDK project leaders about alternatives. More on this soon, I guess.

Here's an overview of posts about the book:
Posted Sat Jan 6 00:14:00 2018 Tags:

Earlier this year, in February, I wrote a blog post encouraging people to donate to where I work, Software Freedom Conservancy. I've not otherwise blogged too much this year. It's been a rough year for many reasons, and while I personally and Conservancy in general have accomplished some very important work this year, I'm reminded as always that more resources do make things easier.

I understand the urge, given how bad the larger political crises have gotten, to want to give to charities other than those related to software freedom. There are important causes out there that have become more urgent this year. Here's three issues which have become shockingly more acute this year:

  • making sure the USA keeps it commitment to immigrants to allow them make a new life here just like my own ancestors did,
  • assuring that the great national nature reserves are maintained and left pristine for generations to come,
  • assuring that we have zero tolerance for abusive behavior — particularly by those in power against people who come to them for help and job opportunities.
These are just three of the many issues this year that I've seen get worse, not better. I am glad that I know and support people who work on these issues, and I urge everyone to work on these issues, too.

Nevertheless, as I plan my primary donations this year, I'm again, as I always do, giving to the FSF and my own employer, Software Freedom Conservancy. The reason is simple: software freedom is still an essential cause and it is frankly one that most people don't understand (yet). I wrote almost two years ago about the phenomenon I dubbed Kuhn's Paradox. Simply put: it keeps getting more and more difficult to avoid proprietary software in a normal day's tasks, even while the number of lines of code licensed freely gets larger every day.

As long as that paradox remains true, I see software freedom as urgent. I know that we're losing ground on so many other causes, too. But those of you who read my blog are some of the few people in the world that understand that software freedom is under threat and needs the urgent work that the very few software-freedom-related organizations, like the FSF and Software Freedom Conservancy are doing. I hope you'll donate now to both of them. For my part, I gave $120 myself to FSF as part of the monthly Associate Membership program, and in a few minutes, I'm going to give $400 to Conservancy. I'll be frank: if you work in technology in an industrialized country, I'm quite sure you can afford that level of money, and I suspect those amounts are less than most of you spent on technology equipment and/or network connectivity charges this year. Make a difference for us and give to the cause of software freedom at least as much a you're giving to large technology companies.

Finally, a good reason to give to smaller charities like FSF and Conservancy is that your donation makes a bigger difference. I do think bigger organizations, such as (to pick an example of an organization I used to give to) my local NPR station does important work. However, I was listening this week to my local NPR station, and they said their goal for that day was to raise $50,000. For Conservancy, that's closer to a goal we have for entire fundraising season, which for this year was $75,000. The thing is: NPR is an important part of USA society, but it's one that nearly everyone understands. So few people understand the threats looming from proprietary software, and they may not understand at all until it's too late — when all their devices are locked down, DRM is fully ubiquitous, and no one is allowed to tinker with the software on their devices and learn the wonderful art of computer programming. We are at real risk of reaching that distopia before 90% of the world's population understands the threat!

Thus, giving to organizations in the area of software freedom is just going to have a bigger and more immediate impact than more general causes that more easily connect with people. You're giving to prevent a future that not everyone understands yet, and making an impact on our work to help explain the dangers to the larger population.

Posted Sun Dec 31 11:50:00 2017 Tags:
Ten alkanes in Wikidata. The ones without CAS regsitry
number previously did not have InChIKey or
PubChem CID. But no more; I added those.
While working on the 'chemical class' aspect for Scholia yesterday I noted that the page for alkanes was quite large, with a list of more than 50 long chain alkanes with pages in the Japanese Wikipedia with no SMILES, InChI, InChIKey, etc.

So, I dug up my Bioclipse scripts to add chemicals to Wikidata starting with a SMILES (btw, the script has significantly evolved since) and extended the query of that Scholia aspect to list just the Wikidata Q-code and name.  This script starts with one or more SMILES strings and generated QuickStatements (a must-learner).

Because the Wikidata entries also had the English IUPAC name, I can use that to autogenerate SMILES. Enter the OPSIN (doi:10.1021/ci100384d) plugin for Bioclipse which in combination with the CDK allowed me to create the matching SMILES, InChI, InChIKey, and use the latter to look up the PubChem compound identifier (CID). This is the script I ended up with:

inputFile = "/Wikidata/Alkanes/alkanes.tsv"
new File(bioclipse.fullPath(inputFile)).eachLine { line ->
fields = line.split("\t")
if (fields[0].startsWith("")) {
wdid = fields[0].substring("".length())
name = fields[1]
if (fields.length > 2) { // skip entities that already have an InChIKey
inchikey = fields[2]
// println "Skipping: $wdid $inchikey"
} else { // ok, consider adding it
// println "Considering $wdid $name"
try {
mol = opsin.parseIUPACName(name)
smiles = cdk.calculateSMILES(
//println " SMILES: $smiles"
println "${smiles}\t${wdid}"
} catch (Exception error) {
//println "Could not parse $name with OPSIN: ${error.message}"

That way, I ended up with changes like this:

Posted Sat Dec 30 09:59:00 2017 Tags:

I did a talk at work about the various project management troubles I've seen through the years. People seemed to enjoy it a lot more than I expected, and it went a bit viral among co-workers. It turns out that most of it was not particularly specific to just our company, so I've taken the slides, annotated them heavily, and removed the proprietary bits. Maybe it'll help you too.

Sorry it got... incredibly long. I guess I had a lot to say!

[Obligatory side note: everything I post to this site is my personal opinion, not necessarily the opinion of my or probably any other employer.]

Scaling in and out of control

Way back in 2016, we were having some trouble getting our team's releases out the door. We also had out-of-control bug/task queues and no idea when we'd be done. It was kind of a mess, so I did what I do best, which is plotting it in a spreadsheet. (If what I do best was actually solving the problem, I assume they would be paying me the Bigger Bucks. Let's just say they got what they paid for and leave it at that.)

Before 2016 though, we had been doing really quite well. As you can see, our average time between releases was about two months, probably due primarily to my manager's iron fist. You might notice that we had some periodic slower releases, but when I looked into them, it turned out that they crossed over the December holiday seasons when nobody did any work. So they were a little longer in wall clock time, but maybe not CPU time. No big deal. Pretty low standard deviation. Deming would be proud. Or as my old manager would say, "Cadence trumps mechanism."

What went wrong in 2016? Well, I decided to find out. And now we're finally getting to the point of this presentation.

But before we introspect, let's extrospect!

Here's a trend from a randomly selected company. I wanted to see how other companies scale up, so I put together a little analysis of Tesla's production (from entirely public sources, by the way; beware that I might have done it completely wrong). The especially interesting part of this graph (other than how long they went while producing only a tiny number of Roadsters and nothing else) is when they released the Model X. Before that, they roughly scaled employees at almost the same rate as they scaled quarterly vehicle production. With the introduction of the Model X, units shipped suddenly went up rapidly compared to number of employees, and sustained at that level.

I don't know for sure why that's the case, but it says something about improved automation, which they claim to have further improved for the Model 3 (not shown because I made this graph earlier in the year, and anyway, they probably haven't gotten the Model 3 assembly straightened out yet).

Tesla is a neat example of working incrementally. When they had a small number of people, they built a car that was economical to produce in small volumes. Then they scaled up slowly, progressively fixing their scaling problems as they went toward slightly-less-luxury car models. They didn't try to build a Model 3 on day 1, because they would have failed. They didn't scale any faster than they legitimately could scale.

So naturally I showed the Tesla plot to my new manager, who looked at it and asked the above question.

Many people groan when they hear this question, but I think we should give credit where credit is due. Many tech companies are based around the idea that if you hire the smartest people, and remove all the obstacles you can, and get rid of red tape, and just let them do what they feel they need to do, then you'll get better results than using normal methods with normal people. The idea has obviously paid off hugely at least for some companies. So the idea of further boosting performance isn't completely crazy. And I did this talk for an Engineering Productivity team, whose job is, I assume, literally to make the engineers produce more.

But the reason people groan is that they take this suggestion in the most basic way: maybe we can just have the engineers work weekends or something? Well, I think most of us know that's a losing battle. First of all, even if we could get engineers to work, say, on Saturdays without any other losses (eg. burnout), that would only be a 20% improvement. 20%? Give me a break. If I said I was going to do a talk about how to improve your team's efficiency by 20%, you wouldn't even come. You can produce 20% more by hiring one more person for your 5-person team, and nobody even has to work overtime. Forget it.

The above is more or less what I answered when I was asked this question. But on the other hand, Tesla is charging thousands of dollars for their "optional" self-driving feature (almost everybody opts for it), which they've been working on for less time with fewer people than many of their competitors. They must be doing something right, efficiency wise, right?

Yes. But the good news is I think we can fix it without working overtime. The bad news is we can't just keep doing what we've been doing.

Cursed by goals

Here's one thing we can't keep doing. Regardless of what you think of Psychology Today (which is where I get almost all my dubiously sourced pop psychology claims; well, that and Wikipedia), there's ample indication from Real Researchers, such as my hero W. Edwards Deming, that says "setting goals" doesn't actually work and can make things worse.

Let's clarify that a bit. It's good to have some kind of target, some idea of which direction you're going. What doesn't work is deciding when you're going to get there. Or telling salespeople they need to sell 10% more next quarter. Or telling school teachers they need to improve their standardized test scores. Or telling engineers they need to launch their project at Whatever Conference (and not a day sooner or a day later).

What all these bad goals have in common is that they're arbitrary. Some supposedly-in-charge executive or product manager just throws out a number and everyone is supposed to scramble to make it come true, and we're supposed to call that leadership. But anybody can throw out a number. Real leaders have some kind of complicated thought process that leads to a number. And the thought process, once in place, actually gives them the ability to really lead, and really improve productivity, and predict when things really will be done (and make the necessary changes to help them get done sooner). Once they have all that, they don't really need to tell you an arbitrary deadline, because they already know when you'll finish, even if you still don't.

When it works, it all feels like magic. Do you suspect that your competitors are doing magic when they outdo you with fewer people and less money? It's not magic.

I just talked about why goals don't really help. This is a quote about how they can actually hurt, because of (among other things) a famous phenomenon called the "Student Syndrome." (By the way, the book this is taken from, Critical Chain, is a great book on project management.)

We all know how it works. The professor says you have an assignment due in a week. Everyone complains and says that's impossible, so the prof gives them an extension of an extra week. Then everyone starts work the night before anyway, so all that extra time is wasted.

That's what happens when you give engineers deadlines, like launching at Whatever Conference. They're sure they can get done in that amount of time, so they take it easy for the first half. Then they get progressively more panicked as the deadline approaches, and they're late anyway, and/or you ship on time and you get what Whatever Conference always gets, which is shoddy product launches. This is true regardless of how much extra time you give people. The problem isn't how you chose the deadline, it's that you had a deadline at all.

Stop and think about that a bit. This is really profound stuff. (By the way, Deming wrote things like this about car manufacturers in the 1950s, and it was profound then too.) We're not just doing deadlines wrong, we're doing it meta-wrong. The whole system is the wrong system. And there are people who have known that for 50+ years, and we've ignored them.

I wanted to throw this in too. It's not as negative as it sounds at first. SREs (Site Reliability Engineers) love SLOs (Service Level Objectives). We're supposed to love SLOs, and hate SLAs (Service Level Agreements). (Yes, they're different.) Why is that?

Well, SREs are onto something. I think someone over there read some of Deming's work. An SLA requires you to absolutely commit to hitting a particular specification for how good something should be and when, and you get punished if you miss it. That's a goal, a deadline. It doesn't work.

An SLO is basically just a documented measurement of how well the system typically works. There are no penalties for missing it, and everyone knows there are no penalties, so they can take it a bit easier. That's essential. The job of SRE is, essentially, to reduce the standard deviation and improve the average of whatever service metric they're measuring. When they do, they can update the SLO accordingly.

And, to the point of my quote above, when things suddenly get worse - when the standard deviation increases or the average goes in the wrong direction - the SLO is there so that they can monitor it and know something has gone wrong.

(An SLI - Service Level Indicator - is the thing you measure. But it's not useful by itself. You have to monitor for changes in the average and standard deviation, and that's essentially the SLO.)

I think it's pretty clear that SRE is a thing that works. SLOs give us some clues about why:

  1. SLOs are loose predictions, not strict deadlines.

  2. SLOs are based on measured history, not arbitrary targets thrown out by some executive.

We need SLOs for software development processes, essentially.

Schedule prediction is a psychological game

That was a bunch of philosophy. Let's switch to psychology. My favourite software psychologist is Joel Spolsky, of the famous Joel on Software blog. Way back in the early 2000s, he wrote a very important post called "Painless Software Schedules," which is the one I've linked to above. If you follow the link, you'll see a note at the top that he no longer thinks you should take his advice and you should use his newfangled thing instead. I'm sure his newfangled thing is very nice, but it's not as simple and elegant as his old thing. Read his old thing first, and worry about the new thing later, if you have time.

One of my favourite quotes from that article is linked above. More than a decade ago, at my first company, I was leading our eng team and I infamously (for us) provided the second above quote in response.

What Joel meant by "psychological games" was managers trying to negotiate with engineers about how long a project would take. These conversations always go the same way. The manager has all the power (the power to fire you or cancel your bonus, for example), and the engineer has none. The manager is a professional negotiator (that's basically what management is) and the engineer is not. The manager also knows what the manager wants (which is the software to be done sooner). The engineer has made an estimate off the top of their head, and is probably in the right ballpark, but not at all sure they have the right number. So the manager says, "Scotty, we need more power!" and the engineer says, "She canna take much more o' this, captain!" but somehow pulls it off anyway. Ha ha, just kidding! That was a TV show. In real life, Scotty agrees to try, but then can't pull it off after all and the ship blows up. That manager's negotiation skills have very much not paid off here. But they sure psyched out the engineer!

In my silly quote, the psychological games I'm talking about are the ones you want to replace that with. Motivation is psychological. Accurate project estimation is psychological. Feature prioritization is psychological. We will all fail at these things unless we know how our minds work and how to make them do what we want. So yes, we definitely need to play some psychological games. Which brings us to Agile.

I know you probably think Agile is mostly cheesy. I do too. But the problem is, the techniques really work. They are almost all psychological games. But the games aren't there to make you work harder, they're there, to use an unfortunate cliche, to make you work smarter. What are we doing that needs smartening? If we can answer that, do we need all the cheesy parts? I say, no. Engineers are pretty smart. I think we can tell engineers how the psychological games work, and they can find a way to take the good parts.

But let's go through all the parts just to be clear.

  • Physical index cards. There are reasons these are introduced into Agile: they help people feel like a feature is a tangible thing. They are especially good for communicating project processes with technophobes. (You can get your technophobic customers to write down things they want on index cards more easily than you can get them to use a bug tracking system.) Nowadays, most tech companies don't have too many technophobic employees. They also often have many employees in remote offices, and physical cards are a pain to share between offices. The fedex bills would be crazy. Some people try to use a tool to turn the physical cards into pictures of virtual physical cards, which I guess is okay for what it is, but I don't think it's necessary. We just need a text string that represents the feature.

  • Stories & story points. These turn out to be way more useful than you think. They are an incredibly innovative and effective psychological game, even if you don't technically write them as "stories." More on that in a bit.

  • Pair programming. Sometimes useful, especially when your programmers are kinda unreliable. They're like a real-time code review, and thus reduce latency. But mostly not actually something most people do, and that's fine.

  • Daily standup meetings. Those are just overhead. Agile, surprisingly enough, is not good because it makes you do stupid management fluff every day. It does, but that's not why it's good. We can actually leave out 95% of the stupid management fluff. The amazing thing about Agile is actually that you get such huge gains despite the extra overhead gunk it adds.

  • Rugby analogies (ie. SCRUM). Not needed. I don't know who decided sports analogies were a reliable way to explain things to computer geeks.

  • Strict prioritization. This is a huge one that we'll get to next - and so is flexible prioritization. Since everyone always knows what your priorities are (and in Agile, you physically post index cards on the wall to make sure they all know, but there are other ways), then people are more likely to work on tasks in the "right" order, which gets features done sooner than you can change your mind. Which means when you do change your mind, it'll be less expensive. That's one of the main effects of Agile. Basically, if you can manage to get everyone to prioritize effectively, you don't need Agile at all. It just turns out to be really hard.

  • Tedious progress tracking: also not needed. See daily standups, above. Agile accrues a lot of cruft because people take a course in it and there's inevitably a TPM (Technical Programme Manager, aka Project Manager) or something who wants to be useful by following all the steps and keeping progress spreadsheets. Those aren't the good parts. If you do this right, you don't even need a TPM, because the progress reports write themselves. Sorry, TPMs.

  • Burndown charts. Speaking of progress reporting, I don't think I'd heard of burndown charts before Agile. They're a simple idea: show the number of bugs (or stories, or whatever) still needing to be done before you hit your milestone. As stories/bugs/tasks/whatever get done, the line goes down, a process which we call "burning" because it makes us feel like awesome Viking warlords instead of computer geeks. A simple concept, right? But when you do it, the results are surprising, which we'll get to in a moment. Burndown charts are our fundamental unit of progress measurement, and the great thing is they draw themselves so you don't need the tedious spreadsheet.

  • A series of sprints, adding up to a marathon. This is phrased maybe too cynically (just try to find a marathon runner that treats a marathon as a series of sprints), but that's actually what Agile proposes. Sprints are very anti-Deming. They encourage Student Syndrome: putting off stuff until each sprint is almost over, then rushing at the end, then missing the deadline and working overtime, then you're burnt out so you rest for the start of the next sprint, and repeat. Sprints are also like salespeople with quarterly goals that they have to hit for their bonus. Oh, did I just say goals? Right, sprints are goals, and goals don't work. Let's instead write software like marathon runners run marathons: at a nice consistent pace that we can continue for a very long time. Or as Deming would say, minimize the standard deviation. Sprints maximize the standard deviation.

Phew! Okay, that was wordy. Let's draw a picture.

I made a SWE simulator that sends a bunch of autonomous SWE (software engineer) drones to work on a batch of simulated bugs (or subtasks). What we see are five simulated burndown charts. In each of the five cases, the same group of SWEs works on the exact same bug/task database, which starts with a particular set of bug/tasks and then adds new ones slowly over time. Each task has a particular random chance of being part of a given feature. On average, a task is included in ~3 different features. Then the job of the product managers (PMs) is to decide what features are targeted for our first milestone. For the simulation, we assume the first milestone will always target exactly two features (and we always start with the same two).

The release will be ready when the number of open tasks for the two selected features hits zero.

Interestingly, look at how straight those downward-sloping lines are. Those are really easy to extrapolate. You don't even need a real simulator; the simulator is just a way of proving that the simulator is not needed. If the simulator were an AI, it would be having an existential crisis right now. Anyway, those easy-to-extrapolate lines are going to turn out to be kind of important later.

For now, note the effect on release date of the various patterns of PMs changing their minds. Each jump upward on the chart is caused by a PM dropping one feature from the release, and swapping in a new one instead. It's always drop one, add one. The ideal case is the top one: make up your mind at the start and let people finish. That finishes the soonest. The second one is unfortunate but not too bad: they change their mind once near the start. The third chart shows how dramatically more expensive it is to change your mind late in the game: we drop the same feature and add the same other feature, but now the cost is much higher. In fact, the cost is much higher than changing their minds multiple times near the beginning, shown in the fourth chart, where we make little progress at all during the bickering, but then can go pretty fast. And of course, all too common, we have the fifth chart, where the PMs change their minds every couple of months (again, dropping and adding one feature each time). When that happens, we literally never finish. Or worse, we get frustrated and lower the quality, or we just ship whatever we have when Whatever Conference comes.

The purported quote near the top, "When the facts change, I change my mind," I included because of its implied inverse: "When the facts don't change, I don't change my mind." One of the great product management diseases is that we change our minds when the facts don't. There are so many opinions, and so much debate, and everyone feels that they can question everyone else (note: which is usually healthy), that we sometimes don't stick to decisions even when there is no new information. And this is absolutely paralyzing.

If you haven't even launched - in the picture above, you haven't even hit a new release milestone - then how much could conditions possibly have changed? Sometimes a competitor will release something unexpected, or the political environment will change, or something. But usually that's not the case. Usually, as they don't say in warfare, your strategy does survive until first contact with the enemy (er, the customer).

It's when customers get to try it that everything falls out from under you.

But when customers get to try it, we've already shipped! And if we've been careful only to work on shipping features until then, when we inevitably must change our minds based on customer feedback, we won't have wasted much time building the wrong things.

If you want to know what Tesla does right and most of us do wrong, it's this: they ship something small, as fast as they can. Then they listen. Then they make a decision. Then they stick to it. And repeat.

They don't make decisions any better than we do. That's key. It's not the quality of the decisions that matters. Well, I mean, all else being equal, higher quality decisions are better. But even if your decisions aren't optimal, sticking to them, unless you're completely wrong, usually works better than changing them.

If you only take one thing away from reading this talk, it should be that. Make decisions and stick to them.

Now, about decision quality. It's good to make high quality decisions. Sometimes that means you need to get a lot of people involved, and they will argue, and that will take some time. Okay, do that. But do it like chart #4, where the PMs bicker for literally 50 days straight and we run in circles, but the launch delay was maybe 30 days, less than the 50 you spent. Get all the data up front, involve all the stakeholders, make the best decision you can, and stick to it. That's a lot better than chart #3 (just a single change in direction!) or #5 (blowing in the wind).

Sorry for the rant in that last slide. Moving on, we have more ideas to steal, this time from Kanban.

For people who don't know, Kanban was another thing computer people copied from Japanese car manufacturers of the 1950s. They, too, used actual paper index cards, although they used them in a really interesting way that you can read about in wikipedia if you care. You don't have to, though, because Kanban for software engineers doesn't use the index cards the same way, and as with Agile, I'm going to tell you to skip the paper cards anyway and do something digital, for the sake of your remote officemates. But Kanban for software uses the paper cards differently than Agile for software, and that's relevant, which we'll address in the next slide.

Kanban (for software) also uses stories and story points, just like Agile. Nothing much new there. (The car manufacturers in the 1950s didn't use stories. That's a new thing.)

Kanban, like Agile, gives the huge benefit of strict prioritization. It does it differently from Agile though. Where Agile tries to convince you to do things in a particular order, Kanban emphasizes doing fewer things at a time. Kanban is the basis of Just-in-Time (JIT) factory inventory management. Their insight is that you don't bother producing something until right before you need it, because inventory is expensive: not just for storage (which in our software analogy, is pretty cheap) but in wasted time when you inevitably turn out to have produced the wrong thing. And worse, in wasted time when you produced the right thing, but it didn't get used for another 6 months, and it broke in the meantime so you had to fix it over and over.

Going back to Critical Chain, that book will tell you the big secret of Kanban for software: unreleased software is inventory. It's very expensive, it slowly rusts in the warehouse, and worst of all, it means you produced work in the wrong order. You shouldn't have been doing something that wouldn't be needed for six months. You should have either been a) helping with the thing that's needed right now, or b) lounging around in the microkitchen reading books and eating snacks. Your buildup of low-priority inventory is slowing down the people working on the high-priority things, and that's unacceptable.

The classic example of this in Kanban for software is that we break software down into phases, say, requirements, UX design, software design, implementation, testing, etc. Let's say the TL (Tech Lead) is responsible for designing stuff, and the rest of the engineers are responsible for implementing it (the classic story of a TL who used to code but now mostly has meetings). The engineering team has a lot of work to do, so after the TL designs the current thing, they go off and start designing the next thing, and the next one, and eventually you have a buildup of design docs that haven't been implemented yet. Kanban says, this is useless; the design docs are inventory. Of course those designs will be at least partly obsolete by the time the eng team has time to implement them. Maybe goals will have changed by then and you won't implement that design at all. Meanwhile, the TL has sent around the doc for review, and there are meetings to talk about it, all of which distracts from actually finishing the work on the current feature which needs to get done.

Kanban, and JIT manufacturing, says the TL should go take a long lunch and recharge for when they're needed. Or better still, help out on the engineering phase, to push it along so that it will finish sooner. And that is, of course, the magic of Kanban. It plays some psychological games, using index cards, to convince TLs to write code and (further down the chain) engineers to help with testing, and reduce useless meetings, and stop maintaining code you don't yet need. It forces you to multitask less, which is a way of tricking you into prioritizing better.

We're engineers, so we're pretty smart. If we want, we could just, you know, dispense with the psychological games, and decide we're going to strictly prioritize and strictly limit multitasking. It takes some willpower, but it can be done. I happen to be terrible at it. I am the worst about multitasking and producing inventory. Way back at the beginning of this talk, when I showed a slide with our release goals slipping in 2016, yeah, that was us multitasking too much, which was partly my fault, and facilitated by having too much headcount.

What that means is, I know this is a lot harder than it sounds. It's actually really, really hard. It's one of the hardest things in all of engineering. Most people are very bad at it, and need constant reminders to stop multitasking and to work on the most important things. That's why these psychological games, like sprints (artificial deadlines) and index cards and kanban boards were invented. But if we want to become the best engineers we can be, we have to move beyond tricking ourselves and instead understand the underlying factors that make our processes work or not work.

That chart of the slipping release times was what reminded me to follow the rules I already knew: Prioritize. Limit multitasking.

Here's a sample kanban-for-software board, from wikipedia. I just wanted to point out visually how the psychological game works. Basically, there is only a limited amount of space on the board. To advance a story from one phase (horizontally) to the next, there needs to be an empty space in the destination phase. If there isn't, then you must first move a job from that phase (by finishing it) to the phase after that. And if there are no holes there, then something needs to get pushed forward (by finishing it) even further down the line, and so on.

If your job is to work on a phase and there are no holes left for you to fill, then you literally have to stop work, or (if you're flexible), go help with one of the downstream phases. And we enforce that by just not having enough physical space to paste your index cards. It's kinda neat, as a mental trick, because the most primordial parts of your brain understand the idea that physical things take physical space and there's no physical space left to put them in.

[Random side note: Kanban-for-factories actually does things the other way around. They send an index card upstream to indicate that they need more stuff in order to continue work. This avoids the need for an actual "kanban board" that everyone all over the factory, plus your suppliers' factories, would need to all congregate at. Instead, you have these index cards flying all around, indicating when work needs to be done. (To limit multitasking, there's a carefully limited number of cards of each type.) On the software kanban board, you can think of the "holes" moving upstream and doing the job that the "cards" do when moving upstream in a factory. That, in turn, reminds me of how doped semiconductors carry signals around: you can have excess electrons moving around (n-type) or "holes" moving around (p-type). A hole is a missing electron, and it can carry information just as well as an electron can, although slightly slower, for reasons I don't remember. Many people are surprised to hear that p-type semiconductors don't involve moving protons around, just holes.]

Time for another simulation, this time about multitasking. Our simulated SWE team is back, working just as hard as ever (and no harder), but this time we're going to extend the simulation past the first milestone. Furthermore, let's assume (perhaps too optimistically) that we have convinced our PMs to follow the advice in the previous slides, and they no longer change their minds in between milestones. The deal is this: once they set milestone goals, those are locked in, and the engineers will keep working until they're done. (This is like a sprint in Agile, except there is no strict deadline. We work until it's done, and different milestones might have different lengths.) However, PMs don't have to tell us what the next milestone contains until this one is done. They can change their minds day after day, and argue all they like, but only until the milestone starts. Then they have to start arguing about the next one. Got it? Great.

Another concept we need to cover is the idea of "aggregate value delivered." Remember that our simulation involves several features, and each feature contains a bunch of bugs/tasks, and each task can be in one or more features, and new tasks are being filed all the time. Because it's a simulation, we don't know how good these "features" really are, so let's assume they are all worth one unit of awesomeness. When we ship a feature, users don't get all the value from that feature right away; it accrues over the time they use the product. So the total value delivered to customers is a function of the number of awesomeness units that have launched, and how long ago they launched. In the charts, that's the green area under the curve. Of course, this is all very approximate, because it ignores features losing or gaining customers over time, the fact that some features have more value (and are harder to implement) than others, and so on. But the fundamental truths are the same.

Note also that new bugs/tasks are coming in all the time, across all features, whether launched or not. Once you've launched a feature, you still have to keep maintaining it forever. And no, not launching doesn't avoid that problem; the market keeps moving, and if you launch your feature a year later, the customers will have quietly become more demanding during that time. That's why in the simulation, new bugs keep getting filed in non-launched features, even if you're not working on those features yet. In the real world, those bugs wouldn't exist in any bug tracking system, but you'd sure find out about them when you go try to launch. So the simulation is assuming perfect knowledge of the future, which is not very realistic, but it's okay because you're not really acting on it. (Features you're not working on are accruing virtual bugs, but... you're not working on them, so they don't affect the simulation results until later, when you do work on them, in which case you would know about them in real life.)

So anyway, there are four simulations here. The only difference between them is simply a variation in how many features we try to squeeze into each milestone. In other words, a measure of the amount of multitasking. In the first graph, we work on, and then launch, each feature one at a time. In the second, we do two at a time. Then five at a time, and eight at a time, respectively. The simulator then helpfully sums up the total green area in each case.

Notice that in, say, the top two charts, it takes about the same time to launch the first two features and then the next two features. (The slight variations are partly due to simulator jiggliness and partly due to some second-order effects we don't need to go into, but they're small enough to ignore.) In other words, the fourth feature ships at the same time in both cases. But because you ship the first and third and fifth features earlier in the first chart, the total accrued value is higher. That's kind of intuitive, right? If you ship a simple thing sooner, some users can start enjoying it right away. They won't enjoy it as much as the bigger, better thing you ship later, and some users won't enjoy it at all, but a few will still be accruing enjoyment, which hopefully translates into revenue for you, which is great.

Notice how each subsequent release takes longer. That's because you have to keep fixing bugs in the already-launched code even while you add the new stuff. That's pretty realistic, right? And once you've launched a bunch of things, you slow down so much that you're pretty much just maintaining the code forever. You can see that in the first chart after feature #5 ships: the burndown chart is basically just flat. At that point, you either need to fundamentally change something - can you make your code somehow require less maintenance? - or get a bigger team. That goes back to our headcount scalability slides from the beginning. Is your code paying for itself yet? That is, is more value being accrued than the cost of maintenance? If so, you can afford to invest more SWEs. If not, you have to cancel the project or figure out how to do it cheaper. Scaling up a money loser is the wrong choice.

The third chart is interesting. We launch five features before we start accruing any value at all. Remember, that's the same five features as in the first two charts, and we have perfect foreknowledge of all bugs/tasks that will be filed, so it's the exact same work. But look how much less aggregate value is delivered! About 4x less than chart #2, or about 5x less than chart #1. And the only thing we changed is how much we're multitasking.

The mathematicians among you might be thinking, okay, Avery, you're cheating here by cutting off the graph at 1200 days. What if we doubled it to 2400 days? Then the majority of the chart would have 5 features launched (recall that the team isn't able to scale beyond 5 features anyhow), and the relative difference between scores would be much less. So it's artificial, right? As time goes to infinity, there is no difference.

Mathematically, that's true. But it's like big-O notation in evaluating algorithms. When n goes to infinity, every O(n) algorithm is indistinguishable. But when n is a small finite number, then different O(n) algorithms are completely different, and sometimes an O(n) algorithm completely beats an O(log n) algorithm for small n. The schedule is like that. Your software won't be accruing all that value forever. Competitors will release something, customers will move on. You have a limited window to extract value from your existing features, usually only a few years. By releasing features later, you are throwing away part of that window.

And of course there's chart #4, where we simply never launch at all, because there are too many features with too many integration points and the team just can't handle it; every time they fix one thing, another thing breaks. I don't want to name particular projects, but I have certainly seen this happen lots of times.

So, can you really get a 5x revenue improvement just by reducing multitasking? Well, I don't know. Maybe that's way too high in the real world. But how about 2x? I think I'm willing to claim 2x. And I know I can do infinitely better than chart #4. the way, none of these simulated SWEs were doing any kind of simulated Kanban with simulated index cards and simulated boards with holes. We just had an informal agreement with PMs about how many features they were going to try to launch at a time (where a feature is, say, a story or a set of stories; we'll get to that soon). Kanban is just a psychological game to enforce that agreement. But if the engineers and PMs trust each other, you don't need all that fluff.

Now that we've seen the simulations, we can go back to Painless Software Schedules to emphasize one of the points Joel makes a big deal about: you can't negotiate about how fast people will work. Engineers work at different rates at different times, but it really isn't very variable. Remember those surprisingly straight downward slopes in the simulations? It's just as surprising in real life projects - I'll get to that in a bit. But that straight line makes a clear statement: that yes, I really can extrapolate to predict how long it will take to finish a particular set of tasks, and no, I really cannot do it faster and I probably will not do it slower, either. End of story, conversation over.

Once we know that - and burndown charts are quite good at convincing people that it's true - then we can have the real conversation, which is, what things actually need to go into the release and which things don't? Once we know the work rate - the slope of that line - we can move the requirements around to hit whatever date we want. We want to do this as early as possible of course, because late requirements changes are expensive, as we saw in the simulator. But since we do know the work rate, we can predict the completion date immediately, which means it's possible to have those discussions really early, rather than as we approach the deadline.

Suddenly, the manager and the engineers are on equal footing. The manager is a trained negotiator and can fire you, but the engineers have empirical truth on their side: this is not a matter of opinion, and the facts are not going to change. Better still, the engineers mostly don't care too much what order they do the features in (within reason). So the manager or PM, who is now making the scheduling decisions, doesn't need to negotiate with engineers. They can negotiate with sales or marketing or bizdev teams or whoever, who have actual feature requirements and date requirements, and agree on a combination of features and dates that is physically possible in the real world.

And then, crucially, they will not tell those dates to the engineering team. The engineering team doesn't need to know. That would be setting a goal, and goals are bad. The engineers just do the work in priority order, and don't multitask too much, and let statistics handle things. No student syndrome, no Whatever Conference rush, no working weekends, no burnout, and no lazy days needed to recover from burnout.

It sounds too good to be true, but this actually works. It's math. Coming up, we have a bunch of slides where I continue to try to convince you that it's just math and real life matches the simulation.

Actually putting it into practice

...but first, it's time to finally start talking about stories and story points.

(Most ideas from this section are blatantly stolen from Estimating Like and Adult.)

I'm going to be honest, I find the way "stories" are done in Agile to be kinda juvenile and faintly embarrassing. You make up these fake "personas" that have names, and pets, and jobs, and feelings, and you describe how they're going to go about their day until they bump into some problem, which your software is then going to solve. But it doesn't yet, of course, because the story isn't true yet, it's fiction, that's what makes it a story, and your job as an engineer is to make the persona's dreams come true.

I'm sorry, I tried to remove the snark from that paragraph, but I just couldn't do it. I think this format, with realistic people and feelings and such, is extremely useful while PMs discuss designs with UX designers (who think about character archetypes) and UX researchers (who tell literal stories about literal people who tried to use the product or mocks of the product). No problem there. That's not juvenile, that's empathy, and it's essential to good UX. I don't know exactly how they do things at Apple, but there's no way they aren't writing excruciatingly detailed stories about their users' product interactions.

But eventually the stories need to get explained to stereotypically un-empathetic software engineers, who often got into this business because they wanted to have a job that mostly did not involve empathizing. Not all of them, but many of them. And let's be honest, the engineer working on a new feature for Gmail's backend probably just doesn't need to know all those personal details. "User will be able to search for emails by keyword, and the results will be returned in no more than 2000ms, and results will be ranked by relevance" is fine, thanks, and gets to the point a lot faster.

But here's the reason Agile pushes those stories so hard. Stories are, by necessity, about a person who does something. That person is always the customer. We don't write stories about what the backend engineer does (which is how we often write bugs/tasks, and that's fine). We write stories about what the customer does. That's absolutely essential. At the level of abstraction we're talking about - features that will deliver value, like in our simulation - only things that affect the customer are allowed, because things that don't affect the customer do not deliver value to the customer.

There are lots of things engineers have to do, in order to deliver value, that do not themselves deliver value to the customer. You might have to do 10 of those to deliver value. Well, in that case, you have 10 bugs (or tasks), making up one story (feature). That's fine; that's what we did in the simulator, too. Engineers were fixing bugs for days and days before a feature was done enough to launch.

For the following discussion, that's what you need to know. Stories don't have to be written like a story, but they have to truly reflect something the user will benefit from. None of this works otherwise.

Agile doesn't just talk about stories, though, they talk about story points. Story points are an absolute breakthrough in project estimation; I might make fun of any other part of Agile (and sometimes I do), but this part is genius. You can tell it's genius because it sounds so trivial, and yet fixes numerous problems that you probably didn't even know you had. But don't worry, of course I'm going to tell you all about those problems.

In my talk, I popped up this slide on the big screen and asked people to do an experiment with me. I didn't know for sure if it would work, but it ended up working out even better than I had hoped. I asked a simple question: "How tall is the third rectangle from the left?"

One person immediately piped up, "Three!" "Ok, three what?" I asked. "Units." "What units? Tell me the height in, say, centimeters." And now people had a problem. "Can we just measure it?" someone asked. "The size depends on what screen you display it on," said another. On my laptop screen, it was maybe 4cm tall; on the projector, maybe 2 feet. "How tall would it be for, say, a typical L5 screen that Meets Expectations?"

It was surprising how quickly people understood this point for estimating rectangles, although it's somehow harder to understand for estimating tasks. Estimating absolute sizes is really tough. Estimating relative sizes is almost magically easy. Now, there was someone in the audience who thought it looked more like 4 units than 3, a disagreement which was actually excellent for multiple reasons, as we'll see in the next slide.

But before we go to the next slide, I want to say more about the problems of estimating tasks using absolute sizes. Don't experienced engineers get pretty good at estimating how long tasks will take? Let's say I'm going to write a basic Android app, and I've done a couple of basic Android apps before. Is this one easier or harder? How much harder? 2x, 3x? Okay, let's say it took 3 months to launch the last one, and this is 2x harder. Then say 6 months, right?

Well. Maybe. First of all, from when did we start counting? Conception, design, first day writing code? And hold on, are you going to hold me to that 6 month estimate? Maybe I should add a safety factor just in case. Let's say 8 or 9 months. But even worse, now we've set a goal, another one of those evil, productivity destroying goals. I have a project that I'm pretty sure will take 6 months, but maybe it's not as hard as I thought, and anyway I said 8 months, so I can relax for the first two at least. Maybe more if I'm willing to rush near the end, because I kinda remember rushing near the end of the last one. Student syndrome returns. And now we can guarantee it won't take less than the 8 month estimate, but it will almost certainly take longer, even though it was only a 6 month project. And that's why the more experienced a person gets, the longer their estimates get.

What happened here? We let the engineer know the due date. Never do that.

People have all these weird hangups about goals and about time. Story points have the astonishing ability to just bypass all that. Nobody sets a "goal" that the project will take 5 points to complete. What does that even mean? It's a five point story, it will always take 5 points to complete, no matter how long that turns out to be. It is 5 points, by definition. What are we arguing about?

With story points, there is no late, and there is no early. There's just how long it takes. Which we can measure surprisingly accurately, because story point estimates turn out to be surprisingly accurate. We just have to cheat by not telling the engineers the date we estimated, because when we do, their brains go haywire.

In the talk, we had this useful disagreement about whether the rectangle was really 3 or 4 units high, which was excellent for showing the next step: the actual process of coming to an estimate. For this, we steal a concept taught in an excellent corporate project planning class (not sure if it's still taught), called Planning Poker. In turn, he apparently stole it from some variants of Agile or Scrum.

Now, the method he teaches involves a physical deck of cards with numbers printed on them in a pseudo-fibonacci sequence. (You know this must be at least a fairly common game if you can buy these cards, which you can.) But playing cards are a pain for people in remote offices, because you have to do silly things like trying to squint at cards over a videoconference, etc. So my co-worker, Dr. Held, created a little spreadsheet version of the game which works pretty well.

You start by loading and making a copy of the spreadsheet for your team. Then you delete all the per-user tabs, and add one for each person on your team. Then you add those users' names in the first column of the Results tab.

For each round of voting, you set the Vote Code field to a new number. (You can just increment it, or use whatever formula you want. The only important thing is that it isn't the same as the last round of voting.) Then, each user opens their own tab in the spreadsheet on their own laptop, and enters the vote. Only once all users have voted (with the right vote code) are the votes revealed on the Results page.

Now, it's actually very important that you don't bias people by discussing the number of points before the vote; you want their honest answers. We screwed that up in this exercise, because we'd already talked a lot about whether the box was size 3 or 4. But oh well, we also had troublemakers who intentionally voted differently, so it worked out this time for the demo. (We still saw the results of bias: as soon as the votes were revealed, a bunch of people rushed to change their answers to make them agree with the consensus. Don't do that!)

Note that the "3 vs 4" discussion was moot anyway, because of the "pseudo-fibonacci sequence" rule. You can't vote for 4, because it's too close to 3. You have to choose 3 or 5. The lucky disagreement about 3 or 4 was a really convenient way to show why the fibonacci sequence is useful for cutting down on needless precision.

So anyway, people all vote and then you see their results. If not everyone agrees, then the tradition is that you ask one of the people with the highest vote (the one who has spoken least recently), what they think is necessary in order to do the task. Then you ask one of the people with the lowest vote. Then, you discuss more if needed, or else go immediately to a revote, and you repeat until you all agree. As a shortcut, if the votes are split between two consecutive values, you can just pick the higher one of the two. (Mathematically it turns out not to matter what the rule is for dealing with split votes, as long as you're consistent. If you bias on the high side or the low side, the formula for converting points to time, ie the velocity, will just compensate. If you bias inconsistently, you'll get increased variability and thus an increased error margin.)

The really interesting thing about this method is that nobody has any particularly vested interest in a particular story being a particular number of points. After all, nobody knows the size of a point anyway, and nobody is setting a deadline or making promises of a completion date. So there's no reason to argue. Instead of fighting to be right about the exact size, people can instead focus on why two people have such a widely varying (at least two fibonacci slots) difference of opinion. When that happens, usually it's because there's missing information about the scope of the story, and that kind of missing information is what really screws up your estimates if you don't resolve it. The ensuing discussion often uncovers some very important misunderstanding (or unstated assumptions) in the story itself, which you can fix before voting again.

Another neat feature of the voting game is that you don't spend time discussing stories where everyone agrees on the size. Maybe not everyone understands every detail of the story right now, but they apparently understand well enough to estimate the size (and maybe they disagree on various details, but apparently those details don't affect the size much). So just accept the story size, and move on.

This combination of features makes it a great way to spend your weekly or biweekly team meeting:

  • You talk about things that definitely matter to customers

  • The person who has spoken least recently gets to talk, which reduces various social biases

  • You don't talk about stuff that everyone already knows about

  • You end the meeting early if you've gone through all the stories.

There are lots of ways to estimate story points, but you might want to give this one a try.

Okay, so now you have a bunch of stories and their sizes, in points. How do you track those stories and their progress? This is where I sell you a tool, right?

Nope! We still don't need a special tool for this. The thing about stories is that they're pretty big - it takes quite a while to deliver a unit of useful functionality to a user. At least days, maybe weeks or even months. The size of a story depends on how your team wants to do things. But don't make them unnecessarily small; it doesn't help much with accuracy, and it makes the estimation a lot more work. You want estimation to be so quick and easy that you end up estimating a lot of tasks that never get scheduled - because when the PM realizes how much work they are, they realize there's something more effective to be working on.

Note that this is a departure from the Joel on Software method, where he recommends estimating line items of no more than 4 hours each. Those aren't stories, though (his method mostly pre-dates Agile and stories), those are bugs (tasks). We'll get to those later. Stories are way bigger than 4 hours, and that's perfectly fine.

Since stories are so big, that means we don't have very many of them. Since we don't have very many of them, they are easy to keep track of using any method you want. Agile people use paper index cards, and they don't get overwhelmed. The course I took recommends just keeping a spreadsheet. That works for me.

The spreadsheet is just a bunch of rows that list the stories and their estimates, but most importantly, the sequence you're planning to do them in. I suggest working on no more than one at a time, if at all possible. Since each one, when implemented, is made up of a bunch of individual tasks (bugs), it is probably possible to share the work across several engineers. That's how you limit multitasking, like kanban says to do. If your team is really big or your stories are small, you'll have to work on several stories at a time. But try not to.

Note also what the spreadsheet does not need to contain:

  • Estimates of how much time has already been spent working on each story

  • Estimates of how much time is remaining

  • Pointers to which bugs are in the story

  • Expected completion date of the story

You don't need that stuff. This spreadsheet is for the PMs and executives to write down what they want, and in what order. The engineers estimate the relative effort each of those things will take. Then the PMs re-sequence those things to their heart's delight, except for the first unfinished one, because it's too inefficient to change the goals mid-milestone (as we saw in the simulation slides earlier).

But you still need status updates, right? Everyone loves status updates. How will we track status updates in a mere spreadsheet, without endless tedium and overhead?

We'll get to that. Hang on.

Bathroom break

That was a lot of serious story estimation business, and this was the spot where we had a designated bathroom break, so I included this meditation slide where I correlated the rainfall in various sites in Peru as part of some random analytics for a team I was working with. I thought it was neat.

Also, the NOAA keeps data sets for the rainfall amount in every 8km x 8km region, between 60N and 60S latitude, every 30 minutes, going back decades. Impressive!

Better bug (or task) tracking

We've spent a lot of time talking about Agile, and Kanban, and Joel, and Stories, but now we have to diverge from all of them. We're going to steal another trick from that project management course I took: we're going to fundamentally break our planning into two completely different layers of abstraction.

The course called these, I think, Business Tasks (or maybe Business Goals) and Engineering Tasks, but those are too hard to pronounce, so let's call them Stories and Bugs.

Agile mostly deals with stories (big). Joel mostly deals with bugs (small). The big innovation in the project management course I took is that we want to do both at once. We'll make stories that are a bit bigger than the stories usually used in Agile (but maybe not as big as an "Epic"), and then we'll split each of them into numerous bugs, on about the scale Joel uses for his individual tasks.

The nicest thing about this split is it allows for a clean division of labour. PMs come up with stories; engineers estimate them; PMs sequence them; engineers then work on them in order, ideally only one at a time. On the other hand, PMs don't care at all about bugs. Only engineers care about bugs, and do them in whatever order they want, with whatever level of multitasking they think is appropriate. Engineers can just avoid attending the story sequencing meetings, and PMs can avoid attending the bug processing meetings. It really cuts coordination overhead a lot.

However, one place where we'll depart from the course I took is in handling bugs. They recommended using a spreadsheet and Planning Poker (with different point scales) for both stories and bugs. I'm going to recommend against estimating bugs at all (see next slide). I also think for bugs, you can skip using a spreadsheet and file them straight in the bug tracker. In applying the method from the course I took, I found there was a lot of tedium in keeping things synced with a bug tracker, which comes down to a common software engineering problem: if you have a bug database and a spreadsheet, then you have two sources of truth, and synchronization between them is a lot of work and tends to break down. Just use the bug tracker. For that we're going to have to use bug trackers effectively, which we'll get to a bit later.

I really like this simplification, because it makes bug fixing a first class activity. That's different from most Agile methods, which tend to treat bugs as overhead (you need to do it, but it's not accomplishing a new story, so we don't really track it and thus don't reward it). Lots of teams use bug trackers a lot, so we need a method that works with what they already do.

To explain why estimating bugs is unnecessary, I first need to convince you that all bugs are effectively the same size. For that, I made yet another simulation. This one is very simple. First (in the top chart), I created 5 stories, and gave them sizes between 1 and 13 points. If we finish them in sequence with no multitasking, we burn down from 5 stories to 4, 3, 2, 1, 0 over time. As you can see, the top line is very erratic. If you tried to estimate story completion dates by drawing a straight line from (0, 5) to (350, 0), you'd be pretty far off, most of the time.

On the other hand, the second chart has a bunch of bugs, distributed between 0.1 days and 10 days in length (with shorter times being more common; essentially a Poisson distribution). That's a range of 2 orders of magnitude, or 100x duration difference from the shortest bug to the longest. Pretty huge, right? The top chart was only one order of magnitude; much less variable.

And yet, the bottom chart is more or less a straight line. Like the top chart, it assumes no multitasking; that is, there is only a single engineer working on exactly one bug at a time, and sometimes it takes them 10 days, but sometimes they get lucky with a stream of 0.1-day bugs. But it doesn't matter! There are so many bugs that it just all averages out. If you draw a straight line from beginning to end, you're only off by a few days at any given point in the sequence. That's way too small to matter. If you add in multiple engineers on a team, the variations get even less, because even while one poor sucker works on the 10-day bug, the rest of the team keeps making progress on 0.5, 1, and 2 day bugs. [Random side note: this is also why restaurants with at least two bathroom stalls are a good idea.]

Why does that happen? The underlying reason is the Central Limit Theorem (linked above), which says that if you sum enough samples from nearly any random distribution, it ends up converging on the normal (Gaussian) distribution. A Gaussian distribution has a mean and a standard deviation. The more samples you sum together, the smaller the standard deviation is as a fraction of the average. Which is a long-winded way of saying if you have a lot of bugs, they are all about the same size. Or close enough not to matter.

You might not believe me, because this is a simulation. Good. Let's (finally!) look at some real patterns from some real projects.

Here's one from my old team's project bug tracker. Note that we're not just looking at one or two stories here; this is literally the whole component. I'm showing the burnup chart here instead of the burndown chart because for this particular team, it makes some characteristics a little more obvious. (In a burnup chart, the red dots correspond to new bugs being added, and green dots correspond to bugs being fixed. The difference between them would be a burndown chart.)

First of all, notice the distinct phases. We launched in late 2012; leading up to that, we were in heavy dogfood, so bugs were getting furiously filed and fixed. After launch, interestingly, bugs started getting filed a little more slowly (perhaps the most serious flaws had finally been resolved, which is why we launched) and fixed more slowly still, resulting in an increasing deviation between the two lines. (This corresponds to more and more bugs outstanding. In and of itself, that's pretty normal; most launched projects find that their user base is finding bugs faster than the engineers can fix them.)

In late 2015 there was a declared bug bankruptcy (long story). I have some choice words about bug bankruptcies but I'll save them for another slide. The main thing to notice is that the bankruptcy improved neither the filing nor the fixing rate. It caused a one-time bump in fixed bugs (a lying bump, of course, since they weren't really fixed, just forgotten). But the lines' slopes stayed the same.

Finally, we reorged the project in early 2017 and moved things elsewhere, which meant a lot fewer bugs in this component got fixed (nobody to fix them) but also a lot fewer got filed (nobody to file them). By mid-2017, a new team took over, and you can see bugs getting filed and fixed again, but at a much lower rate than before.

Cool story, right? But here's what you need to notice. Look how straight those lines are. There are defined segments corresponding to major events (launch; reorg; recovery). But for literally years at a time, the bug filing/fixing rates are remarkably stable. Just like in the simulations.

Here's another team I've been working with, that shall remain nameless. This chart is a little different: instead of a whole bug tracker project, this tracks only the particular milestone (hotlist) they were working on, starting in late July 2017. As you can see from the y axis labels, we're now talking about an order of magnitude fewer bugs, so you'd expect more variability.

The milestone was being worked on in July and finalized around August 4th. It includes all bugs that were still open as of August 4th, as long as they were relevant to the particular set of stories included in this milestone. The bugs could have been filed at any time back in the past (in this case, as far back as late 2014). Somewhat confusingly, bugs which were relevant to this milestone, but which were already fixed by triage time, were probably not added to the milestone. That's probably why you see this weird behaviour where the creation rate increases at first, before it stabilizes into a straight line. Notice that they explicitly stopped adding bugs to the milestone (ie. froze the requirements) around mid-to-late August. That's why the "Created" dots stop at that time, while the Resolved line continues to increase. However, it's pretty clear from the slopes of the two lines that even if they had continued to add new bugs at the same rate, they would have converged eventually. Story-relevant bugs were being found slower than fixed, which is what you need if you ever intend to launch.

Once again, look how straight those lines are. When the milestone was under construction during July, the work rate was already higher than it had been, but once it was fully committed in August, the engineers got right to work and productivity increased dramatically. You can clearly see the before, during, and after segments of the chart.

The most important observation is this: the engineers were, most likely, all working at about the same rate before the milestone goals were decided. (They were fixing a lot of bugs that don't show up in this filter: creating inventory.) But once everyone focused on the agreed-upon priority, performance on the highest priority items more than doubled. Just eyeballing it, maybe even 3x or 4x.

Well-defined strict priorities, communicated to everyone, can get you to launch 2x-4x faster than just letting engineers work on whatever seemed important at the time.

One more very artificial query: the entire internal bug tracker for a huge corporate project.

Again, even though we don't even really know what the project is, aren't the patterns interesting? It's pretty clear that the lines are, again, surprisingly straight, except for some major transitions. I'm guessing something launched around late 2012, and something else changed in mid-to-late 2014. Then something disastrous happened in early 2017, because the number of remaining open bugs (bottom chart) started shooting up out of control, with no end in sight. Observe also, though, that at the same time, the bug creation rate slowed a little (and even more over time), while the bug fixing rate slowed much more. To me, this sounds like a team being destaffed while a project is still in heavy use - a dangerous strategy at the best of times, and a recipe for angry customers.

But oh well. Even here, the key observation is that for very long stretches, other than transitions at major milestones (launch, another launch, destaffing), the lines are again surprisingly straight. I wouldn't feel too guilty taking a ruler and extrapolating a straight line to make a rough time estimate for how long something might take to finish, at any point in that chart. In fact, even after destaffing, the Remaining line is still straight. It'll never finish, because it's upward instead of downward, but it's straight and therefore predictable.

[Random side note: check out the scale on the top graph. They were handling about 15000 internally-facing bugs per year! Handling dogfood feedback for a popular project can be a nightmare. But it also gives us a clue that the reliability is probably not high enough.]

One more chart, for another project that shall remain nameless. This team did at least two bug bankruptcies (possibly another in 2012, though maybe it was just a manual re-triage since the slope isn't as steep). The problem is, once again, that the slopes of the lines didn't change.

Teams that find themselves under a seemingly endless deluge of bugs certainly have a problem: the ingress rate of their bug queue is faster than their egress rate. That will certainly cause an ever-increasing pile to build up. If your project is popular and understaffed, that's unavoidable. (Recall our simulation earlier of trying to continually launch features with the same team: after a while, you just can't make progress anymore because you're stuck maintaining all the old stuff.)

Unfortunately, some of those teams think they can improve their situation by correcting the absolute value of the number of open bugs. This doesn't work; if you can't improve the slope of the line, then a bug bankruptcy is at best a temporary fix. You might as well not be filing the bugs in the first place.

Look again at the previous slide. That team was handling 15000 bugs per year, and the number of bugs remaining open, while it fluctuated a bit, was neither increasing nor decreasing. Impressively, they had the deluge under control. So, we have evidence that it is, in fact, possible to keep your head above water. The team on this slide (and my own old team, three slides ago) was constantly falling behind, unfortunately. They're pretty normal; the more popular your project is, the more likely you are to experience it. But bankruptcy helped not at all. They need a real solution.

Let's use this as the motivator for the next section of my talk: bug triage. In other words, dealing with the inevitability that you can't fix bugs as fast as they get found.

But first, here's my simulation again. Remember chart #4? When they had all 8 features going at once, the open bug count just kept going up instead of down. I bet that team felt doomed. They could have declared bankruptcy, but that wouldn't have helped either.

The first three charts give us a clue. The same exact team, working on the same product, was able to not get into the ever-rising-tide situation. How? By strictly prioritizing, and by controlling the level of multitasking. The trick is that the first three plots don't show the count of all bugs outstanding: they just show the count of bugs related to the project's current priorities.

That's how we deal with the deluge situation: we only count the bugs that really matter. Which leaves only the question of how we decide which ones matter. That's triage.

Triage: sorting bugs

Before I try to give some tips for bug triage, I want to go back to psychology and talk a bit about what does and doesn't work well in a bug tracker.

First of all, in my corner of the world, priority levels have been made almost completely meaningless by well-intentioned teams. Depending on the project, P0 often ends up paging someone, which is rarely what you want, so you usually don't use P0. P1 often has a stupid bot attached, which pings the bug if you stop responding, trying to enforce some kind of short SLO, so the pattern is often that you file a bug, then someone looks at it for a while, makes little progress, gets annoyed by the bot, and lowers the priority to P2, only to be interrupted by the next round of new P1 bugs. It maximizes distraction without necessarily getting more bugs resolved. Oops! P2 is the default, which is fine, but of course, almost all bugs are P2 so your bug pile is undifferentiated. And at this point everyone knows that P3 and P4 both mean "will never implement." I don't know why we need two separate levels for that, but okay, that's how it is.

Some people complain about the tyranny of bug priority schemes chosen by other teams, and how they shouldn't affect my team because we want to use something less silly, but that doesn't work. Everyone has been trained that when they file a bug, those are the levels. Trying to be different will just confuse your users, because they can't keep track of every team's special snowflake policy.

"Severity," a field available in some bug tracking tools, is even weirder. I've never found anyone who can explain what it's actually for. I know what it supposedly means: there are definitions about how many end users it's affecting, and how badly. Like, if there's a typo in the menu bar of a major app, that's definitely S0 because it affects a lot of people and it's super embarrassing. Whereas a crash that affects 10 users out of a billion is maybe S4. Okay, sure. But nobody ever sorts by severity; they sort by priority. Those could just as easily have been P0 and P4. What is a P4 bug with an S0 severity? A placebo. It made someone feel good because there is a field for "yes, this bug is actually important!" and another person decided "but not important enough to work on." Maybe this method works, but I don't know for who. Anyway, I advise just ignoring severity fields.

More psychology: resolving as "Working as Intended" (the infamous "WAI") always just makes your customer angry. So does mass-resolving bugs as obsolete without looking at them. But pro tip: setting them to P3 or P4 does not create this effect. It's a useful mystery. It means the same thing, but customers are happier. Might as well do it. Instead of a bug bankruptcy, let's call it a bug refinancing.

Of course, that advice about leaving more bugs open, and all of them at P2, and ignoring the severity, leaves you with a giant undifferentiated ever-growing mass of bugs. That's no good. On the other hand, that's nothing new; if you're like one of those popular teams on the other slides, you were going to end up with that anyway, because your ingress rate is simply greater than your egress rate, no matter what. We need a system to deal with this.

Step one is understanding the triage step. It's totally different from the fixing step, for all the reasons listed in the slide.

Many teams try to take every incoming bug and make sure it's assigned to someone on the team, under the assumption that the person will reassign it or fix it eventually. Of course, that doesn't work when ingress > egress; each person's queue will be ever-growing, not ever-shrinking, and eventually the Assignee field also loses all meaning, and you have an even bigger problem. That's not what triage is. That's just giving everything high priority, which as we all know, means nothing is high priority. (Incidentally, that's the mistake we made on my old team: skipping triage and just assigning every bug to someone. I had to ask around a lot to find out that successful teams were all using a triage method that actually works!)

Okay Avery, so now you're saying don't assign the bugs to people, don't close them, set them all to average priority, and ignore severity, all while accepting that the pile will be ever-growing? This is not sustainable.

[This slide is a visual representation of unsustainable clutter. I started with an image search for "overrun by bugs" and quickly learned my lesson. Don't do that.]

Moving on to the real advice.

Step one: never just look at an entire project-wide bug tracker. It's like looking at every file on your filesystem, or every web site indexed by a search engine. Beyond a certain size, it just doesn't work anymore. Your project is almost certainly at that size. Bug trackers have queries. It's time to learn how to use them to solve your problem.

Second, a huge surprise that I didn't realize until I started asking around to the teams that were using bug trackers effectively. Bug tracker "components" (aka dividing by "project" or "area") are almost useless! They are only good for one thing: helping your end users point your bug at the right triage team. Each triage team triages one or more components. Each component is triaged by exactly one triage team. If you only have one triage team, there's no real reason to have a dozen sub-components, no matter how much sense that seems to make, because it will just confuse your end users (they won't know which component to use when filing a new bug).

Instead, use hotlists (aka "milestones" or "labels" or "tags" depending what bug tracker you use). Your triage team can figure out what a bug is about and how important it is. Their main job it so assign it to the right label(s), then move on to the next one. Once it's in a milestone or label, you no longer care what component it's in, because the eng teams working on features are working on a strictly-prioritized set of stories (remember!) and the stories are defined by one of those milestones. And the high priority labels are not depressingly ever-increasing in size.

[To give some concrete examples, I'm going to use Google Issue Tracker (, which is not very popular, but happens to be something I use a lot. In Issue Tracker, you can create and view "hotlists" (aka "labels", specific lists of bugs that you have tagged) and "saved searches" (search queries that you have named and saved for later use, which update their contents as bugs change). Every bug tracker has something like these two features, if you explore around. But the syntax of these specific example queries is from Google Issue Tracker.]

Now all you need is a way to tell the triage team which bugs they haven't yet triaged. That's easy enough: make a bug query for "everything in my interesting set of components" but subtract "everything in my hotlist called ‘Triaged'". I call this query "Needs Triage."

Then the triage team, for each bug, will: - Assign the bug to at least one hotlist (either a release milestone, aka story, or a backlog of not-yet-scheduled features; maybe it's even relevant to more than one feature hotlist) - Assign the bug to the Triaged hotlist to make it disappear from their queue.

That's it! No need to assign it to anyone - the eng team pulling from a given hotlist can do that.

If you have a giant bug tracker that already has a lot of bugs, and you want to get started using this method, add a "modified>DATE" query filter. That will make it so that only bugs edited after the given date are subject to triage. But any old bug, if someone pings it, will immediately make it appear, so that you can triage it right away. Much better than declaring bankruptcy!

If you have a huge backlog and are thinking of declaring bankruptcy, another trick is to make a separate query that again looks at all the bugs in all your interested components, and subtracts out everything in the Triaged list. But this one, instead of having a modified>DATE absolute cutoff, uses a relative cutoff date. In this example I picked 730 days, or two years. What this means is that any bug created more than two years ago, but untriaged, will show up in the query, inviting you to triage it now. Over the course of the next two years, this moving window will slowly pass over all your remaining bugs. (If you have more than two years of backlog, increase the date accordingly. It'll take longer to go over the whole set, but then again, you already had super old bugs, so this won't make it any worse!)

Most people faced with triaging a huge bug backlog just give up. Argh, thousands of bugs? It's just too frustrating. Kill them all with fire (bug bankruptcy). This re-triage query is disproportionately much more pleasant than trying to triage them all in a burst or an ad-hoc series of bursts. If you can triage all your new incoming bugs - and I assure you that you can, using this method, because triaging bugs is way easier than fixing bugs - then you can also re-triage your old bugs. After all, the rate of ingress of bugs two years ago was almost certainly <= the rate of ingress today. Plus, you've already closed a lot of the old bugs from that time, and you'll only see the remaining ones. Plus, many old bugs will be obviously obsolete and can be politely closed right away. (My own experience, and experience of at least one other person I talked to, is that the rate tends to be roughly half the bugs are obsolete vs. still important. I've heard some people estimate 90% obsolete, but I haven't seen that be true in practice. Even if it were, closing bugs as obsolete is easy, and losing the remaining 10% of real bugs is pretty bad.)

If you're like me, you might find that the re-triage window actually moves a little too slowly, and you have extra capacity to re-triage more bugs on some days, between meetings or whatever. If so, you can just edit the query to reduce the 730d to a smaller number, a bit at a time. Then it'll take even less than two years to go through them all.

One more tip: sometimes when triaging, you need more information. People often make one of two mistakes when this happens:

  1. They resolve the bug as "not reproducible", which makes the bug filer angry (even if you ask them to reopen it when they've added better information) or
  2. They send the bug back to the reporter for more information, without triaging it. If you have no triage hotlist, and the original reporter never responds, or worse, they respond but forget to send it back to you, you now have an undead bug floating around in your bug tracker with no way to discover and rescue it. Or, if you do use the Triage hotlist and Needs Triage queries, you have the annoying behaviour that it will keep showing up in your triage queue until the customer responds - which may be never.

The solution is the "Needs Discussion" hotlist. When an untriaged bug needs more discussion before it can be adequately triaged, add it to this hotlist and assign it back to the sender. Change your "Needs Triage" query to exclude everything in the "Needs Discussion" hotlist. But now, add a "Needs Discussion (stalled") query with the magic "-modified:6d" filter shown above. What that does is it makes the bug disappear from your queries for 6 days, or as long as people are still communicating on the bug. Once it goes idle for 6 days, it reappears, at which time you can a) remember to continue the discussion; or b) close the bug Not Reproducible, because the person who filed it is non-responsive, and they have lost the right to be angry; or c) if you have all the information you need, finally triage it.

You can use a shorter time than 6 days if you prefer. Longer is probably a bad idea.

[Side note: it would be really nice if bug trackers had some kind of "discussion" flow feature that could track whose job it is to respond. Ideally we could just wake up right away when it's our team's turn again, rather than using the hardcoded delay.]

In the last few slides, we've made all these hotlists and queries. Now we have to put them all together. For this, you want to use your bug tracker's "dashboard" or "bookmark groups" features, which (to generalize) are generally just ordered sequences of hotlists and queries. I recommend the overall order shown above.

"...Release Milestones..." are hotlists representing stories you have scheduled for upcoming releases. "...Backlogs..." are hotlists representing stories or bugs you haven't scheduled for any release yet (and in some cases, probably never will).

Congratulations, you made it to the end! If you thought it took a long time to read all this, just imagine how long it took me to write. Phew.

Posted Thu Dec 14 14:19:52 2017 Tags:

Recently a user filed a bug where the same RGB color when converted into a UIColor, and later into CGColor is different that going from the RGB value to a CGColor directly on recent versions of iOS.

You can see the difference here:

What is happening here is that CGColors that are created directly from the RGB values are being created on kCGColorSpaceGenericRGB colorspace. Starting with iOS 10, UIColor objects are being created with a device specific color space, in my current simluator this value is kCGColorSpaceExtendedSRGB.

You can see the differences in this workbook

Posted Thu Dec 7 22:10:05 2017 Tags:

Just wanted to close the chapter on Mono's TLS 1.2 support which I blogged about more than a year ago.

At the time, I shared the plans that we had for upgrading the support for TLS 1.2.

We released that code in Mono 4.8.0 in February of 2017 which used the BoringSSL stack on Linux and Apple's TLS stack on Xamarin.{Mac,iOS,tvOS,watchOS}.

In Mono 5.0.0 we extracted the TLS support from the Xamarin codebase into the general Mono codebase and it became available as part of the Mono.framework distribution as well as becoming the default.

Posted Tue Nov 21 03:40:34 2017 Tags: