The lower-post-volume people behind the software in Debian. (List of feeds.)

The Internet has gotten too big.

Growing up, I, like many computery people of my generation, was an idealist. I believed that better, faster communication would be an unmitigated improvement to society. "World peace through better communication," I said to an older co-worker, once, as the millenium was coming to an end. "If people could just understand each others' points of view, there would be no reason for them to fight. Government propaganda will never work if citizens of two warring countries can just talk to each other and realize that the other side is human, just like them, and teach each other what's really true."

[ has an excellent article about this sort of belief system.]

"You have a lot to learn about the world," he said.

Or maybe he said, "That's the most naive thing I've ever heard in my entire life." I can't remember exactly. Either or both would have been appropriate, as it turns out.

What actually happened

There's a pattern that I don't see talked about much, but which seems to apply in all sorts of systems, big and small, theoretical and practical, mathematical and physical.

The pattern is: the cheaper interactions become, the more intensely a system is corrupted. The faster interactions become, the faster the corruption spreads.

What is "corruption?" I don't know the right technical definition, but you know it when you see it. In a living system, it's a virus, or a cancer, or a predator. In a computer operating system, it's malware. In password authentication, it's phishing attacks. In politics, it's lobbyists and grift. In finance, it's high-frequency trading and credit default swaps. In Twitter, it's propaganda bots. In earth's orbit, it's space debris and Kessler syndrome.

On the Internet, it's botnets and DDoS attacks.

What do all these things have in common? That they didn't happen when the system was small and expensive. The system was just as vulnerable back then - usually even more so - but it wasn't worth it. The corruption didn't corrupt. The attacks didn't evolve.

What do I mean by evolve? Again, I don't know what technical definition to use. The evolutionary process of bacteria isn't the same as the evolutionary process of a spam or phishing attack, or malware, or foreign-sponsored Twitter propaganda, or space junk in low-earth orbit. Biological infections evolve, we think, by random mutation and imperfect natural selection. But malware evolves when people redesign it. And space junk can turn violent because of accidental collisions; not natural, and yet the exact opposite of careful human design.

Whatever the cause, the corruption happens in all those cases. Intelligent human attackers are only one way to create a new corruption, but they're a fast, persistent one. The more humans you connect, the more kinds of corruption they will generate.

Most humans aren't trying to create corruption. But it doesn't matter, if a rare corruption can spread quickly. A larger, faster, standardized network lets the same attack offer bigger benefits to the attacker, without increasing cost.


One of the natural defenses against corruption is diversity. There's this problem right now where supposedly the most common strain of bananas is dying out because they are all genetically identical, so the wrong fungus at the right time can kill them all. One way to limit the damage would be to grow, say, 10 kinds of bananas; then when there's a banana plague, it'll only kill, say, 10% of your crop, which you can replace over the next few years.

That might work okay for bananas, but for human diseases, you wouldn't want to be one of the unlucky 10%. For computer viruses, maybe we can have 10 operating systems, but you still don't want to be the unlucky one, and you also don't want to be stuck with the 10th best operating system or the 10th best browser. Diversity is how nature defends against corruption, but not how human engineers do.

In fact, a major goal of modern engineering is to destroy diversity. As Deming would say, reduce variation. Find the "best" solution, then deploy it consistently everywhere, and keep improving it.

When we read about adversarial attacks on computer vision, why are they worse than Magic Eye drawings or other human optical illusions? Because they can be perfectly targeted. An optical illusion street sign would only fool a subset of humans, only some of the time, because each of our neural nets is configured differently from everyone else's. But every neural net in every car of a particular brand and model will be susceptible to exactly the same illusion. You can take a perfect copy of the AI, bring it into your lab, and design a perfect illusion that fools it. Subtracting natural diversity has turned a boring visual attack into a disproportionately effective one.

The same attacks work against a search engine or an email spam filter. If you get a copy of the algorithm, or even query it quickly enough and transparently enough in a black box test, you can design a message to defeat it. That's the SEO industry and email newsletter industry, in a nutshell. It's why companies don't want to tell you which clause of their unevenly-enforced terms of service you violated; because if you knew, you'd fine tune your behaviour to be evil, but not quite evil enough to trip over the line.

It's why human moderators still work better than computer moderators: because humans make unpredictable mistakes. It's harder to optimize an attack against rules that won't stay constant.

...but back to the Internet

I hope you didn't think I was going to tell you how to fix Twitter and Facebook and U.S. politics. The truth is, I have no idea at all. I just see that the patterns of corruption are the same. Nobody bothered to corrupt Twitter and Facebook until they got influential enough to matter, and then everybody bothered to corrupt them, and we have no defense. Start filtering out bots - which of course you must do - and people will build better bots, just like they did with email spam and auto-generated web content and CAPTCHA solvers. You're not fighting against AI, you're fighting against I, and the I is highly incentivized by lots and lots of money and power.

But, ok, wait. I don't know how to fix giant social networks. But I do know a general workaround to this whole class of problem: slow things down. Choose carefully who you interact with. Interact with fewer people. Make sure you are certain which people they are.

If that sounds like some religions' advice about sex, it's probably not a coincidence. It's also why you shouldn't allow foreigners to buy political ads in your country. And why local newspapers are better than national ones. And why "free trade" goes so bad, so often, even though it's also often good. And why doctors need to wash their hands a lot. (Hospital staff are like the Internet of Bacteria.)

Unfortunately, this general workaround translates into "smash Facebook" or "stop letting strangers interact on Twitter," which is not very effective because a) it's not gonna happen, and b) it would destroy lots of useful interactions. So like I said, I've got nothing for you there. Sorry. Big networks are hard.

But Avery, the Internet, you said

Oh right. Me and my cofounders at have been thinking about a particular formulation of this problem. Let's forget about Internet Scale problems (like giant social networks) for a moment. The thing is, only very few problems are Internet Scale. That's what makes them newsworthy. I hate to be the bearer of bad news, but chances are, your problems are not Internet Scale.

Why is it so hard to launch a simple web service for, say, just your customers or employees? Why did the Equifax breach happen, when obviously no outsiders at all were supposed to have access to Equifax's data? How did the Capital One + AWS hack happen, when Capital One clearly has decades of experience with not leaking your data all over the place?

I'll claim it again... because the Internet is too big.

Equifax's data was reachable from the Internet even though it should have only been accessible to a few support employees. Capital One's data surely used to be behind layers and layers of misconfigured firewalls, unhelpful proxy servers, and maybe even pre-TCP/IP legacy mainframe protocols, but then they moved it to AWS, eliminating that diversity and those ad-hoc layers of protection. Nobody can say modernizing their systems was the wrong choice, and yet the result was the same result we always get when we lose diversity.

AWS is bananas, and AWS permission bug exploits are banana fungus.

Attackers perfect their attack once, try it everywhere, scale it like crazy.

Back in the 1990s, I worked with super dumb database apps running on LANs. They stored their passwords in plaintext, in files readable by everyone. But there was never a (digital) attack on the scale of 2019 Capital One. Why?

Because... there was no Internet. Well, there was, but we weren't on it. Employees, with a bit of tech skill, could easily attack the database, and surely some got into some white collar crime. And you occasionally heard stories of kids "hacking into" the school's grading system and giving themselves an A. I even made fake accounts on a BBS or two. But random people in foreign countries didn't hack into your database. And the kids didn't give A's to millions of other kids in all the other schools. It wasn't a thing. Each corruption was contained.

Here's what we've lost sight of, in a world where everything is Internet scale: most interactions should not be Internet scale. Most instances of most programs should be restricted to a small set of obviously trusted people. All those people, in all those foreign countries, should not be invited to read Equifax's PII database in Argentina, no matter how stupid the password was. They shouldn't even be able to connect to the database. They shouldn't be able to see that it exists.

It shouldn't, in short, be on the Internet.

On the other hand, properly authorized users, who are on the Internet, would like to be able to reach it from anywhere. Because requiring all the employees to come to an office location to do their jobs ("physical security") seems kinda obsolete.

That leaves us with a conundrum, doesn't it?

Wouldn't it be nice though? If you could have servers, like you did in the 1990s, with the same simple architectures as you used in the 1990s, and the same sloppy security policies developer freedom as you had in the 1990s, but somehow reach them from anywhere? Like... a network, but not the Internet. One that isn't reachable from the Internet, or even addressable on the Internet. One that uses the Internet as a substrate, but not as a banana.

That's what we're working on.

Literary Afterthoughts

I'm certainly not the first to bring up all this. Various sci-fi addresses the problem of system corruption due to excess connectivity. I liked A Fire Upon the Deep by Vernor Vinge, where some parts of the universe have much better connectivity than others and it doesn't go well at all. There's also the Rifters Trilogy by Peter Watts, in which the Internet of their time is nearly unusable because it's cluttered with machine-generated garbage.

Still, I'd be interested in hearing about any "real science" on the general topic of systems corruption at large scales with higher connectivity. Is there math for this? Can we predict the point at which it all falls apart? Does this define an upper limit on the potential for hyperintelligence? Will this prevent the technological singularity?

Logistical note

I'm normally an East Coast person, but I'll be visiting the San Francisco Bay area from August 26-30 to catch up with friends and talk to people about Tailscale, the horrors of IPv6, etc. Feel free to contact me if you're around and would like to meet up.

Posted Mon Aug 19 07:39:48 2019 Tags:

Given that the main development workflow for most kernel maintainers is with email, I spend a lot of time in my email client. For the past few decades I have used (mutt), but every once in a while I look around to see if there is anything else out there that might work better.

One project that looks promising is (aerc) which was started by (Drew DeVault). It is a terminal-based email client written in Go, and relies on a lot of other go libraries to handle a lot of the “grungy” work in dealing with imap clients, email parsing, and other fun things when it comes to free-flow text parsing that emails require.

aerc isn’t in a usable state for me just yet, but Drew asked if I could document exactly how I use an email client for my day-to-day workflow to see what needs to be done to aerc to have me consider switching.

Note, this isn’t a criticism of mutt at all. I love the tool, and spend more time using that userspace program than any other. But as anyone who knows email clients, they all suck, it’s just that mutt sucks less than everything else (that’s literally their motto)

I did a (basic overview of how I apply patches to the stable kernel trees quite a few years ago) but my workflow has evolved over time, so instead of just writing a private email to Drew, I figured it was time to post something showing others just how the sausage really is made.

Anyway, my email workflow can be divided up into 3 different primary things that I do:

  • basic email reading, management, and sorting
  • reviewing new development patches and applying them to a source repository.
  • reviewing potential stable kernel patches and applying them to a source repository.

Given that all stable kernel patches need to already be in Linus’s kernel tree first, the workflow of the how to work with the stable tree is much different from the new patch workflow.

Basic email reading

All of my email ends up in either two “inboxes” on my local machine. One for everything that is sent directly to me (either with To: or Cc:) as well as a number of mailing lists that I ensure I read all messages that are sent to it because I am a maintainer of those subsystems (like (USB), or (stable)). The second inbox consists of other mailing lists that I do not read all messages of, but review as needed, and can be referenced when I need to look something up. Those mailing lists are the “big” linux-kernel mailing list to ensure I have a local copy to search from when I am offline (due to traveling), as well as other “minor” development mailing lists that I like to keep a copy locally like linux-pci, linux-fsdevel, and a few other smaller vger lists.

I get these maildir folders synced with the mail server using (mbsync) which works really well and is much faster than using (offlineimap), which I used for many many years ends up being really slow for when you do not live on the same continent as the mail server. (Luis’s) recent post of switching to mbsync finally pushed me to take the time to configure it all properly and I am glad that I did.

Let’s ignore my “lists” inbox, as that should be able to be read by any email client by just pointing it at it. I do this with a simple alias:

alias muttl='mutt -f ~/mail_linux/'

which allows me to type muttl at any command line to instantly bring it up:

What I spend most of the time in is my “main” mailbox, and that is in a local maildir that gets synced when needed in ~/mail/INBOX/. A simple mutt on the command line brings this up:

Yes, everything just ends up in one place, in handling my mail, I prune relentlessly. Everything ends up in one of 3 states for what I need to do next:

  • not read yet
  • read and left in INBOX as I need to do something “soon” about it
  • read and it is a patch to do something with

Everything that does not require a response, or I’ve already responded to it, gets deleted from the main INBOX at that point in time, or saved into an archive in case I need to refer back to it again (like mailing list messages).

That last state makes me save the message into one of two local maildirs, todo and stable. Everything in todo is a new patch that I need to review, comment on, or apply to a development tree. Everything in stable is something that has to do with patches that need to get applied to the stable kernel tree.

Side note, I have scripts that run frequently that email me any patches that need to be applied to the stable kernel trees, when they hit Linus’s tree. That way I can just live in my email client and have everything that needs to be applied to a stable release in one place.

I sweep my main INBOX ever few hours, and sort things out either quickly responding, deleting, archiving, or saving into the todo or stable directory. I don’t achieve a constant “inbox zero”, but if I only have 40 or so emails in there, I am doing well.

So, for this main workflow, I need an easy way to:

  • filter the INBOX by a pattern so that I only see one “type” of message at a time (more below)
  • read an email
  • write an email
  • respond to existing email, and use vim as an editor as my hands have those key bindings burned into them.
  • delete an email
  • save an email to one of two mboxes with a press of a few keystrokes
  • bulk delete/archive/save emails all at once

These are all tasks that I bet almost everyone needs to do all the time, so a tool like aerc should be able to do that easily.

A note about filtering. As everything comes into one inbox, it is easier to filter that mbox based on things so I can process everything at once.

As an example, I want to read all of the messages sent to the linux-usb mailing list right now, and not see anything else. To do that, in mutt, I press l (limit) which brings up a prompt for a filter to apply to the mbox. This ability to limit messages to one type of thing is really powerful and I use it in many different ways within mutt.

Here’s an example of me just viewing all of the messages that are sent to the linux-usb mailing list, and saving them off after I have read them:

This isn’t that complex, but it has to work quickly and well on mailboxes that are really really big. As an example, here’s me opening my “all lists” mbox and filtering on the linux-api mailing list messages that I have not read yet. It’s really fast as mutt caches lots of information about the mailbox and does not require reading all of the messages each time it starts up to generate its internal structures.

All messages that I want to save to the todo directory I can do with a two keystroke sequence, .t which saves the message there automatically

Again, that’s a binding I set up years ago, , jumps to the specific mbox, and . copies the message to that location.

Now you see why using mutt is not exactly obvious, those bindings are not part of the default configuration and everyone ends up creating their own custom key bindings for whatever they want to do. It takes a good amount of time to figure this out and set things up how you want, but once you are over that learning curve, you can do very complex things easily. Much like an editor (emacs, vim), you can configure them to do complex things easily, but getting to that level can take a lot of time and knowledge. It’s a tool, and if you are going to rely on it, you should spend the time to learn how to use your tools really well.

Hopefully aerc can get to this level of functionality soon. Odds are everyone else does something much like this, as my use-case is not unusual.

Now let’s get to the unusual use cases, the fun things:

Development Patch review and apply

When I decide it’s time to review and apply patches, I do so by subsystem (as I maintain a number of different ones). As all pending patches are in one big maildir, I filter the messages by the subsystem I care about at the moment, and save all of the messages out to a local mbox file that I call s (hey, naming is hard, it gets worse, just wait…)

So, in my linux/work/ local directory, I keep the development trees for different subsystems like usb, char-misc, driver-core, tty, and staging.

Let’s look at how I handle some staging patches.

First, I go into my ~/linux/work/staging/ directory, which I will stay in while doing all of this work. I open the todo mbox with a quick ,t pressed within mutt (a macro I picked from somewhere long ago, I don’t remember where…), and then filter all staging messages, and save them to a local mbox with the following keystrokes:

l staging
s ../s

Yes, I could skip the l staging step, and just do T staging instead of T, but it’s nice to see what I’m going to save off first before doing so:

Now all of those messages are in a local mbox file that I can open with a single keystroke, ’s’ on the command line. That is an alias:

alias s='mutt -f ../s'

I then dig around in that mbox, sort patches by driver type to see everything for that driver at once by filtering on the name and then save those messages to another mbox called ‘s1’ (see, I told you the names got worse.)

l erofs
s ../s1

I have lots of local mbox files all “intuitively” named ‘s1’, ‘s2’, and ‘s3’. Of course I have aliases to open those files quickly:

alias s1='mutt -f ../s1'
alias s2='mutt -f ../s2'
alias s3='mutt -f ../s3'

I have a number of these mbox files as sometimes I need to filter even further by patch set, or other things, and saving them all to different mboxes makes things go faster.

So, all the erofs patches are in one mbox, let’s open it up and review them, and save the patches that look good enough to apply to another mbox:

Turns out that not all patches need to be dealt with right now (moving erofs out of the staging directory requires other people to review it, so I just save those messages back to the todo mbox:

Now I have a single patch that I want to apply, but I need to add some acks from the maintainers of erofs provided. I do this by editing the “raw” message directly from within mutt. I open the individual messages from the maintainers, cut their reviewed-by line, and then edit the original patch and add those lines to the patch:

Some kernel maintainers right now are screaming something like “Automate this!”, “Patchwork does this for you!”, “Are you crazy?” Yeah, this is one place that I need to work on, but the time involved to do this is not that much and it’s not common that others actually review patches for subsystems I maintain, unfortunately.

The ability to edit a single message directly within my email client is essential. I end up having to fix up changelog text, editing the subject line to be correct, fixing the mail headers to not do foolish things with text formats, and in some cases, editing the patch itself for when it is corrupted or needs to be fixed (I want a Linkedin skill badge for “can edit diff files by hand and have them still work”)

So one hard requirement I have is “editing a raw message from within the email client.” If an email client can not do this, it’s a non-starter for me, sorry.

So we now have a single patch that needs to be applied to the tree. I am already in the ~/linux/work/staging/ directory, and on the correct git branch for where this patch needs to go (how I handle branches and how patches move between them deserve a totally different blog post…)

I can apply this patch in one of two different ways, using git am -s ../s1 on the command line, piping the whole mbox into git and applying the patches directly, or I can apply them within mutt individually by using a macro.

When I have a lot of patches to apply, I just pipe the mbox file to git am -s as I’m comfortable with that, and it goes quick for multiple patches. It also works well as I have lots of different terminal windows open in the same directory when doing this and I can quickly toggle between them.

But we are talking about email clients at the moment, so here’s me applying a single patch to the local git tree:

All it took was hitting the L key. That key is set up as a macro in my mutt configuration file with a single line:

macro index L '| git am -s'\n

This macro pipes the output of the current message to git am -s.

The ability of mutt to pipe the current message (or messages) to external scripts is essential for my workflow in a number of different places. Not having to leave the email client but being able to run something else with that message, is a very powerful functionality, and again, a hard requirement for me.

So that’s it for applying development patches. It’s a bunch of the same tasks over and over:

  • collect patches by a common theme
  • filter the patches by a smaller subset
  • review them manually and responding if there are problems
  • saving “good” patches off to apply
  • applying the good patches
  • jump back to the first step

Doing that all within the email program and being able to quickly get in, and out of the program, as well as do work directly from the email program, is key.

Of course I do a “test build and sometimes test boot and then push git trees and notify author that the patch is applied” set of steps when applying patches too, but those are outside of my email client workflow and happen in a separate terminal window.

Stable patch review and apply

The process of reviewing patches for the stable tree is much like the development patch process, but it differs in that I never use ‘git am’ for applying anything.

The stable kernel tree, while under development, is kept as a series of patches that need to be applied to the previous release. This series of patches is maintained by using a tool called (quilt). Quilt is very powerful and handles sets of patches that need to be applied on top of a moving base very easily. The tool was based on a crazy set of shell scripts written by Andrew Morton a long time ago, and is currently maintained by Jean Delvare and has been rewritten in perl to make them more maintainable. It handles thousands of patches easily and quickly and is used by many developers to handle kernel patches for distributions as well as other projects.

I highly recommend it as it allows you to reorder, drop, add in the middle of the series, and manipulate patches in all sorts of ways, as well as create new patches directly. I do this for the stable tree as lots of times we end up dropping patches from the middle of the series when reviewers say they should not be applied, adding new patches where needed as prerequisites of existing patches, and other changes that with git, would require lots of rebasing.

Rebasing a git does not work for when you have developers working “down” from your tree. We usually have the rule with kernel development that if you have a public tree, it never gets rebased otherwise no one can use it for development.

Anyway, the stable patches are kept in a quilt series in a repository that is kept under version control in git (complex, yeah, sorry.) That queue can always be found (here).

I do create a linux-stable-rc git tree that is constantly rebased based on the stable queue for those who run test systems that can not handle quilt patches. That tree is found (here) and should not ever be used by anyone for anything other than automated testing. See (this email for a bit more explanation of how these git trees should, and should not, be used.

With all that background information behind us, let’s look at how I take patches that are in Linus’s tree, and apply them to the current stable kernel queues:

First I open the stable mbox. Then I filter by everything that has upstream in the subject line. Then I filter again by alsa to only look at the alsa patches. I look at the individual patches, looking at the patch to verify that it really is something that should be applied to the stable tree and determine what order to apply the patches in based on the date of the original commit.

I then hit F to pipe the message to a script that looks up the Fixes: tag in the message to determine what stable tree, if any, the commit that this fix was contained in.

In this example, the patch only should go back to the 4.19 kernel tree, so when I apply it, I know to stop at that place and not go further.

To apply the patch, I hit A which is another macro that I define in my mutt configuration

macro index A |'~/linux/stable/apply_it_from_email'\n
macro pager A |'~/linux/stable/apply_it_from_email'\n

It is defined “twice” as you can have different key bindings when you are looking at mailbox’s index of all messages from when you are looking at the contents of a single message.

In both cases, I pipe the whole email message to my apply_it_from_email script.

That script digs through the message, finds the git commit id of the patch in Linus’s tree, then runs a different script that takes the commit id, exports the patch associated with that id, edits the message to add my signed-off-by to the patch as well as dropping me into my editor to make any needed tweaks that might be needed (sometimes files get renamed so I have to do that by hand, and it gives me one final change to review the patch in my editor which is usually easier than in the email client directly as I have better syntax highlighting and can search and review the text better.

If all goes well, I save the file and the script continues and applies the patch to a bunch of stable kernel trees, one after another, adding the patch to the quilt series for that specific kernel version. To do all of this I had to spawn a separate terminal window as mutt does fun things to standard input/output when piping messages to a script, and I couldn’t ever figure out how to do this all without doing the extra spawn process.

Here it is in action, as a video as (asciinema) can’t show multiple windows at the same time.

Once I have applied the patch, I save it away as I might need to refer to it again, and I move on to the next one.

This sounds like a lot of different steps, but I can process a lot of these relatively quickly. The patch review step is the slowest one here, as that of course can not be automated.

I later take those new patches that have been applied and run kernel build tests and other things before sending out emails saying they have been applied to the tree. But like with development patches, that happens outside of my email client workflow.

Bonus, sending email from the command line

In writing this up, I remembered that I do have some scripts that use mutt to send email out. I don’t normally use mutt for this for patch reviews, as I use other scripts for that (ones that eventually got turned into git send-email), so it’s not a hard requirement, but it is nice to be able to do a simple:

mutt -s "${subject}" "${address}" <  ${msg} >> error.log 2>&1

from within a script when needed.

Thunderbird also can do this, I have used:

thunderbird --compose "to='${address}',subject='${subject}',message=${msg}"

at times in the past when dealing with email servers that mutt can not connect to easily (i.e. gmail when using oauth tokens).

Summary of what I need from an email client

So, to summarize it all for Drew, here’s my list of requirements for me to be able to use an email client for kernel maintainership roles:

  • work with local mbox and maildir folders easily
  • open huge mbox and maildir folders quickly.
  • custom key bindings for any command. Defaults that are sane is always good, but everyone is used to a previous program and training fingers can be hard.
  • create new key bindings for common tasks (like save a message to a specific mbox)
  • easily filter messages based on various things. Full regexes are not needed, see the PATTERNS section of ‘man muttrc’ for examples of what people have come up with over the years as being needed by an email client.
  • when sending/responding to an email, bring it up in the editor of my choice, with full headers. I know aerc already uses vim for this, which is great as that makes it easy to send patches or include other files directly in an email body
  • edit a message directly from the email client and then save it back to the local mbox it came from
  • pipe the current message to an external program

That’s what I use for kernel development.

Oh, I forgot:

  • handle gpg encrypted email. Some mailing lists I am on send everything encrypted with a mailing list key which is needed to both decrypt the message and to encrypt messages sent to the list. SMIME could be used if GPG can’t work, but both of them are probably equally horrid to code support for, so I recommend GPG as that’s probably used more often.

Bonus things that I have grown to rely on when using mutt is:

  • handle displaying html email by piping to an external tool like w3m
  • send a message from the command line with a specific subject and a specific attchment if needed.
  • specify the configuration file to use as a command line option. It is usually easier to have one configuration file for a “work” account, and another one for a “personal” one, with different email servers and settings provided for both.
  • configuration files that can include other configuration files. Mutt allows me to keep all of my “core” keybindings in one config file and just specific email server options in separate config files, allowing me to make configuration management easier.

If you have made it this far, and you aren’t writing an email client, that’s amazing, it must be a slow news day, which is a good thing. I hope this writeup helps others to show them how mutt can be used to handle developer workflows easily, and what is required of a good email client in order to be able to do all of this.

Hopefully other email clients can get to state where they too can do all of this. Competition is good and maybe aerc can get there someday.

Posted Wed Aug 14 11:00:44 2019 Tags:

The average user has approximately one thumb per hand. That thumb comes in handy for a number of touchpad interactions. For example, moving the cursor with the index finger and clicking a button with the thumb. On so-called Clickpads we don't have separate buttons though. The touchpad itself acts as a button and software decides whether it's a left, right, or middle click by counting fingers and/or finger locations. Hence the need for thumb detection, because you may have two fingers on the touchpad (usually right click) but if those are the index and thumb, then really, it's just a single finger click.

libinput has had some thumb detection since the early days when we were still hand-carving bits with stone tools. But it was quite simplistic, as the old documentation illustrates: two zones on the touchpad, a touch started in the lower zone was always a thumb. Where a touch started in the upper thumb area, a timeout and movement thresholds would decide whether it was a thumb. Internally, the thumb states were, Schrödinger-esque, "NO", "YES", and "MAYBE". On top of that, we also had speed-based thumb detection - where a finger was moving fast enough, a new touch would always default to being a thumb. On the grounds that you have no business dropping fingers in the middle of a fast interaction. Such a simplistic approach worked well enough for a bunch of use-cases but failed gloriously in other cases.

Thanks to Matt Mayfields' work, we now have a much more sophisticated thumb detection algorithm. The speed detection is still there but it better accounts for pinch gestures and two-finger scrolling. The exclusion zones are still there but less final about the state of the touch, a thumb can escape that "jail" and contribute to pointer motion where necessary. The new documentation has a bit of a general overview. A requirement for well-working thumb detection however is that your device has the required (device-specific) thresholds set up. So go over to the debugging thumb thresholds documentation and start figuring out your device's thresholds.

As usual, if you notice any issues with the new code please let us know, ideally before the 1.14 release.

Posted Wed Jul 17 10:09:00 2019 Tags:
Source. Wikiepdia. Public Domain.
Because more and more cheminformatics I do is with Bioclipse scripts (see doi:10.1186/1471-2105-10-397) and that Bioclipse is currently unmaintained and has become hard to install, I decided to take the plunge and rewrite some stuff so that I could run the scripts from the command line. I wrote up the first release back in April.

Today, I release Bacting 0.0.5 (doi:10.5281/zenodo.3252486) which is the first release you can download from one of the main Maven repositories. I'm still far from a Maven or Grapes expert, but at least you can use Bacting now like this without actually having to download and compile the source code locally first:


workspaceRoot = "."
cdk = new net.bioclipse.managers.CDKManager(workspaceRoot);

println cdk.fromSMILES("CCO")

If you have been using Bacting before, then please note the change in groupId. If you want to check out all functionality, have a look at the changelogs of the releases.

If you want to cite Bacting, please cite the Bioclipse 2 paper and for the version release, follow the instructions on Zenodo. Pending an article. The Journal of Open Source Software? Sounds like a good idea!
Posted Sat Jun 22 14:50:00 2019 Tags:

This is merely an update on the current status quo, if you read this post in a year's time some of the details may have changed

libinput provides an API to handle graphics tablets, i.e. the tablets that are used by artists. The interface is based around tools, each of which can be in proximity at any time. "Proximity" simply means "in detectable range". libinput promises that any interaction is framed by a proximity in and proximity out event pair, but getting to this turned out to be complicated. libinput has seen a few changes recently here, so let's dig into those. Remember that proverb about seeing what goes into a sausage? Yeah, that.

In the kernel API, the proximity events for pens are the BTN_TOOL_PEN bit. If it's 1, we're in proximity, if it's 0, we're out of proximity. That's the theory.

Wacom tablets (or rather the kernel driver) always reset all axes on proximity out. So libinput needs to take care not to send a 0 value to the caller, lest you want a jump to the top left corner every time you move the pen away from the tablet. Some Wacom pens have serial numbers and we use those to uniquely identify a tool. But some devices start sending proximity and axis events before we get the serial numbers which means we can't identify the tool until several ms later. In that case we simply discard the serial. This means we cannot uniquely identify those pens but so far no-one has complained.

A bunch of tablets (HUION) don't have proximity at all. For those, we start getting events and then stop getting events, without any other information. So libinput has a timer - if we don't get events for a given time, we force a proximity out. Of course, this means we also need to force a proximity in when the next event comes in. These tablets are common enough that recently we just enabled the proximity timeout for all tablets. Easier than playing whack-a-mole, doubly so because HUION re-uses USD ids so you can't easily identify them anyway.

Some tablets (HP Spectre 13) have proximity but never send it. So they advertise the capability, just don't generate events for it. Same handling as the ones that don't have proximity at all.

Some tablets (HUION) have proximity, but only send it once per plug-in, after that it's always in proximity. Since libinput may start after the first pen interaction, this means we have to a) query the initial state of the device and b) force proximity in/out based on the timer, just like above.

Some tablets (Lenovo Flex 5) sometimes send proximity out events, but sometimes do not. So for those we have a timer and forced proximity events, but only when our last interaction didn't trigger a proximity event.

The Dell Active Pen always sends a proximity out event, but with a delay of ~200ms. That timeout is longer than the libinput timeout so we'll get a proximity out event, but only after we've already forced proximity out. We can just discard that event.

The Dell Canvas pen (identifies as "Wacom HID 4831 Pen") can have random delays of up to ~800ms in its event reporting. Which would trigger forced proximity out events in libinput. Luckily it always sends proximity out events, so we could quirk out to specifically disable the timer.

The HP Envy x360 sends a proximity in for the pen, followed by a proximity in from the eraser in the next event. This is still an unresolved issue at the time of writing.

That's the current state of things, I'm sure it'll change in a few months time again as more devices decide to be creative. They are artist's tools after all.

The lesson to take away here: all of the above are special cases that need to be implemented but this can only be done on demand. There's no way any one person can test every single device out there and testing by vendors is often nonexistent. So if you want your device to work, don't complain on some random forum, file a bug and help with debugging and testing instead.
Posted Wed Jun 19 00:34:00 2019 Tags:

We're on the road to he^libinput 1.14 and last week I merged the Dell Canvas Totem support. "Wait, what?" I hear you ask, and "What is that?". Good question - but do pay attention to random press releases more. The Totem ( is a round knob that can be placed on the Dell Canvas. Which itself is a pen and touch device, not unlike the Wacom Cintiq range if you're familiar with those (if not, there's always lmgtfy).

The totem's intended use is as secondary device - you place it on the screen while you're using the pen and up pops a radial menu. You can rotate the totem to select items, click it to select something and bang, you're smiling like a stock photo model eating lettuce. The radial menu is just an example UI, there are plenty others. I remember reading papers about bimanual interaction with similar interfaces that dated back to the 80s, so there's a plethora to choose from. I'm sure someone at Dell has written Totem-Pong and if they have not, I really question their job priorities. The technical side is quite simple, the totem triggers a set of touches in a specific configuration, when the firmware detects that arrangement it knows this isn't a finger but the totem.

Pen and touch we already handle well, but the totem required kernel changes and a few new interfaces in libinput. And that was the easy part, the actual UI bits will be nasty.

The kernel changes went into 4.19 and as usual you can throw noises of gratitude at Benjamin Tissoires. The new kernel API basically boils down to the ABS_MT_TOOL_TYPE axis sending MT_TOOL_DIAL whenever the totem is detected. That axis is (like others of the ABS_MT range) an odd one out. It doesn't work as an axis but rather an enum that specifies the tool within the current slot. We already had finger, pen and palm, adding another enum value means, well, now we have a "dial". And that's largely it in terms of API - handle the MT_TOOL_DIAL and you're good to go.

libinput's API is only slightly more complicated. The tablet interface has a new tool type called the LIBINPUT_TABLET_TOOL_TYPE_TOTEM and a new pair of axes for the tool, the size of the touch ellipse. With that you can get the position of the totem and the size (so you know how big the radial menu needs to be). And that's basically it in regards to the API. The actual implementation was a bit more involved, especially because we needed to implement location-based touch arbitration first.

I haven't started on the Wayland protocol additions yet but I suspect they'll look the same as the libinput API (the Wayland tablet protocol is itself virtually identical to the libinput API). The really big changes will of course be in the toolkits and the applications themselves. The totem is not a device that slots into existing UI paradigms, it requires dedicated support. Whether this will be available in your favourite application is likely going to be up to you. Anyway, christmas in July [1] is coming up so now you know what to put on your wishlist.

[1] yes, that's a thing. Apparently christmas with summery temperature, nice weather, sandy beaches is so unbearable that you have to re-create it in the misery of winter. Explains everything you need to know about humans, really.

Posted Tue Jun 18 23:37:00 2019 Tags:

As everyone seems to like to put kernel trees up on github for random projects (based on the crazy notifications I get all the time), I figured it was time to put up a semi-official mirror of all of the stable kernel releases on

It can be found at: and I will try to keep it up to date with the real source of all kernel stable releases at

It differs from Linus’s tree at: in that it contains all of the different stable tree branches and stable releases and tags, which many devices end up building on top of.

So, mirror away!

Also note, this is a read-only mirror, any pull requests created on it will be gleefully ignored, just like happens on Linus’s github mirror.

If people think this is needed on any other git hosting site, just let me know and I will be glad to push to other places as well.

This notification was also cross-posted on the new site, go follow that for more kernel developer stuff.

Posted Sat Jun 15 20:40:54 2019 Tags:

We create, develop, document and collaborate as users of Free and Open Source Software (FOSS) from around the globe, usually by working remotely on the Internet. However, human beings have many millennia of evolution that makes us predisposed to communicate most effectively via in-person interaction. We don't just rely on the content of communication, but its manner of expression, the body language of the communicator, and thousands of different non-verbal cues and subtle communication mechanisms. In fact, I believe something that's quite radical for a software freedom activist to believe: meeting in person to discuss something is always better than some form of online communication. And this belief is why I attend so many FOSS events, and encourage (and work in my day job to support) programs and policies that financially assist others in FOSS to attend such events.

When I travel, Delta Airlines often works out to be the best option for my travel: they have many international flights from my home airport (PDX), including a daily one to AMS in Europe — and since many FOSS events are in Europe, this has worked out well.

Admittedly, most for-profit companies that I patronize regularly engage in some activity that I find abhorrent. One of the biggest challenges of modern middle-class life in an industrialized soceity is figuring out (absent becoming a Thoreau-inspired recluse) how to navigate one's comfort level with patronizing companies that engage in bad behaviors. We all have to pick our own boycotts and what vendors we're going to avoid.

I realize that all the commercial airlines are some of the worst environmental polluters in the world. I realize that they all hire union-busting law firms to help them mistreat their workers. But, Delta Airlines recent PR campaign to frighten their workers about unions was one dirty trick too far.

I know unions can be inconvenient for organizational leadership; I actually have been a manager of a workforce who unionized while I was an executive. I personally negotiated that union contract with staff. The process is admittedly annoying and complicated. But I fundamentally believe it's deeply necessary, because workers' rights to collectively organize and negotiate with their employers is a cornerstone of equality — not just in the USA but around the entire world.

Furthermore, the Delta posters are particularly offensive because they reach into the basest problematic instinct in humans that often becomes our downfall: the belief that one's own short-term personal convenience and comfort should be valued higher than the long-term good of our larger communityf. It's that instinct that causes us to litter, or to shun public transit and favor driving a car and/or calling a ride service.

We won't be perfect in our efforts to serve the greater good, and sometimes we're going to selfishly (say) buy a video game system with money that could go to a better cause. What's truly offensive, and downright nefarious here, is that Delta Airlines — surely in full knowledge of the worst parts of some human instincts — attempted to exploit that for their own profit and future ability to oppress their workforce.

As a regular Delta customer (both personally, and through my employer when they reimburse my travel), I had to decide how to respond to this act that's beyond the pale. I've decided on the following steps:

  • I've written the following statement via Delta's complaint form:

    I am a Diamond Medallion (since 2016) on Delta, and I've flown more than 975,000 miles on Delta since 2000. I am also a (admittedly small) shareholder in Delta myself (via my retirement savings accounts).

    I realize that it is common practice for your company (and indeed likely every other airline) to negotiate hard with unions to get the best deal for your company and its shareholders. However, taking the step to launch what appears to be a well-funded and planned PR campaign to convince your workers to reject the union and instead spend union dues funds on frivolous purchases instead is a despicable, nefarious strategy. Your fiduciary duty to your shareholders does not mandate the use of unethical and immoral strategies with your unionizing labor force — only that you negotiate in good faith to get the best deal with them for the company.

    I demand that Delta issue a public apology for the posters. Ideally, such an apology should include a statement by Delta indicating that you believe your workers have the right to unionize and should take seriously the counter-arguments put forward by the union in favor of union dues and each employee should decide for themselves what is right.

    I've already booked my primary travel through the rest of the year, so I cannot easily pivot away from Delta quickly. This gives you some time to do the right thing. If Delta does not apologize publicly for this incident by November 1st, 2019, I plan to begin avoiding Delta as a carrier and will seek a status match on another airline.

    I realize that this complaint email will likely primarily be read by labor, not by management. I thus also encourage you to do two things: (a) I hope you'll share this message, to the extent you are permitted under your employment agreement, with your coworkers. Know that there are Diamond Medallions out here in the Delta system who support your right to unionize. (b) I hope you escalate this matter up to management decision-makers so they know that regular customers are unhappy at their actions.

  • Given that I'm already booked on many non-refundable Delta flights in the coming months, I would like to make business-card-sized flyers that say something like: I'm a Delta frequent flyer & I support a unionizing workforce. and maybe on the other side: Delta should apologize for the posters. It would be great if these had some good graphics or otherwise be eye-catching in some way. The idea would be to give them out to travelers and leave them in seat pockets on flights for others to find. If anyone is interested in this project and would like to help, email me — I have no graphic design skills and would appreciate help.
  • I'm encouraging everyone to visit Delta's complaint form and complain about this. If you've flown Delta before with a frequent flyer account, make sure you're logged into that account when you fill out the form — I know from experience their system prioritizes how seriously they take the complaint based on your past travel.
  • I plan to keep my DAL stock shares until the next annual meeting, and (schedule-permitting), I plan to attend the annual meeting and attempt to speak about the issue (or at least give out the aforementioned business cards there). I'll also look in to whether shareholders can attend earnings calls to ask questions, so maybe I can do something of this nature before the next annual meeting.

Overall, there is one positive outcome of this for me personally: I am renewed in my appreciation for having spent most of my career working for charities. Charities in the software freedom community have our problems, but nearly everyone I've worked with at software freedom charities (including management) have always been staunchly pro-union. Workers have a right to negotiate on equal terms with their employers and be treated as equals to come to equitable arrangements about working conditions and workplace issues. Unions aren't perfect, but they are the only way to effectively do that when a workforce is larger than a few people.

Posted Fri May 10 13:45:00 2019 Tags:
Understanding one of the most important changes in the high-speed-software ecosystem. #vectorization #sse #avx #avx512 #antivectors
Posted Tue Apr 30 14:45:32 2019 Tags:

Blog: Revisting the gui.cs framework

12 years ago, I wrote a small UI Library to build console applications in Unix using C#. I enjoyed writing a blog post that hyped this tiny library as a platform for Rich Internet Applications (“RIA”). The young among you might not know this, but back in 2010, “RIA” platforms were all the rage, like Bitcoin was two years ago.

The blog post was written in a tongue-in-cheek style, but linked to actual screenshots of this toy library, which revealed the joke:

First gui.cs application - a MonoTorrent client

This was the day that I realized that some folks did not read the whole blog post, nor clicked on the screenshot links, as I received three pieces of email about it.

The first was from an executive at Adobe asking why we were competing, rather than partnering on this RIA framework. Back in 2010, Adobe was famous for building the Flash and Flex platforms, two of the leading RIA systems in the industry. The second was from a journalist trying to find out more details about this new web framework, he was interested in getting on the phone to discuss the details of the announcement, and the third piece was from an industry analyst that wanted to understand what this announcement did for the strategic placement of my employer in their five-dimensional industry tracking mega-deltoid.

This tiny library was part of my curses binding for Mono in a time where I dreamed of writing and bringing a complete terminal stack to .NET in my copious spare time. Little did I know, that I was about to run out of time, as in little less than a month, I would start Moonlight - the open source clone of Microsoft Silverlight and that would consume my time for a couple of years.

Back to the Future

While Silverlight might have died, my desire to have a UI toolkit for console applications with .NET did not. Some fourteen months ago, I decided to work again on gui.cs, this is a screenshot of the result:

Sample app

In many ways the world had changed. You can now expect a fairly modern version of curses to be available across all Unices and Unix systems have proper terminfo databases installed.

Because I am a hopeless romantic, I called this new incarnation of the UI toolkit, gui.cs. This time around, I have updated it to modern .NET idioms, modern .NET build systems, and embraced the UIKit design for some of the internals of the framework and Azure DevOps to run my continuous builds and manage my releases to NuGet.

In addition, the toolkit is no longer tied to Unix, but contains drivers for the Windows console, the .NET System.Console (a less powerful version of the Windows console) and the ncurses library.

You can find the result in GitHub and you can install it on your favorite operating system by installing the Terminal.Gui NuGet package.

I have published both conceptual and API documentation for folks to get started with. Hopefully I will beat my previous record of two users.

The original layout system for gui.cs was based on absolute positioning - not bad for a quick hack. But this time around I wanted something simpler to use. Sadly, UIKit is not a good source of inspiration for simple to use layout systems, so I came up with a novel system for widget layout, one that I am quite fond of. This new system introduces two data types Pos for specifying positions and Dim for specifying dimensions.

As a developer, you assign Pos values to X, Y and Dim values to Width and Height. The system comes with a range of ways of specifying positions and dimensions, including referencing properties from other views. So you can specify the layout in a way similar to specifying formulas in a spreadsheet.

There is a one hour long presentation introducing various tools for console programming with .NET. The section dealing just with gui.cs starts at minute 29:28, and you can also get a copy of the slides.

Posted Tue Apr 23 03:20:25 2019 Tags:

First Election of the .NET Foundation

Last year, I wrote about structural changes that we made to the .NET Foundation.

Out of 715 applications to become members of the foundation, 477 have been accepted.

Jon has posted the results of our first election. From Microsoft, neither Scott Hunter or myself ran for the board of directors, and only Beth Massi remains. So we went from having a majority of Microsoft employees on the board to only having Beth Massi, with six fresh directors joining: Iris Classon, Ben Adams, Jon Skeet, Phil Haack, Sara Chipps and Oren Novotny

I am stepping down very happy knowing that I achieved my main goal, to turn the .NET Foundation into a more diverse and member-driven foundation.

Congratulations and good luck .NET Board of 2019!

Posted Fri Mar 29 15:01:07 2019 Tags:
This is a repost of something I wrote on Google Plus long ago, in October 2013. Thanks to for reminding me of it before G+ goes down for good!

I was asked on Twitter why Python uses 0-based indexing, with a link to a new (fascinating) post on the subject ( I recall thinking about it a lot; ABC, one of Python's predecessors, used 1-based indexing, while C, the other big influence, used 0-based. My first few programming languages (Algol, Fortran, Pascal) used 1-based or variable-based. I think that one of the issues that helped me decide was slice notation.

Let's first look at use cases. Probably the most common use cases for slicing are "get the first n items" and "get the next n items starting at i" (the first is a special case of that for i == the first index). It would be nice if both of these could be expressed as without awkward +1 or -1 compensations.

Using 0-based indexing, half-open intervals, and suitable defaults (as Python ended up having), they are beautiful: a[:n] and a[i:i+n]; the former is long for a[0:n].

Using 1-based indexing, if you want a[:n] to mean the first n elements, you either have to use closed intervals or you can use a slice notation that uses start and length as the slice parameters. Using half-open intervals just isn't very elegant when combined with 1-based indexing. Using closed intervals, you'd have to write a[i:i+n-1] for the n items starting at i. So perhaps using the slice length would be more elegant with 1-based indexing? Then you could write a[i:n]. And this is in fact what ABC did -- it used a different notation so you could write a@i|n.(See

But how does the index:length convention work out for other use cases? TBH this is where my memory gets fuzzy, but I think I was swayed by the elegance of half-open intervals. Especially the invariant that when two slices are adjacent, the first slice's end index is the second slice's start index is just too beautiful to ignore. For example, suppose you split a string into three parts at indices i and j -- the parts would be a[:i], a[i:j], and a[j:].

So that's why Python uses 0-based indexing.
Posted Sat Mar 23 05:03:00 2019 Tags:

I had to work on an image yesterday where I couldn't install anything and the amount of pre-installed tools was quite limited. And I needed to debug an input device, usually done with libinput record. So eventually I found that hexdump supports formatting of the input bytes but it took me a while to figure out the right combination. The various resources online only got me partway there. So here's an explanation which should get you to your results quickly.

By default, hexdump prints identical input lines as a single line with an asterisk ('*'). To avoid this, use the -v flag as in the examples below.

hexdump's format string is single-quote-enclosed string that contains the count, element size and double-quote-enclosed printf-like format string. So a simple example is this:

$ hexdump -v -e '1/2 "%d\n"'
This prints 1 element ('iteration') of 2 bytes as integer, followed by a linebreak. Or in other words: it takes two bytes, converts it to int and prints it. If you want to print the same input value in multiple formats, use multiple -e invocations.

$ hexdump -v -e '1/2 "%d "' -e '1/2 "%x\n"'
-11568 d2d0
23698 5c92
0 0
0 0
6355 18d3
1 1
0 0
This prints the same 2-byte input value, once as decimal signed integer, once as lowercase hex. If we have multiple identical things to print, we can do this:

$ hexdump -v -e '2/2 "%6d "' -e '" hex:"' -e '4/1 " %x"' -e '"\n"'
-10922 23698 hex: 56 d5 92 5c
0 0 hex: 0 0 0 0
14879 1 hex: 1f 3a 1 0
0 0 hex: 0 0 0 0
0 0 hex: 0 0 0 0
0 0 hex: 0 0 0 0
Which prints two elements, each size 2 as integers, then the same elements as four 1-byte hex values, followed by a linebreak. %6d is a standard printf instruction and documented in the manual.

Let's go and print our protocol. The struct representing the protocol is this one:

struct input_event {
#if (__BITS_PER_LONG != 32 || !defined(__USE_TIME_BITS64)) && !defined(__KERNEL__)
struct timeval time;
#define input_event_sec time.tv_sec
#define input_event_usec time.tv_usec
__kernel_ulong_t __sec;
#if defined(__sparc__) && defined(__arch64__)
unsigned int __usec;
__kernel_ulong_t __usec;
#define input_event_sec __sec
#define input_event_usec __usec
__u16 type;
__u16 code;
__s32 value;
So we have two longs for sec and usec, two shorts for type and code and one signed 32-bit int. Let's print it:

$ hexdump -v -e '"E: " 1/8 "%u." 1/8 "%06u" 2/2 " %04x" 1/4 "%5d\n"' /dev/input/event22
E: 1553127085.097503 0002 0000 1
E: 1553127085.097503 0002 0001 -1
E: 1553127085.097503 0000 0000 0
E: 1553127085.097542 0002 0001 -2
E: 1553127085.097542 0000 0000 0
E: 1553127085.108741 0002 0001 -4
E: 1553127085.108741 0000 0000 0
E: 1553127085.118211 0002 0000 2
E: 1553127085.118211 0002 0001 -10
E: 1553127085.118211 0000 0000 0
E: 1553127085.128245 0002 0000 1
And voila, we have our structs printed in the same format evemu-record prints out. So with nothing but hexdump, I can generate output I can then parse with my existing scripts on another box.
Posted Thu Mar 21 00:30:00 2019 Tags:

I made a little flow chart of mainstream programming languages and how programmers seem to move from one to another.

There's a more common kind of chart, which shows how the languages themselves evolved. I didn't want to show the point of view of language inventors, but rather language users, and see what came out. It looks similar, but not quite the same.

If you started out in language A, this shows which language(s) you most likely jumped to next. According to me. Which is not very scientific, but if you wanted science, you wouldn't be here, right? Perhaps this flow chart says more about me than it says about you.

Disclaimers: Yes, I forgot your favourite language. Yes, people can jump from any language to any other language. Yes, you can learn multiple languages and use the right one for the job. Yes, I have biases everywhere.

With that out of the way, I'll write the rest of this post with fewer disclaimers and state my opinions as if they were facts, for better readability.

I highlighted what are currently the most common "terminal nodes" - where people stop because they can't find anything better, in the dimensions they're looking for. The terminal nodes are: Rust, Java, Go, Python 3, Javascript, and (honourable mention because it's kinda just Javascript) node.js.

A few years ago, I also would have highlighted C as a terminal node. Maybe I still should, because there are plenty of major projects (eg. OS kernels) that still use it and which don't see themselves as having any realistic alternative. But the cracks are definitely showing. My favourite example is Fun with NULL pointers, in which the Linux kernel had a vulnerability caused by the compiler helpfully stripping out NULL pointer checks from a function because NULL values were "impossible." C is a mess, and the spec makes several critical mistakes that newer languages don't make. Maybe they'll fix the spec someday, though.

But let's go back a few steps. If we start at the top, you can see four main branches, corresponding to specializations where people seem to get into programming:

  1. "Low level" programming, including asm and C.
  2. "Business" or "learning" programming, starting with BASIC.
  3. Numerical/scientific programming, such as Fortran, MATLAB, and R.
  4. Scripting/glue programming, like shell (sh) and perl.

(We could maybe also talk about "database query languages" like SQL, except there's really only SQL, to my great dismay. Every attempt to replace it has failed. Database languages are stuck in the 1960s. They even still CAPITALIZE their KEYWORDS because (they THINK) that MAKES it EASIER to UNDERSTAND the CODE.)

(I also left out HTML and CSS. Sorry. They are real languages, but everybody has to learn them now, so there was nowhere to put the arrows. I also omitted the Lisp family, because it never really got popular, although a certain subgroup of people always wished it would. And I would have had to add a fifth category of programmer specialization, "configuring emacs.")

(And I skipped Haskell, because... well, I considered just depicting it as a box floating off to the side, with no arrows into or out of the box, but I figured that would be extraneous. It was a great meta-joke though, because Haskell precludes the concept of I/O unless you involve Monads.)

Anyway, let's go back to the 1990s, and pretend that the world was simple, and (1) low level programmers used C or asm or Turbo Pascal, (2) business programmers used VB, (3) Numerical programmers used Fortran or R or MATLAB, and (4) Glue programmers used sh or perl.

Back then, programming languages were kind of regimented like that. I hadn't really thought about it until I drew the chart. But you pretty obviously didn't write operating system kernels in perl, or glue in MATLAB, or giant matrix multiplications in visual basic.

How things have changed! Because you still don't, but now it's not obvious.

Language migration is mostly about style

Let's look at the section of the tree starting with asm (assembly language). Asm is an incredibly painful way to write programs, although to this day, it is still the best way to write certain things (the first few instructions after your computer boots, for example, or the entry code for an interrupt handler). Every compiled language compiles down to assembly, or machine language, eventually, one way or another, even if that happens somewhere inside the App Store or in a JIT running on your phone.

The first thing that happened, when we abstracted beyond asm, was a fork into two branches: the C-like branch and the Pascal-like branch. (Yes, Algol came before these, but let's skip it. Not many people would identify themselves as Algol programmers. It mostly influenced other languages.)

You can tell the Pascal-like branch because it has "begin...end". You can tell the C-like branch because it uses braces. C, of course, influenced the design of many languages in ways not shown in my chart. Because we're talking about programmers, not language designers.

Let's look at C first. Oddly enough, once people got started in C, they started using it for all kinds of stuff: it was one of the few languages where you could, whether or not it was a good idea, legitimately implement all four categories of programming problem. All of them were a bit painful (except low-level programming, which is what C is actually good at), but it was all possible, and it all ran at a decent speed.

But if you're a C programmer, where do you go next? It depends what you were using it for.

C++ was the obvious choice, but C++, despite its name and syntax, is philosophically not very C-like. Unless you're BeOS, you don't write operating system kernels in C++. The operating systems people stuck with C, at least until Rust arrived, which looks like it has some real potential.

But the business ("large programs") and numerical ("fast programs") people liked C++. Okay, for many, "liked" is not the right word, but they stuck with it, while there was nothing better.

For glue, many people jumped straight from C (or maybe C++) to python 2. I certainly did. Python 2, unlike the weirdness of perl, is a familiar, C-like language, with even simpler syntax. It's easy for a C programmer to understand how python C modules work (and to write a new python module). Calling a C function from python is cheaper than in other languages, such as Java, where you have to fight with a non-refcounting garbage collector. The python "os" module just gives you C system calls, the way C system calls work. You can get access to C's errno and install signal handlers. The only problem is python is, well, slow. But if you treat it as a glue language, you don't care about python's slowness; you write C modules or call C libraries or subprocesses when it's slow.

Separately, when Java came out, many C and C++ "business software" programmers were quick to jump to it. Java ran really slow (although unlike python, it was advertised as "theoretically fast"), but people happily paid the slowness price to get rid of C++'s long compile times, header file madness, portability inconveniences, and use-after-free errors.

I recall reading somewhere that the inventors of Go originally thought that Go would be a competitor for Java or C++, but that didn't really work out. Java is like that famous hotel, also probably from Menlo Park, where once you check in, you never check out. Meanwhile, people who still hadn't jumped from C++ to Java were not likely to jump to another language that a) also ran somewhat slower than C++ and b) also had garbage collection, a religious issue.

Where Go did become popular was with all those glue coders who had previously jumped to python 2. It turns out python's slowness was kind of a pain after all. And as computers get more and more insanely complicated, python glue programs tend to get big, and then the dynamic typing starts to bring more trouble than value, and pre-compiling your binaries starts to pay off. And python 2 uses plenty of memory, so Go gives a RAM improvement, not a detriment like when you move from C++. Go isn't much harder to write than python, but it runs faster and with (usually, somewhat) less RAM.

Nowadays we call Go a "systems" language because "glue" languages remind us too much of perl and ruby, but it's all the same job. (Try telling a kernel developer who uses C that Go is a "systems" language and see what they say.) It's glue. You glue together components to make a system.

The Hejlsberg factor

Let's look next at the Visual Basic and Pascal branches, because there's a weird alternate reality that you either find obviously right ("Why would I ever use something as painful as C or Java?") or obviously wrong ("Visual... Basic? Are you serious?")

Back in the 1980s and 1990s, some people still believed that programming should be approachable to new programmers, so personal computers arrived with a pre-installed programming language for free, almost always BASIC.

In contrast, when universities taught programming, they shunned BASIC ("It is practically impossible to teach good programming to students that have had a prior exposure to BASIC"), but also shunned C. They favoured Pascal, which was considered reasonably easy to learn, looked like all those historical Algol academic papers, and whose syntax could be used to teach a class about parsers without having to fail most of your students. So you had the academic branch and the personal computing branch, but what they had in common is that neither of them liked C.

BASIC on PCs (on DOS) eventually became Visual Basic on Windows, which until javascript came along was probably the most-used and most-loved programming language ever. (It is still the "macro" language used in Excel. There are a lot of Excel programmers, although most of them don't think they're programmers.)

Meanwhile, Pascal managed to migrate to PCs and get popular, mainly thanks to Turbo Pascal, which was probably the fastest compiler ever, by a large margin. They weren't kidding about the Turbo. They even got some C programmers to use it despite preferring C's syntax, just because it was so fast. (Turbo C was okay, but not nearly as Turbo. Faster than everyone else's C compiler, though.)

(Pascal in universities got more and more academic and later evolved into Modula and Ada. That branch would have probably died out if it weren't for the US military adopting Ada for high-reliability systems. Let's ignore Ada for today.)

At that point in history, we had two main branches of "business" developers: the BASIC branch and the Pascal branch. And now Windows was released, and Visual Basic. Turbo Pascal for DOS was looking a bit old. Turbo Pascal for Windows was not super compelling. In order to compete, the inventor of Turbo Pascal, Anders Hejlsberg, created Delphi, a visual environment like Visual Basic, but based on the Turbo Pascal language instead, and with fewer execrable always-missing-or-incompatible-dammit runtime DLLs.

It was really good, but it wasn't Microsoft, so business wise, things got tough. In an unexpected turn of events, eventually Hejlsberg ended up working at Microsoft, where he proceeded to invent the C# language, which launched the Microsoft .NET platform, which also had a Visual Basic .NET variant (which was terrible). This unified the two branches. Supposedly.

Unfortunately, as mentioned, VB.NET was terrible. It was almost nothing like Visual Basic; it was more like a slower version of C++, but with a skin of not-quite-Basic syntax on top, and a much worse UI design tool. C# also wasn't Delphi. But all those things were dead, and Microsoft pushed really hard to make sure they stayed that way. (Except Microsoft Office, which to this day still uses the original Visual Basic syntax, which they call "Visual Basic for Applications," or VBA. It might be more commonly used, and certainly more loved by its users, than all of .NET ever was.)

I actually don't know what became of Visual Basic programmers. Microsoft shoved them pretty hard to get them onto VB.NET, but most of them didn't go along. I wanted to draw the "where they really went" arrow in my diagram, but I honestly don't know. Perhaps they became web developers? Or maybe they write Excel macros.

I think it's interesting that nowadays, if you write software for Windows using Microsoft's preferred .NET-based platforms, you are probably using a language that was heavily influenced by Hejlsberg, whose languages were killed by Microsoft and Visual Basic before he killed them back.

Then he went on to write Typescript, but let's not get ahead of ourselves.

A brief history of glue languages

The original glue language was the Unix shell, famous because it introduced the concept of "pipelines" that interconnect small, simple tools to do something complicated.

Ah, those were the days.

Those days are dead and gone
and the eulogy was delivered by Perl.

   -- Rob Pike

It turns out to be hard to design small, simple tools, and mostly we don't have enough time for that. So languages which let you skip the small simple tools and instead write a twisted, gluey mess have become much more popular. (It doesn't help that sh syntax is also very flawed, especially around quoting and wildcard expansion rules.)

First came awk, which was a C-syntax-looking parser language that you could use in a shell pipeline. It was a little weird (at the time) to use a mini-language (awk) inside another language (sh) all in one "line" of text, but we got over it, which is a good thing because that's how the web works all the time now. (Let's skip over csh, which was yet another incompatible C-syntax-looking language, with different fatal flaws, that could be used instead of sh.)

Perl came next, "inspired" by awk, because awk didn't have enough punctuation marks. (Okay, just kidding. Kind of.)

Perl made it all the way to perl 5 with ever-growing popularity, then completely dropped the ball when they decided to stop improving the syntax in order to throw it all away and start from scratch with perl 6. (Perl 6 is not shown in my diagram because nobody ever migrated to it.)

This left room for the job of "glue" to fracture in several directions. If you thought perl syntax was ugly, you probably switched to python. If you thought perl syntax was amazing and powerful and just needed some tweaks, you probably switched to ruby. If you were using perl to run web CGI scripts, well, maybe you kept doing that, or maybe you gave up and switched to this new PHP thing.

It didn't take long for ruby to also grow web server support (and then Ruby on Rails). Python evolved that way too.

It's kind of interesting what happened here: a whole generation of programmers abandoned the command line - the place where glue programs used to run - and wanted to do everything on the web instead. In some ways, it's better, because for example you can hyperlink from one glue program to the next. In other ways it's worse, because all these modern web programs are slow and unscriptable and take 500MB of RAM because you have to install yet another copy of Electron and... well, I guess that brings us to the web.

Web languages

You will probably be unsurprised to see that my chart has pretty much everything in the whole "glue" branch converging on javascript. Javascript was originally considered a frontend-only language, but when node.js appeared, that changed forever. Now you can learn just one language and write frontends and backends and command-line tools. Javascript was designed to be the ultimate glue language, somehow tying together HTML, CSS, object-orientation, functional programming, dynamic languages, JITs, and every other thing you could make it talk to through an HTTP request.

But it's ugly. The emphasis on backward compatibility, which has been essential to the success of the web, also prevents people from fixing its worst flaws. Javascript was famously thrown together in 10 days in 1995. It's really excellent for 10 days of work, but there were also some mistakes, and we can't fix them.

This brings us to the only bi-directional arrow in my chart: from javascript to python 3, and back again. Let's call it the yin-yang of scripting languages.

Most of the other historical glue+web languages are fading away, but not python. At least not yet. I think that's because... it's sane. If you program long enough in javascript, the insanity just starts to get to you after a while. Maybe you need a pressure release valve and you switch to python.

Meanwhile, if you program in python long enough, eventually you're going to need to write a web app, and then it's super annoying that your frontend code is in a completely different language than the backend, with completely different quirks, where in one of them you say ['a','b','c'].join(',') and in the other you say ','.join(['a','b','c']) and you can never quite remember which is which.

One of them has a JIT that makes it run fast once it's started, but one of them starts fast and runs slow.

One of them has a sane namespace system, and the other one... well. Doesn't.

I don't think python 3 can possibly beat javascript in the long run, but it's not obvious it'll lose, either.

Meanwhile, Hejlsberg, never quite satisfied with his alternate reality branch of programming, saw the many problems with javascript and introduced TypeScript. Also meanwhile, Microsoft has suddenly stopped being so pushy about native Windows apps and started endorsing the web and open source in a big way. This means that for the first time, Microsoft is shoving its own developers toward web languages, which means javascript. They have their TypeScript spin on it (which is a very nice language, in my opinion), but that branch, an alternate reality for decades now, is finally converging. It probably won't be long before it ends.

Will TypeScript actually win out over pure javascript? Interesting question. I don't know. It's pretty great. But I've bet on Hejlsberg languages before, and I always lose.

Epilogue: Python 2 vs Python 3

With all that said, now I can finally make a point about python 2 vs 3. They are very similar languages, yet somehow not the same. In my opinion, that's because they occupy totally different spots in this whole programmer migration chart.

Python 2 developers came from a world of C and perl, and wanted to write glue code. Web servers were an afterthought, added later. I mean, the web got popular after python 2 came out, so that's hardly a surprise. And a lot of python 2 developers end up switching to Go, because the kind of "systems glue" code they want to write is something Go is suited for.

Python 3 developers come from a different place. It turns out that python usage has grown a lot since python 3 started, but the new people are different from the old people. A surprisingly large fraction of the new people come from the scientific and numerical processing world, because of modules like SciPy and then Tensorflow. Python is honestly a pretty weird choice for high-throughput numerical processing, but whatever, those libraries exist, so that's where we go. Another triumph of python's easy integration with C modules, I guess. And python 3 is also made with the web in mind, of course.

To understand the difference in audience between python 2 and 3, you only need to look at the different string types. In python 2, strings were a series of bytes, because operating systems deal in bytes. Unix pipelines deal in bytes. Network sockets deal in bytes. It was a glue language for systems programs, and glue languages deal in bytes.

In python 3, strings are a series of unicode characters, because people kept screwing up the unicode conversions... when interacting with the web, where everything is unicode. People doing scientific numerical calculations don't care much about strings, and people doing web programming care a lot about unicode, so it uses unicode. Try to write systems programs in python 3, though, and you'll find yourself constantly screwing up the unicode conversions, even in simple things like filenames. What goes around, comes around.

Posted Mon Mar 18 17:21:18 2019 Tags:
This is something I posted on python-ideas, but I think it's interesting to a wider audience.

There's been a lot of discussion recently about an operator to merge two dicts.

It prompted me to think about the reason (some) people like operators, and a discussion I had with my mentor Lambert Meertens over 30 years ago came to mind.

For mathematicians, operators are essential to how they think. Take a simple operation like adding two numbers, and try exploring some of its behavior.

    add(x, y) == add(y, x)    (1)

Equation (1) expresses the law that addition is commutative. It's usually written using an operator, which makes it more concise:

    x + y == y + x    (1a)

That feels like a minor gain.

Now consider the associative law:

    add(x, add(y, z)) == add(add(x, y), z)    (2)

Equation (2) can be rewritten using operators:

    x + (y + z) == (x + y) + z    (2a)

This is much less confusing than (2), and leads to the observation that the parentheses are redundant, so now we can write

    x + y + z    (3)

without ambiguity (it doesn't matter whether the + operator binds tighter to the left or to the right).

Many other laws are also written more easily using operators.  Here's one more example, about the identity element of addition:

    add(x, 0) == add(0, x) == x    (4)

compare to

    x + 0 == 0 + x == x    (4a)

The general idea here is that once you've learned this simple notation, equations written using them are easier to *manipulate* than equations written using functional notation -- it is as if our brains grasp the operators using different brain machinery, and this is more efficient.

I think that the fact that formulas written using operators are more easily processed *visually* has something to do with it: they engage the brain's visual processing machinery, which operates largely subconsciously, and tells the conscious part what it sees (e.g. "chair" rather than "pieces of wood joined together"). The functional notation must take a different path through our brain, which is less subconscious (it's related to reading and understanding what you read, which is learned/trained at a much later age than visual processing).

The power of visual processing really becomes apparent when you combine multiple operators. For example, consider the distributive law:

    mul(n, add(x, y)) == add(mul(n, x), mul(n, y))  (5)

That was painful to write, and I believe that at first you won't see the pattern (or at least you wouldn't have immediately seen it if I hadn't mentioned this was the distributive law).

Compare to:

    n * (x + y) == n * x + n * y    (5a)

Notice how this also uses relative operator priorities. Often mathematicians write this even more compact:

    n(x+y) == nx + ny    (5b)

but alas, that currently goes beyond the capacities of Python's parser.

Another very powerful aspect of operator notation is that it is convenient to apply them to objects of different types. For example, laws (1) through (5) also work when x, y and z are same-size vectors and n is a scalar (substituting a vector of zeros for the literal "0"), and also if they are matrices (again, n has to be a scalar).

And you can do this with objects in many different domains. For example, the above laws (1) through (5) apply to functions too (n being a scalar again).

By choosing the operators wisely, mathematicians can employ their visual brain to help them do math better: they'll discover new interesting laws sooner because sometimes the symbols on the blackboard just jump at you and suggest a path to an elusive proof.

Now, programming isn't exactly the same activity as math, but we all know that Readability Counts, and this is where operator overloading in Python comes in. Once you've internalized the simple properties which operators tend to have, using + for string or list concatenation becomes more readable than a pure OO notation, and (2) and (3) above explain (in part) why that is.

Of course, it's definitely possible to overdo this -- then you get Perl. But I think that the folks who point out "there is already a way to do this" are missing the point that it really is easier to grasp the meaning of this:

    d = d1 + d2

compared to this:

    d = d1.copy()
    d.update(d2)    # CORRECTED: This line was previously wrong

and it is not just a matter of fewer lines of code: the first form allows us to use our visual processing to help us see the meaning quicker -- and without distracting other parts of our brain (which might already be occupied by keeping track of the meaning of d1 and d2, for example).

Of course, everything comes at a price. You have to learn the operators, and you have to learn their properties when applied to different object types. (This is true in math too -- for numbers, x*y == y*x, but this property does not apply to functions or matrices; OTOH x+y == y+x applies to all, as does the associative law.)

"But what about performance?" I hear you ask. Good question. IMO, readability comes first, performance second. And in the basic example (d = d1 + d2) there is no performance loss compared to the two-line version using update, and a clear win in readability. I can think of many situations where performance difference is irrelevant but readability is of utmost importance, and for me this is the default assumption (even at Dropbox -- our most performance critical code has already been rewritten in ugly Python or in Go). For the few cases where performance concerns are paramount, it's easy to transform the operator version to something else -- *once you've confirmed it's needed* (probably by profiling).
Posted Fri Mar 15 17:58:00 2019 Tags:

Ho ho ho, let's write libinput. No, of course I'm not serious, because no-one in their right mind would utter "ho ho ho" without a sufficient backdrop of reindeers to keep them sane. So what this post is instead is me writing a nonworking fake libinput in Python, for the sole purpose of explaining roughly how libinput's architecture looks like. It'll be to the libinput what a Duplo car is to a Maserati. Four wheels and something to entertain the kids with but the queue outside the nightclub won't be impressed.

The target audience are those that need to hack on libinput and where the balance of understanding vs total confusion is still shifted towards the latter. So in order to make it easier to associate various bits, here's a description of the main building blocks.

libinput uses something resembling OOP except that in C you can't have nice things unless what you want is a buffer overflow\n\80xb1001af81a2b1101. Instead, we use opaque structs, each with accessor methods and an unhealthy amount of verbosity. Because Python does have classes, those structs are represented as classes below. This all won't be actual working Python code, I'm just using the syntax.

Let's get started. First of all, let's create our library interface.

class Libinput:
def path_create_context(cls):
return _LibinputPathContext()

def udev_create_context(cls):
return _LibinputUdevContext()

# dispatch() means: read from all our internal fds and
# call the dispatch method on anything that has changed
def dispatch(self):
for fd in self.epoll_fd.get_changed_fds():

# return whatever the next event is
def get_event(self):
return self._events.pop(0)

# the various _notify functions are internal API
# to pass things up to the context
def _notify_device_added(self, device):

def _notify_device_removed(self, device):

def _notify_pointer_motion(self, x, y):
self._events.append(LibinputEventPointer(x, y))

class _LibinputPathContext(Libinput):
def add_device(self, device_node):
device = LibinputDevice(device_node)

def remove_device(self, device_node):

class _LibinputUdevContext(Libinput):
def __init__(self):
self.udev = udev.context()

def udev_assign_seat(self, seat_id):
self.seat_id =

for udev_device in self.udev.devices():
device = LibinputDevice(udev_device.device_node)

We have two different modes of initialisation, udev and path. The udev interface is used by Wayland compositors and adds all devices on the given udev seat. The path interface is used by the X.Org driver and adds only one specific device at a time. Both interfaces have the dispatch() and get_events() methods which is how every caller gets events out of libinput.

In both cases we create a libinput device from the data and create an event about the new device that bubbles up into the event queue.

But what really are events? Are they real or just a fidget spinner of our imagination? Well, they're just another object in libinput.

class LibinputEvent:
def type(self):
return self._type

def context(self):
return self._libinput

def device(self):
return self._device

def get_pointer_event(self):
if instanceof(self, LibinputEventPointer):
return self # This makes more sense in C where it's a typecast
return None

def get_keyboard_event(self):
if instanceof(self, LibinputEventKeyboard):
return self # This makes more sense in C where it's a typecast
return None

class LibinputEventPointer(LibinputEvent):
def time(self)
return self._time/1000

def time_usec(self)
return self._time

def dx(self)
return self._dx

def absolute_x(self):
return self._x * self._x_units_per_mm

def absolute_x_transformed(self, width):
return self._x * width/ self._x_max_value
You get the gist. Each event is actually an event of a subtype with a few common shared fields and a bunch of type-specific ones. The events often contain some internal value that is calculated on request. For example, the API for the absolute x/y values returns mm, but we store the value in device units instead and convert to mm on request.

So, what's a device then? Well, just another I-cant-believe-this-is-not-a-class with relatively few surprises:

class LibinputDevice:
class Capability(Enum):

def __init__(self, device_node):
pass # no-one instantiates this directly

def name(self):
return self._name

def context(self):
return self._libinput_context

def udev_device(self):
return self._udev_device

def has_capability(self, cap):
return cap in self._capabilities

Now we have most of the frontend API in place and you start to see a pattern. This is how all of libinput's API works, you get some opaque read-only objects with a few getters and accessor functions.

Now let's figure out how to work on the backend. For that, we need something that handles events:

class EvdevDevice(LibinputDevice):
def __init__(self, device_node):
fd = open(device_node)
super().context.add_fd_to_epoll(fd, self.dispatch)

def has_quirk(self, quirk):
return quirk in self.quirks

def dispatch(self):
while True:
data =
if not data:


def _configure(self):
# some devices are adjusted for quirks before we
# do anything with them
if self.has_quirk(SOME_QUIRK_NAME):

self.interface = EvdevTouchpad()
self.interface = EvdevSwitch()
self.interface = EvdevFalback()

class EvdevInterface:
def dispatch_one_event(self, event):

class EvdevTouchpad(EvdevInterface):
def dispatch_one_event(self, event):

class EvdevTablet(EvdevInterface):
def dispatch_one_event(self, event):

class EvdevSwitch(EvdevInterface):
def dispatch_one_event(self, event):

class EvdevFallback(EvdevInterface):
def dispatch_one_event(self, event):
Our evdev device is actually a subclass (well, C, *handwave*) of the public device and its main function is "read things off the device node". And it passes that on to a magical interface. Other than that, it's a collection of generic functions that apply to all devices. The interfaces is where most of the real work is done.

The interface is decided on by the udev type and is where the device-specifics happen. The touchpad interface deals with touchpads, the tablet and switch interface with those devices and the fallback interface is that for mice, keyboards and touch devices (i.e. the simple devices).

Each interface has very device-specific event processing and can be compared to the Xorg synaptics vs wacom vs evdev drivers. If you are fixing a touchpad bug, chances are you only need to care about the touchpad interface.

The device quirks used above are another simple block:

class Quirks:
def __init__(self):

def has_quirk(device, quirk):
for file in self.quirks:
if quirk.has_match( or
quirk.has_match(device.usbid) or
return True
return False

def get_quirk_value(device, quirk):
if not self.has_quirk(device, quirk):
return None

quirk = self.lookup_quirk(device, quirk)
if quirk.type == "boolean":
return bool(quirk.value)
if quirk.type == "string":
return str(quirk.value)
A system that reads a bunch of .ini files, caches them and returns their value on demand. Those quirks are then used to adjust device behaviour at runtime.

The next building block is the "filter" code, which is the word we use for pointer acceleration. Here too we have a two-layer abstraction with an interface.

class Filter:
def dispatch(self, x, y):
# converts device-unit x/y into normalized units
return self.interface.dispatch(x, y)

# the 'accel speed' configuration value
def set_speed(self, speed):
return self.interface.set_speed(speed)

# the 'accel speed' configuration value
def get_speed(self):
return self.speed


class FilterInterface:
def dispatch(self, x, y):

class FilterInterfaceTouchpad:
def dispatch(self, x, y):

class FilterInterfaceTrackpoint:
def dispatch(self, x, y):

class FilterInterfaceMouse:
def dispatch(self, x, y):
self.history.push((x, y))
v = self.calculate_velocity()
f = self.calculate_factor(v)
return (x * f, y * f)

def calculate_velocity(self)
for delta in self.history:
total += delta
velocity = total/timestamp # as illustration only

def calculate_factor(self, v):
# this is where the interesting bit happens,
# let's assume we have some magic function
f = v * 1234/5678
return f
So libinput calls filter_dispatch on whatever filter is configured and passes the result on to the caller. The setup of those filters is handled in the respective evdev interface, similar to this:

class EvdevFallback:
def init_accel(self):
if self.udev_type == 'ID_INPUT_TRACKPOINT':
self.filter = FilterInterfaceTrackpoint()
elif self.udev_type == 'ID_INPUT_TOUCHPAD':
self.filter = FilterInterfaceTouchpad()
The advantage of this system is twofold. First, the main libinput code only needs one place where we really care about which acceleration method we have. And second, the acceleration code can be compiled separately for analysis and to generate pretty graphs. See the pointer acceleration docs. Oh, and it also allows us to easily have per-device pointer acceleration methods.

Finally, we have one more building block - configuration options. They're a bit different in that they're all similar-ish but only to make switching from one to the next a bit easier.

class DeviceConfigTap:
def set_enabled(self, enabled):
self._enabled = enabled

def get_enabled(self):
return self._enabled

def get_default(self):
return False

class DeviceConfigCalibration:
def set_matrix(self, matrix):
self._matrix = matrix

def get_matrix(self):
return self._matrix

def get_default(self):
return [1, 0, 0, 0, 1, 0, 0, 0, 1]
And then the devices that need one of those slot them into the right pointer in their structs:

class EvdevFallback:
def init_calibration(self):
self.config_calibration = DeviceConfigCalibration()

def handle_touch(self, x, y):
if self.config_calibration is not None:
matrix = self.config_calibration.get_matrix

x, y = matrix.multiply(x, y)
self.context._notify_pointer_abs(x, y)

And that's basically it, those are the building blocks libinput has. The rest is detail. Lots of it, but if you understand the architecture outline above, you're most of the way there in diving into the details.

Posted Fri Mar 15 06:15:00 2019 Tags:

One of the features in the soon-to-be-released libinput 1.13 is location-based touch arbitration. Touch arbitration is the process of discarding touch input on a tablet device while a pen is in proximity. Historically, this was provided by the kernel wacom driver but libinput has had userspace touch arbitration for quite a while now, allowing for touch arbitration where the tablet and the touchscreen part are handled by different kernel drivers.

Basic touch arbitratin is relatively simple: when a pen goes into proximity, all touches are ignored. When the pen goes out of proximity, new touches are handled again. There are some extra details (esp. where the kernel handles arbitration too) but let's ignore those for now.

With libinput 1.13 and in preparation for the Dell Canvas Dial Totem, the touch arbitration can now be limited to a portion of the screen only. On the totem (future patches, not yet merged) that portion is a square slightly larger than the tool itself. On normal tablets, that portion is a rectangle, sized so that it should encompass the users's hand and area around the pen, but not much more. This enables users to use both the pen and touch input at the same time, providing for bimanual interaction (where the GUI itself supports it of course). We use the tilt information of the pen (where available) to guess where the user's hand will be to adjust the rectangle position.

There are some heuristics involved and I'm not sure we got all of them right so I encourage you to give it a try and file an issue where it doesn't behave as expected.

Posted Fri Mar 15 05:58:00 2019 Tags:

[ This blog post was co-written by me and Karen M. Sandler, with input from Deb Nicholson, for our Conservancy blog, and that its canonical location. I'm reposting here just for the convenience of those who are subscribed to my RSS feed but not get Conservancy's feed. ]

Yesterday, the Linux Foundation (LF) launched a new service, called “Community Bridge” — an ambitious platform that promises a self-service system to handle finances, address security issues, manage CLAs and license compliance, and also bring mentorship to projects. These tasks are difficult work that typically require human intervention, so we understand the allure of automating them; we and our peer organizations have long welcomed newcomers to this field and have together sought collaborative assistance for these issues. Indeed, Community Bridge's offerings bear some similarity to the work of organizations like Apache Software Foundation, the Free Software Foundation (FSF), the GNOME Foundation (GF), Open Source Initiative (OSI), Software in the Public Interest (SPI) and Conservancy. People have already begun to ask us to compare this initiative to our work and the work of our peer organizations. This blog post hopefully answers those questions and anticipated similar questions.

The first huge difference (and the biggest disappointment for the entire FOSS community) is that LF's Community Bridge is a proprietary software system. §4.2 of their Platform Use Agreement requires those who sign up for this platform to agree to a proprietary software license, and LF has remained silent about the proprietary nature of the platform in its explanatory materials. The LF, as an organization dedicated to Open Source, should release the source for Community Bridge. At Conservancy, we've worked since 2012 on a Non-Profit Accounting Software system, including creating a tagging system for transparently documenting ledger transactions, and various support software around that. We and SPI both now use these methods daily. We also funded the creation of a system to manage mentorship programs, which we now runs the Outreachy mentorship program. We believe fundamentally that the infrastructure we provide for FOSS fiscal sponsorship (including accounting, mentorship and license compliance) must itself be FOSS, and developed in public as a FOSS project. LF's own research already shows that transparency is impossible for systems that are not FOSS. More importantly, LF's new software could directly benefit so many organizations in our community, including not only Conservancy but also the many others (listed above) who do some form of fiscal sponsorship. LF shouldn't behave like a proprietary software company like Patreon or Kickstarter, but instead support FOSS development. Generally speaking, all Conservancy's peer organizations (listed above) have been fully dedicated to the idea that any infrastructure developed for fiscal sponsorship should itself be FOSS. LF has deviated here from this community norm by unnecessarily requiring FOSS developers to use proprietary software to receive these services, and also failing to collaborate over a FOSS codebase with the existing community of organizations. LF Executive Director Jim Zemlin has said that he “wants more participation in open source … to advance its sustainability and … wants organizations to share their code for the benefit of their fellow [hu]mankind”; we ask him to apply these principles to his own organization now.

The second difference is that LF is not a charity, but a trade association — designed to serve the common business interest of its paid members, who control its Board of Directors. This means that donations made to projects through their system will not be tax-deductible in the USA, and that the money can be used in ways that do not necessarily benefit the public good. For some projects, this may well be an advantage: not all FOSS projects operate in the public good. We believe charitable commitment remains a huge benefit of joining a fiscal sponsor like Conservancy, FSF, GF, or SPI. While charitable affiliation means there are more constraints on how projects can spend their funds, as the projects must show that their spending serves the public benefit, we believe that such constraints are most valuable. Legal requirements that assure behavior of the organization always benefits the general public are a good thing. However, some projects may indeed prefer to serve the common business interest of LF's member companies rather than the public good, but projects should note such benefit to the common business interest is mandatory on this platform — it's explicitly unauthorized to use LF's platform to engage in activities in conflict with LF’s trade association status). Furthermore, (per the FAQ) only one maintainer can administer a project's account, so the platform currently only supports the “BDFL” FOSS governance model, which has already been widely discredited. No governance check exists to ensure that the project's interests align with spending, or to verify that the maintainer acts with consent of a larger group to implement group decisions. Even worse, (per §2.3 of the Usage Agreement) terminating the relationship means ceasing use of the account; no provision allows transfer of the money somewhere else when projects' needs change.

Finally, the LF offers services that are mainly orthogonal and/or a subset of the services provided by a typical fiscal sponsor. Conservancy, for example, does work to negotiate contracts, assist in active fundraising, deal with legal and licensing issues, and various other hands-on work. LF's system is similar to Patreon and other platforms in that it is a hands-off system that takes a cut of the money and provides minimal financial services. Participants will still need to worry about forming their own organization if they want to sign contracts, have an entity that can engage with lawyers and receive legal advice for the project, work through governance issues, or the many other things that projects often want from a fiscal sponsor.

Historically, fiscal sponsors in FOSS have not treated each other as competitors. Conservancy collaborates often with SPI, FSF, and GF in particular. We refer applicant projects to other entities, including explaining to applicants that a trade association may be a better fit for their project. In some cases, we have even referred such trade-association-appropriate applicants to the LF itself, and the LF then helped them form their own sub-organizations and/or became LF Collaborative Projects. The launch of this platform, as proprietary software, without coordination with the rest of the FOSS organization community, is unnecessarily uncollaborative with our community and we therefore encourage some skepticism here. That said, this new LF system is probably just right for FOSS projects that (a) prefer to use single-point-of-failure, proprietary software rather than FOSS for their infrastructure, (b) do not want to operate in a way that is dedicated to the public good, and (c) have very minimal fiscal sponsorship needs, such as occasional reimbursements of project expenses.

Posted Wed Mar 13 10:24:00 2019 Tags:

Let me tell you about the still-not-defunct real-time log processing pipeline we built at my now-defunct last job. It handled logs from a large number of embedded devices that our ISP operated on behalf of residential customers. (I wrote and presented previously about some of the cool wifi diagnostics that were possible with this data set.)

Lately, I've had a surprisingly large number of conversations about logs processing pipelines. I can find probably 10+ already-funded, seemingly successful startups processing logs, and the Big Name Cloud providers all have some kind of logs thingy, but still, people are not satisfied. It's expensive and slow. And if you complain, you mostly get told that you shouldn't be using unstructured logs anyway, you should be using event streams.

That advice is not wrong, but it's incomplete.

Instead of doing a survey of the whole unhappy landscape, let's just ignore what other people suffer with and talk about what does work. You can probably find, somewhere, something similar to each of the components I'm going to talk about, but you probably can't find a single solution that combines it all with good performance and super-low latency for a reasonable price. At least, I haven't found it. I was a little surprised by this, because I didn't think we were doing anything all that innovative. Apparently I was incorrect.

The big picture

Let's get started. Here's a handy diagram of all the parts we're going to talk about:

The ISP where I worked has a bunch of embedded Linux devices (routers, firewalls, wifi access points, and so on) that we wanted to monitor. The number increased rapidly over time, but let's talk about a nice round number, like 100,000 of them. Initially there were zero, then maybe 10 in our development lab, and eventually we hit 100,000, and later there were many more than that. Whatever. Let's work with 100,000. But keep in mind that this architecture works pretty much the same with any number of devices.

(It's a "distributed system" in the sense of scalability, but it's also the simplest thing that really works for any number of devices more than a handful, which makes it different from many "distributed systems" where you could have solved the problem much more simply if you didn't care about scaling. Since our logs are coming from multiple sources, we can't make it non-distributed, but we can try to minimize the number of parts that have to deal with the extra complexity.)

Now, these are devices we were monitoring, not apps or services or containers or whatever. That means two things: we had to deal with lots of weird problems (like compiler/kernel bugs and hardware failures), and most of the software was off-the-shelf OS stuff we couldn't easily control (or didn't want to rewrite).

(Here's the good news: because embedded devices have all the problems from top to bottom, any solution that works for my masses of embedded devices will work for any other log-pipeline problem you might have. If you're lucky, you can leave out some parts.)

That means the debate about "events" vs "logs" was kind of moot. We didn't control all the parts in our system, so telling us to forget logs and use only structured events doesn't help. udhcpd produces messages the way it wants to produce messages, and that's life. Sometimes the kernel panics and prints whatever it wants to print, and that's life. Move on.

Of course, we also had our own apps, which means we could also produce our own structured events when it was relevant to our own apps. Our team had whole never-ending debates about which is better, logs or events, structured or unstructured. In fact, in a move only overfunded megacorporations can afford, we actually implemented both and ran them both for a long time.

Thus, I can now tell you the final true answer, once and for all: you want structured events in your database.

...but you need to be able to produce them from unstructured logs. And once you can do that, exactly how those structured events are produced (either from logs or directly from structured trace output) turns out to be unimportant.

But we're getting ahead of ourselves a bit. Let's take our flow diagram, one part at a time, from left to right.

Userspace and kernel messages, in a single stream

Some people who have been hacking on Linux for a while may know about /proc/kmsg: that's the file good old (pre-systemd) klogd reads kernel messages from, and pumps them to syslogd, which saves them to a file. Nowadays systemd does roughly the same thing but with more d-bus and more corrupted binary log files. Ahem. Anyway. When you run the dmesg command, it reads the same kernel messages (in a slightly different way).

What you might not know is that you can go the other direction. There's a file called /dev/kmsg (note: /dev and not /proc) which, if you write to it, produces messages into the kernel's buffer. Let's do that! For all our messages!

Wait, what? Am I crazy? Why do that?

Because we want strict sequencing of log messages between programs. And we want that even if your kernel panics.

Imagine you have, say, a TV DVR running on an embedded Linux system, and whenever you go to play a particular recorded video, the kernel panics because your chipset vendor hates you. Hypothetically. (The feeling is, hypothetically, mutual.) Ideally, you would like your logs to contain a note that the user requested the video, the video is about to start playing, we've opened the file, we're about to start streaming the file to the proprietary and very buggy (hypothetical) video decoder... boom. Panic.

What now? Well, if you're writing the log messages to disk, the joke's on you, because I bet you didn't fsync() after each one. (Once upon a time, syslogd actually did fsync() after each one. It was insanely disk-grindy and had very low throughput. Those days are gone.) Moreover, a kernel panic kills the disk driver, so you have no chance to fsync() it after the panic, unless you engage one of the more terrifying hacks like, after a panic, booting into a secondary kernel whose only job is to stream the message buffer into a file, hoping desperately that the disk driver isn't the thing that panicked, that the disk itself hasn't fried, and that even if you do manage to write to some disk blocks, they are the right ones because your filesystem data structure is reasonably intact.

(I suddenly feel a lot of pity for myself after reading that paragraph. I think I am more scars than person at this point.)


The kernel log buffer is in a fixed-size memory buffer in RAM. It defaults to being kinda small (tens or hundreds of kBytes), but you can make it bigger if you want. I suggest you do so.

By itself, this won't solve your kernel panic problems, because RAM is even more volatile than disk, and you have to reboot after a kernel panic. So the RAM is gone, right?

Well, no. Sort of. Not exactly.

Once upon a time, your PC BIOS would go through all your RAM at boot time and run a memory test. I remember my ancient 386DX PC used to do this with my amazingly robust and life-changing 4MB of RAM. It took quite a while. You could press ESC to skip it if you were a valiant risk-taking rebel like myself.

Now, memory is a lot faster than it used to be, but unfortunately it has gotten bigger more quickly than it has gotten faster, especially if you disable memory caching, which you certainly must do at boot time in order to write the very specific patterns needed to see if there are any bit errors.

So... we don't do the boot-time memory test. That ended years ago. If you reboot your system, the memory mostly will contain the stuff it contained before you rebooted. The OS kernel has to know that and zero out pages as they get used. (Sometimes the kernel gets fancy and pre-zeroes some extra pages when it's not busy, so it can hand out zero pages more quickly on demand. But it always has to zero them.)

So, the pages are still around when the system reboots. What we want to happen is:

  1. The system reboots automatically after a kernel panic. You can do this by giving your kernel a boot parameter like "panic=1", which reboots it after one second. (This is not nearly enough time for an end user to read and contemplate the panic message. That's fine, because a) on a desktop PC, X11 will have crashed in graphics mode so you can't see the panic message anyway, and b) on an embedded system there is usually no display to put the message on. End users don't care about panic messages. Our job is to reboot, ASAP, so they don't try to "help" by power cycling the device, which really does lose your memory.) (Advanced users will make it reboot after zero seconds. I think panic=0 disables the reboot feature rather than doing that, so you might have to patch the kernel. I forget. We did it, whatever it was.)

  2. The kernel always initializes the dmesg buffer in the same spot in RAM.

  3. The kernel notices that a previous dmesg buffer is already in that spot in RAM (because of a valid signature or checksum or whatever) and decides to append to that buffer instead of starting fresh.

  4. In userspace, we pick up log processing where we left off. We can capture the log messages starting before (and therefore including) the panic!

  5. And because we redirected userspace logs to the kernel message buffer, we have also preserved the exact sequence of events that led up to the panic.

If you want all this to happen, I have good news and bad news. The good news is we open sourced all our code; the bad news is it didn't get upstreamed anywhere so there are no batteries included and no documentation and it probably doesn't quite work for your use case. Sorry.

Open source code:

  • logos tool for sending userspace logs to /dev/klogd. (It's logs... for the OS.. and it's logical... and it brings your logs back from the dead after a reboot... get it? No? Oh well.) This includes two per-app token buckets (burst and long-term) so that an out-of-control app won't overfill the limited amount of dmesg space.

  • PRINTK_PERSIST patch to make Linux reuse the dmesg buffer across reboots.

Even if you don't do any of the rest of this, everybody should use PRINTK_PERSIST on every computer, virtual or physical. Seriously. It's so good.

(Note: room for improvement: it would be better if we could just redirect app stdout/stderr directly to /dev/kmsg, but that doesn't work as well as we want. First, it doesn't auto-prefix incoming messages with the app name. Second, libc functions like printf() actually write a few bytes at a time, not one message per write() call, so they would end up producing more than one dmesg entry per line. Third, /dev/kmsg doesn't support the token bucket rate control that logos does, which turns out to be essential, because sometimes apps go crazy. So we'd have to further extend the kernel API to make it work. It would be worthwhile, though, because the extra userspace process causes an unavoidable delay between when a userspace program prints something and when it actually gets into the kernel log. That delay is enough time for a kernel to panic, and the userspace message gets lost. Writing directly to /dev/kmsg would take less CPU, leave userspace latency unchanged, and ensure the message is safely written before continuing. Someday!)

(In related news, this makes all of syslogd kinda extraneous. Similarly for whatever systemd does. Why do we make everything so complicated? Just write directly to files or the kernel log buffer. It's cheap and easy.)

Uploading the logs

Next, we need to get the messages out of the kernel log buffer and into our log processing server, wherever that might be.

(Note: if we do the above trick - writing userspace messages to the kernel buffer - then we can't also use klogd to read them back into syslogd. That would create an infinite loop, and would end badly. Ask me how I know.)

So, no klogd -> syslogd -> file. Instead, we have something like syslogd -> kmsg -> uploader or app -> kmsg -> uploader.

What is a log uploader? Well, it's a thing that reads messages from the kernel kmsg buffer as they arrive, and uploads them to a server, perhaps over https. It might be almost as simple as "dmesg | curl", like my original prototype, but we can get a bit fancier:

  • Figure out which messages we've already uploaded (eg. from the persistent buffer before we rebooted) and don't upload those again.

  • Log the current wall-clock time before uploading, giving us sync points between monotonic time (/dev/kmsg logs "microseconds since boot" by default, which is very useful, but we also want to be able to correlate that with "real" time so we can match messages between related machines).

  • Compress the file on the way out.

  • Somehow authenticate with the log server.

  • Bonus: if the log server is unavailable because of a network partition, try to keep around the last few messages from before the partition, as well as the recent messages once the partition is restored. If the network partition was caused by the client - not too rare if you, like us, were in the business of making routers and wifi access points - you really would like to see the messages from right before the connectivity loss.

Luckily for you, we also open sourced our code for this. It's in C so it's very small and low-overhead. We never quite got the code for the "bonus" feature working quite right, though; we kinda got interrupted at the last minute.

Open source code:

  • loguploader C client, including an rsyslog plugin for Debian in case you don't want to use the /dev/kmsg trick.

  • devcert, a tool (and Debian package) which auto-generates a self signed "device certificate" wherever it's installed. The device certificate is used by a device (or VM, container, whatever) to identify itself to the log server, which can then decide how to classify and store (or reject) its logs.

One thing we unfortunately didn't get around to doing was modifying the logupload client to stream logs to the server. This is possible using HTTP POST and Chunked encoding, but our server at the time was unable to accept streaming POST requests due to (I think now fixed) infrastructure limitations.

(Note: if you write load balancing proxy servers or HTTP server frameworks, make sure they can start processing a POST request as soon as all the headers have arrived, rather than waiting for the entire blob to be complete! Then a log upload server can just stream the bytes straight to the next stage even before the whole request has finished.)

Because we lacked streaming in the client, we had to upload chunks of log periodically, which leads to a tradeoff about what makes a good upload period. We eventually settled on about 60 seconds, which ended up accounting for almost all the end-to-end latency from message generation to our monitoring console.

Most people probably think 60 seconds is not too bad. But some of the awesome people on our team managed to squeeze all the other pipeline phases down to tens of milliseconds in total. So the remaining 60 seconds (technically: anywhere from 0 to 60 seconds after a message was produced) was kinda embarrassing. Streaming live from device to server would be better.

The log receiver

So okay, we're uploading the logs from client to some kind of server. What does the server do?

This part is both the easiest and the most reliability-critical. The job is this: receive an HTTP POST request, write the POST data to a file, and return HTTP 200 OK. Anybody who has any server-side experience at all can write this in their preferred language in about 10 minutes.

We intentionally want to make this phase as absolutely simplistic as possible. This is the phase that accepts logs from the limited-size kmsg buffer on the client and puts them somewhere persistent. It's nice to have real-time alerts, but if I have to choose between somewhat delayed alerts or randomly losing log messages when things get ugly, I'll have to accept the delayed alerts. Don't lose log messages! You'll regret it.

The best way to not lose messages is to minimize the work done by your log receiver. So we did. It receives the uploaded log file chunk and appends it to a file, and that's it. The "file" is actually in a cloud storage system that's more-or-less like S3. When I explained this to someone, they asked why we didn't put it in a Bigtable-like thing or some other database, because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple. Simple things don't break. Our friends on the "let's use structured events to make metrics" team streamed those events straight into a database, and it broke all the time, because databases have configuration options and you inevitably set those options wrong, and it'll fall over under heavy load, and you won't find out until you're right in the middle of an emergency and you really want to see those logs. Or events.

Of course, the file storage service we used was encrypted-at-rest, heavily audited, and auto-deleted files after N days. When you're a megacorporation, you have whole teams of people dedicated to making sure you don't screw this up. They will find you. Best not to annoy them.

We had to add one extra feature, which was authentication. It's not okay for random people on the Internet to be able to impersonate your devices and spam your logs - at least without putting some work into it. For device authentication, we used the rarely-used HTTP client-side certificates option and the devcert program (linked above) so that the client and server could mutually authenticate each other. The server didn't check the certificates against a certification authority (CA), like web clients usually do; instead, it had a database with a whitelist of exactly which certs we're allowing today. So in case someone stole a device cert and started screwing around, we could remove their cert from the whitelist and not worry about CRL bugs and latencies and whatnot.

Unfortunately, because our log receiver was an internal app relying on internal infrastructure, it wasn't open sourced. But there really wasn't much there, honest. The first one was written in maybe 150 lines of python, and the replacement was rewritten in slightly more lines of Go. No problem.

Retries and floods

Of course, things don't always go smoothly. If you're an ISP, the least easy thing is dealing with cases where a whole neighbourhood gets disconnected, either because of a power loss or because someone cut the fiber Internet feed to the neighbourhood.

Now, disconnections are not such a big deal for logs processing - you don't have any. But reconnection is a really big deal. Now you have tens or hundreds of thousands of your devices coming back online at once, and a) they have accumulated a lot more log messages than they usually do, since they couldn't upload them, and b) they all want to talk to your server at the same time. Uh oh.

Luckily, our system was designed carefully (uh... eventually it was), so it could handle these situations pretty smoothly:

  1. The log uploader uses a backoff timer so that if it's been trying to upload for a while, it uploads less often. (However, the backoff timer was limited to no more than the usual inter-upload interval. I don't know why more people don't do this. It's rather silly for your system to wait longer between uploads in a failure situation than it would in a success situation. This is especially true with logs, where when things come back online, you want a status update now. And clearly your servers have enough capacity to handle uploads at the usual rate, because they usually don't crash. Sorry if I sound defensive here, but I had to have this argument a few times with a few SREs. I understand why limiting the backoff period isn't always the right move. It's the right move here.)

  2. Less obviously, even under normal conditions, the log uploader uses a randomized interval between uploads. This avoids traffic spikes where, after the Internet comes back online, everybody uploads again exactly 60 seconds later, and so on.

  3. The log upload client understands the idea that the server can't accept its request right now. It has to, anyway, because if the Internet goes down, there's no server. So it treats server errors exactly like it treats lack of connectivity. And luckily, log uploading is not really an "interactive" priority task, so it's okay to sacrifice latency when things get bad. Users won't notice. And apparently our network is down, so the admins already noticed.

  4. The /dev/kmsg buffer was configured for the longest reasonable outage we could expect, so that it wouldn't overflow during "typical" downtime. Of course, there's a judgement call here. But the truth is, if you're having system-wide downtime, what the individual devices were doing during that downtime is not usually what you care about. So you only need to handle, say, the 90th percentile of downtime. Safely ignore the black swans for once.

  5. The log receiver aggressively rejects requests that come faster than its ability to write files to disk. Since the clients know how to retry with a delay, this allows us to smooth out bursty traffic without needing to either over-provision the servers or lose log messages.

    (Pro tip, learned the hard way: if you're writing a log receiver in Go, don't do the obvious thing and fire off a goroutine for every incoming request. You'll run out of memory. Define a maximum number of threads you're willing to handle at once, and limit your request handling to that. It's okay to set this value low, just to be safe: remember, the uploader clients will come back later.)

Okay! Now our (unstructured) logs from all our 100,000 devices are sitting safely in a big distributed filesystem. We have a little load-balanced, multi-homed cluster of log receivers accepting the uploads, and they're so simple that they should pretty much never die, and even if they do because we did something dumb (treacherous, treacherous goroutines!), the clients will try again.

What might not be obvious is this: our reliability, persistence, and scaling problems are solved. Or rather, as long as we have enough log receiver instances to handle all our devices, and enough disk quota to store all our logs, we will never again lose a log message.

That means the rest of our pipeline can be best-effort, complicated, and frequently exploding. And that's a good thing, because we're going to start using more off-the-shelf stuff, we're going to let random developers reconfigure the filtering rules, and we're not going to bother to configure it with any redundancy.

Grinding the logs

The next step is to take our unstructured logs and try to understand them. In other words, we want to add some structure. Basically we want to look for lines that are "interesting" and parse out the "interesting" data and produce a stream of events, each with a set of labels describing what categories they apply to.

Note that, other than this phase, there is little difference between how you'd design a structured event reporting pipeline and a log pipeline. You still need to collect the events. You still (if you're like me) need to persist your events across kernel panics. You still need to retry uploading them if your network gets partitioned. You still need the receivers to handle overloading, burstiness, and retries. You still would like to stream them (if your infrastructure can handle it) rather than uploading every 60 seconds. You still want to be able to handle a high volume of them. You're just uploading a structured blob instead of an unstructured blob.

Okay. Fine. If you want to upload structured blobs, go for it. It's just an HTTP POST that appends to a file. Nobody's stopping you. Just please try to follow my advice when designing the parts of the pipeline before and after this phase, because otherwise I guarantee you'll be sad eventually.

Anyway, if you're staying with me, now we have to parse our unstructured logs. What's really cool - what makes this a killer design compared to starting with structured events in the first place - is that we can, at any time, change our minds about how to parse the logs, without redeploying all the software that produces them.

This turns out to be amazingly handy. It's so amazingly handy that nobody believes me. Even I didn't believe me until I experienced it; I was sure, in the beginning, that the unstructured logs were only temporary and we'd initially use them to figure out what structured events we wanted to record, and then modify the software to send those, then phase out the logs over time. This never happened. We never settled down. Every week, or at least every month, there was some new problem which the existing "structured" events weren't configured to catch, but which, upon investigating, we realized we could diagnose and measure from the existing log message stream. And so we did!

Now, I have to put this in perspective. Someone probably told you that log messages are too slow, or too big, or too hard to read, or too hard to use, or you should use them while debugging and then delete them. All those people were living in the past and they didn't have a fancy log pipeline. Computers are really, really fast now. Storage is really, really cheap.

So we let it all out. Our devices produced an average of 50 MB of (uncompressed) logs per day, each. For the baseline 100,000 devices that we discussed above, that's about 5TB of logs per day. Ignoring compression, how much does it cost to store, say, 60 days of logs in S3 at 5TB per day? "Who cares," that's how much. You're amortizing it over 100,000 devices. Heck, a lot of those devices were DVRs, each with 2TB of storage. With 100,000 DVRs, that's 200,000 TB of storage. Another 300 is literally a rounding error (like, smaller than if I can't remember if it's really 2TB or 2TiB or what).

Our systems barfed up logs vigorously and continuously, like a non-drunken non-sailor with seasickness. And it was beautiful.

(By the way, now would be a good time to mention some things we didn't log: personally identifiable information or information about people's Internet usage habits. These were diagnostic logs for running the network and detecting hardware/software failures. We didn't track what you did with the network. That was an intentional decision from day 1.)

(Also, this is why I think all those log processing services are so badly overpriced. I wanna store 50 MB per device, for lots of devices. I need to pay S3 rates for that, not a million dollars a gigabyte. If I have to overpay for storage, I'll have to start writing fewer logs. I love my logs. I need my logs. I know you're just storing it in S3 anyway. You probably get a volume discount! Let's be realistic.)

But the grinding, though

Oh right. So the big box labeled "Grinder" in my diagram was, in fact, just one single virtual machine, for a long time. It lasted like that for much longer than we expected.

Whoa, how is that possible, you ask?

Well, at 5TB per day per 100,000 devices, that's an average of 57 MBytes per second. And remember, burstiness has already been absorbed by our carefully written log receivers and clients, so we'll just grind these logs as fast as they arrive or as fast as we can, and if there are fluctuations, they'll average out. Admittedly, some parts of the day are busier than others. Let's say 80 MBytes per second at peak.

80 MBytes per second? My laptop can do that on its spinning disk. I don't even need an SSD! 80 MBytes per second is a toy.

And of course, it's not just one spinning disk. The data itself is stored on some fancy heavily-engineered distributed filesystem that I didn't have to design. Assuming there are no, er, collossal, failures in provisioning (no comment), there's no reason we shouldn't be able to read files at a rate that saturates the network interface available to our machine. Surely that's at least 10 Gbps (~1 GByte/sec) nowadays, which is 12.5 of those. 1.25 million devices, all processed by a single grinder.

Of course you'll probably need to use a few CPU cores. And the more work you do per log entry, the slower it'll get. But these estimates aren't too far off what we could handle.

And yeah, sometimes that VM gets randomly killed by the cluster's Star Trek-esque hive mind for no reason. It doesn't matter, because the input data was already persisted by the log receivers. Just start a new grinder and pick up where you left off. You'll have to be able to handle process restarts no matter what. And that's a lot easier than trying to make a distributed system you didn't need.

As for what the grinder actually does? Anything you want. But it's basically the "map" phase in a mapreduce. It reads the data in one side, does some stuff to it, and writes out postprocessed stuff on the other side. Use your imagination. And if you want to write more kinds of mappers, you can run them, either alongside the original Grinder or downstream from it.

Our Grinder mostly just ran regexes and put out structures (technically protobufs) that were basically sets of key-value pairs.

(For some reason, when I search the Internet for "streaming mapreduce," I don't get programs that do this real-time processing of lots of files as they get written. Instead, I seem to get batch-oriented mapreduce clones that happen to read from stdin, which is a stream. I guess. But... well, now you've wasted some perfectly good words that could have meant something. So okay, too bad, it's a Grinder. Sue me.)

Reducers and Indexers

Once you have a bunch of structured events... well, I'm not going to explain that in a lot of detail, because it's been written about a lot.

You probably want to aggregate them a bit - eg. to count up reboots across multiple devices, rather than storing each event for each device separately - and dump them into a time-series database. Perhaps you want to save and postprocess the results in a monitoring system named after Queen Elizabeth or her pet butterfly. Whatever. Plug in your favourite.

What you probably think you want to do, but it turns out you rarely need, is full-text indexing. People just don't grep the logs across 100,000 devices all that often. I mean, it's kinda nice to have. But it doesn't have to be instantaneous. You can plug in your favourite full text indexer if you like. But most of the time, just an occasional big parallel grep (perhaps using your favourite mapreduce clone or something more modern... or possibly just using grep) of a subset of the logs is sufficient.

(If you don't have too many devices, even a serial grep can be fine. Remember, a decent cloud computer should be able to read through ~1 GByte/sec, no problem. How much are you paying for someone to run some bloaty full-text indexer on all your logs, to save a few milliseconds per grep?)

I mean, run a full text indexer if you want. The files are right there. Don't let me stop you.

On the other hand, being able to retrieve the exact series of logs - let's call it the "narrative" - from a particular time period across a subset of devices turns out to be super useful. A mini-indexer that just remembers which logs from which devices ended up in which files at which offsets is nice to have. Someone else on our team built one of those eventually (once we grew so much that our parallel grep started taking minutes instead of seconds), and it was very nice.

And then you can build your dashboards

Once you've reduced, aggregated, and indexed your events into your favourite output files and databases, you can read those databases to build very fast-running dashboards. They're fast because the data has been preprocessed in mostly-real time.

As I mentioned above, we had our pipeline reading the input files as fast as they could come in, so the receive+grind+reduce+index phase only took a few tens of milliseconds. If your pipeline isn't that fast, ask somebody why. I bet their program is written in java and/or has a lot of sleep() statements or batch cron jobs with intervals measured in minutes.

Again here, I'm not going to recommend a dashboard tool. There are millions of articles and blog posts about that. Pick one, or many.

In conclusion

Please, please, steal these ideas. Make your log and event processing as stable as our small team made our log processing. Don't fight over structured vs unstructured; if you can't agree, just log them both.

Don't put up with weird lags and limits in your infrastructure. We made 50MB/day/device work for a lot of devices, and real-time mapreduced them all on a single VM. If we can do that, then you can make it work for a few hundreds, or a few thousands, of container instances. Don't let anyone tell you you can't. Do the math: of course you can.


Eventually our team's log processing system evolved to become the primary monitoring and alerting infrastructure for our ISP. Rather than alerting on behaviour of individual core routers, it turned out that the end-to-end behaviour observed by devices in the field were a better way to detect virtually any problem. Alert on symptoms, not causes, as the SREs like to say. Who has the symptoms? End users.

We had our devices ping different internal servers periodically and log the round trip times; in aggregate, we had an amazing view of overloading, packet loss, bufferbloat, and poor backbone routing decisions, across the entire fleet, across every port of every switch. We could tell which was better, IPv4 or IPv6. (It's always IPv4. Almost everyone spends more time optimizing their IPv4 routes and peering. Sorry, but it's true.)

We detected some weird configuration problems with the DNS servers in one city by comparing the 90th percentile latency of DNS lookups across all the devices in every city.

We diagnosed a manufacturing defect in a particular batch of devices, just based on their CPU temperature curves and fan speeds.

We worked with our CPU vendor to find and work around a bug in their cache coherency, because we spotted a kernel panic that would happen randomly every 10,000 CPU-hours, but for every 100,000 devices, that's still 10 times per hour of potential clues.

...and it sure was good for detecting power failures.

Anyway. Log more stuff. Collect those logs. Let it flow. Trust me.

Update 2019-04-26: So, uh, I might have lied in the title when I said you can't have this logs pipeline. Based on a lot of positive feedback from people who read this blog post, I ended up starting a company that might be able to help you with your logs problems. We're building pipelines that are very similar to what's described here. If you're interested in being an early user and helping us shape the product direction, email me!

Posted Sat Feb 16 08:11:48 2019 Tags:

In this blog post, I'll explain how to update systemd's hwdb for a new device-specific entry. I'll focus on input devices, as usual.

What is the hwdb and why do I need to update it?

The hwdb is a binary database sitting at /etc/udev/hwdb.bin and /usr/lib/udev/hwdb.d. It is usually used to apply udev properties to specific devices, those properties are then picked up by other processes (udev builtins, libinput, ...) to apply device-specific behaviours. So you'll need to update the hwdb if you need a specific behaviour from the device.

One of the use-cases I commonly deal with is that some touchpad announces wrong axis ranges or resolutions. With the correct hwdb entry (see the example later) udev can correct these at device initialisation time and every process sees the right axis ranges.

The database is compiled from the various .hwdb files you have sitting on your system, usually in /etc/udev/hwdb.d and /usr/lib/hwdb.d. The terser format of the hwdb files makes them easier to update than, say, writing a udev rule to set those properties.

The full process goes something like this:

  • The various .hwdb files are installed or modified
  • The hwdb.bin file is generated from the .hwdb files
  • A udev rule triggers the udev hwdb builtin. If a match occurs, the builtin prints the to-be properties, and udev captures the output and applies it as udev properties to the device
  • Some other process (often a different udev builtin) reads the udev property value and does something.
On its own, the hwdb is merely a lookup tool though, it does not modify devices. Think of it as a virtual filing cabinet, something will need to look at it, otherwise it's just dead weight.

An example for such a udev rule from 60-evdev.rules contains:

IMPORT{builtin}="hwdb --subsystem=input --lookup-prefix=evdev:", \
RUN{builtin}+="keyboard", GOTO="evdev_end"
The IMPORT statement translates as "look up the hwdb, import the properties". The RUN statement runs the "keyboard" builtin which may change the device based on the various udev properties now set. The GOTO statement goes to skip the rest of the file.

So again, on its own the hwdb doesn't do anything, it merely prints to-be udev properties to stdout, udev captures those and applies them to the device. And then those properties need to be processed by some other process to actually apply changes.

hwdb file format

The basic format of each hwdb file contains two types of entries, match lines and property assignments (indented by one space). The match line defines which device it is applied to.

For example, take this entry from 60-evdev.hwdb:

# Lenovo X230 series
evdev:name:SynPS/2 Synaptics TouchPad:dmi:*svnLENOVO*:pn*ThinkPad*X230*
The match line is the one starting with "evdev", the other two lines are property assignments. Property values are strings, any interpretation to numeric values or others is to be done in the process that requires those properties. Noteworthy here: the hwdb can overwrite previously set properties, but it cannot unset them.

The match line is not specified by the hwdb beyond "it's a glob". The format to use is defined by the udev rule that invokes the hwdb builtin. Usually the format is:

someprefix:search criteria:
For example, the udev rule that applies for the match above is this one in 60-evdev.rules:

KERNELS=="input*", \
IMPORT{builtin}="hwdb 'evdev:name:$attr{name}:$attr{[dmi/id]modalias}'", \
RUN{builtin}+="keyboard", GOTO="evdev_end"
What does this rule do? $attr entries get filled in by udev with the sysfs attributes. So on your local device, the actual lookup key will end up looking roughly like this:

evdev:name:Some Device Name:dmi:bvnWhatever:bvR112355:bd02/01/2018:...
If that string matches the glob from the hwdb, you have a match.

Attentive readers will have noticed that the two entries from 60-evdev.rules I posted here differ. You can have multiple match formats in the same hwdb file. The hwdb doesn't care, it's just a matching system.

We keep the hwdb files matching the udev rules names for ease of maintenance so 60-evdev.rules keeps the hwdb files in 60-evdev.hwdb and so on. But this is just for us puny humans, the hwdb will parse all files it finds into one database. If you have a hwdb entry in my-local-overrides.hwdb it will be matched. The file-specific prefixes are just there to not accidentally match against an unrelated entry.

Applying hwdb updates

The hwdb is a compiled format, so the first thing to do after any changes is to run

$ systemd-hwdb update
This command compiles the files down to the binary hwdb that is actually used by udev. Without that update, none of your changes will take effect.

The second thing is: you need to trigger the udev rules for the device you want to modify. Either you do this by physically unplugging and re-plugging the device or by running

$ udevadm trigger
or, better, trigger only the device you care about to avoid accidental side-effects:

$ udevadm trigger /sys/class/input/eventXYZ
In case you also modified the udev rules you should re-load those too. So the full quartet of commands after a hwdb update is:

$ systemd-hwdb update
$ udevadm control --reload-rules
$ udevadm trigger
$ udevadm info /sys/class/input/eventXYZ
That udevadm info command lists all assigned properties, these should now include the modified entries.

Adding new entries

Now let's get down to what you actually want to do, adding a new entry to the hwdb. And this is where it also get's tricky to have a generic guide because every hwdb file has its own custom match rules.

The best approach is to open the .hwdb files and the matching .rules file and figure out what the match formats are and which one is best. For USB devices there's usually a match format that uses the vendor and product ID. For built-in devices like touchpads and keyboards there's usually a dmi-based match format (see /sys/class/dmi/id/modalias). In most cases, you can just take an existing entry and copy and modify it.

My recommendation is: add an extra property that makes it easy to verify the new entry is applied. For example do this:

# Lenovo X230 series
evdev:name:SynPS/2 Synaptics TouchPad:dmi:*svnLENOVO*:pn*ThinkPad*X230*
Now run the update commands from above. If FOO=1 doesn't show up, then you know it's the hwdb entry that's not yet correct. If FOO=1 does show up in the udevadm info output, then you know the hwdb matches correctly and any issues will be in the next layer.

Increase the value with every change so you can tell whether the most recent change is applied. And before your submit a pull request, remove the FOOentry.

Oh, and once it applies correctly, I recommend restarting the system to make sure everything is in order on a freshly booted system.


The reason for adding hwdb entries is always because we want the system to handle a device in a custom way. But it's hard to figure out what's wrong when something doesn't work (though 90% of the time it's a typo in the hwdb match).

In almost all cases, the debugging sequence is the following:

  • does the FOO property show up?
  • did you run systemd-hwdb update?
  • did you run udevadm trigger?
  • did you restart the process that requires the new udev property?
  • is that process new enough to have support for that property?
If the answer to all these is "yes" and it still doesn't work, you may have found a bug. But 99% of the time, at least one of those is a sound "no. oops.".

Your hwdb match may run into issues with some 'special' characters. If your device has e.g. an ® in its device name (some Microsoft devices have this), a bug in systemd caused the match to fail. That bug is fixed now but until it's available in your distribution, replace with an asterisk ('*') in your match line.

Greybeards who have been around since before 2014 (systemd v219) may remember a different tool to update the hwdb: udevadm hwdb --update. This tool still exists, but it does not have the exact same behaviour as systemd-hwdb update. I won't go into details but the hwdb generated by the udevadm tool can provide unexpected matches if you have multiple matches with globs for the same device. A more generic glob can take precedence over a specific glob and so on. It's a rare and niche case and fixed since systemd v233 but the udevadm behaviour remained the same for backwards-compatibility.

Happy updating and don't forget to add Database Administrator to your CV when your PR gets merged.

Posted Thu Feb 14 01:40:00 2019 Tags: