The lower-post-volume people behind the software in Debian. (List of feeds.)

Multi-factor authentication remains hard-to-use, hard-to-secure, and error-prone. I've been studying authentication lately to see if it might be possible to adapt some security practices, especially phishing prevention, from big companies to small companies and consumers.

Here's what I have so far.

(Sorry: this is another long one. Security is complicated!)

The three factors aren't independent

The traditional definition of multi-factor authentication (although I couldn't find a reference to where it originated) is that you should have two or more of:

  1. Something you know (eg. a password)
  2. Something you have (eg. a card)
  3. Something you are (eg. a fingerprint or retinal scan).

This definition is actually not bad, and has done its job for years. But it tends to lead people into some dead ends that, years later, we now know are wrong.

First of all, let's talk about #3: something you are. The obvious implementation of this is to scan your fingerprint or retina at a door or computer screen. The computer looks it up in a central database somewhere and allows or denies entry. Right?

Well, it turns out, not so much. First of all, there are huge privacy implications to having your biometric data in a central server. I have a Nexus card, which is a programme created jointly by the Canada and U.S. governments to allow theoretically slightly faster border crossing between the two countries. In reality, it is almost certainly a ploy to collect more biometric data. For some reason, Canada insisted on collecting my retinal scans, and the U.S. insisted on collecting my fingerprints, and it seems a bit unlikely to me that the information from one of those will never leak to the other. So now two of my biometrics are in two different databases, someday to be (if not already) merged into one. Whether or not this is legal is probably of only academic interest to the spy agencies doing it.

Now that's for government use, which is bad enough. But I don't need every bank, employer, or apartment building to have my biometrics, for all sorts of reasons. It's a privacy invasion. It can be used to track me well beyond the intended use case. It can be copied, by a sufficiently skilled attacker, and used to masquerade as me. And when that happens, it's not something I can replace like I can replace a stolen password. The dark joke among security engineers is at least I have ten fingers, so I get ten tries at keeping it safe.

Yeah, yeah, this is old news, right? But actually, biometrics have an even bigger problem that people don't talk about enough: false positives. If you have a giant database of fingerprints or retinas or DNA samples or voiceprints or faces, and you have an algorithm for searching the database, and you have one sample you want to look up in the database to find the right match... it turns out to be a hard problem. ML/AI people hide this problem from you all the time. There's a fun toy that tells you which celebrity you look most similar to, and it's eerily accurate, right? Sure. But the toy works because of false positives. It's easy, relatively, to find matches that look similar to you. It's tough to figure out which of those matches, if any, really are you. Which is fine for a toy, but bad for authentication.

(iPhotos and Facebook's face tagging are all susceptible to the same thing. It's pretty easy to see if a given photo matches someone already on your friend list, but very hard, maybe impossible, to match accurately against a single worldwide global database. That's why it only ever suggests tagging a face as one of your friends, even if it's wrong. This whole thing also creates serious obstacles to those dystopian "we'll assign a social score to every citizen and track them all everywhere through public security cameras" plans.)

The fix for false positives is surprisingly easy, and lets you fix a bunch of the other problems with biometrics at the same time: combine it with factor #2, ie. something you have, like a card or a phone.

Instead of storing your fingerprints or retinal scan or facial structure in a big central database, just put it on a card or a phone, in a secure element. Then, when you want to authenticate, the scanner makes a connection to your device, sends the current biometric scan, and the card compares it to its single target (you), and chooses whether or not to unlock itself in response. If it does, it tells the scanner (usually with a cryptographic signature) that all's well.

This is much better: your search algorithm can be much worse. Even a 1 in 1000 false positive rate will be undetectable by random end users, even if it would be utterly hopeless in a million-person database. You don't need to put your biometrics in a central database, which is a big improvement when that database inevitably gets stolen. And best of all, a skilled attacker can't only clone your biometrics, because they won't have the key from your device. No more showing up at the secret facility with some synthetic rubber fingers and waltzing right in. That other factor (something you have... and will notice when it's stolen) goes a long way.

But what exactly is that second factor?

"Something you have" is not quite right

So much for biometrics, factor #3. But what about #2? "Something you have" is pretty straightforward, right? A card, a phone, an OTP fob, a U2F key, whatever. If you have it, that's an authentication factor.

Well, almost. We're going to need to be a little more specific in order to explain exactly why things work the way they do, and why some attacks are possible when you might think they're impossible.

First, let's be clear. "Something you have" is always an encryption key, a long sequence of bits, a large number. It's not a screwdriver, or an 1800s-style physical iron key (physical analogies for two-factor authentication all eventually turn out badly), or an ear infection. It's an encryption key. Let's call it that.

Secondly, it's not just any encryption key. Anybody can generate an encryption key if they have a big enough random number generator. Your encryption key is special. It's a key that you have previously agreed upon ("enrolled") with an authentication provider.

So let's change the definition. Factor #2 isn't something you have; it's just a particular very large number. Maybe you keep it on a device. Maybe you memorize it (if you can memorize 256 digits of pi, I guess you can memorize a 2048-bit RSA key, but I don't envy your key rotation strategy), and then it's something you "know." Maybe you get a QR code tattoo, and then it's something you "are." But it's a big number, and it's valuable only because the other end already knows, by prior agreement, that they will trust someone who has that number.

(Side note: instead of saving each enrolled key in a database, the authentication provider might sign your (public) key using a Certificate Authority (CA). The net result is the same: in order to get a signature from the CA, you had to enroll with it at some point in the past. The difference only affects the format of the backend authentication database on the server, which is opaque to us.)

From now on, factor #2 is "your previously-enrolled private key."

All factors are not created equal

An incorrect conclusion from the "something you know, something you have, something you are" definition is that you can pick any two of these and have a system that's about equally secure as any other combination.

Nope.

In fact, factor #2 - the previously-enrolled privaate key - is by far the safest of the three. This is why banks give you bank cards but are willing to "authenticate" those cards with (in the U.S.) a useless pen-and-paper signature, or (elsewhere) an easily-stolen-or-guessed 4-digit PIN.

Your factor #2 key can be easily replaced (and re-enrolled). You can have as many as you want, corresponding to as many pseudonymous identities as you want. It's a truly random stream of bits, not a weak one you composed because it would be easy to remember. It's easily and perfectly distinguished from everyone else's key. It can't be used to track you between use cases (unless your device is designed to leak it intentionally, sigh). With software configured correctly, it can't be phished. It can be stolen - but (if the hardware is designed right) your attacker has to steal it in the physical world, which eliminates Internet-based attackers, which is most of them. And if they steal your physical key card, you'll probably notice pretty fast.

Even ancient magstripe credit cards share most of these advantages. Their main weakness is that a physical "attacker" can clone rather than steal your card, which is much harder to detect. That's the only real advantage of "chip" cards: they can't be trivially copied. (The switch from signatures to PINs is mostly a distraction and adds little.) We make fun of old-fashioned banks for "still" accepting magstripe cards and being trapped in the past, but those cards are vastly more secure than almost anything else your non-bank service providers use.

(Let's skip over Internet shopping, in which you just type in the not-so-secret magic number from the back of the credit card and combine it with your not-so-secret postal code. Oh well. Sometimes convenience trumps security, even for a bank. But they still keep issuing those physical cards.)

Your phone is probably doing this right

By the way, this whole post is just repeating what experts already know. Your phone, especially if it's an iPhone, already benefits from people who know all this stuff.

Your iPhone contains a previously-enrolled private key, in a secure element, that can uniquely identify it to Apple. You can generate more of them, like the key used to encrypt your local storage. Actions on the key can be restricted to require a fingerprint first, and even the fingerprint reader uses its own previously-enrolled private key link to the secure enclave, preventing attackers from tearing apart your phone and feeding it digital copies of your fingerprint, bypassing the sensor. Plus, after a power cycle, the iPhone secure element refuses to unlock at all without first seeing your PIN, which an attacker can't lift from your hotel room or jail cell like they can lift your fingerprints. And it only lets an attacker try a few guesses at the PIN before it locks iself permanently.

FaceID is similar. It doesn't really matter if the fingerprint or FaceID algorithms have a fairly high false positive rate, because of all the other protections, especially that previously-enrolled private key. Somehow, an attacker either needs your phone, or needs to hack the software on your phone and wait for you to scan a biometric.

Awesome! Let's use our phone to authenticate everything!

Yeah, hold on, we're getting there.

That pesky "previous" enrollment

So here's the catch. The whole multi-factor authentication thing is almost completely solved at this point. Virtually everybody has a phone already (anyway, more people have phones than computers), and any phone can store a secret key - it's just a number, after all - even if it doesn't have secure element hardware. (The secure element helps against certain kinds of malware attacks, but factor #2 authentication is still a huge benefit even with no secure element.)

The secret key on your phone can be protected with a PIN, or biometric, or both, so even if someone steals your phone, they can't immediately pretend to be you.

And, assuming your phone was not a victim of a supply chain attack, you have a safe and reliable way to tell your phone not to authorize anybody unless they have your PIN or biometric: you just need to be the person who initially configures the phone. Nice! Passwords are obsolete! Your phone is all three authentication factors in one!

All true!

But... how does a random Internet service know your phone's key is the key that identifies you? Who are you, anyway?

The thing about a previously-enrolled private key is you have to... previously... enroll it... of course. Which is a really effective way of triggering Inception memes. Just log into the web site, and tell it to trust... oh, rats.

Authentication is easy. Enrollment is hard.

Which leads us to a surprising conclusion: security research has actually come really far. We now have good technology to prevent phishing while decreasing the chance of forgotten passwords (PINs are easier than passwords!) and avoiding all the privacy problems with biometrics. Cool! Why isn't it deployed everywhere?

Because enrollment remains unsolved.

Here are some ways we do enrollment today:

  1. Pick a password at account creation time and hope you weren't being phished at that time. (Phishable. Incidentally, "password manager" apps are a variant of this.)

  2. Get a security token (OTP, U2F, whatever) issued by your employer in person, after some kind of identity check. (Very secure, but tedious, which is why only corporations bother with it.)

  3. Mail a token to your physical mailbox and hope nobody else gets there first. Banks do this with your credit card. (Subject to (eventually detectable) mail theft by criminals and roommates, but very effective, albeit slow.)

  4. Log in once with a password, then associate a security token for subsequent logins. (Password logins are phishable.)

  5. Your physical device (eg. iPhone) arrives with a key that the vendor (eg. Apple) already trusts, so at least it knows it's talking to a real iPhone. (Great, but it doesn't prove who you are.)

  6. Send a confirmation message to your email address or phone number, thus ensuring you own it. (Inception again: this just verifies that you can log in somewhere else, with an unknown level of security. Email addresses and phone numbers are stolen all the time.)

  7. Ask for personal information, eg. "security questions", your government SIN, your physical address, your credit card number. (These answers aren't as secret as you'd like to think.)

All the above methods have different sorts of flaws or limitations.

That said, we should draw attention to some special kinds of "enrollment" that are actually nearly perfect:

  1. Setting a PIN on my iPhone, assuming the iPhone's supply chain was secure, is reliable and easy because of a physical link between it and me.

  2. Setting fingerprint or face unlock on my iPhone is reliable and easy, again because of a physical (sensor) link between it and me.

  3. My corporate-issued U2F token can issue signatures to my web browser because they are physically linked, and it can assume I plugged it into whatever computer on purpose.

  4. Bluetooth devices make me do an explicit enrollment ("pairing") phase on both devices, which prevents attacks other than by people who are physically nearby at the time of initial pairing. (When available, a pairing PIN adds a small amount of security on top, but is usually more work than it's worth unless the NSA is parked outside.)

  5. Registering a new iCloud account on my iPhone can be safe because the iPhone's built-in certificate can be used to mutually validate between Apple's auth server and the phone. (This is only safe if you create your account from a suitably secure iPhone in the first place. Securing a non-secured account later is subject to the usual enrollment problems.)

  6. Registering additional Apple devices to your existing iCloud account, where Apple pops up a confirmation on your already-registered Apple devices and then makes you type an OTP into the new device. (The OTP in this case might seem extraneous, but it ensures that someone can't just trigger a request from their own device at the same moment as you trigger a request from yours, and have them both go through because you click confirm twice. It's also cool that they show the GPS location of the requester in the confirmation dialog, although that seems forgeable.)

...okay, I didn't expect this article to turn out as an iPhone ad, but you know what? They've really thought through this consumer authentication stuff. (To their credit, although Google is behind on consumer authentication, their work on U2F for corporate athentication is really great. It relies on the tedious-but-highly-effective human enrollment process.)

It seems to me that the above successful enrollment patterns all use one or more of the following techniques:

  • A human authenticates you and issues you a token (usually in person).

  • A short-distance, physical link (proximity-based authentication) like a biometric sensor, or USB or bluetooth connection.

  • Delegation to an existing authenticator (like a popular web service, eg. via oauth2, or by charging a credit card) to confirm that it's you. Once you have even a single chicken, you no longer have a chicken-or-egg problem. After delegating once, you can enroll a new token and avoid future delegations to that service if you want. (That might or might not be desirable.)

That's all I've got. In this article, I'm just writing about what other people do. I don't have any magic solutions to enrollment. So let's leave enrollment for now as an only-partially-solved, rather awkward problem.

(By the way, the reason gpg is so complicated and unusable is that the "web of trust" is all enrollment, all the time.)

But now that we understand about enrollment, let's talk about...

Why is U2F so effective?

Other than enrollment, using U2F (Universal Second Factor) is pretty simple. You get a physical key, plug it into a USB port (or nowadays sometimes a bluetooth link), and you press it when prompted.

When you press it, the device signs a request from any requesting web site, via the web browser, thus proving that you are you. (Whoever "you" are was determined during enrollment.)

The non-obvious best feature is that the signature only applies to a particular web domain that made the request, and the web browser enforces that the domain name is "real" by using HTTPS certificates. (Inception again! How do we know we can trust the web site? Because their HTTPS cert was signed by a global CA. How do we know which CAs to trust? A list of them came with our browser or our OS. How can we trust the browser or OS? Uh... supply chain?)

Anyway, put it all together, and this domain validation is like magic; it completely defeats phishing.

A good mental model for phishing is to imagine some person sitting there with their own web browser, but also operating a web site that proxies your requests straight through to the "real" web site, while capturing the traffic. So you accidentally visit gmail.example.com, which proxies to gmail.com. Somehow you are fooled into thinking you need to log in as if it's gmail. You type your username and password, which are logged by the proxy and sent along to gmail.com. Our hacker friend reads the username and password from their trace, and logs into gmail.com on their web browser. You've been phished!

Now let's add U2F. After you type your password, the site asks you to tap your U2F. You tap it, and generate a valid signature directed to... gmail.example.com, the site your browser confirmed as having made the request. They proxy that valid signature along to gmail.com, but it's useless: gmail.com doesn't trust signed authenticators directed at gmail.example.com. The phishing has been blocked completely.

(By the way, OTP - any device that generates a number that you have to type in - doesn't prevent phishing at all. At least, not if the attacker can do man-in-the-middle attacks, which they probably can. In our example, you'd happily type the OTP into gmail.example.com, for the same reason you typed your password, and it would proxy it to gmail.com, and bam, you lose.)

(I mentioned Apple's authenticator above, which requires you to type an OTP. It doesn't have this weakness, because they are also authenticating using a U2F-like mechanism behind the scenes. Forcing you to also type the OTP is an extra layer of security, part of a very subtle enrollment process that defends against even rarer types of attack.)

Note also that you can unplug your U2F token and plug it into any other computer and web browser instance. Without any special pairing process, it can trust the new web browser and the web browser can trust the U2F. Why? Because of that physical link again. Somebody plugged this U2F key into this browser. That means they intended for them to trust each other. It's axiomatic; that's the design.

For this reason, pure U2F doesn't work well without a second factor, such as a password. Otherwise, if someone steals your U2F key, they could impersonate you by plugging it into their own computer, unless there's something else proving it's you.

The reason these two factors are so effective is because of the threat model. Basically, there are physical attackers (eg. purse snatchers) and there are Internet attackers (eg. spammers). Physical attackers can steal your U2F device, but they also need to see you type your password, and the U2F is traceable when used, and the attack is unscalable (you need a physical human to be near you to steal it). On the other hand, Internet attackers can easily phish your password, but can't steal your U2F.

Skilled, well-funded attackers with physical presence can do both. Sorry. But most of us aren't important enough to worry about that.

Can I use my phone as a U2F?

Learning about the above led me to ask a question many people have asked: why do I need the stupid physical security token, which is only plugged into one computer at a time and which I'm guaranteed to misplace? Can't my phone be the security token?

This is a good question. Aside from enrollment being generally hard, the stupid, expensive physical security token is, almost certainly, the thing preventing widespread adoption of U2F. I doubt it'll ever catch on, other than for employees of corporations (who pay you for the inconvenience and have good ways to enroll a new token when you inevitably lose yours).

It ought to be easy, right? A U2F is really just a secure element containing a previously-enrolled private key. As established earlier, you can put a private key anywhere you want. Heck, your iPhone has one already, protected by a PIN or fingerprint or face. We're almost there!

Almost. The thing is, when I'm browsing on my computer with my phone in my pocket, it's not my phone that needs to authenticate with the web site: it's my computer. So my phone and computer are going to have to talk to each other, just like a USB U2F and a computer talk to each other.

Both my web browser and my phone are connected to the Internet, so that's the most obvious way for them to talk. (If my computer weren't on the Internet, it presumably wouldn't need to authenticate with a web site. And my phone probably has a cellular data plan, so situations where it's on the Internet are "mostly" a superset of situations where my computer is.)

So all I need to do is generate an encrypted push request from my browser, through the Internet, to my phone, saying "please sign an authorization for gmail.com." Then I tap my phone to approve, and it sends the answer back, and the web browser proceeds, exactly as it would with a USB-connected U2F token. Right?

Almost. There's just one catch. How does my phone know the request is coming from a trusted web browser, and not that browser the phishing attacker, up above, was operating, or from some other random browser on the Internet? How can my phone trust my browser?

Inception.

You have to... previously enroll... the browser's secret key... with your phone's U2F app.

Okay, okay, we can do this. Recall that there are three ways to do a safe enrollment: 1) a physical person gives you a key; 2) a direct physical link; 3) delegation from an existing authenticator. Traditional USB U2F uses method #2. When you physically plug a U2F into a new web browser, it starts trusting that web browser. Easy and safe.

But your computer and your phone aren't linked (if they are, then problem solved, but they usually aren't). Enrollment between your browser and your phone can be done in various different ways: you could make a bluetooth link, or generate a key on one and type it into the other, or scan a QR code from one to the other. Let's assume you picked some way to do this, and your browser is enrolled with your phone.

Now what? Okay, now when I want to log in, I must know my password and must physically possess my computer (containing the enrolled browser), and my phone with its U2F app. I type my password, which triggers a U2F request from the web site to the browser, which signs and pushes a signing request to my phone, which pops up an acknowledgement, which I tap, and then it signs and sends an answer back to my browser, which sends it back to the web site, and success!

Except... something's fishy there. That was more complicated than it needs to be. Notice that:

  • The browser is trusted by the phone U2F app because of a previous enrollment.

  • The phone U2F app is trusted by the web site because of a (different) previous enrollment.

  • My screen tap (or fingerprint scan) is trusted by the U2F app because of my physical presence (or another previous enrollment).

Why do I even need my phone in this sequence? I've already enrolled my browser; it's trusted. During enrollment, we could have given it a key that's just as useful as a U2F device, and we could have done that precisely as safely as enrolling a new U2F device. (It's nice if my computer has a secure element in which to store the key, but as established above, that only matters for security if my computer gets hit by malware.)

We could just do this instead: I type my password. Web site generates U2F request. Web browser signs the request using its previously-enrolled private key, enrolled earlier using my phone. Login completes, and phishing is prevented! Plus we don't rely on my phone or its flakey Internet connection.

Great, right?

(Note that while this does indeed prevent phishing - the biggest threat for most organizations - it doesn't prevent your roommate, spouse, or 6-year old child from watching you type your password, then sometime later unlocking your computer and using its already-enrolled web browser to pretend to be you. By leaving your 30-pound desktop computer just lying around, you've let your second factor get stolen. In that way, phones are better second factors than computers, because we're so addicted to them that we rarely let them out of our sight. That doesn't stop your children from borrowing your fingerprints while you nap, however.)

There is, of course, still a catch. If you do this, you have to enroll every browser you might use, separately, using your phone, with one of these tedious methods (bluetooth, typing a key, or QR code scan). That's kind of a hassle. With a USB U2F key, you can just carry it around and plug it into any computer. Even better, when you stop trusting that computer, you just unplug the token, and it cancels the enrollment. With a physical device, it's super easy to understand exactly what you've enrolled and what you haven't. (With a software key, the server needs a good UI to let you un-enroll computers you don't need anymore, and users will surely screw that up.)

What people tend to miss, though, is that this enrollment is necessary whether or not you send a push notification to the phone during login. The push notification is only secure if this specific browser instance is enrolled; but if this browser instance is enrolled, and my computer is not easier to steal than my phone, then the push notification adds no extra security. The enrollment was the security.

And that's where I'm stuck. I've more or less convinced myself that phone-based OTP (prone to phishing) or phone-push-based U2F (not useful after initial enrollment) add no interesting security but do make things harder for end users. I guess they call that "security theatre." Meanwhile, physical U2F tokens are unlikely to become popular with consumers because they're inconvenient.

At least that conclusion is consistent with the state of the world as it exists today, and what the market leaders are currently doing. Which means maybe I've at least explained it correctly.

(Special thanks to Tavis Ormandy, Reilly Grant, and Perry Lorier for answering some of my questions about this on Twitter. But any mistakes in this article are my fault, not theirs.)

Posted Mon Jan 14 03:50:47 2019 Tags:

I've been working on a series of tutorials using redo for various use cases. (One of the most common user requests is more examples of how to solve real-world problems with redo. The problem with extremely flexible tools is that it can be hard for people to figure out how to start.)

The most recent in the series is a a tutorial on building docker and kvm containers from scratch using redo. I think it turned out pretty well. Maybe the tutorial deserves a disclaimer, though: is this really something you should do?

Good question. I don't know.

You see, there are a few "standard" ways to build containers nowadays, most commonly using Dockerfiles and the docker build command. It's not at all obvious that we need another way to do it. Then again, it's not obvious that we don't.

Wait, let's back up a bit. Where are we and how did we get here?

The idea of isolated chroot-based containers has been around for a very long time. In my first startup, we even had a commercial version of the concept as early as 2005, first implemented by Patrick Patterson, called Nitix Virtual Server (NVS). The original idea was to take our very-locked-down Linux-based server appliance and let you install apps on it, without removing the (very useful for security and reliability) locked-down-ness of the operating system. Nitix was a very stripped down, minimal Linux install, but NVS was a full install of a CentOS-based system in userspace, so you could do basically anything that Linux could do. (You might recognize the same concepts in ChromeOS's Crostini.) We could start, stop, and snapshot the "inner" NVS operating system and make backups. And most importantly, if you were an app developer, you could take one of these pre-installed and customized snapshots, package it up, and sell it to your customers. Hopefully with one of our appliances!

Eventually one of these packaged apps became appealing enough that the app maker, who was much larger than us, decided to acquire our company, then (as often happens) accidentally killed it with love, and that was sadly the end of that branch of the container evolutionary tree.

Our containers were installed on physical appliance hardware - another branch of evolution that seems to have died. Nobody believes anymore that you can have a zero-maintenance, self configuring, virus free appliance server that runs on your office network without needing a professional sysadmin. Come to think of it, most people didn't believe us back then either, at least not until after seeing a demo. But whatever, the product doesn't exist anymore, so nowadays they're right.

In any case, the modern solution to this is for everybody to host everything in the Cloud. The Cloud has its own problems, but at least those problems are fairly well understood, and most importantly, you can pay by the minute for a crack team of experts, the ones who own the servers, to fix problems for you. For most people, this works pretty well.

But back to containers. The way we made them, long ago, was a bit ad-hoc: a person installed a fresh NVS, then installed the app, then wrote a few scripts to take system configuration data (user accounts, remember when we used those? and hostnames, IP addresses, and so on) and put them in the right app-specific config files. Then they'd grab a snapshot of the whole NVS and distribute it. Making new versions of an app container involved either making additional tweaks to the live image (a little risky) and re-snapshotting, or having a human start over from scratch and re-run all the manual steps.

Those were simpler times.

Nowadays, people care a lot more about automated builds and automated testing than they did back in 2005, and this is a big improvement. They also collaborate a lot more. Docker containers share almost the same basic concepts: take a base filesystem, do some stuff to it, take another snapshot, share the snapshot. But it's more automated, and there are better ways to automate the "do some stuff" part. And each step is a "layer", and you can share your layers, so that one person can publish a base OS install, another person can install the Go compiler, another person can build and install their app, and another person can customize the app configuration, and all those people can work at different companies or live in different countries.

As a sign of how out of touch I am with the young'uns, I would never have thought you could trust a random unauthenticated person on the Internet to provide a big binary image for the OS platform your company uses to distribute its app. And maybe you can't. But people do, and surprisingly it almost never results in a horrible, widespread security exploit. I guess most people are surprisingly non-evil.

Anyway, one thing that bothered me a lot, both in the old NVS days and with today's Dockerfiles, was all the extra crud that ends up in your image when you install it this way. It's a whole operating system! For example, I looked at what might be the official Dockerfile for building a MySQL server image (although I'm not sure how one defines "official") and it involves installing a whole C++ compiler toolchain, then copying in the source code and building and installing the binary. The resulting end-user container still has all that stuff in it, soaking up disk space and download time and potentially adding security holes and definitely adding a complete lack of auditability.

I realize nobody cares. I care, though, because I'm weird and I care about boring things, and then I write about them.

Anyway, there are at least two other ways to do it. One way endorsed by Dockerfiles is to "skip" intermediate layers: after building your package, uninstall the compiler and extra crap, and then don't package the "install compiler" and "install source code" layers at all. Just package one layer for the basic operating system, and one more for all the diffs between the operating system and your final product. I call this the "blacklist" approach: you're explicitly excluding things you don't want from your image. Dockerfiles make this approach relatively easy, once you get the hang of it.

A more obsessive approach is a "whitelist": only include the exact files you want to include. The trick here is to first construct the things you want in your final container, and then at the end, to copy only the interesting things into a new, fresh, empty container. Docker doesn't really make this harder than anything else, but Docker doesn't really help here, either. The problem is that Docker fundamentally runs on the concept of "dive into the container, execute some commands, make a snapshot" and we don't even have a container to start with.

So that's the direction I went with my redo tutorial; I built some scripts that actually construct a complete, multi-layered container image without using Docker at all. (To Docker's credit, this is pretty easy, because their container format is simple and pretty well-defined.) Using those scripts, it's easy to just copy some files into a subdirectory, poke around to add in the right supporting files (like libc), and then package it up into a file that can be loaded into docker and executed. As bonus, we can do all this without being the 'root' user, having any docker permissions, or worrying about cluttering the local docker container cache.

I like it, because I'm that kind of person. And it was a fun exercise. But I'm probably living in the past; nobody cares anymore if it takes a few gigabytes to distribute an app that should be a few megabytes at most. Gigabytes are cheap.

Side note: incremental image downloads

While we're here, I would like to complain about how people distribute incremental changes to containers. Basically, the idea is to build your containers in layers, so that most of the time, you're only replacing the topmost layers (ie. your app binaries) and not the bottommost layers (ie. the OS). And you can share, say, the OS layer across multiple containers, so that if you're deploying many containers to a single machine, it only has to download the OS once.

This is generally okay, but I'm a bit offended1 that if I rebuild the OS with only a few changes - eg. a couple of Debian packages updated to fix a security hole - then it has to re-download the whole container. First of all, appropriate use of rsync could make this go a lot more smoothly.

But secondly, I already invented a solution, eight years ago, and open sourced it, and then promptly failed to document or advertise it so that (of course) nobody knew it exists. Oops.

The solution is something I call bupdate (a mix of "bup" and "update"), a little-known branch of my bup incremental backup software, which I've written about previously.

Unlike bup, the bupdate client works with any dumb (static files only) http server. bupdate takes any group of files - in this case, tarballs, .iso images, or VM disk images - runs the bupsplit algorithm to divide them into chunks, and writes their file offsets to files ending in .fidx (file index, similar to git's .idx packfile indexes and bup's .midx multi-pack indexes), which you then publish along with the original files. The client downloads the .fidx files, generates its own index of all the local files it already has lying around (eg. old containers, in this case), and constructs exact replicas of the new files out of the old chunks and any necessary newly-downloaded chunks. It requests the new chunks using a series of simple HTTP byterange requests from the image files sitting on the server.

It's pretty neat. There's even an NSIS plugin so that you can have NSIS do the download and reassembly for you when installing a big blob on Windows (which I implemented for one of our clients at one point), like for updating big video game WAD files.

(By the way, this same technique would help a lot with, say, apt-get update's process for retrieving its Packages files. All we'd need to do is upload a Packages.fidx alongside the Packages file itself, and a new client which understood bupdate could use that to retrieve only the parts of the Packages file that has changed since last time. This could reduce incremental Packages downloads from several megabytes to tens of kilobytes. Old clients would ignore the .fidx and just download the whole Packages file as before.)

bupdate is pretty old (8 years now!), and relies on an even older C++ library that probably doesn't work with modern compilers, but it wouldn't be too hard to rejuvenate. Somebody really ought to start using it for updating container layers or frequently-updated large lists. Contact me if you think this might be useful to you, and maybe I'll find time to bring bupdate back to life.

gzip --rsyncable

On that note, if you haven't heard of it already, you really should know about gzip --rsyncable.

It's widely known that gzip'd files don't work well with rsync, because if even one byte changes near the beginning of the file, that'll change the compression for the entire rest of the file, so you have to re-download the whole thing. And bupdate, which is, at its core, really just a one-sided rsync, suffers from the same problem.

But gzip with --rsyncable is different. It carefully changes the compression rules so the dictionary is flushed periodically, such that if you change a few bytes early on, it'll only disrupt the next few kilobytes rather than the entire rest of the file. If you compress your files with --rsyncable, then bupdate will work a lot better.

Alternatively, if you're using a web server that supports on-the-fly compression, you can serve an uncompressed file and let the web server compress the blocks you're requesting. This will be more byte-efficient than gzip --rsyncable (since you don't have to download the entire block, up to the next resync point), but costs more CPU time on the server. Nowadays, CPU time is pretty cheap and gzip is pretty fast, so that might be a good tradeoff.

Footnote

1 When I say I'm offended by the process used to update containers, it's not so much that I'm offended by people failing to adopt my idea - which, to be fair, I neglected to tell anyone about. Mostly I'm offended that nobody else managed to invent a better idea than bupdate, or even a comparably good idea2,3, in the intervening 8 years. Truly, there is no point worrying about people stealing my ideas. Rather the opposite.

2 Edit 2019-01-13: Eric Anderson pointed me to casync, which is something like bup and bupdate, and references bup as one of its influences. So I guess someone did invent at least a "comparably good idea." I think the bupdate file format is slightly cuter, since it sits alongside and reuses the original static files, which allows for cheap backward compatibility with plain downloads or rsyncs. But I'm biased.

3 Edit 2019-01-13: JNRowe points out a program called zsync, which sounds very similar to bupdate and shares the same goal of not disturbing your original file set. In fact, it's even more clever, because you can publish a zsync index on one web site that refers to chunks on another web site, allowing you to encourage zsync use even if the upstream maintainer doesn't play along. And it can look inside .gz files even if you don't use gzip --rsyncable! Maybe use that instead of bupdate. (Disclaimer: I haven't tried it yet.)

Posted Fri Jan 11 23:42:23 2019 Tags:
Cover of the last print
version of the book.
Like a phoenix (Phenix aegyptus), my Groovy Cheminformatics rises from the ashes. About a year ago I blogged that I could not longer maintain my book, not in the print form. The hardest part was actually resizing the cover each time the book got thicker. I actually started the book about 10 years ago, but the wish to make it Open Access grew bigger with the years.

So, here we go. It's based on CDK 2.0, but somewhere in the coming weeks I'll migrate to the latest version. It will take some weeks to migrate all content, and your chapter priority requests here.

The making of...
Over the past months I have been playing with some ideas on how to make the transition. I wanted to preserve the core concept of the book that all books are compiled and executed which each release and that all output of scripts is autogenerated (including many of the diagrams). I wanted to publish the next iteration of the book as Markdown, but also pondered with the idea of still being able to generate a PDF with LaTeX. That means I have a lot of stuff to upgrade.

I ended up somewhere in between. It's source is Markdown, but not entirely. It's source code that looks like Markdown with snippets of XML. This makes sure the source looks formatted when on GitHub:
But you can see that this is not processed yet. The CreateAtom1 and CreateAtom2 refers to code examples, and the above screenshot shows the source of a source code inclusion (for CreateAtom1 and CreateAtom2) and a output inclusion (for CreateAtom2). After processing, the actual page looks like this:


That looks pretty close to what the print book had. An extra here is that you can click (hard in a print book) the link to the code. That is something I improved on along the way, and leads to a Markdown (new) page that shows the full sources and the output (should I add the @Grab instructions, or too obvious?):


If you check the first online version (🎶 On the first day of xmas, #openscience got from me ... 🎶), I have quite some content to migrate. First, back to doing the reference sections properly, as if I was still working with BibLaTeX.

Happy holidays!
Posted Wed Dec 26 07:27:00 2018 Tags:

I woke up early this morning, and those of you live above 45 parallel north or so are used to the “I'm wide awake but it's still dark as night” feeling in the winter. I usually don't turn on the lights, wander into my office, and just bring my computer out of hibernate; that takes a bit as my 100% Free-Software-only computer is old and slow, so I usually go to make coffee while that happens.

As I came back in my office this morning I was a bit struck by both displays with the huge Debian screen lock image, and it got me thinking of how Debian has been my companion for so many years. I spoke about this at DebConf 15 a bit, and wrote about a similar concept years before. I realize that it's been almost nine years that I've been thinking rather deeply about my personal relationship with Debian and why it matters.

This morning, I was inspired to post this because, echoing back to my thoughts at my DebConf 15 talk, that I can't actually do the work I do without Debian. I thought this morning about a few simple things that Debian gets done for me that are essential:

  • Licensing assurance. I really can trust that Debian will not put something in main that fails to respect my software freedom. Given my lifelong work on Free Software licensing, yes, I can vet a codebase to search for hidden proprietary software among the Free, but it's so convenient to have another group of people gladly do that job for me and other users.
  • Curated and configured software, with connection to the expert. Some days it seems none of the new generation of developers are a fan of software packaging anymore. Anytime you want to run something new these days, someone is trying to convince you to download some docker image or something like that. It's not that I don't see the value in that, but what I usually want is that software I just read about installed on my machine as quickly as possible. Debian's repository is huge, and the setup of Debian as a project allows for each package maintainer to work in relative independence to make the software of their interest run correctly as part of the whole. For the user, that means when I hear about some interesting software, Debian immediately connects me, via apt, with the individual expert who knows about that software and my operating system / distribution both. Apt, Debian's Bug Tracker, etc. are actually a rudimentary but very usable form of a social networking that allows me to find the person who did the job to get this software actually working on my system. That's a professional community that's amazing
  • Stability. It's rather amusing, All the Debian developers I know run testing on their laptop and stable only on their servers. I run stable on my laptop. I have a hectic schedule and always lots of work to do that, sadly, does not usually include “making my personal infrastructure setup do new things”. While I enjoy that sort of work, it's a rabbit hole that I rarely have the luxury to enter. Running Debian stable on my laptop means I am (almost) never surprised by any behavior of my equipment. In the last nine years, if my computer does something weird, it's basically always a hardware problem.

Sure, maybe you can get the last two mostly with other distributions, but I don't think you can get the first one anywhere better. Anyway, I've gotta get to work for the day, but those of you out there that make Debian happen, perhaps you'll see a bit of a thank you from me today. While I've thanked you all before, I think that no one does it enough.

Posted Sat Dec 15 06:24:00 2018 Tags:

Disclaimer: this is pending for v4.21 and thus not yet in any kernel release.

Most wheel mice have a physical feature to stop the wheel from spinning freely. That feature is called detents, notches, wheel clicks, stops, or something like that. On your average mouse that is 24 wheel clicks per full rotation, resulting in the wheel rotating by 15 degrees before its motion is arrested. On some other mice that angle is 18 degrees, so you get 20 clicks per full rotation.

Of course, the world wouldn't be complete without fancy hardware features. Over the last 10 or so years devices have added free-wheeling scroll wheels or scroll wheels without distinct stops. In many cases wheel behaviour can be configured on the device, e.g. with Logitech's HID++ protocol. A few weeks back, Harry Cutts from the chromium team sent patches to enable Logitech high-resolution wheel scrolling in the kernel. Succinctly, these patches added another axis next to the existing REL_WHEEL named REL_WHEEL_HI_RES. Where available, the latter axis would provide finer-grained scroll information than the click-by-click REL_WHEEL. At the same time I accidentally stumbled across the documentation for the HID Resolution Multiplier Feature. A few patch revisions later and we now have everything queued up for v4.21. Below is a summary of the new behaviour.

The kernel will continue to provide REL_WHEEL as axis for "wheel clicks", just as before. This axis provides the logical wheel clicks, (almost) nothing changes here. In addition, a REL_WHEEL_HI_RES axis is available which allows for finer-grained resolution. On this axis, the magic value 120 represents one logical traditional wheel click but a device may send a fraction of 120 for a smaller motion. Userspace can either accumulate the values until it hits a full 120 for one wheel click or it can scroll by a few pixels on each event for a smoother experience. The same principle is applied to REL_HWHEEL and REL_HWHEEL_HI_RES for horizontal scroll wheels (which these days is just tilting the wheel). The REL_WHEEL axis is now emulated by the kernel and simply sent out whenever we have accumulated 120.

Important to note: REL_WHEEL and REL_HWHEEL are now legacy axes and should be ignored by code handling the respective high-resolution version.

The magic value of 120 is taken directly from Windows. That value was chosen because it has a good number of integer factors, so dividing 120 by whatever multiplier the mouse uses gives you a integer fraction of 120. And because HW manufacturers want it to work on Windows, we can rely on them doing it right, provided we use the same approach.

There are two implementations that matter. Harry's patches enable the high-resolution scrolling on Logitech mice which seem to mostly have a multiplier of 8 (i.e. REL_WHEEL_HI_RES will send eight events with a value of 15 before REL_WHEEL sends 1 click). There are some interesting side-effects with e.g. the MX Anywhere 2S. In high-resolution mode with a multiplier of 8, a single wheel movement does not always give us 8 events, the firmware does its own magic here. So we have some emulation code in place with the goal of making the REL_WHEEL event happen on the mid-point of a wheel click motion. The exact point can shift a bit when the device sends 7 events instead of 8 so we have a few extra bits in place to reset after timeouts and direction changes to make sure the wheel behaviour is as consistent as possible.

The second implementation is for the generic HID protocol. This was all added for Windows Vista, so we're only about a decade behind here. Microsoft got the Resolution Multiplier feature into the official HID documentation (possibly in the hope that other HW manufacturers implement it which afaict didn't happen). This feature effectively provides a fixed value multiplier that the device applies in hardware when enabled. It's basically the same as the Logitech one except it's set through a HID feature instead of a vendor-specific protocol. On the devices tested so far (all Microsoft mice because no-one else seems to implement this) the multipliers vary a bit, ranging from 4 to 12. And the exact behaviour varies too. One mouse behaves correctly (Microsoft Comfort Optical Mouse 3000) and sends more events than before. Other mice just send the multiplied value instead of the normal value, so nothing really changes. And at least one mouse (Microsoft Sculpt Ergonomic) sends the tilt-wheel values more frequently and with a higher value. So instead of one event with value 1 every X ms, we now get an event with value 3 every X/4 ms. The mice tested do not drop events like the Logitech mice do, so we don't need fancy emulation code here. Either way, we map this into the 120 range correctly now, so userspace gets to benefit.

As mentioned above, the Resolution Multiplier HID feature was introduced for Windows Vista which is... not the most recent release. I have a strong suspicion that Microsoft dumped this feature as well, the most recent set of mice I have access to don't provide the feature anymore (they have vendor-private protocols that we don't know about instead). So the takeaway for all this is: if you have a Logitech mouse, you'll get higher-resolution scrolling on v4.21. If you have a Microsoft mouse a few years old, you may get high-resolution wheel scrolling if the device supports it. Any other vendor or a new Microsoft mouse, you don't get it.

Coincidentally, if you know anyone at Microsoft who can provide me with the specs for their custom protocol, I'd appreciate it. We'd love to have support for it both in libratbag and in the kernel. Or any other vendor, come to think of it.

Posted Wed Dec 12 04:27:00 2018 Tags:

This time we're digging into HID - Human Interface Devices and more specifically the protocol your mouse, touchpad, joystick, keyboard, etc. use to talk to your computer.

Remember the good old days where you had to install a custom driver for every input device? Remember when PS/2 (the protocol) had to be extended to accommodate for mouse wheels, and then again for five button mice. And you had to select the right protocol to make it work. Yeah, me neither, I tend to suppress those memories because the world is awful enough as it is.

As users we generally like devices to work out of the box. Hardware manufacturers generally like to add bits and bobs because otherwise who would buy that new device when last year's device looks identical. This difference in needs can only be solved by one superhero: Committee-man, with the superpower to survive endless meetings and get RFCs approved.

Many many moons ago, when USB itself was in its infancy, Committee man and his sidekick Caffeine boy got the USB consortium agree on a standard for input devices that is so self-descriptive that operating systems (Win95!) can write one driver that can handle this year's device, and next year's, and so on. No need to install extra drivers, your device will just work out of the box. And so HID was born. This may only be an approximate summary of history.

Originally HID was designed to work over USB. But just like Shrek the technology world is obsessed with layers so these days HID works over different transport layers. HID over USB is what your mouse uses, HID over i2c may be what your touchpad uses. HID works over Bluetooth and it's celebrity-diet version BLE. Somewhere, someone out there is very slowly moving a mouse pointer by sending HID over carrier pigeons just to prove a point. Because there's always that one guy.

HID is incredibly simple in that the static description of the device can just be bytes burnt into the ROM like the Australian sun into unprepared English backpackers. And the event frames are often an identical series of bytes where every bit is filled in by the firmware according to the axis/buttons/etc.

HID is incredibly complicated because parsing it is a stack-based mental overload. Each individual protocol item is simple but getting it right and all into your head is tricky. Luckily, I'm here for you to make this simpler to understand or, failing that, at least more entertaining.

As said above, the purpose of HID is to make devices describe themselves in a generic manner so that you can have a single driver handle any input device. The idea is that the host parses that standard protocol and knows exactly how the device will behave. This has worked out great, we only have around 200 files dealing with vendor- and hardware-specific HID quirks as of v4.20.

HID messages are Reports. And to know what a Report means and how to interpret it, you need a Report Descriptor. That Report Descriptor is static and contains a series of bytes detailing "what" and "where", i.e. what a sequence of bits represents and where to find those bits in the Report. So let's try and parse one of Report Descriptors, let's say for a fictional mouse with a few buttons. How exciting, we're at the forefront of innovation here.

The Report Descriptor consists of a bunch of Items. A parser reads the next Item, processes the information within and moves on. Items are small (1 byte header, 0-4 bytes payload) and generally only apply exactly one tiny little bit of information. You need to accumulate several items to build up enough information to actually know what's happening.

The "what" question of the Report Descriptor is answered with the so-called Usage. This could be something simple like X or Y (0x30 and 0x31) or something more esoteric like System Menu Exit (0x88). A Usage is 16 bits but all Usages are grouped into so-called Usage Pages. A Usage Page too is a 16 bit value and together they form the 32-bit value that tells us what the device can do. Examples:


0001 0031 # Generic Desktop, Y
0001 0088 # Generic Desktop, System Menu Exit
0003 0005 # VR Controls, Head Tracker
0003 0006 # VR Controls, Head Mounted Display
0004 0031 # Keyboard, Keyboard \ and |
Note how the Usage in the last item is the same as the first one, without the Usage Page you will mix things up. It helps if you always think of as the Usage as a 32-bit number. For your kids' bed-time story time, here are the HID Usage Tables from 2004 and the approved HID Usage Table Review Requests of the last decade. Because nothing puts them to sleep quicker than droning on about hex numbers associated with remote control buttons.

To successfully interpret a Report from the device, you need to know which bits have which Usage associated with them. So let's go back to our innovative mouse. We would want a report descriptor with 6 items like this:


Usage Page (Generic Desktop)
Usage (X)
Report Size (16)
Usage Page (Generic Desktop)
Usage (Y)
Report Size (16)
This basically tells the host: X and Y both have 16 bits. So if we get a 4-byte Report from the device, we know two bytes are for X, two for Y.

HID was invented when a time when bits were more expensive than printer ink, so we can't afford to waste any bits (still the case because who would want to spend an extra penny on more ROM). HID makes use of so-called Global items, once those are set their value applies to all following items until changed. Usage Page and Report Size are such Global items, so the above report descriptor is really implemented like this:


Usage Page (Generic Desktop)
Usage (X)
Usage (Y)
Report Count (2)
Report Size (16)
Input (Data,Var,Rel)
The Report Count just tells us that 2 fields of the current Report Size are coming up. We have two usages, two fields, and 16 bits each so we know what to do. The Input item is sort-of the marker for the end of the stack, it basically tells us "process what you've seen so far", together with a few flags. Rel in this case means that the Usages are relative. Oh, and Input means that this is data from device to host. Output would be data from host to device, e.g. to set LEDs on a keyboard. There's also Feature which indicates configurable items.

Buttons on a device are generally just numbered so it'd be monumental 16-bits-at-a-time waste to have HID send Usage (Button1), Usage (Button2), etc. for every button on the device. HID instead provides a Usage Minimum and Usage Maximumto sequentially order them. This looks like this:


Usage Page (Button)
Usage Minimum (1)
Usage Maximum (5)
Report Count (5)
Report Size (1)
Input (Data,Var,Abs)
So we have 5 buttons here and each button has one bit. Note how the buttons are Abs because a button state is not a relative value, it's either down or up. HID is quite intolerant to Schrödinger's thought experiments.

Let's put the two things together and we have an almost-correct Report descriptor:


Usage Page (Button)
Usage Minimum (1)
Usage Maximum (5)
Report Count (5)
Report Size (1)
Input (Data,Var,Abs)

Report Size (3)
Report Count (1)
Input (Cnst,Arr,Abs)

Usage Page (Generic Desktop)
Usage (X)
Usage (Y)
Report Count (2)
Report Size (16)
Input (Data,Var,Rel)
New here is Cnst. This signals that the bits have a constant value, thus don't need a Usage and basically don't matter (haha. yeah, right. in theory). Linux does indeed ignore those. Cnst is used for padding to align on byte boundaries - 5 bits for buttons plus 3 bits padding make 8 bits. Which makes one byte as everyone agrees except for granddad over there in the corner. I don't know how he got in.

Were we to get a 5-byte Report from the device, we'd parse it approximately like this:


button_state = byte[0] & 0x1f
x = bytes[1] | (byte[2] << 8)
y = bytes[3] | (byte[4] << 8)
Hooray, we're almost ready. Except not. We may need more info to correctly interpret the data within those reports.

The Logical Minimum and Logical Maximum specify the value range of the actual data. We need this to tell us whether the data is signed and what the allowable range is. Together with the Physical Minimumand the Physical Maximum they specify what the values really mean. In the simple case:


Usage Page (Generic Desktop)
Usage (X)
Usage (Y)
Report Count (2)
Report Size (16)
Logical Minimum (-32767)
Logical Maximum (32767)
Input (Data,Var,Rel)
This just means our x/y data is signed. Easy. But consider this combination:

...
Logical Minimum (0)
Logical Maximum (1)
Physical Minimum (1)
Physical Maximum (12)
This means that if the bit is 0, the effective value is 1. If the bit is 1, the effective value is 12.

Note that the above is one report only. Devices may have multiple Reports, indicated by the Report ID. So our Report Descriptor may look like this:


Report ID (01)
Usage Page (Button)
Usage Minimum (1)
Usage Maximum (5)
Report Count (5)
Report Size (1)
Input (Data,Var,Abs)
Report Size (3)
Report Count (1)
Input (Cnst,Arr,Abs)

Report ID (02)
Usage Page (Generic Desktop)
Usage (X)
Usage (Y)
Report Count (2)
Report Size (16)
Input (Data,Var,Rel)
If we were to get a Report now, we need to check byte 0 for the Report ID so we know what this is. i.e. our single-use hard-coded parser would look like this:

if byte[0] == 0x01:
button_state = byte[1] & 0x1f
else if byte[0] == 0x02:
x = bytes[2] | (byte[3] << 8)
y = bytes[4] | (byte[5] << 8)
A device may use multiple Reports if the hardware doesn't gather all data within the same hardware bits. Now, you may ask: if I get fifteen reports, how should I know what belongs together? Good question, and lucky for you the HID designers are miles ahead of you. Report IDs are grouped into Collections.

Collections can have multiple types. An Application Collectiondescribes a set of inputs that make sense as a whole. Usually, every Report Descriptor must define at least one Application Collection but you may have two or more. For example, a a keyboard with integrated trackpoint should and/or would use two. This is how the kernel knows it needs to create two separate event nodes for the device. Application Collections have a few reserved Usages that indicate to the host what type of device this is. These are e.g. Mouse, Joystick, Consumer Control. If you ever wondered why you have a device named like "Logitech G500s Laser Gaming Mouse Consumer Control" this is the kernel simply appending the Application Collection's Usage to the device name.

A Physical Collection indicates that the data is collected at one physical point though what a point is is a bit blurry. Theoretical physicists will disagree but a point can be "a mouse". So it's quite common for all reports on a mouse to be wrapped in one Physical Collections. If you have a device with two sets of sensors, you'd have two collections to illustrate which ones go together. Physical Collections also have reserved Usages like Pointer or Head Tracker.

Finally, a Logical Collection just indicates that some bits of data belong together, whatever that means. The HID spec uses the example of buffer length field and buffer data but it's also common for all inputs from a mouse to be grouped together. A quick check of my mice here shows that Logitech doesn't wrap the data into a Logical Collection but Microsoft's firmware does. Because where would we be if we all did the same thing...

Anyway. Now that we know about collections, let's look at a whole report descriptor as seen in the wild:


Usage Page (Generic Desktop)
Usage (Mouse)
Collection (Application)
Usage Page (Generic Desktop)
Usage (Mouse)
Collection (Logical)
Report ID (26)
Usage (Pointer)
Collection (Physical)
Usage Page (Button)
Usage Minimum (1)
Usage Maximum (5)
Report Count (5)
Report Size (1)
Logical Minimum (0)
Logical Maximum (1)
Input (Data,Var,Abs)
Report Size (3)
Report Count (1)
Input (Cnst,Arr,Abs)
Usage Page (Generic Desktop)
Usage (X)
Usage (Y)
Report Count (2)
Report Size (16)
Logical Minimum (-32767)
Logical Maximum (32767)
Input (Data,Var,Rel)
Usage (Wheel)
Physical Minimum (0)
Physical Maximum (0)
Report Count (1)
Report Size (16)
Logical Minimum (-32767)
Logical Maximum (32767)
Input (Data,Var,Rel)
End Collection
End Collection
End Collection
We have one Application Collection (Generic Desktop, Mouse) that contains one Logical Collection (Generic Desktop, Mouse). That contains one Physical Collection (Generic Desktop, Pointer). Our actual Report (and we have only one but it has the decimal ID 26) has 5 buttons, two 16-bit axes (x and y) and finally another 16 bit axis for the Wheel. This device will thus send 8-byte reports and our parser will do:

if byte[0] != 0x1a: # it's decimal in the above descriptor
error, should be 26
button_state = byte[1] & 0x1f
x = byte[2] | (byte[3] << 8)
y = byte[4] | (byte[5] << 8)
wheel = byte[6] | (byte[7] << 8)
That's it. Now, obviously, you can't write a parser for every HID descriptor out there so your actual parsing code needs to be generic. The Linux kernel does exactly that and so does everything else that needs to parse HID. There's a huge variety in devices out there, all with HID descriptors that may or may not be correct. As with so much in life, correct HID implementations are often defined by "whatever Windows accepts" so if you like playing catch, Linux development is for you.

Oh, in case you just got a bit too optimistic about the state of the world: HID allows for vendor-defined usages. Which does exactly what you'd think it does, it hides vendor-specific protocol inside what should be a generic protocol. There are devices with hidden report IDs that you can only unlock by sending the right magic sequence to the report and/or by defeating the boss on Level 4. Usually those devices present themselves as basic/normal devices over HID but if you know the magic sequence you get to use *gasp* all buttons. Or access the device-specific configuration features. Logitech's HID++ is just one example here but at least that's one where we have most of the specs available.

The above describes how to parse the HID report descriptor and interpret the reports. But what happens once you have a HID report correctly parsed? In the case of the Linux kernel, once the report descriptor is parsed evdev nodes are created (one per Application Collection, more or less). As the Reports come in, they are mapped to evdev codes and the data appears on the evdev node. That's where userspace like libinput can pick it up. That bit is actually quite simple (mostly anyway).

The above output was generated with the tools from the hid-tools repository. Go forth and hid-record.

Posted Tue Dec 11 05:56:00 2018 Tags:

From a book I've been reading:

    Power is certainly important, particularly in dictatorships, in places where constitutions, laws, unwritten rules, traditions and understandings don't count. But in a healthy democracy, power is a surprisingly limited element. And the unwritten conventions, understandings, forms of respect for how things are done, for how citizens relate to government and to each other, are surprisingly important. Why? Because if democracy is only power, then what we are left with is a system of deep distrust. Why? Because if only power matters - even if it is the result of an election - then the government feels that it has a mandate to do whatever it wants, that the law is there principally to serve power. If democracy is only about winning power and using it, then it has been deformed into a denial of society and of the idea of responsible citizenship.

    And that is the increasingly common characteristic of government, even in democracies. Only power matters. This is partly the outcome of government being de-intellectualized. Elections are now thought to be unsuitable moments for real debates over ideas. In between these elections the focus is on administrative problems - legalistic, managerial undertakings. In this case, real debates over ideas are unnecessary because the decision about power was taken on election night. There is nothing, therefore, to debate. Worse still, the efficient putting in place of programs to be administered can only be made inefficient by debate.

    And so we are witnessing a growth in the Napoleonic or Mussolinian corporatist idea that when citizens vote in an election, it is actually an all-purpose referendum or plebiscite. It is then the winner's job to get on with running things. First win an election, then administer as you wish. Omnibus bills are one of the ways you can speed things up; they are a great way to convert the deeply inefficient process of democracy, with all its thinking, debating and complex differences of opinion, into a sort of shop-floor system of utilitarian efficiency. Therefore, once elected, a government has broad, unlimited permission.

    [...]

    In any case, today's plebiscitary approach is both populist and anti-democratic. What it amounts to is this: We won the election. We have power. It is now a matter of administrative efficiency. We can do what we want.

        - John Ralson Saul, The Comeback, 2014

This finally put into words a concept I've been struggling with for a long time: the idea that ethics and responsibility come with (should come with) political power, and how they seem to be on the decline lately.

For example, I recently learned, to my horror, about the concept of a Frankenstein Veto, where in some U.S. states, the governor can just delete words (eg. "not") from a bill passed by the legislature, producing and then approving a bill with a completely different meaning than was intended.

The exact legal machinations of this are beside the point. The point is, politicians increasingly feel that they "first win an election, then administer as you wish." That almost sounds logical, so why does it feel so wrong?

Because it is wrong. The job of an elected official isn't to do whatever they want. It's to figure out what the people want (or need), and to deliver that, in accordance with principles and ethics. This is a surprisingly selfless expectation: sometimes the right thing to do is the opposite of what you want to do. And it can be hard to figure out what's right, which is why we have debates, and why we listen to our opponents in those debates, even when we have a majority and they're "merely" the opposition.

Even direct polling on issues doesn't always work, because sometimes issues are too subtle for the population to simply vote yes or no; that's why we elect representatives who will presumably do their homework, consult the right people, and figure out the right, complex answer. At least one thing is clear: if the legislature vigorously debates an issue and decides one thing, a governor passing a law that does the opposite - or even rejecting the one that was so carefully produced - is just not right.

I've been struck many times by the way the U.S. system of political "checks and balances" treats politicians like children: none of them can be trusted, so we must restrict them at every opportunity. The underlying assumption is that they will all operate without honour. As most of us eventually learn in our human relationships, if you treat people as children - or worse, as actively malicious adults - then they will tend to meet your expectations. The great irony is that in countries with less of a focus on checking "power" (such as Canada), politicians seem to abuse their power less often and less dramatically.

The above quote made me think that maybe the secret recipe is some combination of traditions, conventions, understanding, and respect. Maybe those are more effective balances to power than any system designed to simply neutralize it. And maybe this was obvious to the people who designed it, and we've forgotten, long ago.

Posted Sun Dec 9 12:14:41 2018 Tags:

.NET Foundation Changes

Today we announced a major change to the .NET Foundation, in which we fundamentally changed the way that the foundation operates. The new foundation draws inspiration from the Gnome Foundation and the F# Foundation.

We are making the following changes:

  • The Board of Directors of the Foundation will now be elected by the .NET Foundation membership, and they will be in charge of steering the direction of the foundation. The Board of Directors will be elected annually via direct vote from the members of the Foundation, with just one permanent member from Microsoft.

  • Anyone contributing to projects in the .NET Foundation can become a voting member of the Foundation. The main benefit is that you get to vote for who should represent you in the board of directors. To become a member, we will judge contributions to the projects in the foundation, which can either be code contributions, documentation, evangelism or other activities that advance .NET and its ecosystem.

  • Membership fee: we are adding a membership fee that will give the .NET Foundation independence from Microsoft when it comes to how it chooses to promote .NET and the ecosystem around it. We realize that not everyone can pay this fee, so this fee can be waived. But those that contribute to the Foundation will help us fund activities that will expand .NET.

  • We intend to have elections every year, so individuals will campaign on what they intend to bring to the board.

  • There is a limit in the number of members on the board representing a single company, which prevents the board from being stacked up by contributors for a single company, and will encourage our community to vote for board members with diverse backgrounds, strengthening the views of the board.

  • Companies do not vote. The only way to vote is for contributors to the .NET ecosystem, which could be affiliated with a company to vote, but the companies themselves have no vote. Our corporate sponsors are sponsors that care as much as we care as the growth and health of our ecosystem.

These changes are very close to my heart and took a lot of work to make them happen and make sure that Microsoft the company was comfortable with giving up the control over the .NET Foundation.

I want to thank Jon Galloway, the Executive Director of the current .NET Foundation to help make this a reality.

Going from the idea to the execution took a long time. Martin Woodward did some of the early foot work to get various people at Microsoft comfortable with the idea. Then Jon took over, and had to continue this process to get everyone on board and get everyone to accept that our little baby was ready to graduate, go to college and start its own independent life.

I want to thank my peers in the board of directors that supported this move, Scott Hunter, Oren Novotny, Rachel Reese as well as the entire supporting crew that helped us make this happen, Beth Massi, Jay Schmelzer and the various heroes in the Microsoft legal department that crossed all the t’s and dotted all the i’s.

See you on the campaign trail!

Posted Tue Dec 4 17:25:08 2018 Tags:

Last week while reviewing a patch I read that some gaming keyboards have two modes - keyboard mode and gaming mode. When in gaming mode, the keys send out pre-recorded macros when pressed. Presumably (I am not a gamer) this is to record keyboard shortcuts to have quicker access to various functionalities. The macros are stored in the hardware and are thus relatively independent of the host system. Pprovided you have access to the custom protocol, which you probably don't when you're on Linux. But I digress.

I reckoned this could be done in software and work with any 5 dollar USB keyboard. A few hours later, I have this working now: ggkbdd. It sits directly above the kernel and waits for key events. Once the 'mode key' is hit, the keyboard will send pre-configured key sequences for the respective keys. Hitting the mode key again (or ESC) switches back to normal mode.

There's a lot of functionality that is missing such as integration with the desktop (probably via DBus), better security (dropping privs, masking the fd to avoid accidental key logging), better system integration (request fds from logind, possibly through the compositor). And error handling, etc. I think the total time on this spent is somewhere between 3 and 4h, and that includes the time to write this blog post and debug the systemd unit autostartup. There are likely other projects that solve it the same way, or at least in a similar manner. I didn't check.

This was done as proof-of-concept and

  • I don't know if it's useful and if so, what the use-cases are
  • I don't know if I will have any time to fix things on this
  • I don't know if other (better developed) projects already occupy that space
In the grand glorious future and provided this is indeed something generally useful, this would need compositor integration. Not sure we'll ever get to that point. Meanwhile, consider this a code drop for a proof-of-concept and expect that you'll have to fix any bugs yourself.
Posted Mon Dec 3 05:43:00 2018 Tags:

A while ago I was discussing hiring plans with a co-founder of a recently-funded startup. Their story was something like this:

"The company is growing very fast and we now have about a dozen people. We have tight deadlines, and none of the founders has much experience in project management or people management. So, we're trying to hire an experienced project manager and an experienced engineering lead / people manager."

This course of action is pretty common. It also always makes me nervous, but I haven't thought much about it until now. This discussion finally forced me to clarify my thoughts on the subject.

First of all, I asked the co-founder what their role is. The answer was a good one, albeit vague, as startup roles always are: "I don't really know. It's frustrating that I never know. But I do know my job is whatever it takes to make the company successful."

Well, what are the most important problems that the other co-founders aren't solving? The answer came quickly: "We don't have a clear idea of our project schedule. And nobody is managing interpersonal, cultural, social issues. There are lots of other problems, but people are handling those." So those were the top two. Nobody was managing those, and it was getting serious.

This reminded me of some advice I must have heard somewhere, but no longer remember where: Don't delegate the most important thing. Or maybe it was a 1980's management book: Don't outsource your core competency.1

This advice is counterintuitive. At first glance it sounds like it should be intuitive, but once you think about it, it's not. The most important thing - whatever it is! - needs to be done well. Shouldn't we hire the best person possible to handle it? And isn't it vanishingly unlikely that the person already works here?

You're right! In an ideal world, you would get the best, most motivated, most passionate project manager in the world to manage your project. And the best, most motivated, most empathetic, most wise people manager to manage your people. But the real world is rarely ideal. You are unlikely to find the best person in the world on short notice. You probably can't afford to pay them what they deserve. In a wide world where they have their choice of exciting projects, they (statistically) probably don't care enough about your project anyway. And because they're so good, they're probably accustomed to leading huge teams with huge problems that need huge solutions.

In short, you aren't likely to hire the ideal person. You can probably hire a pretty good person. Or you might get unlucky and hire a terribly mismatched person.

That variability is a huge problem. Even if the average person you could hire might be better than you, the standard deviation is very high. It's essentially a random chance, a lottery ticket. Do you want to buy a lottery ticket for the (self-selected) most important thing in your entire domain, whatever it is?

The alternative is to find a person with less specific expertise, but who can understand the company vision, connect with everyone on the team personally, be highly invested in the shared outcome (not just personal gain), and be so passionate about the project that they'd be willing to step aside if they can someday find someone who is a better fit. As a bonus, this person already works at your company: it's you.

Relatively speaking, it's actually pretty easy to learn project management and scheduling or people management. It takes time and effort, but not much more, to become passably decent at either one. Whereas you can't buy shared vision, personal knowledge, shared motivation, or humility, for any amount of money.

Once the schedule or the management of small teams is no longer your most critical problem, then delegate those. Better still, by then you will know how to do the jobs, so you will be very good at interviewing your replacement.

Footnotes

1 "Core competencies" is one of those once-useful terms that has been badly diluted through overuse. It mostly just means "things that uniquely make you special compared to your competitors." If you're a software company, that's probably your software.2 So buy some commodity package (or service) to do your accounting, which doesn't differentiate you, and hire people to write your software, which does.

2 Even at a software company, sometimes your software is crappy and your customer service or sales organizations are your real advantage. Confusion about this has led to horrible strategic blunders. IBM's core competency, for example, clearly is not software.

Posted Fri Nov 30 21:20:48 2018 Tags:
I regularly receive questions from students in the field of computer science looking for career advice.

Here's an answer I wrote to one of them. It's not comprehensive or anything, but I thought people might find it interesting.

[A question about whether to choose a 9-5 job or be an entrepreneur]

The question about "9-5" vs. "entrepreneur" is a complex one -- not everybody can be a successful entrepreneur (who would do the work? :-) and not everybody has the temperament for it. For me personally it was never an option -- there are vast parts of management and entrepreneurship that I wouldn't enjoy doing, such as hiring (I hate interviewing and am bad at it) and firing (too emotionally draining -- even just giving negative feedback is hard for me). Pitching ideas to investors is another thing that I'd rather do without.

If any of that resonates with you, you may be better off not opting for entrepreneurship -- the kind of 9-5 software development jobs I have had are actually (mostly) very rewarding: I get to write software that gets used by hundreds or thousands of other developers (or millions in the case of Python), and those other developers in turn use my software to produce product that get uses by hundreds of thousands or, indeed hundreds of millions of users. Not every 9-5 job is the same! For me personally, I don't like the product stuff (since usually that means it's products I have no interest in using myself), but "your mileage may vary" (as they say in the US). Just try to do better than an entry-level web development job;  that particular field (editing HTML and CSS) is likely to be automated away, and would feel repetitive to me.

[A question about whether AI would make human software developers redundant (not about what I think of the field of AI as a career choice)]

Regarding AI, I'm not worried at all. The field is focused on automating boring, repetitive tasks like driving a car or recognizing faces, which humans can learn to do easily but find boring if they have to do it all the time. The field of software engineering (which includes the field of AI) is never boring, since as soon as a task is repetitive, you automate it, and you start solving new problems.
Posted Mon Nov 26 17:13:00 2018 Tags:

I recently decided to switch my laptop from a Macbook to a Chromebook, partly because Apple's keyboards are so terrible lately, and partly because ChromeOS is suddenly useful now that they invented Crostini.

(Some people ask why I, a person who actually knows how to use Linux and has debugged wifi drivers and XF86Config files, would want to use a locked-down desktop Linux variant instead of just installing Debian or something. And I do install Debian, on desktop hardware. But on a laptop, hardware support is paramount: external monitors (eg. for presentations), bluetooth audio (for music while travelling), long battery life, and rapid, non-crashy suspend/resume, are all really important to me. ChromeOS actually does all that stuff reliably nowadays, because they design the OS and the hardware at the same time. Debian can't compete with that.)

One showstopper for me when I'm trying to do software development, however, is having a proper window manager, which is to say, one that I can run without resorting to a mouse or touchpad. Because I am old and crusty and unreasonably opinionated, the one I want to run is ion1. I had it working on MacOS, but I wanted it on ChromeOS.

Now, modern ChromeOS uses Wayland as its display manager, not X11, which is of some concern becuse ion1 stopped evolving more than a decade ago (which is how I like it) and therefore only understands X11. Also, ChromeOS provides a Wayland compositor and doesn't let you replace it from Crostini, even though they do let you securely launch Wayland and X11 windows from Crostini (which is pretty cool).

Do we give up? No! The "obvious" "solution" is Xnest, an "X proxy" from before the dawn of time, which puts all your windows inside one big window. So I created one big full-screen Xnest window (managed by Wayland), and then ran ion1 and a bunch of rxvt terminals inside.

This actually worked almost right, except: ion1 doesn't support these fancypants client-rendered fonts. No. It's old and crusty, like me. It expects the X server to render its fonts. And unfortunately, ChromeOS contains only about four fonts in its X server, all of which are hopelessly microscopic on the 200dpi screen in my Chromebook. Oops.

Luckily, about 11 years ago, slightly after the dawn of time, someone else didn't like some Xnest limitations and made Xephyr, which apparently is more of a framebuffer and less of a proxy, the upshot of which is that it renders its own "server side" fonts (which from Wayland's point of view are on the client side, but from ion1's point of view are definitely on the server side). As a bonus, thanks to xrandr, it understands the idea of having its window resized, so I can use the handy ChromeOS "full screen" key and have it do something nice. Or drag it to an external monitor and get decent results.

I didn't try to do anything with 3D, but nominally Xephyr can do that too. But Crostini supposedly can't. I don't know and I don't really care, I'm just trying to run some terminals here.

Xephyr and its DPI calculation

One weird problem I've had with Xephyr is that whatever it's doing to calculate "dots per inch" (as reported by xdpyinfo) is completely insane. It starts off with a value that is definitely not the same as its host display, and then if you resize the window, it just changes the value and gets more and more confused. This is bad for people who want to specify their font size in points so that fonts will be roughly the same size no matter what size the display is.

As far as I can tell, this is just a bug in Xephyr. But if anybody knows what's going on or (especially) how to fix it, I'd love to know. Meanwhile, I learned to specify my font sizes in pixels instead of points.

Cut and paste

An even more annoying problem, which deserves its own section, is the question of how to deal with cut-and-paste between my Xephyr+ion1+rxvt session and the toplevel Wayland session. This is needed for two reasons:

  1. I want to copy URLs and text between my web browser and my terminals.

  2. I want to be able to run multiple Xephyr sessions and share text between them.

By default, Xephyr appears to have no clipboard sync at all between its internal clipboard and its host server's clipboard. That's no fun. (To their credit, ChromeOS does seem to manage to sync the clipboard between Wayland and its toplevel XWayland session, which is essential if we want anything to work. It's just Xnest and Xephyr that break the chain.)

Now, there are a few things you should know about X11 clipboards. The canonical explanation is jwz's X Selections, Cut Buffers, and Kill Rings, which is quite excellent and gives some background on how it's not the X11 clipboard that's crazy, it's all the apps using it.

So anyway, with that background in mind, all we need to do is magically keep the clipboard in sync between :0 (the toplevel XWayland server, which is synced with Wayland and Chrome), and :1 (inside my Xephyr server), and ideally :2 .. :n (inside other nested Xephyr servers). How hard can it be?

Well, apparently it can be hard. The best answers I could find on the Internet (which I won't link to, because they suck) are:

  1. Run a script that uses xclip to periodically grab the clipboard content from each server. If one server has different clipboard content than the currently-expected content, then copy it to the other server. This method has a few problems: first, you have to choose a periodicity for the sync process, which is inevitably either annoyingly long or battery-killingly short. Second, the "content based" sync decision is rather error prone and results in potentially unstable race conditions, especially with multi-way sync. And third, typical implementations are a little too pushy about copying the clipboard data to all screens: jwz's article talks about this problem in refence to the "X cut buffer" support before the new-style support was added. If you highlight/copy text frequently or in large volumes, it's pretty wasteful to copy it to other screens before it's needed for pasting.

  2. Run synergy, a tool that lets you seamlessly extend your mouse/keyboard/clipboard across multiple displays on multiple computers. This was very tempting, despite being severe overkill (I don't want to extend my mouse and keyboard, just my clipboard). Unfortunately, it didn't work. It almost worked. But it didn't.

Luckily(?) for you, I spent quite some time diagnosing why synergy didn't work for me in my use case. The symptom was that it would sync the clipboard in only one direction (say A->B), and only the first time I copied something. If I copied another thing on A, the clipboard on B would not be updated. To make it update, I had to copy something on B (which always fails to sync to A), and then copy something on A (which would work).

How hard can it be? I thought to myself, again, foolishly, and decided to read the source code.

Now, the synergy source code is actually pretty good. It has a nice abstraction layer for the various clipboard types in X, MacOS, and Windows. It's pretty easy to follow. It has a bit too few debug trace messages, but okay, those are easy enough to add as we go.

Unfortunately, synergy's clipboard support has two fatal design flaws:

  1. Like the periodic xclip case above, it grabs a copy of clipboard data right away when the clipboard ownership changes. It's better than a naive xclip script, because it actually gets a notification when the clipboard ownership changes, rather than polling periodically. Unfortunately, those notifications are also its downfall. See, in X11, there is only a clipboard notification when the clipboard owner changes, not when the content changes. If I copy text from rxvt, it will grab the clipboard. Synergy will notice this and read the clipboard. But if I then copy different text in rxvt, the owner doesn't change, so there is no notification, so Synergy doesn't re-copy it. That explains why it only worked the first time. (It also explains why copying on B and then on A causes it to work again exactly once: the clipboard ownership changes.) (This bug may not be visible on all terminals. If rxvt would give up the clipboard, then take it back, every time I made a copy, it would work around this problem.) (I think VNC's clipboard sync has/had the same problem.)

  2. Synergy, because it's mostly a keyboard/mouse sharing app, maintains the concept of a "current screen." That is, it watches which screen currently has the mouse pointer, and only replicates the clipboard to that screen. This is a performance optimization: since it (like the poorly designed "x cut buffer" mentioned by jwz) takes a copy every time the clipboard changes, it doesn't want to replicate this to screens where you're not using it. Unfortunately, since my screens are nested, I had to disable the keyboard/mouse sharing feature, which also leaves the "current screen" incorrect exactly half the time, which is why the clipboard fails to replicate from B->A and only works from A->B.

I was willing to try to fix some minor clipboard bugs in synergy, but I gave up when I realized this design (grab and replicate the content as soon as clipboard owner changes) was never going to work well with rxvt. That's when I gave up and decided to write my own trivial clipboard syncing tool, based on all the otherwise-useless trivia I had acquired while investigating the above.

The result is xclipsync, and, other than omitting non-text clipboard formats, I think I did it right.

  1. It starts up by taking ownership of the clipboard on display A.
  2. When it loses clipboard ownership on A, it takes ownership of the clipboard on display B.
  3. When it loses clipboard ownership on B, it goes back to step 1.
  4. When it receives a request for clipboard contents (which should be someone requesting a paste), it then reads the clipboard content from the display that it doesn't currently own, and forwards it along.

And that's it!

This avoids the problem of a single owner changing their clipboard content (since it grabs content only on demand). It doesn't do extra work if you copy content without pasting. It actually does nothing at all if you do a lot of work on one display: it loses the clipboard content on that display, which means it does nothing at all until you use the clipboard on the other display. It doesn't ever poll anything, so there are no arbitrary delays or race conditions.

And best of all, this algorithm works even for multi-way sync. You can run parallel instances of xclipsync between any two displays, and as long as you don't create any bridging loops, it will do the right thing across all of them. That is, exactly one display will "own" the clipboard, and all the other ones will copy from it. This works because every time you copy something from a new display, exactly one xclipsync instance will lose ownership, which causes it to assert ownership on exactly one display. If another xclipsync is syncing with that display, it will then lose ownership, and assert ownership on exactly one other display, and so on. As long as there are no loops, this process will terminate, and it'll do so very efficiently.

Things that could be better

There's no particular reason xclipsync can't support non-plaintext clip formats. I just didn't implement it, because I didn't need anything but text, since my Xephyr session is just terminals anyway.

xclipsync currently uses tcl/tk to take clipboard ownership (yes!). This would have been unnecessary if xclip had just one more feature: the ability to run a command at paste time, rather than always reading clipboard content from stdin at startup time. Then xclipsync would have been just a couple of (foregrounded) alternating xclip calls in a loop.

Xephyr probably should just implement this exact clipboard syncing protocol internally.

Note that it appears ChromeOS+Wayland is actually implementing some other kind of clipboard sync between Wayland and the toplevel XWayland server. When xclipsync tries to take ownership of the XWayland clipboard, it immediately experiences one "paste" operation and then loses ownership. This might be related to WAyland's inter-process security isolation features. In any case, xclipsync reacts as usual (giving clipboard ownership to XWayland and proxying requests to XWayland from other displays that want to paste) and all is well.

I really wish Alt-Tab would work even when the Xephyr instance is fullscreened in ChromeOS. I understand why they want to let me capture Alt-Tab in my full-screen X apps, but also... I don't want to.

The marketing-driven "Assistant key" on the Pixelbook is a user-hostile disaster in its current form. On the other hand, if they would let me remap it to, say, Meta, it would instantly redeem itself.

Posted Sat Nov 24 13:56:30 2018 Tags:

I have until now avoided making a public statement about my views on the various interrelated issues regarding the GNU Kind Communication Guidelines that came up over the last month. However, given increasing interest in our community on these issues, and the repeated inquiries that I received privately from major contributors in our community, I now must state my views publicly. I don't have much desire to debate these topics in public, nor do I think such is particularly useful, but I've been asked frequently about these GNU policy statements. I feel, if for no other reason than efficiency, that I should share them in one place publicly for easy reference:

  • I think the GNU Kind Communication Guidelines, as a stand-alone document, are useful suggestions and helpful to the GNU project and would be helpful, if adopted, for any software freedom project.
  • However, I think that the GNU Kind Communication Guidelines standing alone are inadequate for a project of GNU's size and number of contributors to address the stated problems. Traditional Codes of Conduct, particularly those that offer mechanisms for complaint resolution when bad behavior occurs, are necessary in Free Software projects of GNU's size. Codes of Conduct are the best mechanism known today in our community to ensure welcoming environments for those who might be targeted by inappropriate and unprofessional behavior.
  • I therefore disagree with the meta-material stated in the announcement of these Communication Guidelines. First, I disagree with the decision to reject any Code of Conduct for the GNU project. Second, I believe that diversity is an important goal for advancing software freedom and human equality generally. I am a supporter of Outreachy and work hard to help it succeed as part of my day job. I have publicly supported affirmative action since the early 1990s, and continue to support it. I agree with “making diversity a goal”; Richard Stallman (RMS), speaking on behalf of GNU, states that perse disagrees with “making diversity a goal”.
  • I also disagree with encouraging GNU project contributors to ignore the request of non-binary-gender individuals who ask for the pronouns they/them, as stated in RMS' personal essay linked to from the GNU Kind Communication Guidelines. My position is that refusing to use the pronouns people ask for is the same unkindness as refusing to call transgender people by a name that is not their legal name when they request it. I don't think the grammatical argument that “pronouns are different from proper nouns” is compelling enough to warrant unwelcoming behavior toward these individuals. The words people use matter. RMS has insisted for years that people make a clear distinction between open source and free software — for good reason —. I believe that how we say things makes a political statement in itself.
  • Related to the last point, I am concerned with the conflating of GNU project views with RMS' personal views. RMS seems to have decided unilaterally that GNU would take a position that requests for use of they/them pronouns need not be honored. I think it is essential that RMS keeps per personal views separate from official GNU policy; I have said so many times to the FSF Board of Directors in various contexts. It was a surprise to me that RMS' personal view on this issue was referenced as part of GNU project guidelines.
  • I think the GNU Kindness Communication Guidelines should apply to all communication from the project, including GNU manuals themselves, and I also believe the glibc abort() joke should be removed. I don't believe free speech of anyone is impacted if a Free Software project forbids certain types of off-topic communication in its official channels. Everyone can have their own website and blog to express their personal views; they don't need to do so through project channels.

I have been encouraged many times this year by various prominent community members to resign from the FSF's Board of Directors (sometimes over these issues, and sometimes over other, similar issues). I have also received many private communications from other prominent community members (including some GNU contributors) expressing similar concerns to the above, but these individuals noted that they feel much better about the FSF and its shepherding of the GNU project because I'm on the FSF Board of Directors, even though I clearly pointed out to them that my views on these matters will not necessarily become GNU and/or FSF policy. The argument that many have made to me is that it's valuable to have dissenting opinions in the leadership on these issues, even if those dissenting opinions do not become FSF and/or GNU policy.

I am swayed by the latter argument, and I have decided to continue as an FSF Director indefinitely (assuming the other Directors wish me to continue). However, these recent public positions are far enough out of alignment with my own views that I feel it necessary to exercise my own free speech rights here on my personal blog and state my disagreement with them. I will continue to urge the FSF and GNU to change and/or clarify these positions. (I also sent this blog post privately to the FSF Directors 8 days before I posted it, and had also discussed these concerns in detail with RMS for a month before posting this.)

Governing well means working (and finding common ground) with those you disagree. We oscillate a bit too much in software freedom communities: either we air every last disagreement no matter how minor, or (perhaps as an over-correction to the former) we seek to represent a seemingly perfect consensus even when one isn't present. I try to avoid both extremes; so this is the first time in my many years on the FSF Board of Directors where I've publicly disagreed with an FSF or GNU project policy. FSF and GNU primarily fight for one principle: equal software freedom for all users and developers. On other topics, there can easily exist disagreement, and working through those disagreements together, in my opinion, usually make the community stronger.

As always, this is my personal blog, and nothing here necessarily reflects the official views of any organization with which I am affiliated, including not only the Free Software Foundation and GNU, but also Software Freedom Conservancy.

Posted Thu Nov 22 08:09:00 2018 Tags:

I'm working on (statistically speaking) my least interesting work in years.

Previously I spent time doing wifi drivers, boot scripts, logfile processors, payment systems, and project estimation, so one might reasonably have assumed I can't get much more boring. But hah! We have standards to exceed.

Here's what's weird though: I'm enjoying it.

There's this Feeling I get, very rarely, when I'm sure I'm on the right track. Over the last several years, I've almost had the Feeling occasionally, but not very often. It's been so long since I felt it that I actually forgot what it feels like. Talking to software people, I strongly suspect some have never felt it at all.

A common question, when recent graduates ask me for advice after landing their dream job, is of the form, "Is it... always like this? When I started to learn coding, I really liked it. I kind of assumed work would be... better." What they mean isn't that coding sucks, but that their project is unfulfilling. They suspect the Feeling exists. Maybe they remember having it as a kid. They thought they were almost there again. Life is all lined up: great school, great grades, great employer at a great salary, great co-workers. And then... nothing. Is it always like... this?

No, it's not. Not always.

But a lot.

...

Yesterday, someone on a mailing list told me (politely)1 that my documentation sucked. A few days before, someone called my code spaghetti. And the thing is... YES. Yes, I am a pretty good writer, but that documentation is not my best work. That code is some of the best I've written, but it's also the most incomprehensible, and I've known for years how to make it better, but I haven't gotten around to it. It's refreshing to have a project with no stakes, with volunteers who care about elegance, where I can hear stuff like that. I live on that. It's what makes me go, and when I finally do get it right, it means the positive feedback is real.

It's been years since someone told me my work sucked. And that was one thing, I can live with that, it's West Coast People, but in the last year or so, the compliments have been egregious. People's acceptance of my work and my opinions had a lot more to do with my reputation (among a certain small group) than about my quality. It's easy to get addicted to that, to let it take the place of the Feeling, but in the end it's just candy. The sugar high fades, and too soon, you need more sugar.

That's why I had to change gears. The withdrawal process has been a bit painful, but it's worth it. Maybe my project won't amount to anything. But if it dies, at least it'll die beautiful.

Footnote

1 Just to be clear, constructive negative feedback is important, but being a jerk is not. There's no need to go on a swearing/ranting angry rampage when things go wrong. It doesn't help anyway. But pretending bad things are good things doesn't help either.

Posted Sat Nov 17 04:36:05 2018 Tags:

tl;dr: Rebuilding a target because its mtime is older than the mtimes of its dependencies, like make does, is very error prone. redo does it better, and so can you.

A recent twitter discussion (pro tip: never do those) led me to realize that I have studied the problem of mtime comparison in considerably more depth than most people. I don't know whether to be proud of this or very concerned, but nevertheless, here we are. Soon, you'll know everything I do about the topic. I expect you will regret that as much as I have.

What is an mtime, anyway?

mtime is the "modified time" for the content associated with a given file. Generally, if anyone writes bytes anywhere in a file, the mtime will be updated. If a file has more than one name (ie. it's hardlinked to more than one place), all the names share the same inode and content, and thus all share the same mtime.

Annoyingly, when you update the content of a file, the mtime of its containing directory is not changed. All sorts of very convenient tree traversals would be possible if the directory mtime were updated (recursively to the root) when contained files changed, but no. This is probably because of hardlinks: since the kernel doesn't, in general, know all the filenames of an open file, it literally cannot update all the containing directories because it doesn't know what they are either. And anyway, purists might argue that the "content" of a directory doesn't change when the files it points to change; the content is merely a list of filenames and inode numbers, after all, and those stay the same, no matter what happens inside those inodes. Purists make me sad.

(Random side note: on MacOS, the kernel does know all the filenames of a hardlink, because hardlinks are secretly implemented as fancy symlink-like data structures. You normally don't see any symptoms of this except that hardlinks are suspiciously slow on MacOS. But in exchange for the slowness, the kernel actually can look up all filenames of a hardlink if it wants. I think this has something to do with Aliases and finding .app files even if they move around, or something.)

Related to mtime is the ctime, which most people would guess means "create time," but it absolutely does not. It means "attribute change time," which is different from the "modified time" because it updates whenever various inode fields change, not just the file contents. mtime is one of the inode fields, so whenever mtime changes, ctime also changes, but not vice versa. Among other things, ctime changes when file ownership, size, or link count change.

Link count is especially interesting: if you create or delete a hardlink to a given file, its ctime changes. Renaming is defined as creating a new hardlink and then removing another one, which means it updates the ctime (not the mtime), even though when it finishes, the link count is back to normal so the inode looks unchanged (other than the ctime). (Whether rename's create and unlink are supposed to be a single atomic transaction is a subject of much debate.)

So anyway, ctime changes much more sensitively than mtime. It turns out that mostly you don't care about the changes ctime measures, so it causes false positives, especially because of that pesky link count, but if you're paranoid, this can be helpful. Let's mostly talk about mtimes for now.

For completeness, there is also the atime, which means "access time." Originally, this would update whenever anyone "accessed" a file, usually defined as reading bytes from it. But this is unhelpful for two reasons: first, it means reading a filesystem causes writes to that filesystem, which greatly increases disk load (some people estimate by ~30%). Secondly, the definition of "access time" does not match what end users mean, which means various programs (especially backup software and search engines) try to avoid updating it. This workaround is so common that Linux added an O_NOATIME flag to open(2) to prevent updating atime. The default atime performance hit is so bad that many filesystems now have a relatime mount flag which decreases the precision of atime, thus reducing disk load. (Trivia: the Debian popularity-contest, which I started long ago, uses atime to figure out which installed packages you actually use.) (More trivia: if you mount your filesystem readonly, it is technically not POSIX compliant anymore because the atimes won't be updated.)

Popular misconceptions about mtime

  • How precise is it? It depends on the OS and filesystem. Originally, mtime had a precision of one second, which is all you can safely rely on. Nowadays most OSes have a stat(2) syscall that returns a struct timespec, which contains nanoseconds, but almost no filesystems provide that level of precision, and it depends on your kernel and disk format. For example, on my system (Debian Linux 4.9.0-7 with ext4), I get about 0.01s granularity. Stackoverflow has an explanation.

  • Is mtime monotonically increasing? No, it can go backwards. For example, the utimes(2) syscall, used by the touch command, can set the mtime to any value at all. (And tar might do this when extracting a tarball, for example.) If your system clock jumps from one time to another, it will set subsequent mtimes to match the new clock, even if the jump was backwards. And so on.

  • Does mtime get set to >= the current time? No, this depends on clock granularity. For example, gettimeofday() can return times in microseconds on my system, but ext4 rounds timestamps down to the previous ~10ms (but not exactly 10ms) increment, with the surprising result that a newly-created file is almost always created in the past:

      $ python -c "
      import os, time
      t0 = time.time()
      open('testfile', 'w').close()
      print os.stat('testfile').st_mtime - t0
      "
    
      -0.00234484672546
    
  • Does mtime get set to <= the current time? No, it might be set to a future time. For example, imagine you have an NFS server whose clock is set 5 seconds in the future relative to your client. The mtime is assigned by the server, so when you create the file, its mtime will be 5 seconds in the future. (Changing the standard so that mtime is set by the client doesn't really help: then programs running on the server will see a file 5 seconds in the past. And relying on ntpd isn't perfect either: it can only reduce clock skew between machines, not eliminate it.) For extra inconsistency, if a client uses utimes(2) to force the time to a particular value, this gets passed through to the server unchanged.

  • Is mtime always nonzero? No. Various cheaply-written virtual filesystems, like many fuse-based ones, don't bother setting mtime.

  • Does a changed mtime guarantee that a file has different content? No. Perhaps you wrote a block that happened to be identical to the block that already existed at that point in the file; the mtime changes anyway. Perhaps you wrote a block and then changed it back; the mtime changes twice.

  • Does changed content guarantee a changed mtime? No. Clock skew, low precision, or utimes(2) can cause an mtime to be the same as last time you checked. (This is also true for ctime, etc.)

  • Do version control systems like git save the mtime? No, not really. The tree and blob objects stored by git contain no timestamp information at all. (This is very good for deduplication.) commit objects contain various timestamps (commit time, author time, etc), and you could use that to reverse-engineer a guess for the mtime of a given file: the commit time of the most recent commit that changed that file's content, for example. But that's not what people do, mostly because it creates problems with make, which we'll get to shortly. (Git doesn't have the dangerous mtime-setting feature built in, but it does seem to exist in svn. You probably still shouldn't do it.)

    (This all creates interesting philosophical questions. Is the "last modified" time of a file the time when the new content itself was created, or when this particular instance of it was written to disk? If you had a sci-fi device that could make a perfect scan of my physical being and run me in a simulation, what would be the mtime of the input file? And so on.)

    (The bup project I started, which uses a git-formatted repo to back up your filesystem, does need to save mtime and other metadata. It stores metadata in separate hidden files in the git tree and reapplies it at restore time.)

  • Does switching branches in git screw up the mtime? No, not more than anything else. git just rewrites the changed files and lets the kernel update the mtime, so they look as if someone edited them with a text editor.

  • Does writing to a file via mmap() update the mtime? Hah. Well, maybe. See, POSIX guarantees that the mtime "will be marked for update at some point in the interval between a write reference to the mapped region and the next call to msync() ... If there is no such call, these fields may be marked for update at any time after a write reference." This definition actually leaves a lot of leeway for weirdness.

    I wrote a little test program (mmap_test.c) to check how this works nowadays, and, of course, it varies between OSes. On Linux (4.9.0, ext4), the mtime is updated at the first page dirty after an mmap() or msync(). On FreeBSD (11.2, ufs), it's updated at msync() or munmap() time. On MacOS (10.11.6), it updates only at msync() time, not at munmap() time. I even tried with the "WSL personality" (4.4.0-17134-Microsoft) on Windows 10, which had especially terrible results: mmaped writes never updated the mtime at all.

    I think the MacOS behaviour is allowed because the spec says "may" instead of "will" in that second sentence, but it's a stretch. The Linux behaviour may be illegal depending how you define "a write reference"; Linux seems to interpret it as "the first" or "a randomly selected" write reference, while I would expect to interpret it as "each" write reference (with the result that mtime must be updated at least once between the last reference and the msync(), which would be fine).

    Of all these, the only useful behaviour seems to be FreeBSD's; at minimum, we surely want mtime to be updated at least once after all changes to a file have been done. MacOS and Linux don't always do so, and WSL never does so. This lends credence to the claim that the .git/index file, which uses mmap, is synced incorrectly by file sync tools relying on mtime. Ironically, the faster and better the file sync tool, the more likely it is to hit the race condition. An easy fix would be to have git always write() a useless byte before closing the index file. But I'd prefer if the kernel were less dumb.

Okay! That's the introduction. Now let's move on to application.

mtimes and make

I've kinda ruined the surprise by listing the caveats above. But let's look at what all that means when we try to use mtime for something.

make dependencies work in a very simple way. Now that we, as an industry, have decades of experience learning all the above caveats, we might describe it as "naive" in the sense that, when make was first invented, nobody had heard of all these problems, so it would be unfair to expect the author to design around them. In the world where make was first written:

  • there was no NFS;
  • there was no mmap;
  • there was no version control;
  • there were no fuse filesystems;
  • computers and compilers were so slow that a one-second timestamp granularity was never a problem.

In that world, they made the seemingly obvious decision to rebuild any target if the mtime of any of its dependencies was > the mtime of the target. (If you want to be extra safe in the presence of granularity problems, rebuild if >= rather than >.) This was an exciting innovation at the time.

Unfortunately, we now know that this can lead to numerous mistakes:

  • with NFS and clock skew, if a source file is edited on one machine and you run make on another, the input file might have mtime < target mtime, so nothing will happen. Or, you might rebuild the target and its mtime will still be < source mtime, so it'll be rebuilt again later.

  • If you accidentally set your system clock ahead by a day and build some stuff, then set your clock back to the present, all the stuff you built during that time will show up as "in the future" and therefore newer than source files you edit today, preventing all rebuilds. (Eventually GNU make started detecting future-dated files and printing a warning.)

  • If you have files modified through mmap, the mtime might not be up to date. (Luckily mmap is a rarity when editing source files or building software. Usually you aren't directly using live databases as your source files.)

  • If you replace one source file with another, eg.

      mv foo.c foo.c.bak
      mv foo.c.new foo.c
    

    then the mtime is not updated, and make will see the old mtime of foo.c.new. That might be older than your foo binary, even though the binary does not yet contain the new foo.c. It won't be rebuilt.

  • If you have a dependency like

      foo.a: $(patsubst %.c,%.o,$(wildcard *.c))
    

    (ie. produce foo.a from all the .o files built from all the C source files), then if one of the source files is deleted, it will no longer be one of the dependencies at all. But all the remaining dependencies are still older than foo.a, so foo.a will not be rebuilt.

  • If you put automake/autoconf-generated files (like ./configure and Makefile) in version control, you can get surprising results. Let's say automake has a Makefile rule to regenerate Makefile whenever the automake input files (eg. Makefile.am) change. In a tarball, which preserves mtimes, this will work, because Makefile will be newer than Makefile.am. But in a version control system, which uses the default kernel-assigned mtime when writing the files, it's undefined whether Makefile or Makefile.am is written first. If your timestamps are high precision (or they're low precision and you get unlucky), then Makefile could be "older" than Makefile.am, and automake will try to run anyway. Or if not, then it won't. So different people checking out the same source code will get different results based on random luck.

  • Computers are now so fast that you can save foo.c in your editor, and then produce foo.o, and then compile foo, all in the same one-second time period. If you do this and, say, save foo.c twice in the same second (and you have one-second granularity mtimes), then make can't tell if foo.o and foo are up to date or not. (As above, make can work around this by assuming if source mtime == target mtime, the target still needs to be rebuilt. This could cause spurious rebuilds, but is less dangerous than missing rebuilds.)

    (This often happens if you're using one of those fancy new inotify-based tools that fires off a compile immediately, every time you hit save in your editor. Typescript does something like this, for example, as do auto-reloaders for various modern web languages. Symptom: needing to save your source file twice before the autocompiler catches it. And it happens more on MacOS, which has 1-second mtime granularity, than on Linux, which has 0.01-second mtimes.)

  • If your source files are in a virtual filesystem where mtime is always 0, then make will always think your source files have not changed and the target will never rebuild.

While we're here, there are some other common problems that aren't really the fault of mtime, but are common dependency problems with make:

  • If you upgrade your toolchain (eg. your C compiler), make doesn't know to rebuild your source files, unless you declare an explicit dependency on the toolchain files, which nobody does because it's hard to write that system-dependent stuff as a Makefile dependency rule. (This is one reason autoconf needs to be a ./configure script that generates a Makefile, instead of just a dependency executed by your Makefile.)

    For that matter, when you update your toolchain, it's often from a distro-provided package (basically a tarball) with timestamps helpfully in the past, which are probably older than all your output files. So make won't see it as updated anyway!

  • If you pass variables on the make command line, like CFLAGS=-O2, they will usually not be part of a dependency and so won't cause a rebuild, and you'll end up with programs built halfway with the old flags, and halfway with the new ones. You can fix this by writing CFLAGS to a file, atomically replacing it only if the content differs, and depending on that file. But nobody does.

  • If you modify the Makefile, make will not by default rebuild any targets. You can fix this by adding an explicit dependency on Makefile, but this is a giant pain during development, because Makefile contains all your build rules; you don't want to recompile every source file just because you changed the linker command line, for example. (Some nowadays-rare versions of make actually tried to track Makefile changes, per rule, and cause rebuilds for these cases.)

make is not the only program that is affected by naive use of mtime. It's fairly common. For example, Go had so much trouble that they recently changed the Go compiler to just read and hash all the input files every time it runs. (Thanks to bradfitz for this link.)

redo: mtime dependencies done right

I happened to be aware of all these problems (well, not the mmap() madness; bleah!) when I set out to write redo so many years ago. I was also influenced by djb's design for redo, in which he writes, "When redo is asked to create a file that it hasn't heard of before, it presumes that the file is a source file if it exists, or a target file otherwise. In the second case (new target), redo immediately saves this decision to disk."

In other words, redo's design fundamentally depends on keeping a database of targets, if only to remember which files were produced by redo and which were not. From there, it's easy enough to extend that database to include mtime information about sources. And from there, we can add a bit more metadata to make the timestamp even more reliable.

My implementation of redo remembers the following information about each source and target:

  • mtime
  • size
  • inode number
  • file mode
  • owner uid and gid
  • (targets only) the sequence number of the last time it was built

redo considers a dependency dirty if any of those attributes changed since the last time a target was built. Notice how this dodges the various problems of mtime skew:

  • NFS client/server time skew doesn't matter; as long as the mtime changes in any direction, it's fine.

  • mmap() weirdness is reduced, because we notice changes in file size, as well as source mtimes that changed but are still older than the target.

  • If you mv a file to replace another, it will have a different inode number, which we notice. It also probably has a different size and (even if not newer than the target) mtime, any of which are sufficient.

  • Because redo has a database of all the dependencies used to produce a given target, if one of those inputs disappears, the target needs to be rebuilt. make doesn't remember the dependencies used last time, it only remembers the dependencies declared this time, so it can miss important changes in the list of dependencies.

    (More generally, it's an interesting mathematical phenomenon that to correctly build software, we need to know not only the dependencies as they are now, but as they were before. Those two lists are used very differently. I don't think most build systems are designed with this realization, and it leads to subtle failures.)

  • If you put autoconf/automake generated files in your source repo, redo will "presume that the file is a source file," make a note of that, and not rebuild it. (It's still probably not a great idea to check those into version control. But at least now your build system won't go crazy.) If you then delete them, redo will consider them targets to be built.

  • redo has special treatment of source files whose mtime == the target mtime, so it can correct for overlaps even when your filesystem has very coarse timestamp granularity. Also, if you continue editing a source file, it will usually end up with a changed size, which also marks it as changed.

  • If your source files are in a braindead fuse filesystem, redo can use inode number and size to detect changes (although it still sucks and you should fix your fuse filesystem).

We can also fix the non-mtime-related missing dependencies:

  • It's easy to declare dependencies on your toolchain, because the rule for each target can track which parts of the toolchain were used while building, then retroactively declare a dependency on those. And we still notice a change if the new mtimes are in the past.

  • redo doesn't allow you to set variables on the command line; you have to write them to a file instead. This lets you easily declare dependencies on the file.

  • Since rules are written in separate .do files instead of one big Makefile, it's reasonable for redo to auto-declare a dependency on the .do file it used for a given target. When you edit a rule, the affected targets are automatically rebuilt.

I mentioned above that the Go compiler had problems with naive mtime-based dependency checking. I don't expect Go to switch to redo, but they could solve their problems in a similar way: generate a "database" (which might just be a text file) at build time. In the database, list the source files and their stamp information (mtime, inode, etc). Also list the toolchain version and relevant command line flags. Next time, read the database and compare against the new list of source files, the new stamps, and the new flags. If any are different, run the build. (Of course, all this is just a performance optimization that allows the compiler to avoid opening and reading files unnecessarily. The Go developers might reasonably continue to opt for the slower choice with fewer edge cases.)

Why not use checksums instead of mtimes?

Inevitably when the discussion of build dependencies comes up, someone who has heard part of the above story (usually some of the make problems caused by mtime comparisons) suggests throwing away mtimes entirely and always doing dependencies based on file checksums.

This can work, sometimes. And wow, I love checksums a lot (I wrote bup after all). But it isn't perfect for every situation.

As a clue to how complicated this can get: most people talking about this option suggest checksums as a way to avoid false negatives, ie., failing to rebuild when a source file has changed. But inode attributes change, in theory, at least as often as the content hash changes. Checksums are more useful for reducing false positives (ie. to avoid rebuilding in situations where we know the output will be identical). If someone is talking to you about rebuilding based on checksums, ask if they have thought about that difference.

Anyway, here are some specific problems with checksum-based dependencies:

  • Sometimes building a target has side effects. For example, imagine you have a redo rule for deploying a container to AWS. This does not really produce a "file" locally that you can checksum; it usually produces just log messages, or blank output, and the checksum of that will usually not change. Now, imagine you have a second container that you want to deploy only if the first container gets deployed correctly. If the checksum of the first container deployment is unchanged, the second one will think all its dependencies are unchanged, and not run, which might be incorrect. There are numerous other examples of side effects where this always-use-checksums behaviour is undesirable.

    (On the other hand, some systems out there, like blaze/bazel, specialize in build systems without side effects. In that case a pure-checksum system is more appropriate. But then you have to escape from such systems if you want to do fun stuff like deploying containers. You end up punting the dependency problem elsewhere.)

  • Checksumming every output after building it is somewhat slow. This requires the build system to read the whole content of the file and do some math on it. Mostly this is not too serious: the file is probably already in disk cache (since you just wrote it a moment ago!) and calculating a checksum is almost always much faster than generating the file in the first place. And it only happens when a build was needed, which is expensive anyway. But it does add time to every build step.

  • Checksumming every input file before building is very slow. If you're considering whether to rebuild foo.a, and foo.a depends on *.o, and each.o depends on each.c, then you have to checksum the full content of every .c file every time you consider making an incremental build of foo. In large projects, this could be thousands, or tens of thousands of files, each of which we have to open(), read(), checksum, and close(), possibly over a network filesystem. For small projects this is fine, but for large projects, this sucks a lot.

    blaze/bazel come from a world where source files are stored in a virtual filesystem, which happens to have the ability to tell you a precalculated checksum for every source file (except the ones you've changed locally). If you only have to checksum your locally-changed files on each build, that'll be very fast. But you need filesystem support to make this possible, and we can't assume that everywhere.

redo does support checksum-based dependencies, but it avoids the above problems as much as possible:

  • If you do nothing, redo uses database-mtime-based dependency checking, which is extremely fast on all operating systems. It's even reasonably fast on NFS.

  • redo-stamp lets you provide, after building a target, the data used to calculate that target's checksum (which might differ from the target itself, if you want).

  • redo-stamp records the checksum in its database after building a target. Any downstream target remembers that checksum in its list of dependencies; if it changes later, then the downstream target needs to be rebuilt. There is no need to actually recalculate any checksums when checking dependencies in the future. No special filesystem support is needed.

So you can use redo-stamp, in appropriate places, to reduce false positives in a way that causes overhead only at build time (not for checking dependencies later), and only for targets that need it.

That mmap() behaviour though. Seriously.

Posted Mon Nov 12 04:51:34 2018 Tags:

Lately I've been getting back to hacking on my djb redo implementation. We've fixed some problems with file attribute handling on NFS, obscure locking, and MacOS/FreeBSD portability. If you haven't tried redo in a while, you might want to give it another shot.

In case you haven't heard of redo before, here's the overview: it's like make, but with no special syntax (just sh scripts). The first time you "do" a build, it runs a set of recursive sh scripts, once per target. Those scripts can run a command called redo-ifchange, which declares dependencies on the given targets, checks if they are up to date, and if not, recurses into more scripts in order to build them. And that's it!

redo combines the best parts of imperative systems with the best parts of functional systems. The build scripts are all imperative - it just runs commands, and declaring dependencies happens as a side effect of some of those commands (redo-ifchange). When it's done, you have a purely-functional data structure that you can use for extremely fast dependency calculation. (Theoretically as fast as ninja, but my code isn't as optimized.)

[Credit: redo was invented by Daniel J. Bernstein. I merely implemented it.]

Parallelism

Things get a little more complex on modern multicore computers, where you almost always want parallel builds, which means producing different parts of the tree all at once, so in principle, a sequential-imperative tree of sh scripts is no longer the perfect model. Luckily, redo can handle it: if your script does redo-ifchange on more than one target at a time, it'll try to build all those in parallel. Then, if more than one parallel target tries to build a given dependency, it uses inter-process file locking to make sure the dependency only builds once.

But what's this about serializing logs?

I'm sure you already have your favourite build system and it builds things, and it almost certainly handles parallelism for whatever your use case. Even make does parallelism.

Where things tend to fall down is in rendering the output of a parallel build. When we're running a lot of jobs all at once, and blasting them all to stdout/stderr, and one step deep in the tree gets an error, then you might get pages and pages of successful output from other tasks interspersed with your error, making it very hard to figure out what went wrong and why.

There are various approaches to solving that. Some people would argue that the Unix Way is for programs that didn't fail to just print nothing at all; that's how the Go compiler works, for example. If you like that philosophy but you're using tools that don't agree (such as make itself, which prints all kinds of stuff while it works), you could wrap every command in a script that withholds its output, printing it only if the command returns a nonzero exit code.

That's all nice until you have to debug what went wrong. It's not a coincidence that make, which is made by Unix people, does not follow the Unix Way. Makefiles are just too complicated and hard to debug if you can't see what they're doing; and if step 10 goes wrong, you might be very curious about step 9, even though (nominally) it worked. It's not okay to throw away the successful log messages from step 9.

Fine. Parallel make output is flawed and gross. But everyone knows make is flawed and gross, so they switch to other systems. Most other popular build systems are tool-specific. Someone did a lot of work in cmake, for example, to make it print pretty messages during parallel builds of C/C++ programs. That works well. But if you're not building C/C++ programs, it can't help.

redo is a general purpose dependency system like make, so by definition it's going to run scripts which produce a lot of clutter, possibly including instances of make itself, and someone is going to have to debug them. What can we do to sanitize the logs?

Digression from 2012: loglinear

I've actually been thinking about this problem for more than six years already. Back in 2012, I added a log sanitizer script called loglinear (please pause for a moment to admire the pun) to our project's buildroot. (buildroot is a really handy all-in-one tool for building embedded Linux systems from scratch.)

loglinear worked like this: every time we ran a sub-make command like make path/to/project, we'd instead replace it with loglinear make path/to/project. loglinear then prefixes each line of the output with a job name, say path/to/project, and buffers it. When one of the loglinear processes exits, the top-level loglinear process then takes the buffer from that instance and dumps it to the top-level stdout.

Let's do an example. Imagine we're parallel building A, which depends on J, which depends on all of X, Y, and Z. We launch loglinear make A, which starts loglinear make J, which itself starts (all in parallel) loglinear make X, loglinear make Y, and loglinear make Z. J cannot continue until X, Y, and Z are done, but those three might finish in any order, and loglinear will print the output of each one as soon as it's done. So, the output will look something like this:

  Z: ...some stuff...
  Z: exited with code 0

  X: ...
  X: exited with code 0

  Y: ...
  Y: exited with code 0

  J: ...stuff...
  J: make X
  J: make Y
  J: make Z
  J: ...
  J: exited with code 0

  A: make J
  A: exited with code 0

loglinear also had some magic in case one of the processes returned nonzero: in that case, we'd print the successful processes first and the unsuccessful processes last, in the hope that the "most interesting" messages would end up at the bottom.

This made debugging a lot easier, because build messages from entire packages (like, say, the Linux kernel and busybox) separated out instead of interspersed, but it had some flaws. Most importantly, the output was very bursty: it waited until a given job was completely done before it printed anything. When busybox finished, you saw all the busybox logs; when the kernel finished, you saw all the kernel logs. Although useful, this is, frankly, not as fun as watching your 16-core workstation live-blast several screenfuls of compiler log messages per second. It feels slow.

There was also a secondary problem, which is that the messages, although linearized, were in the wrong order. Notice that, in the above, 'make J' (in A) happens after all the messages from J. This is because we print jobs in the order that they finish, and by definition, J must finish before the job that started J can finish. If we tried to print it in a more reasonable order (topmost job first, dependencies next, etc), then we couldn't print any incremental logs at all: A is guaranteed to finish last, but we want to print it first. This is all very logical when you think deeply about it, but trust me, it gets tedious explaining it to every new developer on your team, who just wants to know why the time is flowing backwards.

So we used loglinear, and it was a necessary evil, especially for viewing autobuilder logs, but nobody liked it. I dreamed of a better way.

Back to 2018: redo-log

I've had many years to contemplate my 2012 sins, and I have good news: I finally figured out how to do it right. Not only that, but instead of introducing a weird tool that you have to hack into your makefiles (and hack I did, oh boy did I ever, to make buildroot parallelize things the way I wanted), I've helpfully integrated the magic directly into redo. And not only that, but I've updated buildroot to use redo so that not only can you get linearized logs, but you can get faster buildroot startup time and faster/better buildroot dependencies too.

(Dear buildroot team: If you're reading this, I was going to send this patch to your mailing list, but it's not ready for prime time, or even code review, yet. I'd love to hear your feedback if you have any.)

redo-log takes a totally different approach from loglinear:

  • It saves the log for each target persistently to its own file, so you can look at it again later.
  • Rather than a flat list of log files, it tracks their tree order.
  • It prints log messages "depth first" instead of "breadth first," for less burstiness.
  • It prints output in the order dependencies were launched, instead of the order in which they were finished.
  • It can helpfully indent log messages based on their recursion level.
  • Since we persist logs anyway, we reserve the right to simply not print messages from some irrelevant targets when an error happens. You can always pull up the logs later if you care.

In other words, the logs from our earlier build now look like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     ...X stuff...
  X:     exit 0
  J:   redo Y
  Y:     ...Y stuff...
  Y:     exit 0
  J:   redo Z
  Z:     ...Z stuff...
  Z:     exit 0
  J:   ...more J stuff...
  J:   exit 0
  A: exit 0

The important realization - which is maybe obvious to you, but it wasn't obvious to me - is that, if you decide to do a depth-first traversal of log messages, the "deepest" one that is still running will continue producing incremental messages until it finishes. There's no need to buffer them!

During that time, other parallel branches of the tree will also be producing messages, which we do buffer until later. So Z might finish before X, but we just print the messages from X as they come out, until X is done. Then we go back to J, which sends us to Y, which we follow until it's done. When we get to Z, which is done already, we just print all its enqueued messages in one big blast, then move on.

An interesting invariant here is that it doesn't matter whether X, Y, or Z finishes first. If they each print their own messages (including launching their own subtasks) in a reproducible order, then no matter how the CPU schedules them, the total output will be in a reproducible order. This has the almost-impossible-sounding property that a set of "reproducible build" steps will produce a byte-for-byte reproducible log, even in the presence of unlimited parallelism.

The tricks go a little deeper. Let's say X, Y, and Z all depend on Q. Because of how .do scripts work, they will each run redo-ifchange Q at some undefined time in their respective build scripts. We only need to build Q once, but we don't know which of X, Y, or Z will be the one to do it. This is where the persistent logs come in; we don't actually care! Effectively the log is a DAG (directed acyclic graph, the same kind of structure used in git) with multiple links to Q. Its structure is like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Q:       ...build Q...
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Q:       ...build Q...
  Z:     ...Z stuff...
  J:   ...more J stuff...
  A: exit 0

Of course we only ran Q once, so it's silly to print its output more than once. Let's trim it:

  A: redo J
  J:   ...stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Z:     ...Z stuff...
  J:   ...stuff...
  A: exit 0

Because of our depth-first traversal rule, the log will always look exactly like that - even if job Q was "actually" launched by job Y and not X. redo-log prints logs in dependency order.

After the build finishes, though, you might want to investigate exactly how Z got built. To do that, you run redo-log Z, which prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

In this case, we can show the steps for job Q as a subtree of Z, even though Q was actually built by Y, because it's not redundant when we're not printing Y.

One more complication arises if one of Z's dependencies changes and we need to rebuild Z, but Q has not changed. If we do that, then the "honest" redo log for the incremental rebuild of Z looks like this:

  Z: redo-ifchange Q  [nothing happens]
  Z: ...Z stuff...

But depending what you're doing - for example, if you want to see if the "reproducible log" for an incremental build of your whole reproducible build project matches a from-scratch build - it might make sense to show where Q came from. This is redo-log's -u option ("recurse into unchanged targets"), which then prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

...in other words, the exact same log as you got when you built Z the first time.

Conclusion

I'm sure almost everyone reading this thinks I'm hopelessly pedantic to care so much about the sequence of lines of output in my build logs. You're right! But you're getting off easy, because you didn't have to live through my obsessing over LED blink synchronization across a lab full of wifi routers. (Useless trivia: I found at least three bugs in openntpd by noticing the LEDs in our lab were not all blinking uniformly.)

And that, my friends, is why tree traversal algorithms are fair game in job interviews.

...

Uh, also, you should try redo. You may also want to see how I redo-ized buildroot. If you're interested, you can join the discussions on the redo-list mailing list.

Posted Tue Nov 6 04:45:13 2018 Tags:

Lately I've been getting back to hacking on my djb redo implementation. We've fixed some problems with file attribute handling on NFS, obscure locking, and MacOS/FreeBSD portability. If you haven't tried redo in a while, you might want to give it another shot.

In case you haven't heard of redo before, here's the overview: it's like make, but with no special syntax (just sh scripts). The first time you "do" a build, it runs a set of recursive sh scripts, once per target. Those scripts can run a command called redo-ifchange, which declares dependencies on the given targets, checks if they are up to date, and if not, recurses into more scripts in order to build them. And that's it!

redo combines the best parts of imperative systems with the best parts of functional systems. The build scripts are all imperative - it just runs commands, and declaring dependencies happens as a side effect of some of those commands (redo-ifchange). When it's done, you have a purely-functional data structure that you can use for extremely fast dependency calculation. (Theoretically as fast as ninja, but my code isn't as optimized.)

[Credit: redo was invented by Daniel J. Bernstein. I merely implemented it.]

Parallelism

Things get a little more complex on modern multicore computers, where you almost always want parallel builds, which means producing different parts of the tree all at once, so in principle, a sequential-imperative tree of sh scripts is no longer the perfect model. Luckily, redo can handle it: if your script does redo-ifchange on more than one target at a time, it'll try to build all those in parallel. Then, if more than one parallel target tries to build a given dependency, it uses inter-process file locking to make sure the dependency only builds once.

But what's this about serializing logs?

I'm sure you already have your favourite build system and it builds things, and it almost certainly handles parallelism for whatever your use case. Even make does parallelism.

Where things tend to fall down is in rendering the output of a parallel build. When we're running a lot of jobs all at once, and blasting them all to stdout/stderr, and one step deep in the tree gets an error, then you might get pages and pages of successful output from other tasks interspersed with your error, making it very hard to figure out what went wrong and why.

There are various approaches to solving that. Some people would argue that the Unix Way is for programs that didn't fail to just print nothing at all; that's how the Go compiler works, for example. If you like that philosophy but you're using tools that don't agree (such as make itself, which prints all kinds of stuff while it works), you could wrap every command in a script that withholds its output, printing it only if the command returns a nonzero exit code.

That's all nice until you have to debug what went wrong. It's not a coincidence that make, which is made by Unix people, does not follow the Unix Way. Makefiles are just too complicated and hard to debug if you can't see what they're doing; and if step 10 goes wrong, you might be very curious about step 9, even though (nominally) it worked. It's not okay to throw away the successful log messages from step 9.

Fine. Parallel make output is flawed and gross. But everyone knows make is flawed and gross, so they switch to other systems. Most other popular build systems are tool-specific. Someone did a lot of work in cmake, for example, to make it print pretty messages during parallel builds of C/C++ programs. That works well. But if you're not building C/C++ programs, it can't help.

redo is a general purpose dependency system like make, so by definition it's going to run scripts which produce a lot of clutter, possibly including instances of make itself, and someone is going to have to debug them. What can we do to sanitize the logs?

Digression from 2012: loglinear

I've actually been thinking about this problem for more than six years already. Back in 2012, I added a log sanitizer script called loglinear (please pause for a moment to admire the pun) to our project's buildroot. (buildroot is a really handy all-in-one tool for building embedded Linux systems from scratch.)

loglinear worked like this: every time we ran a sub-make command like make path/to/project, we'd instead replace it with loglinear make path/to/project. loglinear then prefixes each line of the output with a job name, say path/to/project, and buffers it. When one of the loglinear processes exits, the top-level loglinear process then takes the buffer from that instance and dumps it to the top-level stdout.

Let's do an example. Imagine we're parallel building A, which depends on J, which depends on all of X, Y, and Z. We launch loglinear make A, which starts loglinear make J, which itself starts (all in parallel) loglinear make X, loglinear make Y, and loglinear make Z. J cannot continue until X, Y, and Z are done, but those three might finish in any order, and loglinear will print the output of each one as soon as it's done. So, the output will look something like this:

  Z: ...some stuff...
  Z: exited with code 0

  X: ...
  X: exited with code 0

  Y: ...
  Y: exited with code 0

  J: ...stuff...
  J: make X
  J: make Y
  J: make Z
  J: ...
  J: exited with code 0

  A: make J
  A: exited with code 0

loglinear also had some magic in case one of the processes returned nonzero: in that case, we'd print the successful processes first and the unsuccessful processes last, in the hope that the "most interesting" messages would end up at the bottom.

This made debugging a lot easier, because build messages from entire packages (like, say, the Linux kernel and busybox) separated out instead of interspersed, but it had some flaws. Most importantly, the output was very bursty: it waited until a given job was completely done before it printed anything. When busybox finished, you saw all the busybox logs; when the kernel finished, you saw all the kernel logs. Although useful, this is, frankly, not as fun as watching your 16-core workstation live-blast several screenfuls of compiler log messages per second. It feels slow.

There was also a secondary problem, which is that the messages, although linearized, were in the wrong order. Notice that, in the above, 'make J' (in A) happens after all the messages from J. This is because we print jobs in the order that they finish, and by definition, J must finish before the job that started J can finish. If we tried to print it in a more reasonable order (topmost job first, dependencies next, etc), then we couldn't print any incremental logs at all: A is guaranteed to finish last, but we want to print it first. This is all very logical when you think deeply about it, but trust me, it gets tedious explaining it to every new developer on your team, who just wants to know why the time is flowing backwards.

So we used loglinear, and it was a necessary evil, especially for viewing autobuilder logs, but nobody liked it. I dreamed of a better way.

Back to 2018: redo-log

I've had many years to contemplate my 2012 sins, and I have good news: I finally figured out how to do it right. Not only that, but instead of introducing a weird tool that you have to hack into your makefiles (and hack I did, oh boy did I ever, to make buildroot parallelize things the way I wanted), I've helpfully integrated the magic directly into redo. And not only that, but I've updated buildroot to use redo so that not only can you get linearized logs, but you can get faster buildroot startup time and faster/better buildroot dependencies too.

(Dear buildroot team: If you're reading this, I was going to send this patch to your mailing list, but it's not ready for prime time, or even code review, yet. I'd love to hear your feedback if you have any.)

redo-log takes a totally different approach from loglinear:

  • It saves the log for each target persistently to its own file, so you can look at it again later.
  • Rather than a flat list of log files, it tracks their tree order.
  • It prints log messages "depth first" instead of "breadth first," for less burstiness.
  • It prints output in the order dependencies were launched, instead of the order in which they were finished.
  • It can helpfully indent log messages based on their recursion level.
  • Since we persist logs anyway, we reserve the right to simply not print messages from some irrelevant targets when an error happens. You can always pull up the logs later if you care.

In other words, the logs from our earlier build now look like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     ...X stuff...
  X:     exit 0
  J:   redo Y
  Y:     ...Y stuff...
  Y:     exit 0
  J:   redo Z
  Z:     ...Z stuff...
  Z:     exit 0
  J:   ...more J stuff...
  J:   exit 0
  A: exit 0

The important realization - which is maybe obvious to you, but it wasn't obvious to me - is that, if you decide to do a depth-first traversal of log messages, the "deepest" one that is still running will continue producing incremental messages until it finishes. There's no need to buffer them!

During that time, other parallel branches of the tree will also be producing messages, which we do buffer until later. So Z might finish before X, but we just print the messages from X as they come out, until X is done. Then we go back to J, which sends us to Y, which we follow until it's done. When we get to Z, which is done already, we just print all its enqueued messages in one big blast, then move on.

An interesting invariant here is that it doesn't matter whether X, Y, or Z finishes first. If they each print their own messages (including launching their own subtasks) in a reproducible order, then no matter how the CPU schedules them, the total output will be in a reproducible order. This has the almost-impossible-sounding property that a set of "reproducible build" steps will produce a byte-for-byte reproducible log, even in the presence of unlimited parallelism.

The tricks go a little deeper. Let's say X, Y, and Z all depend on Q. Because of how .do scripts work, they will each run redo-ifchange Q at some undefined time in their respective build scripts. We only need to build Q once, but we don't know which of X, Y, or Z will be the one to do it. This is where the persistent logs come in; we don't actually care! Effectively the log is a DAG (directed acyclic graph, the same kind of structure used in git) with multiple links to Q. Its structure is like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Q:       ...build Q...
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Q:       ...build Q...
  Z:     ...Z stuff...
  J:   ...more J stuff...
  A: exit 0

Of course we only ran Q once, so it's silly to print its output more than once. Let's trim it:

  A: redo J
  J:   ...stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Z:     ...Z stuff...
  J:   ...stuff...
  A: exit 0

Because of our depth-first traversal rule, the log will always look exactly like that - even if job Q was "actually" launched by job Y and not X. redo-log prints logs in dependency order.

After the build finishes, though, you might want to investigate exactly how Z got built. To do that, you run redo-log Z, which prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

In this case, we can show the steps for job Q as a subtree of Z, even though Q was actually built by Y, because it's not redundant when we're not printing Y.

One more complication arises if one of Z's dependencies changes and we need to rebuild Z, but Q has not changed. If we do that, then the "honest" redo log for the incremental rebuild of Z looks like this:

  Z: redo-ifchange Q  [nothing happens]
  Z: ...Z stuff...

But depending what you're doing - for example, if you want to see if the "reproducible log" for an incremental build of your whole reproducible build project matches a from-scratch build - it might make sense to show where Q came from. This is redo-log's -u option ("recurse into unchanged targets"), which then prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

...in other words, the exact same log as you got when you built Z the first time.

Conclusion

I'm sure almost everyone reading this thinks I'm hopelessly pedantic to care so much about the sequence of lines of output in my build logs. You're right! But you're getting off easy, because you didn't have to live through my obsessing over LED blink synchronization across a lab full of wifi routers. (Useless trivia: I found at least three bugs in openntpd by noticing the LEDs in our lab were not all blinking uniformly.)

And that, my friends, is why tree traversal algorithms are fair game in job interviews.

...

Uh, also, you should try redo. You may also want to see how I redo-ized buildroot. If you're interested, you can join the discussions on the redo-list mailing list.

Posted Tue Nov 6 04:45:13 2018 Tags:

Lately I've been getting back to hacking on my djb redo implementation. We've fixed some problems with file attribute handling on NFS, obscure locking, and MacOS/FreeBSD portability. If you haven't tried redo in a while, you might want to give it another shot.

In case you haven't heard of redo before, here's the overview: it's like make, but with no special syntax (just sh scripts). The first time you "do" a build, it runs a set of recursive sh scripts, once per target. Those scripts can run a command called redo-ifchange, which declares dependencies on the given targets, checks if they are up to date, and if not, recurses into more scripts in order to build them. And that's it!

redo combines the best parts of imperative systems with the best parts of functional systems. The build scripts are all imperative - it just runs commands, and declaring dependencies happens as a side effect of some of those commands (redo-ifchange). When it's done, you have a purely-functional data structure that you can use for extremely fast dependency calculation. (Theoretically as fast as ninja, but my code isn't as optimized.)

[Credit: redo was invented by Daniel J. Bernstein. I merely implemented it.]

Parallelism

Things get a little more complex on modern multicore computers, where you almost always want parallel builds, which means producing different parts of the tree all at once, so in principle, a sequential-imperative tree of sh scripts is no longer the perfect model. Luckily, redo can handle it: if your script does redo-ifchange on more than one target at a time, it'll try to build all those in parallel. Then, if more than one parallel target tries to build a given dependency, it uses inter-process file locking to make sure the dependency only builds once.

But what's this about serializing logs?

I'm sure you already have your favourite build system and it builds things, and it almost certainly handles parallelism for whatever your use case. Even make does parallelism.

Where things tend to fall down is in rendering the output of a parallel build. When we're running a lot of jobs all at once, and blasting them all to stdout/stderr, and one step deep in the tree gets an error, then you might get pages and pages of successful output from other tasks interspersed with your error, making it very hard to figure out what went wrong and why.

There are various approaches to solving that. Some people would argue that the Unix Way is for programs that didn't fail to just print nothing at all; that's how the Go compiler works, for example. If you like that philosophy but you're using tools that don't agree (such as make itself, which prints all kinds of stuff while it works), you could wrap every command in a script that withholds its output, printing it only if the command returns a nonzero exit code.

That's all nice until you have to debug what went wrong. It's not a coincidence that make, which is made by Unix people, does not follow the Unix Way. Makefiles are just too complicated and hard to debug if you can't see what they're doing; and if step 10 goes wrong, you might be very curious about step 9, even though (nominally) it worked. It's not okay to throw away the successful log messages from step 9.

Fine. Parallel make output is flawed and gross. But everyone knows make is flawed and gross, so they switch to other systems. Most other popular build systems are tool-specific. Someone did a lot of work in cmake, for example, to make it print pretty messages during parallel builds of C/C++ programs. That works well. But if you're not building C/C++ programs, it can't help.

redo is a general purpose dependency system like make, so by definition it's going to run scripts which produce a lot of clutter, possibly including instances of make itself, and someone is going to have to debug them. What can we do to sanitize the logs?

Digression from 2012: loglinear

I've actually been thinking about this problem for more than six years already. Back in 2012, I added a log sanitizer script called loglinear (please pause for a moment to admire the pun) to our project's buildroot. (buildroot is a really handy all-in-one tool for building embedded Linux systems from scratch.)

loglinear worked like this: every time we ran a sub-make command like make path/to/project, we'd instead replace it with loglinear make path/to/project. loglinear then prefixes each line of the output with a job name, say path/to/project, and buffers it. When one of the loglinear processes exits, the top-level loglinear process then takes the buffer from that instance and dumps it to the top-level stdout.

Let's do an example. Imagine we're parallel building A, which depends on J, which depends on all of X, Y, and Z. We launch loglinear make A, which starts loglinear make J, which itself starts (all in parallel) loglinear make X, loglinear make Y, and loglinear make Z. J cannot continue until X, Y, and Z are done, but those three might finish in any order, and loglinear will print the output of each one as soon as it's done. So, the output will look something like this:

  Z: ...some stuff...
  Z: exited with code 0

  X: ...
  X: exited with code 0

  Y: ...
  Y: exited with code 0

  J: ...stuff...
  J: make X
  J: make Y
  J: make Z
  J: ...
  J: exited with code 0

  A: make J
  A: exited with code 0

loglinear also had some magic in case one of the processes returned nonzero: in that case, we'd print the successful processes first and the unsuccessful processes last, in the hope that the "most interesting" messages would end up at the bottom.

This made debugging a lot easier, because build messages from entire packages (like, say, the Linux kernel and busybox) separated out instead of interspersed, but it had some flaws. Most importantly, the output was very bursty: it waited until a given job was completely done before it printed anything. When busybox finished, you saw all the busybox logs; when the kernel finished, you saw all the kernel logs. Although useful, this is, frankly, not as fun as watching your 16-core workstation live-blast several screenfuls of compiler log messages per second. It feels slow.

There was also a secondary problem, which is that the messages, although linearized, were in the wrong order. Notice that, in the above, 'make J' (in A) happens after all the messages from J. This is because we print jobs in the order that they finish, and by definition, J must finish before the job that started J can finish. If we tried to print it in a more reasonable order (topmost job first, dependencies next, etc), then we couldn't print any incremental logs at all: A is guaranteed to finish last, but we want to print it first. This is all very logical when you think deeply about it, but trust me, it gets tedious explaining it to every new developer on your team, who just wants to know why the time is flowing backwards.

So we used loglinear, and it was a necessary evil, especially for viewing autobuilder logs, but nobody liked it. I dreamed of a better way.

Back to 2018: redo-log

I've had many years to contemplate my 2012 sins, and I have good news: I finally figured out how to do it right. Not only that, but instead of introducing a weird tool that you have to hack into your makefiles (and hack I did, oh boy did I ever, to make buildroot parallelize things the way I wanted), I've helpfully integrated the magic directly into redo. And not only that, but I've updated buildroot to use redo so that not only can you get linearized logs, but you can get faster buildroot startup time and faster/better buildroot dependencies too.

(Dear buildroot team: If you're reading this, I was going to send this patch to your mailing list, but it's not ready for prime time, or even code review, yet. I'd love to hear your feedback if you have any.)

redo-log takes a totally different approach from loglinear:

  • It saves the log for each target persistently to its own file, so you can look at it again later.
  • Rather than a flat list of log files, it tracks their tree order.
  • It prints log messages "depth first" instead of "breadth first," for less burstiness.
  • It prints output in the order dependencies were launched, instead of the order in which they were finished.
  • It can helpfully indent log messages based on their recursion level.
  • Since we persist logs anyway, we reserve the right to simply not print messages from some irrelevant targets when an error happens. You can always pull up the logs later if you care.

In other words, the logs from our earlier build now look like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     ...X stuff...
  X:     exit 0
  J:   redo Y
  Y:     ...Y stuff...
  Y:     exit 0
  J:   redo Z
  Z:     ...Z stuff...
  Z:     exit 0
  J:   ...more J stuff...
  J:   exit 0
  A: exit 0

The important realization - which is maybe obvious to you, but it wasn't obvious to me - is that, if you decide to do a depth-first traversal of log messages, the "deepest" one that is still running will continue producing incremental messages until it finishes. There's no need to buffer them!

During that time, other parallel branches of the tree will also be producing messages, which we do buffer until later. So Z might finish before X, but we just print the messages from X as they come out, until X is done. Then we go back to J, which sends us to Y, which we follow until it's done. When we get to Z, which is done already, we just print all its enqueued messages in one big blast, then move on.

An interesting invariant here is that it doesn't matter whether X, Y, or Z finishes first. If they each print their own messages (including launching their own subtasks) in a reproducible order, then no matter how the CPU schedules them, the total output will be in a reproducible order. This has the almost-impossible-sounding property that a set of "reproducible build" steps will produce a byte-for-byte reproducible log, even in the presence of unlimited parallelism.

The tricks go a little deeper. Let's say X, Y, and Z all depend on Q. Because of how .do scripts work, they will each run redo-ifchange Q at some undefined time in their respective build scripts. We only need to build Q once, but we don't know which of X, Y, or Z will be the one to do it. This is where the persistent logs come in; we don't actually care! Effectively the log is a DAG (directed acyclic graph, the same kind of structure used in git) with multiple links to Q. Its structure is like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Q:       ...build Q...
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Q:       ...build Q...
  Z:     ...Z stuff...
  J:   ...more J stuff...
  A: exit 0

Of course we only ran Q once, so it's silly to print its output more than once. Let's trim it:

  A: redo J
  J:   ...stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Z:     ...Z stuff...
  J:   ...stuff...
  A: exit 0

Because of our depth-first traversal rule, the log will always look exactly like that - even if job Q was "actually" launched by job Y and not X. redo-log prints logs in dependency order.

After the build finishes, though, you might want to investigate exactly how Z got built. To do that, you run redo-log Z, which prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

In this case, we can show the steps for job Q as a subtree of Z, even though Q was actually built by Y, because it's not redundant when we're not printing Y.

One more complication arises if one of Z's dependencies changes and we need to rebuild Z, but Q has not changed. If we do that, then the "honest" redo log for the incremental rebuild of Z looks like this:

  Z: redo-ifchange Q  [nothing happens]
  Z: ...Z stuff...

But depending what you're doing - for example, if you want to see if the "reproducible log" for an incremental build of your whole reproducible build project matches a from-scratch build - it might make sense to show where Q came from. This is redo-log's -u option ("recurse into unchanged targets"), which then prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

...in other words, the exact same log as you got when you built Z the first time.

Conclusion

I'm sure almost everyone reading this thinks I'm hopelessly pedantic to care so much about the sequence of lines of output in my build logs. You're right! But you're getting off easy, because you didn't have to live through my obsessing over LED blink synchronization across a lab full of wifi routers. (Useless trivia: I found at least three bugs in openntpd by noticing the LEDs in our lab were not all blinking uniformly.)

And that, my friends, is why tree traversal algorithms are fair game in job interviews.

...

Uh, also, you should try redo. You may also want to see how I redo-ized buildroot. If you're interested, you can join the discussions on the redo-list mailing list.

Posted Tue Nov 6 04:45:13 2018 Tags:

[ A similar version was crossposted on Conservancy's blog. ]

More than 15 years ago, Free, Libre, and Open Source Software (FLOSS) community activists successfully argued that licensing proliferation was a serious threat to the viability of FLOSS. We convinced companies to end the era of “vanity” licenses. Different charities — from the Open Source Initiative (OSI) to the Free Software Foundation (FSF) to the Apache Software Foundation — all agreed we were better off with fewer FLOSS licenses. We de-facto instituted what my colleague Richard Fontana once called the “Rule of Three” — assuring that any potential FLOSS license should be met with suspicion unless (a) the OSI declares that it meets their Open Source Definition, (b) the FSF declares that it meets their Free Software Definition, and (c) the Debian Project declares that it meets their Debian Free Software Guidelines. The work for those organizations quelled license proliferation from radioactive threat to safe background noise. Everyone thought the problem was solved. Pointless license drafting had become a rare practice, and updated versions of established licenses were handled with public engagement and close discussion with the OSI and other license evaluation experts.

Sadly, the age of license proliferation has returned. It's harder to stop this time, because this isn't merely about corporate vanity licenses. Companies now have complex FLOSS policy agendas, and those agendas are not to guarantee software freedom for all. While it is annoying that our community must again confront an old threat, we are fortunate the problem is not hidden: companies proposing their own licenses are now straightforward about their new FLOSS licenses' purposes: to maximize profits.

Open-in-name-only licenses are now common, but seem like FLOSS licenses only to the most casual of readers. We've succeeded in convincing everyone to “check the OSI license list before you buy”. We can therefore easily dismiss licenses like Common Clause merely by stating they are non-free/non-open-source and urging the community to avoid them. But, the next stage of tactics have begun, and they are harder to combat. What happens when for-profit companies promulgate their own hyper-aggressive (quasi-)copyleft licenses that seek to pursue the key policy goal of “selling proprietary licenses” over “defending software freedom”? We're about to find out, because, yesterday, MongoDB declared themselves the arbiter of what “strong copyleft” means.

Understanding MongoDB's Business Model

To understand the policy threat inherent in MongoDB's so-called “Server Side Public License, Version 1”, one must first understand the fundamental business model for MongoDB and companies like them. These companies use copyleft for profit-making rather than freedom-protecting. First, they require full control (either via ©AA or CLA) of all copyrights in the work, and second, they offer two independent lines of licensing. Publicly, they provide the software under the strongest copyleft license available. Privately, the same (or secretly improved) versions of the software are available under fully proprietary terms. In theory, this could be merely selling exceptions: a benign manner of funding more Free Software code — giving the proprietary option only to those who request it. In practice — in all examples that have been even mildly successful (such as MongoDB and MySQL) — this mechanism serves as a warped proprietary licensing shake-down: “Gee, it looks like you're violating the copyleft license. That's a shame. I guess you just need to abandon the copyleft version and buy a proprietary license from us to get yourself out of this jam, since we don't plan to reinstate any lost rights and permissions under the copyleft license.” In other words, this structure grants exclusive and dictatorial power to a for-profit company as the arbiter of copyleft compliance. Indeed, we have never seen any of these companies follow or endorse the Principles of Community-Oriented GPL Enforcement. While it has made me unpopular with some, I still make no apologies that I have since 2004 consistently criticized this “proprietary relicensing” business model as “nefarious”, once I started hearing regular reports that MySQL AB (now Oracle) asserts GPL violations against compliant uses merely to scare users into becoming “customers”. Other companies, including MongoDB, have since emulated this activity.

Why Seek Even Stronger Copyleft?

The GNU Affero General Public License (AGPL) has done a wonderful job defending the software freedom of community-developed projects like Mastodon and Mediagoblin. So, we should answer with skepticism a solitary for-profit company coming forward to claim that “Affero GPL has not resulted in sufficient legal incentives for some of the largest users of infrastructure software … to participate in the community. Many open source developers are struggling with a similar reality”. If the last sentence were on Wikipedia, I'd edit it to add a Citation Needed tag, as I know of nomulti-copyright-held or charity-based AGPL'd project that has “struggled with this reality”. In fact, it's only a “reality” for those that engage in proprietary relicensing. Eliot Horowitz, co-founder of MongoDB and promulgator of their new license, neglects to mention that.

The most glaring problem with this license, which Horowitz admits in his OSI license-review list post, is that there was no community drafting process. Instead, a for-profit company, whose primary goal is to use copyleft as a weapon against the software-sharing community for the purpose of converting that “community” into paying customers, published this license as a fait accompli without prior public discussion of the license text.

If this action were an isolated incident by one company, ignoring it is surely the best response. Indeed, I urged everyone to simply ignore the Commons Clause. Now, we see a repackaging of the Commons Clause into a copyleft-like box (with reuse of Commons Clause's text such as “whose value derives, entirely or substantially, from the functionality of the Software”). Since both licenses were drafted in secret, we cannot know if the reuse of text was simply because the same lawyer was employed to write both, or if MongoDB has joined a broader and more significant industry-wide strategy to replace existing FLOSS licensing with alternatives that favor businesses over individuals.

The Community Creation Process Matters

Admittedly, the history of copyleft has been one of slowly evolving community-orientation. GPLv1 and GPLv2 were drafted in private, too, by Richard Stallman and FSF's (then) law firm lawyer, Jerry Cohen. However, from the start, the license steward was not Stallman himself, nor the law firm, but the FSF, a 501(c)(3) charity dedicated to serve the public good. As such, the FSF made substantial efforts in the GPLv3 process to reorient the drafting of copyleft licenses as a public policy and legislative process. Like all legislative processes, GPLv3 was not ideal — and I was even personally miffed to be relegated to the oft-ignored “GPLv3 Discussion Committee D” — but the GPLv3 process was undoubtedly a step forward in FLOSS community license drafting. Mozilla Corporation made efforts for community collaboration in redrafting the MPL, and specifically included the OSI and the FSF (arbiters of the Open Source Definition and Free Software Definition (respectively)) in MPL's drafting deliberations. The modern acceptable standard is a leap rather than a step forward: a fully public, transparent drafting process with a fully public draft repository, as the copyleft-next project has done. I think we should now meet with utmost suspicion any license that does not use copyleft-next's approach of “running licensing drafting as a Free Software project”.

I was admittedly skeptical of that approach at first. What I have seen six years since Richard Fontana started copyleft-next is that, simply put, the key people who are impacted most fundamentally by a software license are mostly likely to be aware of, and engage in, a process if it is fully public, community-oriented, and uses community tools, like Git.

Like legislation, the policies outlined in copyleft licenses impact the general public, so the general public should be welcomed to the drafting. At Conservancy, we don't draft our own licenses0, so our contracts with software developers and agreements with member projects state that the licenses be both “OSI-approved Open Source” and “FSF-approved GPL-compatible Free Software”. However, you can imagine that Conservancy has a serious vested interest in what licenses are ultimately approved by the OSI and the FSF. Indeed, with so much money flowing to software developers bound by those licenses, our very charitable mission could be at stake if OSI and the FSF began approving proprietary licenses as Open, Free, and/or GPL-compatible. I want to therefore see license stewards work, as Mozilla did, to make the vetting process easier, not harder, for these organizations.

A community drafting process allows everyone to vet the license text early and often, to investigate the community and industry impact of the license, and to probe the license drafter's intent through the acceptance and rejection of proposed modified text (ideally through a DVCS). With for-profit actors seeking to gain policy control of fundamental questions such as “what is strong copyleft?”, we must demand full drafting transparency and frank public discourse.

The Challenge Licensing Arbiters Face

OSI, FSF, and Debian have a huge challenge before them. Historically, the FSF was the only organization who sought to push the boundary of strong copyleft. (Full disclosure: I created the Affero clause while working for the FSF in 2002, inspired by Henry Poole's useful and timely demands for a true network services copyleft.) Yet, the Affero clause was itself controversial. Many complained that it changed the fundamental rules of copyleft. While “triggered only on distribution, not modification” was a fundamental rule of the regular GPL, we as a community — over time and much public debate — decided the Affero clause is a legitimate copyleft, and AGPL was declared Open Source by OSI and DFSG-free by Debian.

That debate was obviously framed by the FSF. The FSF, due to public pressure, compromised by leaving the AGPL as an indefinite fork of the GPL (i.e., the FSF did not include the Affero clause in plain GPL. While I personally lobbied (from GPLv3 Discussion Committee D and elsewhere) for the merger of AGPL and GPL during the GPLv3 drafting process, I respect the decision of the FSF, which was informed not by my one voice, but the voices of the entire community.

Furthermore, the FSF is a charity, chartered to serve the public good and the advancement of software freedom for users and developers. MongoDB is a for-profit company, chartered to serve the wallets of its owners. While MongoDB employees1 (like those of any other company) should be welcomed on equal footing to the other unaffiliated individuals, and representatives of companies, charities, and trade-associations to the debate about the future of copyleft, we should not accept their active framing of that debate. By submitting this license to OSI for approval without any public community discussion, and without any discussion whatsoever with the key charities in the community, is unacceptable. The OSI should now adopt a new requirement for license approval — namely, that licenses without a community-oriented drafting process should be rejected for the meta-reason of “non-transparent drafting”, regardless of their actual text. This will have the added benefit of forcing future license drafters to come to OSI, on their public mailing lists, before the license is finalized. That will save OSI the painstaking work of walking back bad license drafts, which has in recent years consumed much expert time by OSI's volunteers.

Welcoming All To Public Discussion

Earlier this year, Conservancy announced plans to host and organize the first annual CopyleftConf. Conservancy decided to do this because Conservancy seeks to create a truly neutral, open, friendly, and welcoming forum for discussion about the past and future of copyleft as a strategy for defending software freedom. We had no idea when Karen and I first mentioned the possibility of running CopyleftConf (during the Organizers' Panel at the end of the Legal and Policy DevRoom at FOSDEM 2018 in February 2018) that multiple companies would come forward and seek to control the microphone on the future of copyleft. Now that MongoDB has done so, I'm very glad that the conference is already organized and on the calendar before they did so.

Despite my criticisms of MongoDB, I welcome Eliot Horowitz, Heather Meeker (the law firm lawyer who drafted MongoDB's new license and the Commons Clause), or anyone else who was involved in the creation of MongoDB's new license to submit a talk. Conservancy will be announcing soon the independent group of copyleft experts (and critics!) who will make up the Program Committee and will independently evaluate the submissions. Even if a talk is rejected, I welcome rejected proposers to attend and speak about their views in the hallway track and the breakout sessions.

One of the most important principles in copyleft policy that our community has learned is that commercial, non-commercial, and hobbyist activity3 should have equal footing with regard to rights assured by the copyleft licenses themselves. There is no debate about that; we all agree that copyleft codebases become meeting places for hobbyists, companies, charities, and trade associations to work together toward common goals and in harmony and software freedom. With this blog post, I call on everyone to continue on the long road to applying that same principle to the meta-level of how these licenses are drafted and how they are enforced. While we have done some work recently on the latter, not enough has been done on the former. MongoDB's actions today give us an opportunity to begin that work anew.


0 While Conservancy does not draft any main FLOSS license texts, Conservancy does help with the drafting of additional permissions upon the request of our member projects. Note that additional permissions (sometimes called license exceptions) grant permission to engage in activities that the main license would otherwise prohibit. As such, by default, additional permissions can only make a copyleft license weaker, never stronger.

1 , 3 I originally had “individual actors” here instead of “hobbyist activity”, and additionally had expressed poorly the idea of welcoming individuals representing all types of entities to the discussion. The miscommunication in my earlier text gave one person the wrong impression that I believe the rights of companies should be equal to the rights of individuals. I fundamentally that companies and organizations should not have rights of personhood and I've updated the text in an effort to avoid such confusions.

Posted Tue Oct 16 22:44:00 2018 Tags:

[ A similar version was crossposted on Conservancy's blog. ]

Folks lauded today that Microsoft has joined the Open Invention Network (OIN)'s limited patent non-aggression pact, suggesting that perhaps it will bring peace in our time regarding Microsoft's historical patent aggression. While today's announcement is a step forward, we call on Microsoft to make this just the beginning of their efforts to stop their patent aggression efforts against the software freedom community.

The OIN patent non-aggression pact is governed by something called the Linux System Definition. This is the most important component of the OIN non-aggression pact, because it's often surprising what is not included in that Definition especially when compared with Microsoft's patent aggression activities. Most importantly, the non-aggression pact only applies to the upstream versions of software, including Linux itself.

We know that Microsoft has done patent troll shakedowns in the past on Linux products related to the exfat filesystem. While we at Conservancy were successful in getting the code that implements exfat for Linux released under GPL (by Samsung), that code has not been upstreamed into Linux. So, Microsoft has not included any patents they might hold on exfat into the patent non-aggression pact.

We now ask Microsoft, as a sign of good faith and to confirm its intention to end all patent aggression against Linux and its users, to now submit to upstream the exfat code themselves under GPLv2-or-later. This would provide two important protections to Linux users regarding exfat: (a) it would include any patents that read on exfat as part of OIN's non-aggression pact while Microsoft participates in OIN, and (b) it would provide the various benefits that GPLv2-or-later provides regarding patents, including an implied patent license and those protections provided by GPLv2§7 (and possibly other GPL protections and assurances as well)

Posted Wed Oct 10 12:41:00 2018 Tags: