The lower-post-volume people behind the software in Debian. (List of feeds.)

Here's a post I've been thinking about for a while. I shared it privately with a few people last year, but wanted to find a way to present it that wouldn't be wildly misconstrued. No luck so far.

Summary: Inspired by some conversations at work, I made a variant of my previously popular SimSWE (software engineer simulator) that has our wily engineers trying to buy houses and commute to work. The real estate marketplace is modeled based on Silicon Valley, a region where I don't live but my friends do, and which I define as the low-density Northern Californian suburban area including Sunnyvale, Mountain View, Cupertino, etc, but excluding San Francisco. (San Francisco is obviously relevant, since lots of engineers live in SF and commute to Silicon Valley, or vice versa, but it's a bit different so I left it out for simplicity.)

Even more than with the earlier versions of SimSWE (where I knew the mechanics in advance and just wanted to visualize them in a cool way), I learned a lot by making this simulation. As with all software projects, the tangible output was what I expected, because I kept debugging it until I got what I expected, and then I stopped. But there were more than the usual surprises along the way.

Is simulation "real science"?

Let's be clear: maybe simulation is real science sometimes, but... not when I do it.

Some of my friends were really into Zen and the Art of Motorcycle Maintenance back in high school. The book gets less profound as I get older, but one of my favourite parts is their commentary on the scientific method:

    A man conducting a gee-whiz science show with fifty thousand dollars’ worth of Frankenstein equipment is not doing anything scientific if he knows beforehand what the results of his efforts are going to be. A motorcycle mechanic, on the other hand, who honks the horn to see if the battery works is informally conducting a true scientific experiment. He is testing a hypothesis by putting the question to nature. [...]

    The formation of hypotheses is the most mysterious of all the categories of scientific method. Where they come from, no one knows. A person is sitting somewhere, minding his own business, and suddenly - flash! - he understands something he didn’t understand before. Until it’s tested the hypothesis isn’t truth. For the tests aren’t its source. Its source is somewhere else.

Here's the process I followed. I started by observing that prices in Silicon Valley are unexpectedly high, considering how much it sucks to live there, and rising quickly. (Maybe you like the weather in California and are willing to pay a premium; but if so, that premium has been rising surprisingly quickly over the last 15 years or so, even as the weather stays mostly the same.)

Then I said, I have a hypothesis about those high prices: I think they're caused by price inelasticity. Specifically, I think software engineers can make so much more money living in California, compared to anywhere else, that it would be rational to move there and dramatically overpay for housing. The increase in revenue will exceed the increase in costs.

I also hypothesized that there's a discontinuity in the market: unlike, say, New York City, where prices are high but tend to gently fluctuate, prices in Silicon Valley historically seem to have two states: spiking (eg. dotcom bubble and today's strong market) or collapsing (eg. dotcom crash).

Then I tried to generate a simulator that would demonstrate those effects.

This is cheating: I didn't make a simulator from first principles to see what would happen. What I did is I made a series of buggy simulators, and discarded all the ones that didn't show the behaviour I was looking for. That's not science. It looks similar. It probably has a lot in common with p-hacking. But I do think it's useful, if you use the results wisely.

If it's not science, then what is it?

It's part of science. This approach is a method for improved hypothesis formulation - the "most mysterious" process described in the quote above.

I started with "I think there's a discontinuity," which is too vague. Now that I made a simulator, my hypothesis is "there's a discontinuity at the point where demand exceeds supply, and the market pricing patterns should look something like this..." which is much more appropriate for real-life testing. Maybe this is something like theoretical physics versus experimental physics, where you spend some time trying to fit a formula to data you have, and some time trying to design experiments to get specific new data to see if you guessed right. Except worse, because I didn't use real data or do experiments.

Real science in this area, by the way, does get done. Here's a paper that simulated a particular 2008 housing market (not California) and compared it to the actual market data. Cool! But it doesn't help us explain what's going on in Silicon Valley.

The simulator

Okay, with all those disclaimers out of the way, let's talk about what I did. You can find the source code here, if you're into that sort of thing, but I don't really recommend it, because you'll probably find bugs. Since it's impossible for this simulation to be correct in the first place, finding bugs is rather pointless.

Anyway. Imagine a 2-dimensional region with a set of SWEs (software engineers), corporate employers, and various homes, all scattered around randomly.

Luckily, we're simulating suburban Northern California, so there's no public transit to speak of, traffic congestion is uniformly bad, and because of zoning restrictions, essentially no new housing ever gets built. Even the 2-dimensional assumption is accurate, because all the buildings are short and flat. So we can just set all those elements at boot time and leave them static.

What does change is the number of people working, the amount companies are willing to pay them, the relative sizes of different companies, and exactly which company employs a given SWE at a given time. Over a period of years, this causes gravity to shift around in the region; if an engineer buys a home to be near Silicon Graphics (RIP), their commute might get worse when they jump over to Facebook, and they may or may not decide it's time to move homes.

So we have an array of autonomous agents, their income, their employer (which has a location), their commute cost, their property value (and accumulated net worth), and their property tax.

(I also simulated the idiotic California "property tax payments don't change until property changes owners" behaviour. That has some effect, mainly to discourage people from exchanging equally-priced homes to move a bit closer to work, because they don't want to pay higher taxes. As a result, the market-distorting law ironically serves to increase commute times, thus also congestion, and make citizens less happy. Nice work, California.)

The hardest part of the simulator was producing a working real estate bidding system that acted even halfway believably. My simulated SWEs are real jerks; they repeatedly exploited every flaw in my market clearing mechanics, leading to all kinds of completely unnatural looking results.

Perversely, the fact that the results in this version finally seem sensible gives me confidence that the current iteration of my bidding system is not totally wrong. A trained logician could likely prove that my increased confidence is precisely wrong, but I'm not a logician, I'm a human, and here we are today.

The results

Let's see that plot again, repeated from up above.

The x axis is time, let's say months since start. The top chart shows one dot for every home that gets sold on the open market during the month. The red line corresponds to the 1.0 crossover point of the Demand to Supply Ratio (DSR) - the number of people wanting a home vs the number of homes available.

The second plot shows DSR directly. That is, when DSR transitions from <1.0 to >1.0, we draw a vertical red line on all three plots. For clarity there's also a horizontal line at 1.0 on the second plot.

The third plot, liquidity, shows the number of simulated homes on the market (but not yet sold) at any given moment. "On the market" means someone has decided they're willing to sell, but the price is still being bid up, or nobody has made a good enough offer yet. (Like I said, this part of the simulator was really hard to get right. In the source it just looks like a few lines of code, but you should see how many lines of code had to die to produce those few. Pricing wise, it turns out to be quite essential that you (mostly) can't buy a house which isn't on the market, and that bidding doesn't always complete instantaneously.)

So, what's the deal with that transition at DSR=1.0?

To answer that question, we have to talk about the rational price to pay for a house. One flaw in this simulation is that our simulated agents are indeed rational: they will pay whatever it takes as long as they can still make net profit. Real people aren't like that. If a house sold for $500k last month, and you're asking $1.2 million today, they will often refuse to pay that price, just out of spite, even though the whole market has moved and there are no more $500k houses. (You could argue that it's rational to wait and see if the market drops back down. Okay, fine. I had enough trouble simulating the present. Simulating my agents' unrealistic opinions of what my simulator was going to do next seemed kinda unwieldy.)

Another convenient aspect of Silicon Valley is that almost all our agents are engineers, who are a) so numerous and b) so rich that they outnumber and overwhelm almost all other participants in the market. You can find lots of news articles about how service industry workers have insane commutes because they're completely priced out of our region of interest.

(Actually there are also a lot of long-term residents in the area who simply refuse to move out and, while complaining about the obnoxious techie infestation, now see their home as an amazing investment vehicle that keeps going up each year by economy-beating percentages. In our simulator, we can ignore these people because they're effectively not participating in the market.)

To make a long story short, our agents assume that if they can increase their income by X dollars by moving to Silicon Valley vs living elsewhere, then it is okay to pay mortgage costs up to R*X (where R is between 0 and 100%) in order to land that high-paying job. We then subtract some amount for the pain and suffering and lost work hours of the daily commute, proportionally to the length of the commute.

As a result of all this, housing near big employers is more expensive than housing farther away. Good.

But the bidding process depends on whether DSR is less than one (fewer SWEs than houses) or more than one (more SWEs than houses). When it's less than one, people bid based on, for lack of a better word, the "value" of the land and the home. People won't overpay for a home if they can buy another one down the street for less. So prices move, slowly and smoothly, as demand changes slowly and smoothly. There's also some random variation based on luck, like occasional employer-related events (layoffs, etc). Market liquidity is pretty high: there are homes on the market that are ready to buy, if someone will pay the right price. It's a buyer's market.

Now let's look at DSR > 1.0, when (inelastic) demand exceeds supply. Under those conditions, there are a lot of people who need to move in, as soon as possible, to start profiting from their huge wages. But they can't: there aren't enough homes. So they get desperate. Every month they don't have a house, they forfeit at least (1-R)*X in net worth, and that makes them very angry, so they move fast. Liquidity goes essentially to zero. People pay more than the asking price. Bidding wars. Don't stop and think before you make an offer: someone else will buy it first, at a premium. It's a seller's market.

When this happens, prices settle at, basically, R*X. (Okay, R*X is the mortgage payment, so convert the annuity back to a selling price. The simulator also throws in some variably sized down payments depending on the net worth you've acquired through employment and previous real estate flipping. SWEs gonna SWE.)

Why R*X? Because in our simulator - which isn't too unlike reality - most of our engineers make roughly the same amount of income. I mean, we all know there's some variation, but it's not that much; certainly less than an order of magnitude, right? And while there are a few very overpaid and very underpaid people, the majority will be closer to the median income. (Note that this is quite different from other housing markets, where there are many kinds of jobs, the income distribution is much wider, and most people's price sensitivity is much greater.)

So as a simplification, we can assume R and X are the same for "all" our engineers. That means they simply cannot, no matter how much they try, pay more than R*X for a home. On the other hand, it is completely rational to pay all the way up to R*X. And demand exceeds supply. So they if they don't pay R*X, someone else will, and prices peak at that level.

When DSR dips back below 1.0: liquidity goes up and prices go back down. Interestingly, the simulated prices drop a lot slower than they shot up in the first place. One reason is that most people are not as desperate to sell as they were to buy. On the other hand, the people who do decide to sell might have a popular location, so people who were forced to buy before - any home at any price - might still bid up that property to improve their commute. The result is increasing price variability as people sell off not-so-great locations in exchange for still-rare great locations.

What does all this mean?

First of all, unlike healthier markets (say, New York City) where an increase in demand translates to higher prices, and demand can increase or decrease smoothly, and you can improve a property to increase its resale price, Silicon Valley is special. It has these three unusual characteristics:

  1. Demand is strictly greater than supply
  2. Most buyers share a similar upper limit on how much they can pay
  3. Other than that limit, buyers are highly price insensitive

That means, for example, that improving your home is unlikely to increase its resale value. People are already paying as much as they can. Hence the phenomenon of run-down homes worth $1.5 million in ugly neighbourhoods with no services, no culture, and no public transit, where that money could buy you a huge mansion elsewhere, or a nice condo in an interesting neighbourhood of a big city.

It means raising engineer salary to match the higher cost of living ("cost of living adjustment") is pointless: it translates directly to higher housing prices (X goes up for everyone, so R*X goes up proportionally), which eats the benefit.

Of course, salaries do continue to rise in Silicon Valley, mostly due to continually increasing competition for employees - after all, there's no more housing so it's hard to import more of them - which is why we continue to see a rise in property values at all. But we should expect it to be proportional to wages and stock grants, not housing value or the demand/supply ratio.

In turn, this means that a slight increase in housing supply should have effectively no impact on housing prices. (This is unusual.) As long as demand exceeds supply, engineers will continue to max out the prices.

As a result though, the market price provides little indication of how much more supply is needed. If DSR > 1.0, this simulation suggests that prices will remain about flat (ignoring wage increases), regardless of changes in the housing supply. This makes it hard to decide how much housing to build. Where the market is more healthy, you can see prices drop a bit (or rise slower) when new housing comes on the market, and you can extrapolate to see how much more housing is appropriate.

At this point we can assume "much more" housing is needed. But how much? Are we at DSR=2.5 or DSR=1.001? If the latter, a small amount of added housing could drop us down to DSR=0.999, and then the market dynamics would change discontinuously. According to the simulation - which, recall, we can't necessarily trust - the prices would drop slowly, but they would still drop, by a lot. It would pop the bubble. And unlike my simulation, where all the engineers are rational, popping the bubble could cause all kinds of market panic and adjacent effects, way beyond my area of expertise.

In turn, what this means is that the NIMBYs are not all crazy. If you try to improve your home, the neighbourhood, or the region, you will not improve the property values, so don't waste your money (or municipal funds); the property values are already at maximum. But if you build more housing, you run the risk of putting DSR below 1.0 and sending property values into free fall, as they return to "normal" "healthy" market conditions.

Of course, it would be best globally if we could get the market back to normal. Big tech companies could hire more people. Service industry workers could live closer to work, enjoy better lives, and be less grumpy. With more market liquidity, engineers could buy a home they want, closer to work, instead of just whatever was available. That means they could switch employers more easily. People would spend money to improve their property and their neighbourhood, thus improving the resale value and making life more enjoyable for themselves and the next buyer.

But global optimization isn't what individuals do. They do local optimization. And for NIMBYs, that popping bubble could be a legitimate personal financial disaster. Moreover, the NIMBYs are the people who get to vote on zoning, construction rules, and improvement projects. What do you think they'll vote for? As little housing as possible, obviously. It's just common sense.

I would love to be able to give advice on what to do. It's most certainly a housing bubble. All bubbles pop eventually. Ideally you want to pop the bubble gently. But what does that mean? I don't know; an asset that deteriorates to 30% of its current price, slowly, leaves the owner just as poor as if it happened fast. And I don't know if it's possible to hold prices up to, say, 70% instead of 30%, because of that pesky discontinuity at DSR=1.0. The prices are either hyperinflated, or they aren't, and there seems to be no middle.

Uh, assuming my simulator isn't broken.

That's my hypothesis.

Posted Mon Sep 17 21:44:13 2018 Tags:

Back in the early 2000s, XML was all the rage. An unusual evolution from HTML, which itself was an evolution (devolution?) from SGML, XML was supposed to be a backlash against complexity.

SGML originally grew from the publishing industry (for example, the original DocBook was an SGML language) and had flexible parser features so not-too-technical writers could use it without really understanding how tags worked. It provided some interesting shortcuts: for example, there's no reason to close the last <chapter> when opening a new <chapter>, because obviously you can't have a chapter inside a chapter, and so on. SGML was an organically-evolved mess, but it was a mess intended for humans. You can see that legacy in HTML, which was arguably just a variant of SGML for online publishing, minus a few features.

All that supposedly-human-friendly implicit behaviour became a problem, especially for making interoperable implementations (like web browsers). Now, don't get me wrong, the whole parsability issue was pretty overblown. Is browser compatibility really about what I mean when I write some overlapping tags like <b>hello <u>cruel</b> world</u>? I mean, yes. But more important are semantics, like which methods of javascript DOM objects take which sorts of parameters, or which exist at all, and what CSS even means.

But we didn't know that then. Let's say all our compatibility problems were caused by how hard it is to parse HTML.

Given that, some brave souls set out to solve the problem Once and For All. That was XML: a simplification of HTML/SGML with parsing inconsistencies removed, so that given any XML document, if nothing else, you always knew exactly what the parse tree should be. That made it a bit less human friendly (now you always had to close your tags), but most humans can figure out how to close tags, eventually, right?

Because strictness was the goal, Postel's Law didn't apply, and there was a profusion of XML validators, each stricter than the last, including fun features like silently downloading DTDs from the Internet on every run, and fun bugs like arbitrary code execution on your local machine or data leakage if that remote DTD got hacked.

(Side note about DTDs: those existed in SGML too. Interestingly, because of the implicit tag closing, it was impossible to parse SGML without knowing the DTD, because only then could you know which tags to nest and which to auto-close. In XML, since all tags need to be closed explicitly, you can happily parse any document without even having the DTD: a welcome simplification. So DTDs are vestigial, syntactically, and could have been omitted. (You can still ignore them whenever you use XML.) DTDs still mean something - they prevent syntactically legal parse trees from being accepted if they contain certain semantic errors - but that turns out to be less important. Oh well.)

Unfortunately, XML was invented by a [series of] standards committees with very little self control, so after simplifying it they couldn't stop themselves from complexifying it again. But you could mostly ignore the added bits, except for the resulting security holes, and people mostly did, and they were mostly happy.

There was a short-lived attempt to convince every person on the Internet to switch from easy-to-write HTML to easy-to-parse XHTML (HTML-over-XML), but that predictably failed, because HTML gets written a few billion times a day and HTML parsers get written once or twice a decade, so writability beats parsability every time. But that's an inconsequential historical footnote, best forgotten.

What actually matters is this:

XML is the solution to every problem

Why do we still hear about XML today? Because despite failing at its primary goal - a less hacky basis for HTML - it was massively successful at the related job of encoding other structured data. You could grab an XML parser, write a DTD, and auto-generate code for parsing pretty much anything. Using XSL, you could also auto-generate output files from your auto-parsed XML input files. If you wanted, your output could even be more XML, and the cycle could continue forever!

What all this meant is that, if you adopted XML, you never needed to write another parser or another output generator. You never needed to learn any new syntax (except, ironically, XSL and DTD) because all syntax was XML. It was the LISP of the 2000s, only with angle brackets instead of round ones, and not turing complete, and we didn't call it programming.

Most importantly, you never needed to argue with your vendor about whether their data file was valid, because XML's standards compliant validator tools would tell you. And never mind, since your vendor would obviously run the validator before sending you the file, you'd never get the invalid file in the first place. Life would be perfect.

Now we're getting to the real story. XML was created to solve the interoperability problem. In enterprises, interoperability is huge: maybe the biggest problem of all. Heck, even humans at big companies have trouble cooperating, long before they have to exchange any data files. Companies will spend virtually any amount of money to fix interoperability, if they believe it'll work.

Money attracts consultants, and consultants attract methodologies, and metholologies attract megacorporations with methodology-driven products. XML was the catalyst. Money got invested, deployments got deployed, and business has never been the same since.

Right?

Okay, from your vantage point, situated comfortably with me here in the future, you might observe that it didn't all work out exactly as we'd hoped. JSON came along and wiped out XML for web apps (but did you ever wonder why we fetch JSON using an XMLHttpRequest?). SOAP and XML-RPC were pretty unbearable. XML didn't turn out to be a great language for defining your build system configs, and "XML databases" were discovered to be an astonishingly abysmal idea. Nowadays you mostly see XML in aging industries that haven't quite gotten with the programme and switched to JSON and REST and whatever.

But what's interesting is, if you ask the enterprisey executive types whether they feel like they got their money's worth from the giant deployments they did while going Full XML, the feedback will be largely positive. XML didn't live up to expectations, but spending a lot of money on interoperability kinda did. Supply chains are a lot more integrated than they used to be. Financial systems actually do send financial data back and forth. RPCs really do get Remotely Called. All that stuff got built during the XML craze.

XML, the data format, didn't have much to do with it. We could have just as easily exchanged data with JSON (if it had existed) or CSV or protobufs or whatever. But XML, the dream, was a fad everyone could get behind. Nobody ever got fired for choosing XML. That dream moved the industry forward, fitfully, chaotically, but forward.

Blockchains

So here we are back in the present. Interoperability remains a problem, because it always will. Aging financial systems are even more aged now than they were 15 or 20 years ago, and they exchange data only a little better than before. Many of us still write cheques and make "wire" transfers, so named because they were invented for the telegraph. Manufacturing supply chains are a lot better, but much of that improvement came from everybody just running the same one or two software megapackages. Legal contracts are really time consuming and essentially non-automated. Big companies are a little aggravated at having to clear their transactions through central authorities, not because they have anything against centralization and paying a few fees, but because those central authorities (whether banks, exchanges, or the court system) are really slow and inefficient.

We need a new generation of investment. And we need everyone to care about it all at once, because interoperability doesn't get fixed unless everybody fixes it.

That brings us to blockchains. Like XML, they are kinda fundamentally misguided; they don't solve a problem that is actually important. XML solved syntax, which turned out not to be the problem. Blockchains [purport to] solve centralization, which will turn out not to be the problem. But they do create the incentive to slash and burn and invest a lot of money hiring consultants. They give us an excuse to forget everything we thought we knew about contracts and interoperability and payment clearing, much of which was already irrelevant.

It's the forgetting that will allow progress.

Disclaimers

  1. Bitcoin is like the XHTML of blockchains.

  2. No, I don't think cryptocurrency investing is a good idea.

  3. Blockchain math is actually rather useful, to the extent that it is a (digitally signed) "chain of blocks," which was revolutionary long ago, when it was first conceived. As one example, git is a chain of blocks and many of its magical properties come directly from that. Chains of blocks are great.

But the other parts are all rather dumb. We can do consensus in many (much cheaper) ways. Most people don't want their transactions or legal agreements published to the world. Consumers actually like transactions to be reversible, within reason; markets work better that way. Companies even like to be able to safely unwind legal agreements sometimes when it turns out those contracts weren't the best idea. And they rarely want the public to know about their contracts, let alone their inventory details.

I predict that in 20 years, we're going to have a lot of "blockchain" stuff in production, but it won't be like how people imagine it today. It'll have vestigial bits that we wonder about, and it'll all be faintly embarrassing, like when someone sends you their old XML-RPC API doc and tells you to use that.

"Yeah, I know," they'll say. "But it was state of the art back then."

Updates

2018-09-15: Some people are saying that json is "schemaless." It isn't, not any more than XML, but schema enforcement is optional, like in XML, and there's more than one way to do it, like in XML. A really elegant json schema mechanism is Go's reflection-based one, where you declare a struct and the standard library knows how to convert it to/from json. This is harder to do with XML, because generic XML doesn't map directly onto typical language data structures. (The downside of declaring your schemas in Go is that the enforcement doesn't work in any other language, of course.)

Posted Fri Sep 14 21:04:13 2018 Tags:

libinput 1.12 was a massive development effort (over 300 patchsets) with a bunch of new features being merged. It'll be released next week or so, so it's worth taking a step back and looking at what actually changed.

The device quirks files replace the previously used hwdb-based udev properties. I've written about this in more detail here but the gist is: we have our own .ini style file format that can match on devices and apply the various quirks devices need. This simplifies debugging a lot, we can now reliably tell users why a quirks file applies or doesn't apply, historically a problem with the hwdb.

The sphinx-based documentation was merged, fixed and added to. We switched to sphinx for the docs and the result is much more user-friendly. Which was the point, it was a switch from a developer-oriented documentation to a user-oriented one. Not that documentation is ever finished.

The usual set of touchpad improvements went in, e.g. the slight motion on finger up is now ignored. We have size-based thumb detection now (useful for Apple touchpads!). And of course various quirks for better pressure ranges, etc. Tripletap on some synaptics touchpads had a tendency to cause multiple taps because of some weird event sequence. Movement in the software button now generates events, the buttons are not just a dead zone anymore. Pointer jump detection is more adaptive now and catches and discards smaller jumps that previously slipped through the cracks. A particularly quirky behaviour was seen on Dell XPS i2c touchpads that exhibit a huge pointer jump, courtesy of the trackpoint controller going to sleep and taking its time to wake up. The delay is still there but the pointer at least lands in the correct location.

We now have improved direction-locking for two-finger scrolling on touchpads. Scrolling up/down should not generate horizontal scroll events anymore as long as the movement is close enough to vertical. This feature is transparent, a diagonal or horizontal movement will immediately disable the direction lock and produce horizontal scroll events as expected.

The trackpoint acceleration has been re-done, see this post for more details and links to the background articles. I've only received one bug report for the new acceleration so it seems to work quite well now. Trackpoints that send events in bursts (e.g. bluetooth ones) are smoothened now to avoid jerky movement.

Velocity averaging was dropped to increase pointer accuracy. Previously we averaged the velocity across multiple events which makes the motion smoother on jittery devices but less accurate on good devices.

We build on FreeBSD now. Presumably this also means it works on FreeBSD :)

libinput now supports palm detection on touchscreens, at least where the ABS_MT_TOOL_TYPE evdev bit is provided.

I think that's about it. Busy days...

Posted Tue Sep 4 03:34:00 2018 Tags:

[ A similar version of this blog post was cross-posted on Software Freedom Conservancy's blog

In recent weeks, I've been involved with a complex internal discussion by a major software freedom project about a desire to take a stance on social justice issues other than software freedom. In the discussion, many different people came forward with various issues that matter to them, including vegetarianism, diversity, and speech censorship, wondering how that software freedom project should handle other social justices causes that are not software freedom. This week, (separate and fully unrelated) another project, called Lerna, publicly had a similar debate. The issues involved are challenging, and it deserves careful consideration regardless of how the issue is raised.

One of the first licensing discussions that I was ever involved in the mid 1990s was with a developer, who was a lifelong global peace activist, objecting to the GPL because it allowed the USA Department of Defense and the wider military industrial complex to incorporate software into their destructive killing machines. As a lifelong pacifist myself, I sympathized with his objection, and since then, I have regularly considered the question of “do those who perpetrate other social injustices deserve software freedom?”

I ultimately drew much of my conclusion about this from activists for free speech, who have a longer history and have therefore had longer time to consider the philosophical question. I remember in the late 1980s when I first learned of the ACLU, and hearing that they assisted the Klu-Klux Klan in their right to march. I was flabbergasted; the Klan is historically well-documented as an organization that was party to horrific murder. Why would the ACLU defend their free speech rights? Recently, many people had a similar reaction when, in defense of the freedom of association and free speech of the National Rifle Association (NRA), the ACLU filed an amicus brief in a case involving the NRA, an organization that I and many others oppose politically. Again, we're left wondering: why should we act to defend the free speech and association rights of political causes we oppose — particularly for those like the NRA and big software companies who have adequate resources to defend themselves?

A few weeks ago, I heard a good explanation of this in an interview with ACLU's Executive Director, whom I'll directly quote, as he stated succinctly the reason why ACLU has a long history of defending everyone's free speech and free association rights:

[Our decision] to give legal representation to Nazis [was controversial].… It is not for the government's role to decide who gets a permit to march based on the content of their speech. We got lots of criticism, both internally and externally. … We believe these rights are for everyone, and we truly mean it — even for people we hate and whose ideology is loathsome, disgusting, and hurtful. [The ACLU can't be] just a liberal/left advocacy group; no liberal/left advocacy group would take on these kinds of cases. … It is important for us to forge a path that talks about this being about the rights of everyone.

Ultimately, fighting for software freedom is a social justice cause similar to that of fighting for free speech and other causes that require equal rights for all. We will always find groups exploiting those freedoms for ill rather than good. We, as software freedom activists, will have to sometimes grit our teeth and defend the rights to modify and improve software for those we otherwise oppose. Indeed, they may even utilize that software for those objectionable activities. It's particularly annoying to do that for companies that otherwise produce proprietary software: after all, in another realm, they are actively working against our cause. Nevertheless, either we believe the Four Software Freedoms are universal, or we don't. If we do, even our active political opponents deserve them, too.

I think we can take a good example from the ACLU on this matter. The ACLU, by standing firm on its core principles, now has, after two generations of work, developed the power to make impact on related causes. The ACLU is the primary organization defending immigrants who have been forcibly separated from their children by the USA government. I'd posit that only an organization with a long history of principled activity can have both the gravitas and adequate resources to take on that issue.

Fortunately, software freedom is already successful enough that we can do at least a little bit of that now. For example, Conservancy (where I work) already took a public position, early, in opposition of Trump's immigration policy because of its negative impact on software freedom, whose advancement depends on the free flow of movement by technologists around the world. Speaking out from our microphone built from our principled stand on software freedom, we can make an impact that denying software freedom to others never could. Specifically, rather than proprietarizing the license of projects to fight USA's Immigration and Customs Enforcement (ICE) and its software providers, I'd encourage us to figure out a specific FOSS package that we can prove is deployed for use at ICE, and use that fact as a rhetorical lever to criticize their bad behavior. For example, has anyone investigated if ICE uses Linux-based servers to host their otherwise proprietary software systems? If so, the Linux community is already large and powerful enough that if a group of Linux contributors made a public statement in political opposition to the use of Linux in ICE's activities, it would get national news attention here in the USA. We could even ally with the ACLU to assure the message is heard. No license change is needed to do that, and it will surely be more effective.

Again, this is how software freedom is so much like free speech. We give software freedom to all, which allows them to freely use and deploy the software for any purpose, just like hate groups can use the free speech microphone to share their ideas. However, like the ACLU, software freedom activists, who simultaneously defend all users equal rights in copying, sharing and modifying the software, can use their platform — already standing on the moral high ground that was generated by that long time principled support of equal rights — to speak out against those who bring harm to society in other ways.

Finally, note that the Four Software Freedoms obviously should never be the only laws and/or rules of conduct of our society. Just like you should be prevented from (proverbially) falsely yelling Fire! in a crowded movie theater, you still should be stopped when you deploy Free Software in a manner that violates some other law, or commits human rights violations. However, taking away software freedom from bad actors, while it seems like a panacea to other societal ills, will simply backfire. The simplicity and beauty of copyleft is that it takes away someone's software freedom only at the moment when they take away someone else's software freedom; copyleft ensures that is the only reason your software freedom should be lost. Simple tools work best when your social justice cause is an underdog, and we risk obscurity of our software if we seek to change the fundamental simple design of copyleft licensing to include licensing penalties for other social justice grievances (— even if we could agree on which other non-FOSS causes warrant “copyleft protection”). It means we have a big tent for software freedom, and we sometimes stand under it with people whose behavior we despise. The value we have is our ability to stand with them under the tent, and tell them: “while I respect your right to share and improve that software, I find the task you're doing with the software deplorable.”. That's the message I deliver to any ICE agent who used Free Software while forcibly separating parents from their children.

Posted Thu Aug 30 09:10:00 2018 Tags:

I get a lot of questions about people asking me about what stable kernel should they be using for their product/device/laptop/server/etc. all the time. Especially given the now-extended length of time that some kernels are being supported by me and others, this isn’t always a very obvious thing to determine. So this post is an attempt to write down my opinions on the matter. Of course, you are free to use what ever kernel version you want, but here’s what I recommend.

As always, the opinions written here are my own, I speak for no one but myself.

What kernel to pick

Here’s the my short list of what kernel you should use, raked from best to worst options. I’ll go into the details of all of these below, but if you just want the summary of all of this, here it is:

Hierarchy of what kernel to use, from best solution to worst:

  • Supported kernel from your favorite Linux distribution
  • Latest stable release
  • Latest LTS release
  • Older LTS release that is still being maintained

What kernel to never use:

  • Unmaintained kernel release

To give numbers to the above, today, as of August 24, 2018, the front page of kernel.org looks like this:

So, based on the above list that would mean that:

  • 4.18.5 is the latest stable release
  • 4.14.67 is the latest LTS release
  • 4.9.124, 4.4.152, and 3.16.57 are the older LTS releases that are still being maintained
  • 4.17.19 and 3.18.119 are “End of Life” kernels that have had a release in the past 60 days, and as such stick around on the kernel.org site for those who still might want to use them.

Quite easy, right?

Ok, now for some justification for all of this:

Distribution kernels

The best solution for almost all Linux users is to just use the kernel from your favorite Linux distribution. Personally, I prefer the community based Linux distributions that constantly roll along with the latest updated kernel and it is supported by that developer community. Distributions in this category are Fedora, openSUSE, Arch, Gentoo, CoreOS, and others.

All of these distributions use the latest stable upstream kernel release and make sure that any needed bugfixes are applied on a regular basis. That is the one of the most solid and best kernel that you can use when it comes to having the latest fixes (remember all fixes are security fixes) in it.

There are some community distributions that take a bit longer to move to a new kernel release, but eventually get there and support the kernel they currently have quite well. Those are also great to use, and examples of these are Debian and Ubuntu.

Just because I did not list your favorite distro here does not mean its kernel is not good. Look on the web site for the distro and make sure that the kernel package is constantly updated with the latest security patches, and all should be well.

Lots of people seem to like the old, “traditional” model of a distribution and use RHEL, SLES, CentOS or the “LTS” Ubuntu release. Those distros pick a specific kernel version and then camp out on it for years, if not decades. They do loads of work backporting the latest bugfixes and sometimes new features to these kernels, all in a Quixote quest to keep the version number from never being changed, despite having many thousands of changes on top of that older kernel version. This work is a truly thankless job, and the developers assigned to these tasks do some wonderful work in order to achieve these goals. If you like never seeing your kernel version number change, then use these distributions. They usually cost some money to use, but the support you get from these companies is worth it when something goes wrong.

So again, the best kernel you can use is one that someone else supports, and you can turn to for help. Use that support, usually you are already paying for it (for the enterprise distributions), and those companies know what they are doing.

But, if you do not want to trust someone else to manage your kernel for you, or you have hardware that a distribution does not support, then you want to run the Latest stable release:

Latest stable release

This kernel is the latest one from the Linux kernel developer community that they declare as “stable”. About every three months, the community releases a new stable kernel that contains all of the newest hardware support, the latest performance improvements, as well as the latest bugfixes for all parts of the kernel. Over the next 3 months, bugfixes that go into the next kernel release to be made are backported into this stable release, so that any users of this kernel are sure to get them as soon as possible.

This is usually the kernel that most community distributions use as well, so you can be sure it is tested and has a large audience of users. Also, the kernel community (all 4000+ developers) are willing to help support users of this release, as it is the latest one that they made.

After 3 months, a new kernel is released and you should move to it to ensure that you stay up to date, as support for this kernel is usually dropped a few weeks after the newer release happens.

If you have new hardware that is purchased after the last LTS release came out, you almost are guaranteed to have to run this kernel in order to have it supported. So for desktops or new servers, this is usually the recommended kernel to be running.

Latest LTS release

If your hardware relies on a vendors out-of-tree patch in order to make it work properly (like almost all embedded devices these days), then the next best kernel to be using is the latest LTS release. That release gets all of the latest kernel fixes that goes into the stable releases where applicable, and lots of users test and use it.

Note, no new features and almost no new hardware support is ever added to these kernels, so if you need to use a new device, it is better to use the latest stable release, not this release.

Also this release is common for users that do not like to worry about “major” upgrades happening on them every 3 months. So they stick to this release and upgrade every year instead, which is a fine practice to follow.

The downsides of using this release is that you do not get the performance improvements that happen in newer kernels, except when you update to the next LTS kernel, potentially a year in the future. That could be significant for some workloads, so be very aware of this.

Also, if you have problems with this kernel release, the first thing that any developer whom you report the issue to is going to ask you to do is, “does the latest stable release have this problem?” So you will need to be aware that support might not be as easy to get as with the latest stable releases.

Now if you are stuck with a large patchset and can not update to a new LTS kernel once a year, perhaps you want the older LTS releases:

Older LTS release

These releases have traditionally been supported by the community for 2 years, sometimes longer for when a major distribution relies on this (like Debian or SLES). However in the past year, thanks to a lot of suport and investment in testing and infrastructure from Google, Linaro, Linaro member companies, kernelci.org, and others, these kernels are starting to be supported for much longer.

Here’s the latest LTS releases and how long they will be supported for, as shown at kernel.org/category/releases.html on August 24, 2018:

The reason that Google and other companies want to have these kernels live longer is due to the crazy (some will say broken) development model of almost all SoC chips these days. Those devices start their development lifecycle a few years before the chip is released, however that code is never merged upstream, resulting in a brand new chip being released based on a 2 year old kernel. These SoC trees usually have over 2 million lines added to them, making them something that I have started calling “Linux-like” kernels.

If the LTS releases stop happening after 2 years, then support from the community instantly stops, and no one ends up doing bugfixes for them. This results in millions of very insecure devices floating around in the world, not something that is good for any ecosystem.

Because of this dependency, these companies now require new devices to constantly update to the latest LTS releases as they happen for their specific release version (i.e. every 4.9.y release that happens). An example of this is the Android kernel requirements for new devices shipping for the “O” and now “P” releases specified the minimum kernel version allowed, and Android security releases might start to require those “.y” releases to happen more frequently on devices.

I will note that some manufacturers are already doing this today. Sony is one great example of this, updating to the latest 4.4.y release on many of their new phones for their quarterly security release. Another good example is the small company Essential which has been tracking the 4.4.y releases faster than anyone that I know of.

There is one huge caveat when using a kernel like this. The number of security fixes that get backported are not as great as with the latest LTS release, because the traditional model of the devices that use these older LTS kernels is a much more reduced user model. These kernels are not to be used in any type of “general computing” model where you have untrusted users or virtual machines, as the ability to do some of the recent Spectre-type fixes for older releases is greatly reduced, if present at all in some branches.

So again, only use older LTS releases in a device that you fully control, or lock down with a very strong security model (like Android enforces using SELinux and application isolation). Never use these releases on a server with untrusted users, programs, or virtual machines.

Also, support from the community for these older LTS releases is greatly reduced even from the normal LTS releases, if available at all. If you use these kernels, you really are on your own, and need to be able to support the kernel yourself, or rely on you SoC vendor to provide that support for you (note that almost none of them do provide that support, so beware…)

Unmaintained kernel release

Surprisingly, many companies do just grab a random kernel release, slap it into their product and proceed to ship it in hundreds of thousands of units without a second thought. One crazy example of this would be the Lego Mindstorm systems that shipped a random -rc release of a kernel in their device for some unknown reason. A -rc release is a development release that not even the Linux kernel developers feel is ready for everyone to use just yet, let alone millions of users.

You are of course free to do this if you want, but note that you really are on your own here. The community can not support you as no one is watching all kernel versions for specific issues, so you will have to rely on in-house support for everything that could go wrong. Which for some companies and systems, could be just fine, but be aware of the “hidden” cost this might cause if you do not plan for this up front.

Summary

So, here’s a short list of different types of devices, and what I would recommend for their kernels:

  • Laptop / Desktop: Latest stable release
  • Server: Latest stable release or latest LTS release
  • Embedded device: Latest LTS release or older LTS release if the security model used is very strong and tight.

And as for me, what do I run on my machines? My laptops run the latest development kernel (i.e. Linus’s development tree) plus whatever kernel changes I am currently working on and my servers run the latest stable release. So despite being in charge of the LTS releases, I don’t run them myself, except in testing systems. I rely on the development and latest stable releases to ensure that my machines are running the fastest and most secure releases that we know how to create at this point in time.

Posted Fri Aug 24 16:04:51 2018 Tags:
At devconf.us 2018 in Boston I was involved in five different talks. They should have been recorded and the result uploaded to Youtube. One event was actually a panel discussion but slides for the others should be available soon-ish.
Posted Thu Aug 23 06:16:59 2018 Tags:

[ A similar version was crossposted on Conservancy's blog. ]

Proprietary software has always been about a power relationship. Copyright and other legal systems give authors the power to decide what license to choose, and usually, they choose a license that favors themselves and takes rights and permissions away from others.

The so-called “Commons Clause” purposely confuses and conflates many issues. The initiative is backed by FOSSA, a company that sells materiel in the proprietary compliance industrial complex. This clause recently made news again since other parties have now adopted this same license.

This proprietary software license, which is not Open Source and does not respect the four freedoms of Free Software, seeks to hide a power imbalance ironically behind the guise “Open Source sustainability”. Their argument, once you look past their assertion that the only way to save Open Source is to not do open source, is quite plain: If we can't make money as quickly and as easily as we'd like with this software, then we have to make sure no one else can as well.

These observations are not new. Software freedom advocates have always admitted that if your primary goal is to make money, proprietary software is a better option. It's not that you can't earn a living writing only Free Software; it's that proprietary software makes it easier because you have monopolistic power, granted to you by a legal system ill-equipped to deal with modern technology. In my view, it's a power which you don't deserve — that allows you to restrict others.

Of course, we all want software freedom to exist and survive sustainably. But the environmental movement has already taught us that unbridled commerce and conspicuous consumption is not sustainable. Yet, companies still adopt strategies like this Commons Clause to prioritize rapid growth and revenue that the proprietary software industry expects, claiming these strategies bolster the Commons (even if it is a “partial commons in name only”). The two goals are often just incompatible.

At Software Freedom Conservancy (where I work), we ask our projects to be realistic about revenue. We don't typically see Conservancy projects grow at rapid rates. They grow at slow and steady rates, but they grow better, stronger, and more diverse because they take the time to invite everyone to get involved. The software takes longer to mature, but when it does it's more robust and survives longer.

I'll take a bet with anyone who'd like. Let's pick five projects under the Affero GPL and five projects under the Commons Clause, and then let's see which ones survive longer as vibrant communities with active codebases and diverse contributors.

Finally, it's not surprising that the authors chose the name “Commons”. Sadly, “commons” has for many years been a compromised term, often used by those who want to promote licenses or organizational models that do not guarantee all four freedoms inherent in software freedom. Proprietary software is the ultimate tragedy of the software commons, and while it's clever rhetoric for our opposition to claim that they can make FLOSS sustainable by proprietarizing it, such an argument is also sophistry.

Posted Wed Aug 22 09:13:00 2018 Tags:

This is mostly a request for testing, because I've received zero feedback on the patches that I merged a month ago and libinput 1.12 is due to be out. No comments so far on the RC1 and RC2 either, so... well, maybe this gets a bit broader attention so we can address some things before the release. One can hope.

Required reading for this article: Observations on trackpoint input data and X server pointer acceleration analysis - part 5.

As the blog posts linked above explain, the trackpoint input data is difficult and largely arbitrary between different devices. The previous pointer acceleration libinput had relied on a fixed reporting rate which isn't true at low speeds, so the new acceleration method switches back to velocity-based acceleration. i.e. we convert the input deltas to a speed, then apply the acceleration curve on that. It's not speed, it's pressure, but it doesn't really matter unless you're a stickler for technicalities.

Because basically every trackpoint has different random data ranges not linked to anything easily measurable, libinput's device quirks now support a magic multiplier to scale the trackpoint range into something resembling a sane range. This is basically what we did before with the systemd POINTINGSTICK_CONST_ACCEL property except that we're handling this in libinput now (which is where acceleration is handled, so it kinda makes sense to move it here). There is no good conversion from the previous trackpoint range property to the new multiplier because the range didn't really have any relation to the physical input users expected.

So what does this mean for you? Test the libinput RCs or, better, libinput from master (because it's stable anyway), or from the Fedora COPR and check if the trackpoint works. If not, check the Trackpoint Configuration page and follow the instructions there.

Posted Thu Aug 16 04:47:00 2018 Tags:

libinput made a design decision early on to use physical reference points wherever possible. So your virtual buttons are X mm high/across, the pointer movement is calculated in mm, etc. Unfortunately this exposed us to a large range of devices that don't bother to provide that information or just give us the wrong information to begin with. Patching the kernel for every device is not feasible so in 2015 the 60-evdev.hwdb was born and it has seen steady updates since. Plenty a libinput bug was fixed by just correcting the device's axis ranges or resolution. To take the magic out of the 60-evdev.hwdb, here's a blog post for your perusal, appreciation or, failing that, shaking a fist at. Note that the below is caller-agnostic, it doesn't matter what userspace stack you use to process your input events.

There are four parts that come together to fix devices: a kernel ioctl and a trifecta of udev rules hwdb entries and a udev builtin.

The kernel's EVIOCSABS ioctl

It all starts with the kernel's struct input_absinfo.


struct input_absinfo {
__s32 value;
__s32 minimum;
__s32 maximum;
__s32 fuzz;
__s32 flat;
__s32 resolution;
};
The three values that matter right now: minimum, maximum and resolution. The "value" is just the most recent value on this axis, ignore fuzz/flat for now. The min/max values simply specify the range of values the device will give you, the resolution how many values per mm you get. Simple example: an x axis given at min 0, max 1000 at a resolution of 10 means your devices is 100mm wide. There is no requirement for min to be 0, btw, and there's no clipping in the kernel so you may get values outside min/max. Anyway, your average touchpad looks like this in evemu-record:

# Event type 3 (EV_ABS)
# Event code 0 (ABS_X)
# Value 2572
# Min 1024
# Max 5112
# Fuzz 0
# Flat 0
# Resolution 41
# Event code 1 (ABS_Y)
# Value 4697
# Min 2024
# Max 4832
# Fuzz 0
# Flat 0
# Resolution 37
This is the information returned by the EVIOCGABS ioctl (EVdev IOCtl Get ABS). It is usually run once on device init by any process handling evdev device nodes.

Because plenty of devices don't announce the correct ranges or resolution, the kernel provides the EVIOCSABS ioctl (EVdev IOCtl Set ABS). This allows overwriting the in-kernel struct with new values for min/max/fuzz/flat/resolution, processes that query the device later will get the updated ranges.

udev rules, hwdb and builtins

The kernel has no notification mechanism for updated axis ranges so the ioctl must be applied before any process opens the device. This effectively means it must be applied by a udev rule. udev rules are a bit limited in what they can do, so if we need to call an ioctl, we need to run a program. And while udev rules can do matching, the hwdb is easier to edit and maintain. So the pieces we have is: a hwdb that knows when to change (and the values), a udev program to apply the values and a udev rule to tie those two together.

In our case the rule is 60-evdev.rules. It checks the 60-evdev.hwdb for matching entries [1], then invokes the udev-builtin-keyboard if any matching entries are found. That builtin parses the udev properties assigned by the hwdb and converts them into EVIOCSABS ioctl calls. These three pieces need to agree on each other's formats - the udev rule and hwdb agree on the matches and the hwdb and the builtin agree on the property names and value format.

By itself, the hwdb itself has no specific format beyond this:


some-match-that-identifies-a-device
PROPERTY_NAME=value
OTHER_NAME=othervalue
But since we want to match for specific use-cases, our udev rule assembles several specific match lines. Have a look at 60-evdev.rules again, the last rule in there assembles a string in the form of "evdev:name:the device name:content of /sys/class/dmi/id/modalias". So your hwdb entry could look like this:

evdev:name:My Touchpad Name:dmi:*svnDellInc*
EVDEV_ABS_00=0:1:3
If the name matches and you're on a Dell system, the device gets the EVDEV_ABS_00 property assigned. The "evdev:" prefix in the match line is merely to distinguish from other match rules to avoid false positives. It can be anything, libinput unsurprisingly used "libinput:" for its properties.

The last part now is understanding what EVDEV_ABS_00 means. It's a fixed string with the axis number as hex number - 0x00 is ABS_X. And the values afterwards are simply min, max, resolution, fuzz, flat, in that order. So the above example would set min/max to 0:1 and resolution to 3 (not very useful, I admit).

Trailing bits can be skipped altogether and bits that don't need overriding can be skipped as well provided the colons are in place. So the common use-case of overriding a touchpad's x/y resolution looks like this:


evdev:name:My Touchpad Name:dmi:*svnDellInc*
EVDEV_ABS_00=::30
EVDEV_ABS_01=::20
EVDEV_ABS_35=::30
EVDEV_ABS_36=::20
0x00 and 0x01 are ABS_X and ABS_Y, so we're setting those to 30 units/mm and 20 units/mm, respectively. And if the device is multitouch capable we also need to set ABS_MT_POSITION_X and ABS_MT_POSITION_Y to the same resolution values. The min/max ranges for all axes are left as-is.

The most confusing part is usually: the hwdb uses a binary database that needs updating whenever the hwdb entries change. A call to systemd-hwdb update does that job.

So with all the pieces in place, let's see what happens when the kernel tells udev about the device:

  • The udev rule assembles a match and calls out to the hwdb,
  • The hwdb applies udev properties where applicable and returns success,
  • The udev rule calls the udev keyboard-builtin
  • The keyboard builtin parses the EVDEV_ABS_xx properties and issues an EVIOCSABS ioctl for each axis,
  • The kernel updates the in-kernel description of the device accordingly
  • The udev rule finishes and udev sends out the "device added" notification
  • The userspace process sees the "device added" and opens the device which now has corrected values
  • Celebratory champagne corks are popping everywhere, hands are shaken, shoulders are patted in congratulations of another device saved from the tyranny of wrong axis ranges/resolutions

Once you understand how the various bits fit together it should be quite easy to understand what happens. Then the remainder is just adding hwdb entries where necessary but the touchpad-edge-detector tool is useful for figuring those out.

[1] Not technically correct, the udev rule merely calls the hwdb builtin which searches through all hwdb entries. It doesn't matter which file the entries are in.

Posted Thu Aug 9 02:17:00 2018 Tags:

My parents live in a rural area, where the usual monopolist Internet service provider provides the usual monopolist Internet service: DSL, really far from the exchange point, very very asymmetric, and with insanely oversized buffers (ie. bufferbloat), especially in the upstream direction. The result is that, basically, if you tried to browse the web while uploading anything, it pretty much didn't work at all.

I wrote about the causes of these problems (software, of course) in my bufferbloat rant from 2011. For some reason, there's been a recent resurgence of interest in that article. Upon rereading it, I (re-)discovered that it's very... uh... stream-of-consciousness. I find it interesting that some people like it so much. Even I barely understand what I wrote anymore. Also, it's now obsolete, because there are much better solutions to the problems than there used to be, so even people who understand it are not going to get the best possible results. Time for an update!

The Challenge

I don't live in the same city as my parents, and I won't be back for a few months, but I did find myself with some spare time and a desire to pre-emptively make their Internet connection more usable for next time I visited. So, I wanted to build a device (a "bump in the wire") that:

  • Needs zero configuration at install time
  • Does not interfere with the existing network (no DHCP, firewall, double NAT, etc)
  • Doesn't reduce security (no new admin ports in the data path)
  • Doesn't need periodic reboots
  • Actually solves their bufferbloat problem

Let me ruin the surprise: it works. Although we'll have to clarify "works" a bit.

If you don't care about all that, skip down to the actual setup down below.

This is an improvement, I promise!

Here's the fast.com test result before we installed the Bump.

(Side note: there are a lot of speedtests out there. I like fast.com for two reasons. First, they have an easy-to-understand bufferbloat test. Second, their owner has strong incentives to test actual Internet speeds including peering, and to bypass various monopolistic ISPs' various speedtest-cheating traffic shaping techniques.)

And here's what it looked like after we added the Bump:

...okay, so you're probably thinking, hey, that big number is lower now! It got worse! Yes. In a very narrow sense, it did get worse. But in most senses (including all the numbers in smaller print), it got better. And even the big number is not as much worse as it appears at first.

It would take a really long time and a lot of words to try to explain how these numbers interact and why it matters. But unluckily for you, I'm on vacation!

Download speed is the wrong measurement

In my wifi data presentation from 2016, I spent a lot of time exploring what makes an Internet connection feel "fast." In particular, I showed a slide from an FCC report from 2015 (back when the FCC was temporarily anti-monopolist):

What's that slide saying? Basically, that beyond 20 Mbps or so, typical web page load times stop improving.1 Sure, if you're downloading large files, a faster connection will make it finish sooner.2 But most people spend most of their time just browsing, not downloading.

Web page load times are limited by things other than bandwidth, including javascript parsing time, rendering time, and (most relevant to us here) round trip times to the server. (Most people use "lag", "latency", and "round trip time" to mean about the same thing, so we'll do that here too.) Loading a typical web page requires several round trips: to one or more DNS servers, then the TCP three-way handshake, then SSL negotiation, then grabbing the HTML, then grabbing the javascript it points to, then grabbing whatever other files are requested by the HTML and javascript. If that's, say, 10 round trips, at 100ms each, you can see how a typical page would take at least a second to load, even with no bandwidth constraints. (Maybe there are fewer round trips needed, each with lower latencies; same idea.)

So that's the first secret: if your page load times are limited by round trip time, and round trip time goes from 80ms (or 190ms) to 31ms (or 42ms), then you could see a 2x (or 4.5x) improvement in page load speed, just from cutting latency. Our Bump achieved that - which I'll explain in a moment.

It also managed to improve the measured uplink speed in this test. How is that possible? Well, probably several interconnected reasons, but a major one is: TCP takes longer to get up to speed when the round trip time is longer. (One algorithm for this is called TCP slow start.) And it has even more trouble converging if the round trip time is variable, like it was in the first test above. The Bump makes round trip time lower, but also more consistent, so it improves TCP performance in both ways.

But how does it work?

Alert readers will have noticed that by adding a Bump in the wire, that is, by adding an extra box and thus extra overhead, I have managed to make latency less. Alert readers will hate this, as they should, because it's called "negative latency," and alert readers know that there is no such thing. (I tried to find a good explanation of it on the web, but all the pages I could find sucked. I guess that's fair, for a concept that does not exist. Shameless self-plug then: I did write a fun article involving this topic back in 2009 about work we did back in 2003. Apparently I've been obsessing over this for a long time.)

So, right, the impossible. As usual, the impossible is a magic trick. Our Bump doesn't subtract latency; it just tricks another device - in this case the misconfigured DSL router provided by the monopolistic ISP - into adding less latency, by precisely adding a bit of its own. The net result is less than the DSL router on its own.

Bufferbloat (and chocolate)

Stop me if you've heard this one before. Most DSL routers and cable modems have buffers that were sized to achieve the maximum steady-state throughput on a very fast connection - the one that the monopolistic ISP benchmarks on, for its highest priced plan. To max out the speed in such a case, you need a buffer some multiple of the "bandwidth delay product," (BDP) which is an easier concept than it sounds like: just multiply the bandwidth by the round trip time (delay). So if you have 100ms round trip time and your upstream is about 25 Mbps = ~2.5 MBytes/sec, then your BDP is 2.5 Mbytes/sec * 0.1sec = 2.5 MBytes. If you think about it, the BDP is "how much data fits in the wire," the same way a pipe's capacity is how much water fits in the pipe. For example, if a pipe spits out 1L of water per second, and it takes 10 seconds for water to traverse the pipe, then the pipe contains 1L x 10 seconds = 10L.

Anyway, the pipe is the Internet3, and we can't control the bandwidth-delay product of the Internet from our end. People spend a lot of time trying to optimize that, but they get paid a lot, and I'm on vacation, and they don't let me fiddle with their million-dollar equipment, so too bad. What I can control is the equipment that feeds into the pipe: my router, or, in our plumbing analogy, the drain.

Duck in a bathtub drain vortex,
via pinterest

You know how when you drain the bathtub, near the end it starts making that sqlrshplshhh sucking noise? That's the sound of a pipe that's not completely full. Now, after a nice bath that sound is a key part of the experience and honestly makes me feel disproportionately gleeful, but it also means your drain is underutilized. Err, which I guess is a good thing for the environment. Uh.

Okay, new analogy: oil pipelines! Wait, those are unfashionable now too. Uh... beer taps... no, apparently beer is bad for diversity or something... chocolate fountains!

Chocolate fountain via indiamart

Okay! Let's say you rented one of those super fun chocolate fountain machines for a party: the ones where a pool of delicious liquid chocolate goes down a drain at the bottom, and then gets pumped back up to the top, only to trickle gloriously down a chocolate waterfall (under which you can bathe various fruits or whatever) and back into the pool, forever, until the party is over and the lucky, lucky party consultants get to take it home every night to their now-very-diabetic children.

Mmmm, tasty, tasty chocolate. What were we talking about again?

Oh right. The drain+pump is the Internet. The pool at the bottom is the buffer in your DSL router. And the party consultant is, uh, me, increasingly sure that I've ended up on the wrong side of this analogy, because you can't eat bits, and now I'm hungry.

Aaaaanyway, a little known fact about these chocolate fountain machines is that they stop dripping chocolate before the pool completely empties. In order to keep the pump running at capacity, there needs to be enough chocolate in the pool to keep it fully fed. In an ideal world, the chocolate would drip into the pool and then the pump at a perfectly constant rate, so you could adjust the total amount of chocolate in the system to keep the pool+pump content at the absolute minimimum, which is the bandwidth-delay product (FINALLY HE IS BACK ON TOPIC). But that would require your chocolate to be far too watery; thicker chocolate is more delicious (SIGH), but has the annoying habit of dripping in clumps (as shown in the picture) and not running smoothly into the drain unless the pool has extra chocolate to push it along. So what we do is to make the chocolate thicker and clumpier (not negotiable) and so, to keep the machine running smoothly, we have to add extra chocolate so that the pool stays filled, and our children thus become more diabetic than would otherwise be necessary.

Getting back to the math of the situation, if you could guarantee perfectly smooth chocolate (packet) flow, the capacity of the system could be the bandwidth-delay product, which is the minimum you need in order to keep the chocolate (bits) dripping at the maximum speed. If you make a taller chocolate tower (increase the delay), you need more chocolate, because the BDP increases. If you supersize your choco-pump (get a faster link), it moves the chocolate faster, so you need more chocolate, because the BDP increases. And if your chocolate is more gloppy (bursty traffic), you need more chocolate (bits of buffer) to make sure the pump is always running smoothly.

Moving back into pure networking (FINALLY), we have very little control over the burstiness of traffic. We generally assume it follows some statistical distribution, but in any case, while there's an average flow rate, the flow rate will always fluctuate, and sometimes it fluctuates by a lot. That means you might receive very little traffic for a while (draining your buffer aka chocolate pool) or you might get a big burst of traffic all at once (flooding your buffer aka chocolate pool). Because of a phenomenon called self-similarity, you will often get the big bursts near the droughts, which means your pool will tend to fill up and empty out, or vice versa.

(Another common analogy for network traffic is road traffic. When a road is really busy, car traffic naturally arranges itself into bursts, just like network traffic does.)

Okay! So your router is going to receive bursts of traffic, and the amount of data in transit will fluctuate. To keep your uplink fully utilized, there must always be 1 BDP of traffic in the Internet link (the round trip from your router to whatever server and back). To fill the Internet uplink, you need to have a transmit queue in the router with packets. Because the packets arrive in bursts, you need to keep that transmit queue nonempty: there's an ideal fill level so that it (almost) never empties out, but so our children don't get unnecessarily diabetic, um, I mean, so that our traffic is not unnecessarily delayed.

An empty queue isn't our only concern: the router has limited memory. If the queue memory fills up because of a really large incoming burst, then the only thing we can do is throw away packets, either the newly-arrived ones ("tail drop") or some of the older ones ("head drop" or more generally, "active queue management").

When we throw away packets, TCP slows down. When TCP slows down, you get slower speedtest results. When you get slower speedtest results, and you're a DSL modem salesperson, you sell fewer DSL modems. So what do we do? We add more RAM to DSL modems so hopefully the queue never fills up.4 The DSL vendors who don't do this, get a few percent slower speeds in the benchmarks, so nobody buys their DSL modem. Survival of the fittest!

...except, as we established earlier, that's the wrong benchmark. If customers would time page load times instead of raw download speeds, shorter buffers would be better. But they don't, unless they're the FCC in 2015, and we pay the price. (By the way, if you're an ISP, use better benchmarks! Seriously.)

So okay, that's the (very long) story of what went wrong. That's "bufferbloat." How do we fix it?

"Active" queue management

Imagine for a moment that we're making DSL routers, and we want the best of both worlds: an "unlimited" queue so it never gets so full we have to drop packets, and the shortest possible latency. (Now we're in the realm of pure fiction, because real DSL router makers clearly don't care about the latter, but bear with me for now. We'll get back to reality later.)

What we want is to have lots of space in the queue - so that when a really big burst happens, we don't have to drop packets - but for the steady state length of the queue to be really short.

But that raises a question. Where does the steady state length of the queue come from? We know why a queue can't be mainly empty - because we wouldn't have enough packets to keep the pipe full - and we know that the ideal queue utilization has something to do with the BDP and the burstiness. But who controls the rate of incoming traffic into the router?

The answer: nobody, directly. The Internet uses a very weird distributed algorithm (or family of algorithms) called "TCP congestion control." The most common TCP congestion controls (Reno and CUBIC) will basically just keep sending faster and faster until packets start getting dropped. Dropped packets, the thinking goes, mean that there isn't enough capacity so we'd better slow down. (This is why, as I mentioned above, TCP slows down when packets get dropped. It's designed that way.)

Unfortunately, a side effect of this behaviour is that the obvious dumb queue implementation - FIFO - will always be full. That's because the obvious dumb router doesn't drop packets until the queue is full. TCP doesn't slow down until packets are dropped,5 so it doesn't slow down until the queue is full. If the queue is not full, TCP will speed up until packets get dropped.6

So, all these TCP streams are sending as fast as they can until packets get dropped, and that means our queue fills up. What can we do? Well, perversely... we can drop packets before our queue fills up. As far as I know, the first proposal of this idea was Random Early Detection (RED), by Sally Floyd and Van Jacobson. The idea here is that we calculate the ideal queue utilization (based on throughput and burstiness), then drop more packets if we exceed that length, and fewer packets if we're below that length.

The only catch is that it's super hard to calculate the ideal queue utilization. RED works great if you know that value, but nobody ever does. I think I heard that Van Jacobson later proved that it's impossible to know that value, which explains a lot. Anyway, this led to the development of Controlled Delay (CoDel), by Kathleen Nichols and Van Jacobson. Instead of trying to figure out the ideal queue size in packets, CoDel just sees how long it takes for packets to traverse the queue. If it consistently takes "too long," then it starts dropping packets, which signals TCP to slow down, which shortens the average queue length, which means a shorter delay. The cool thing about this design is it's nearly configuration-free: "too long," in milliseconds, is pretty well defined no matter how fast your link is. (Note: CoDel has a lot of details I'm skipping over here. Read the research paper if you care.)

Anyway, sure enough, CoDel really works, and you don't need to configure it. It produces the best of both worlds: typically short queues that can absorb bursts. Which is why it's so annoying that DSL routers still don't use it. Jerks. Seriously.

Flow queueing (FQ)

A discussion on queue management wouldn't be complete without a discussion about flow queueing (FQ), the second half of the now very popular (except among DSL router vendors) fq_codel magic combination.

CoDel is a very exciting invention that should be in approximately every router, because it can be implemented in hardware, requires almost no extra memory, and is very fast. But it does have some limitations: it takes a while to converge, and it's not really "fair"7. Burstiness in one stream (or ten streams) can increase latency for another, which kinda sucks.

Imagine, for example, that I have an ssh session running. It uses almost no bandwidth: most of the time it just goes as fast as I can type, and no faster. But I'm also running some big file transfers, both upload and download, and that results in an upload queue that has something to do with the BDP and burstiness of the traffic, which could build up to hundreds of extra milliseconds. If the big file transfers weren't happening, my queue would be completely empty, which means my ssh traffic would get through right away, which would be optimal (just the round trip time, with no queue delay).

A naive way to work around this is prioritization: whenever an ssh packet arrives, put it at the front of the queue, and whenever a "bulk data" packet arrives, put it at the end of the queue. That way, ssh never has to wait. There are a few problems with that method though. For example, if I use scp to copy a large file over my ssh session, then that file transfer takes precedence over everything else. Oops. If I use ssh on a different port, there's no way to tag it. And so on. It's very brittle.

FQ tries to give you (nearly) the same low latency, even on a busy link, with no special configuration. To make a long story short, it keeps a separate queue for every active flow (or stream), then alternates "fairly"7 between them. Simple round-robin would work pretty well, but they take it one step further, detecting so-called "fat" flows (which send as fast as they can) and "thin" flows (which send slower than they can) and giving higher priority to the thin ones. An interactive ssh session is a thin flow; an scp-over-ssh file transfer is a fat flow.

And then you put CoDel on each of the separate FQ queues, and you get Linux's fq_codel, which works really well.

Incidentally, it turns out that FQ alone - forget about CoDel or any other active queue management - gets you most of the benefits of CoDel, plus more. You have really long queues for your fat flows, but the thin flows don't care. The CoDel part still helps (for example, if you're doing a videoconference, you really want the latency inside that one video stream to be as low as possible; and TCP always works better with lower latency), and it's cheap, so we include it. But FQ has very straightforward benefits that are hard to resist, as long as you and FQ agree on what "fairness"7 means.

FQ is a lot more expensive than CoDel: it requires you to maintain more queues - which costs more memory and CPU time and thrashes the cache more - and you have to screw around with hash table algorithms, and so on. As far as I know, nobody knows how to implement FQ in hardware, so it's not really appropriate for routers running at the limit of their hardware capacity. This includes super-cheap home routers running gigabit ports, or backbone routers pushing terabits. On the other hand, if you're limited mainly by wifi (typically much less than a gigabit) or a super slow DSL link, the benefits of FQ outweigh its costs.8

Back to the Bump

Ok, after all that discussion about CoDel and FQ and fq_codel, you might have forgotten that this whole exercise hinged on the idea that we were making DSL routers, which we aren't, but if we were, we could really cut down that latency. Yay! Except that's not us, it's some hypothetical competent DSL router manufacturer.

I bet you're starting to guess what the Bump is, though, right? You insert it between your DSL modem and your LAN, and it runs fq_codel, and it fixes all the queuing, and life is grand, right?

Well, almost. The problem is, the Bump has two ethernet ports, the LAN side and the WAN side, and they're both really fast (in my case, 100 Mbps ethernet, but they could be gigabit ethernet, or whatever). So the data comes in at 100 Mbps, gets enqueued, then gets dequeued at 100 Mbps. If you think about it for a while, you'll see this means the queue length is always 0 or 1, which is... really short. No bufferbloat there, which means CoDel won't work, and no queue at all, which means there's nothing for FQ to prioritize either.

What went wrong? Well, we're missing one trick. We have to release the packets out the WAN port (toward the DSL modem) more slowly. Ideally, we want to let them out perfectly smoothly at exactly the rate that the DSL modem can transmit them over the DSL link. This will allow the packets to enqueue in the Bump instead, where we can fq_codel them, and will leave the DSL modem's dumb queue nearly empty. (Why can that queue be empty without sacrificing DSL link utilization? Because the burstiness going into the DSL modem is near zero, thanks to our smooth release of packets from the Bump. Remember our chocolate fountain: if the chocolate were perfectly smooth, we wouldn't need a pool of chocolate at the bottom. There would always be exactly the right amount of chocolate to keep the pump going.)

Slowing down the packet outflow from the Bump is pretty easy using something called a token bucket filter (tbf). But it turns out that nowadays there's a new thing called "cake" which is basically fq_codel+tbf combined. Combining them has some advantages that I don't really understand, but one of them is that it's really easy to set up. You just load the cake qdisc, tell it the upload and download speeds, and it does the magic. Apparently it's also less bursty and takes less CPU. So use that.

The only catch is... what upload/download speeds should we give to cake? Okay, I cheated for that one. I just asked my dad what speed his DSL link goes in real life, and plugged those in. (Someday I want to build a system that can calculate this automatically, but... it's tricky.)

But what about the downstream?

Oh, you caught me! All that stuff was talking about the upstream direction. Admittedly, on DSL links, the upstream direction is usually the worst, because it's typically about 10x slower than the downstream, which means upstream bufferbloat problems are about 10x worse than downstream. But of course, not to be left out, the people making the big heavy multi-port DSL equipment at the ISP added plenty of bufferbloat too. Can we fix that?

Kind of. I mean, ideally they'd get a Bump over on their end, between the ISP and their DSL megarouter, which would manage the uplink's queue. Or, if we're dreaming anyway, the surprisingly competent vendor of the DSL megarouter would just include fq_codel, or at least CoDel, and they wouldn't need an extra Bump. Fat chance.

It turns out, though, that if you're crazy enough, you can almost make it work in the downstream direction. There are two catches: first, FQ is pretty much impossible (the downstream queue is just one queue, not multiple queues, so tough). And second, it's a pretty blunt instrument. What you can do is throw away packets after they've traversed the downstream queue, a process called "policing" (as in, we punish your stream for breaking the rules, rather than just "shaping" all streams so that they follow the rules). With policing, the best you can do is detect that data is coming in too fast, and start dropping packets to slow it down. Unfortunately, the CoDel trick - dropping traffic only if the queue is persistently too long - doesn't work, because on the receiving side, you don't know how big the queue is. When you get a packet from the WAN side, you just send it to the LAN side, and there's no bottleneck, so your queue is always empty. You have to resort to just throwing away packets whenever the incoming rate is even close to the maximum. That is, you have to police to a rate somewhat slower than the DSL modem's downlink speed.

Whereas in the upload direction, you could use, say, 99.9% of the upload rate and still have an empty queue on the DSL router, you don't have the precise measurements needed for that in the download direction. In my experience you have to use 80-90%.

That's why the download speed in the second fast.com test at the top of this article was reduced from the first test: I set the shaping rate pretty low. (I think I set it too low, because I wanted to ensure it would cut the latency. I had to pick some guaranteed-to-work number before shipping the Bump cross-country to my parents, and I only got one chance. More tuning would help.)

Phew!

I know, right? But, assuming you read all that, now you know how the Bump works. All that's left is learning how to build one.


BYOB (Build Your Own Bump)

Modern Linux already contains cake, which is almost all you need. So any Linux box will do, but the obvious choice is a router where you install openwrt. I used a D-Link DIR-825 because I didn't need it to go more than 100 Mbps (that's a lot faster than a 5 Mbps DSL link) and I liked the idea of a device with 0% proprietary firmware. But basically any openwrt hardware will work, as long as it has at least two ethernet ports.

You need a sufficiently new version of openwrt. I used 18.06.0. From there, install the SQM packages, as described in the openwrt wiki.

Setting up the cake queue manager

This part is really easy: once the SQM packages are installed in openwrt, you just activate them in the web console. First, enable SQM like this:

In the Queue Discipline tab, make sure you're using cake instead of whatever the overcomplicated and mostly-obsolete default is:

(You could mess with the Link Layer Adaptation tab, but that's mostly for benchmark twiddlers. You're unlikely to notice if you just set your download speed to about 80%, and upload speed to about 90%, of the available bandwidth. You should probably also avoid the "advanced" checkboxes. I tried them and consistently made things worse.)

If you're boring, you now have a perfectly good wifi/ethernet/NAT router that happens to have awesome queue management. Who needs a Bump? Just throw away your old wifi/router/firewall and use this instead, attached to your DSL modem.

Fancy bridge mode

...On the other hand, if, like me, you're not boring, you'll want to configure it as a bridge, so that nothing else about the destination network needs to be reconfigured when you install it. This approach just feels more magical, because you'll have a physical box that produces negative latency. It's not as cool if the negative and positive latencies are added together all in one box; that's just latency.

What I did was to configure the port marked "4" on the DIR-825 to talk to its internal network (with a DHCP server), and configure the port marked "1" to bridge directly to the WAN port. I disabled ports 2 and 3 to prevent bridging loops during installation.

To do this, I needed two VLANs, like this:

(Note: the DIR-825 labels have the ports in the opposite order from openwrt. In this screenshot, port LAN4 is on VLAN1, but that's labelled "1" on the physical hardware. I wanted to be able to say "use ports 1 and WAN" when installing, and reserve port 4 only for configuration purposes, so I chose to go by the hardware labels.)

Next, make sure VLAN2 (aka eth0.2) is not bridged to the wan port (it's the management network, only for configuring openwrt):

And finally, bridge VLAN1 (aka eth0.1) with the wan port:

You may need to reboot to activate the new settings.

Footnotes

1 Before and since that paper in 2015, many many people have been working on cutting the number of round trips, not just the time per round trip. Some of the recent improvements include TCP fast open, TLS session resumption, and QUIC (which opens encrypted connections in "zero" round trips). And of course, javascript and rendering engines have both gotten faster, cutting the other major sources of page load times. (Meanwhile, pages have continued getting larger, sigh.) It would be interesting to see an updated version of the FCC's 2015 paper to see if the curve has changed.

2 Also, if you're watching videos, a faster connection will improve video quality (peaking at about 5 Mbps/stream for an 1080p stream or 25 Mbps/stream for 4K, in Netflix's case). But even a 20 Mbps Internet connection will let you stream four HD videos at once, which is more than most people usually need to do.

3 We like to make fun of politicians, but it's actually very accurate to describe the Internet as a "series of tubes," albeit virtual ones.

4 A more generous interpretation is that DSL modems end up with a queue size calculated using a reasonable formula, but for one particular use case, and fixed to a number of bytes. For example, a 100ms x 100 Mbps link might need 0.1s x 100 Mbit/sec x ~0.1 bytes/bit = 1 Mbyte of buffer. But on a 5 Mbit/sec link, that same 1 Mbyte would take 10 Mbits / 5 Mbit/sec = 2 seconds to empty out, which is way too long. Unfortunately, until a few years ago, nobody understood that too-large buffers could be just as destructive as too-small ones. They just figured that maxing out the buffer would max out the benchmark, and that was that.

5 Various TCP implementations try to avoid this situation. My favourite is the rather new TCP BBR, which does an almost magically good job of using all available bandwidth without filling queues. If everyone used something like BBR, we mostly wouldn't need any of the stuff in this article.

6 To be more precise, in a chain of routers, only the "bottleneck" router's queue will be full. The others all have excess capacity because the link attached to the bottleneck is overloaded. For a home Internet connection, the bottleneck is almost always the home router, so this technicality doesn't matter to our analysis.

7 Some people say that "fair" is a stupid goal in a queue. They probably say this because fairness is so hard to define: there is no queue that can be fair by all possible definitions, and no definition of fair will be the "best" thing to do in all situations. For example, let's say I'm doing a videoconference call that takes 95% of my bandwidth and my roommate wants to visit a web site. Should we now each get 50% of the bandwidth? Probably not: video calls are much more sensitive to bandwidth fluctuations, whereas when loading a web page, it mostly doesn't matter if it takes 3 seconds instead of 1 second right now, as long as it loads. I'm not going to try to take sides in this debate, except to point out that if you use FQ, the latency for most streams is much lower than if you don't, and I really like low latency.

8 Random side note: FQ is also really annoying because it makes your pings look fast even when you're building up big queues. That's because pings are "thin" and so they end up prioritized in front of your fat flows. Weirdly, this means that most benchmarks of FQ vs fq_codel show exactly the same latencies; FQ hides the CoDel improvements unless you very carefully code your benchmarks.

Posted Wed Aug 8 10:32:47 2018 Tags:

To make testing libinput git master easier, I set up a whot/libinput-git Fedora COPR yesterday. This repo gets the push triggers directly from GitLab so it will rebuild with whatever is currently on git master.

To use the COPR, simply run:


sudo dnf copr enable whot/libinput-git
sudo dnf upgrade libinput
This will give you the libinput package from git. It'll have a date/time/git sha based NVR, e.g. libinput-1.11.901-201807310551git22faa97.fc28.x86_64. Easy to spot at least.

To revert back to the regular Fedora package run:


sudo dnf copr disable whot/libinput-git
sudo dnf distro-sync "libinput-*"

Disclaimer: This is an automated build so not every package is tested. I'm running git master exclusively (from a a ninja install) and I don't push to master unless the test suite succeeds. So the risk for ending up with a broken system is low.

On that note: if you are maintaining a similar repo for other distributions and would like me to add a push trigger in GitLab for automatic rebuilds, let me know.

Posted Wed Aug 1 01:07:00 2018 Tags:

libinput's documentation started out as doxygen of the developer API - they were the main target 4 years ago. Over time, more and more extra documentation was added and now most of it is aimed at users (for self-debugging and troubleshooting or just to explain concepts and features). Unfortunately, with doxygen this all ends up in the "Related Pages". The developer API documentation itself became a less important part, by now all the major compositors have libinput support and it doesn't change much. So while it needs to be there, most of the traffic goes to the user documentation (I think, it's not like I'm running stats).

Something more suited for prose-style docs was needed. I prefer the RTD look so last week I converted most of the libinput documentation into RST format and it's now built with sphinx and the RTD theme. Same URL as before: http://wayland.freedesktop.org/libinput/doc/latest/.

The biggest difference is that the Developer API Documentation (still doxygen) is now at http://wayland.freedesktop.org/libinput/doc/latest/api/, (i.e. add /api/ to the link). If you're programming against libinput's API (e.g. because you're writing a compositor), that's where you need to go.

It's still basically the same content as before, I'll be tidying things up and adding to it over the next few weeks. Hopefully without breaking existing links. There is probably detritus from the doxygen → rst change floating around, I'll be fixing that too. If you want to help out please don't hesitate, I'll do my best to be quick to review any merge requests.

Posted Mon Jul 30 04:16:00 2018 Tags:

Yesterday, we lost an important member of the FLOSS community. Gervase Markham finally succumbed to his battle with cancer (specifically, metastatic adenoid cystic carcinoma).

I met Gerv in the early 2000s, after he'd already been diagnosed. He has always been very public about his illness. He was frank with all who knew him that his life expectancy was sadly well below average due to that illness. So, this outcome isn't a surprise nor a shock, but it is nevertheless sad and unfortunate for all who knew him.

I really liked Gerv. I found him insightful and thoughtful. His insatiable curiosity for my primary field — FLOSS licensing — was a source of enjoyment for me in our many conversations on the subject. Gerv was always Socratic in his approach: he asked questions, rather than make statements, even when it was pretty obvious he had an answer of his own; he liked to spark debate and seek conversation. He thoughtfully considered the opinions of others and I many times saw his positions change based on new information. I considered him open-minded and an important contributor to FLOSS licensing thought.

I bring up Gerv's open-mindedness because I know that many people didn't find him so, but, frankly, I think those folks were mistaken. It is well documented publicly that Gerv held what most would consider particularly “conservative values”. And, I'll continue with more frankness: I found a few of Gerv's views offensive and morally wrong. But Gerv was also someone who could respectfully communicate his views. I never felt the need to avoid speaking with him or otherwise distance myself. Even if a particular position offended me, it was nevertheless clear to me that Gerv had come to his conclusions by starting from his (a priori) care and concern for all of humanity. Also, I could simply say to Gerv: I really disagree with that so much, and if it became clear our views were just too far apart to productively discuss the matter further, he'd happily and collaboratively find another subject for us to discuss. Gerv was a reasonable man. He could set aside fundamental disagreements and find common ground to talk with, collaborate with, and befriend those who disagreed with him. That level of kindness and openness is rarely seen in our current times.

In fact, Gerv gave me a huge gift without even knowing it: he really helped me understand myself better. Specifically, I have for decades publicly stated my belief that the creation and promulgation of proprietary software is an immoral and harmful act. I am aware that many people (e.g., proprietary software developers) consider that view offensive. I learned much from Gerv about how to productively live in a world where the majority are offended by my deeply held, morally-founded and well-considered beliefs. Gerv taught me how to work positively, productively and in a friendly way alongside others who are offended by my most deeply-held convictions. While I mourn the loss of Gerv today, I am so glad that I had that opportunity to learn from him. I am grateful for the life he had and his work.

Gerv's time with us was too short. In response, I suggest that we look at his life and work and learn from his example. Gerv set aside his illness for as long as possible to continue good work in FLOSS. If he can do that, we can all be inspired by him to set aside virtually any problem to work hard, together, for important outcomes that are bigger than us all.

[Finally, I should note that the text above was vetted and approved by Gerv, a few months ago, before his death. I am also very impressed that he planned so carefully for his own death that he contacted Conservancy to seek to assign his copyrights for safe keeping and took the time to review and comment on the text above. ]

Posted Sun Jul 29 04:40:21 2018 Tags:
The Sun sets on iptables (image by fdecomite, CC BY 2.0)

iptables is the default Linux firewall and packet manipulation tool. If you’ve ever been responsible for a Linux machine (aside from an Android phone perhaps) then you’ve had to touch iptables. It works, but that’s about the best thing anyone can say about it.

At Red Hat we’ve been working hard to replace iptables with its successor: nftables. Which has actually been around for years but for various reasons was unable to completely replace iptables.  Until now.

What’s Wrong With iptables?

iptables is slow. It processes rules linearly which was fine in the days of 10/100Mbit ethernet. But we can do better, and nftables does; it uses maps and concatenations to touch packets as little as possible for a given action.

Most of nftables’ intelligence is in the userland tools rather than the kernel, reducing the possibility for downtime due to kernel bugs. iptables puts most of its logic in the kernel and you can guess where that leads.

When adding or updating even a single rule, iptables must read the entire existing table from the kernel, make the change, and send the whole thing back. iptables also requires locking workarounds to prevent parallel processes from stomping on each other or returning errors. Updating an entire table requires some synchronization across all CPUs meaning the more CPUs you have, the longer it takes. These issues cause problems in container orchestration systems (like OpenShift and Kubernetes) where 100,000 rules and 15 second iptables-restore runs are not uncommon. nftables can update one or many rules without touching any of the others.

iptables requires duplicate rules for IPv4 and IPv6 packets and for multiple actions, which just makes the performance and maintenance problems worse. nftables allows the same rule to apply to both IPv4 and IPv6 and supports multiple actions in the same rule, keeping your ruleset small and simple.

If you’ve every had to log or debug iptables, you know how awful that can be. nftables allows logging and other actions in the same rule, saving you time, effort, and cirrhosis of the liver. It also provides the “nft monitor trace” command to watch how rules apply to live packets.

nftables also uses the same netlink API infrastructure as other modern kernel systems like /sbin/ip, the Wi-Fi stack, and others, so it’s easier to use in other programs without resorting to command-line parsing and execing random binaries.

Finally, nftables has integrated set support with consistent syntax rather than requiring a separate tool like ipset.

What about eBPF?

You might have heard that eBPF will replace everything and give everyone a unicorn. It might, if/when it gets enhancements for accountability, traceability, debuggability, auditability, and broad driver support for XDP. But nftables has been around for years and has most (all?) of these things today.

nftables Everywhere

I’d like to highlight the great work by members of my team to bring nftables over the finish line:

  • Phil Sutter is almost done with compat versions of arptables and ebtables and has been adding testcases everywhere. He also added a JSON interface to libnftables (much like /sbin/ip) for easier programmatic use which firewalld will use in the near future.
  • Eric Garver updated firewalld (the default firewall manager on Fedora, RHEL, and other distros) to use nftables by default. This change alone will seamlessly flip the nftables switch for countless users. It’s a huge deal.
  • Florian Westphal figured out how to make nftables and iptables NAT coexist in the kernel. He also fixed up the iptables compat commands and handles the upstream releases to make sure we can actually use this stuff.
  • And of course the upstream netfilter community!

Thanks iptables; it’s been a nice ride. But nftables is better.

 

Posted Fri Jul 27 19:20:43 2018 Tags:

Gather round children, it's story time. Especially for you children who lurk on /r/linux and think you may learn something there. Today, I'll tell you a horror story. The one where we convert kernel input events into touchpad events, with the subtle subtitle of "friends don't let friends handle evdev events".

The question put forward is "why do we need libinput at all", when, as frequently suggested on the usual websites, it's sufficient to just read evdev data and there's really no need for libinput. That is of course true. You can use evdev events from the kernel directly. Did you know that the events the kernel gives you are absolute coordinates? And that not all touchpads have buttons? Or that some touchpads have specific event sequences that need to be filtered? No? Well, boy, are you in for a few surprises! Anyway, let's go and handle evdev events ourselves and write our own libmyinput.

How do we know something is a touchpad? Well, we look at the exposed evdev bits. We need ABS_X, ABS_Y and BTN_TOOL_FINGER but don't want INPUT_PROP_DIRECT. If the latter bit is set then we have a touchscreen (probably). We don't actually care about buttons here, that comes later. ABS_X and ABS_Y give us device-absolute coordinates. On touch down you get the evdev frame of "a finger is down at x/y device units from the top-left". As you move around, you get the x/y coordinate updates. The data itself is exactly the same as you would get from a touchscreen, but we know it's a touchpad because we queried the other bits at startup. So your first job is to convert the absolute x/y coordinates to deltas by subtracting the previous position.

Touchpads have different resolutions for x and y so a delta of 10/10 does not mean it's a 45-degree movement. Better check with the resolution to convert this to physical distances to be on the safe side. Oh, btw, the axes aren't reliable. The min/max ranges and the resolutions are wrong on a large number of touchpads. Luckily systemd fixes this for you with the 60-evdev.hwdb. But I should probably note that hwdb only exists because of libinput... Either way, you don't have to care about it because the road's already paved. You're welcome.

Oh wait, you do have to care a little because there are touchpads (e.g. HP Stream 11, ZBook Studio G3, ...) where bits are missing or wrong. So you better write a device database that tells you when you have correct the evdev bits. You could implement this as config option but that's just saying "I know what's wrong here, I know how to fix it but I'm still going to make you google for it and edit a local configuration file to make it work". You could treat your users this way, but you really shouldn't.

As you're happily processing your deltas, you notice that on some touchpads you get motion before you touch the touchpad. Ooops, we need a way to tell whether a finger is down. Luckily the kernel gives you BTN_TOUCH for that event, so you switch your implementation to only calculate deltas when BTN_TOUCH is set. But then you realise that is effectively a hardcoded threshold in the kernel and does not match a lot of devices. Some devices require too-hard finger pressure to trigger BTN_TOUCH, others send it on super-light pressure or even while hovering. After grinding some enamel away you find that many touchpads give you ABS_PRESSURE. Awesome, let's make touches pressure-based instead. Let's use a threshold, no, I mean a device-specific threshold (because if two touchpads would be the same the universe will stop doing whatever a universe does, I clearly haven't thought this through). Luckily we already have the device database so we just add the thresholds there.

Oh, if you want this to run on a Apple touchpad better implement touch size handling (ABS_MT_TOUCH_MAJOR/ABS_MT_TOUCH_MINOR). These axes give you the size of the touching ellipse which is great. Except that the value is just an arbitrary number range that have no reflection to physical properties, so better update your database so you can add those thresholds.

Ok, now we have single-finger handling in our libnotinput. Let's add some sophisticated touchpad features like button clicks. Buttons are easy, the kernel gives us BTN_LEFT and BTN_RIGHT and, if you're lucky, BTN_MIDDLE. Unless you have a clickpad of course in which case you only ever get BTN_LEFT because the whole touchpad can be depressed (much like you, if you continue writing your own evdev handling). Those clickpads are in the majority of laptops these days, so we have to deal with them. The two approaches we have are "software button areas" and "clickfinger". The former detects where your finger is when you push the touchpad down - if it's in the bottom right corner we convert the kernel's BTN_LEFT to a BTN_RIGHT and pass that on. Decide how big the buttons will be (note: some touchpads that need software buttons are only 50mm high, others exceed 100mm height). Whatever size you choose, it's an invisible line on the touchpad. Do you know yet how you will handle a finger that moves from outside the button are into the button area before the click? Or the other way round? Maybe add this to your todo list for fixing later.

Maybe "clickfinger" is easier? It counts how many fingers are on the touchpad when clicking (1 finger == left click, 2 fingers == right click, 3 fingers == middle click). Much easier, except that so far we only handle one finger. The easy fix is to use BTN_TOOL_DOUBLETAP and BTN_TOOL_TRIPLETAP which are bitflags that tell you when a second/third finger are down. Add that to your libthisisnotlibinput. Coincidentally, users often click with their thumb while moving. So you have one finger moving the pointer, then a thumb click. Two fingers down but the user doesn't perceive it as such, this should be a left click. Oops, we don't actually know where the second finger is.

Let's switch our libstillnotlibinput to use ABS_MT_POSITION_X and ABS_MT_POSITION_Y because that gives us per-finger position information (once you understand how the kernel's MT protocol slots work). And when I say "switch" of course I meant "add" because there are still touchpads in use that don't support multitouch so you get to keep both implementations. There are also a bunch of touchpads that can give you the position of two fingers but not of the third. Wipe that tear away and pencil that into your todo list. I haven't mentioned semi-mt devices yet that will give you multitouch position data for two fingers but it won't track them correctly - the first touch position is always the top/left of the bounding box, the second touch is always the bottom/right of the bounding box. Do the right thing for our libwhathaveidone and just pretend semi-mt devices are single-touch touchpads. libinput (the real one) does the same because my sanity is something to be cherished.

Oh, on another note, some touchpads don't have any buttons (some Wacom tablets are large touchpads). Add that to your todo list. You wanted middle buttons to work? Few touchpads have a middle button (clickpads never do anyway). Better write a middle button emulation system that generates BTN_MIDDLE when both buttons are pressed. Or when a finger is on the left and another finger is on the right software button. Or when a finger is in a virtual middle button area. All these need to be present because if not, you get dissed by users for not implementing their favourite interaction method.

So we're several paragraphs in and so far we have: finger tracking and some button handling. And a bunch of things on the todo list. We haven't even started with other fancy features like edge scrolling, two-finger scrolling, pinch/swipe gestures or thumb and palm detection. Oh, and you're not yet handling any other devices like graphics tablets which are a world of their own. If you think all the other features and devices are any less of a mess... well, an Austrian comedian once said (paraphrased): "optimism is just a fancy word for ignorance".

All this is just handling features that users have come to expect. Examples for non-features that you'll have to implement: on some Lenovo series (*50 and newer) you will get a pointer jump after a series of of events that only have pressure information. You'll have to detect and discard that jump. The HP Pavilion DM4 touchpad has random jumps in the slot data. Synaptics PS/2 touchpads may 'randomly' end touches and restart them on the next event frame 10ms later. If you don't handle that you'll get ghost taps. And so on and so forth.

So as you, happily or less so, continue writing your libthisismoreworkthanexpected you'll eventually come to realise that you're just reimplementing libinput. Congratulations or condolences, whichever applies.

libinput's raison d'etre is that it deals with all the mess above so that compositor authors can be blissfully unaware of all this. That's the reason why all the major/general-purpose compositors have switched to libinput. That's the reason most distributions now use libinput with the X server (through the xf86-input-libinput driver). libinput has made some design decisions that you may disagree with but honestly, that's life. Deal with it. It doesn't even do all I want and I wrote >90% of it. Suggesting that you can just handle evdev directly is like suggesting you can use GPS coordinates directly to navigate. Sure you can, but there's a reason why people instead use a Tom Tom or Google Maps.

Posted Wed Jul 25 02:34:00 2018 Tags:

I spend a lot of time answering various people's tech-business questions. Occasionally, I will say something that sounds brilliant, and an unlucky listener will exclaim, "Avery, how do you know so much stuff?"

Answer: I read it in a book. Now, I don't actually read very many books. But I've been lucky enough to receive some great recommendations. Here are mine. Now you, too, can sound like a genius.

To make it more interesting, I'll start with the questions, then tell you the book that answers them.

Q: Why is your multi-year epic project still not profitable?

Because you don't understand how the "technology adoption life cycle" (the one with visionaries, early adopters, mainstream, laggards, etc) really works. The ultimate book on this topic is Crossing the Chasm, by Geoffrey Moore in 1991.

(You're going to notice a pattern to my book suggestions: they're old. It turns out really good advice stays relevant.)

Crossing the Chasm is very dear to my heart: it was recommended to me by one of the VCs that invested in my first startup - unfortunately a bit late. Chapter by chapter, it was like reading the post-mortem for my startup, except it was all written before we ever started.

Don't do this, because that will happen. Oh, crap, that's what we did, and that's what happened. Whatever you do, avoid this common mistake, because it always results in xyz. Yup, xyz everywhere. And so on. When we finally turned things around, I gave all the credit directly to this book. It is that good.

My favourite part is their definition of a "market segment." Most people don't know what they're talking about when they talk about a market segment. You think you do, but you don't. I thought I did, but I didn't. You can't run a startup without understanding what a market segment is. You will fail. Read the book instead.

One of their bits of advice: don't try to capture 10% of a big market. Capture 100% of a small market. Anyone who says "if we can just get 1% of this 10 billion dollar market..." is admitting defeat. Nobody will buy your product if it resolves only 90% of their obstacles, and for every market segment, it's a different 90%. You need to find a market segment, even a small one, for which you can solve 100%.

Q: Why are all these Cloud service providers losing money and offering terrible products?

Because the "Cloud" market has already Crossed the Chasm and has moved into the mainstream ("hypergrowth") phase, where the rules are nothing like what you're used to.

After the chasm has been crossed and you've made a product that serves several adjacent market niches, it gets easier to win even more market segments, partly because your company is actually making money and growing. Each new segment makes you more profitable, so you grow more and expand more, in a positive feedback loop.

The only problem is that, once you start taking off, you attract competition. Crossing the Chasm is largely about staying orthogonal to your competitors, but as you enter the mainstream, that stops working.

That's Cloud services right now. And ride sharing.

Inside the Tornado, also by Geoffrey Moore, a few years later in 1995, talks about this situation and how to deal with it.

To be honest, I read this book because I liked Crossing the Chasm so much, but I recommend it a lot less often. Almost nobody is lucky enough to get into the middle of a hypergrowth market, so the book is mostly irrelevant to almost everyone. (One exception: by reading the book and learning what a hypergrowth market is really like, you might realize you aren't in one after all, and save yourself a lot of mistakes. Or you might realize you don't have the capital needed to compete, and bail out.)

An interesting observation in the book is that hypergrowth phases are temporary (eventually the entire mainstream is saturated), and the market share of various competitors, after hypergrowth ends, stays mostly fixed. Customers sign up with a particular product, and are reluctant to change. That's why, say, desktop PCs have run about the same set of OSes for a long time, the same word processors and spreadsheets have been popular in the same proportion for a long time, and the relative market shares between Coke and Pepsi or Colgate and Crest rarely change.

(After saturation, customers still switch from one supplier to another, but there's a dynamic equilibrium: everybody keeps spending money on sales and features, but none of them can do it any better than the others, so you might lose one customer and gain another.)

So, to oversimplify, the book's recommended strategy during hypergrowth is to invest like crazy to acquire market share, because the big customers you acquire now might stick around for decades. It's worth losing money in the short term. You have to move impossibly fast and land those customers at any cost, sacrificing short-term profitability, product elegance, and so on.

The tornado is ugly. It results in ugly products and weird business models (at least temporarily). There'll be time to fix it all later, when the market is saturated.

Q: Won't Big Company X just clone your product and steal all your customers?

Maybe! But maybe not. It depends on some surprising factors.

Nowadays in the tech industry we use the words "disruption" and "innovation" interchangeably. We like to talk about "disruptive innovation," but by now we've forgotten why we insert the adjective instead of just saying "innovation." Is it just because disruption just sounds cooler?

We can trace the term "disruptive innovation" back to the book that invented it in the first place: The Innovator's Dilemma, by Clayton M. Christensen, in 1997. That book is about the difference between two kinds of innovation: sustaining innovation, and disruptive innovation.

As techies, what we've forgotten is that sustaining innovation is the common one. It's so common we forget to name it. When Intel makes a newer, faster chipset (sadly less often lately) or there's a new model of laptop, that's sustaining innovation. It happens all the time. And notably, big companies are really great at it. If you're a big company and you make laptops, then if you make an even better laptop, you'll probably make customers happier, sell more of them, and make more money. Everybody wins. Companies pour money into that.

If your startup is making a product that's the same thing only a bit better, then yeah, Big Company X is going to clone it and eat your lunch. You're doing sustaining innovation, and all that takes is money. They have more money than you. The end.

Very different is disruptive innovation. A disruptive innovation generally has some fatal flaw that makes it laughable to incumbents. A smart phone... without a keyboard? Come on. A taxi service... where the drivers are untrained, unprofessional, and unlicensed? An ad algorithm that tries to show you fewer ads? A tiny little 1" hard disk that's way more expensive per gigabyte? Who wants that?

The Innovator's Dilemma explores what actually makes innovation "disruptive," across several different product areas, from digging machines, to printers, to hard disks, and shows how the same trend repeats over and over. As the incumbent, you keep making your products better (sustaining innovation). Customer requirements increase, but not as fast as your products improve. Meanwhile, some competitors are working on some obviously inferior technology selling to your lowest-value cheapskate customers you're happy to have off your books anyway, while you sell to ever-higher-end customers and make ever-larger profits.

The competing technology is far from meeting your customers' demands, and its trajectory clearly shows it will never catch up to your product line. Then, one day, the inferior technology - still much worse than yours - finally gets good enough to meet your biggest customers' needs. Boom. You're dead. It's still not as good as your product, but that doesn't matter. It's good enough.

What makes this a "dilemma" is that, right up until that crossover point, your company would lose money by investing in the new thing. The old thing is far more profitable, and your customers don't even want the new thing; the new thing can't yet do what they want! Any profitable company has been highly optimized to deliver what customers want, and reject what customers don't want, so of course you won't invest in the inferior product. And then, one day, all your customers change their minds all at once, and you're too late.

You can plot the whole thing on a graph. You can even plot it before it happens, knowing it will happen, and still not be able to stop it. Math is beautiful. (The book suggests a few tricks to try to save yourself, but the tricks are very non-obvious.)

If you want to know why IBM missed the PC revolution, and why Microsoft was caught off guard by the Internet, eventually exited the smartphone market, still runs apps with tiny mouse-optimized toolbars on touchscreen tablets, and still can't run your Access databases on the web, this book is the one.

It'll also help you answer that very important question: what's really stopping Big Company X from cloning my product?

Q: I have lots of money. How do I clone a competitor's product, only better?

Don't do that.

Perhaps it's too obvious when I phrase it that way, but let's be honest, people try this all the time. Imagine, say, I don't know, instant messaging apps. There's an existing product getting popular, it looks pretty trivial, I could clone that in a week! Then I'll just spend more on marketing, or bundle some product integrations, or my better brand name, and rake in the customers.

Maybe. I mean, you're not automatically doomed if you try this approach. It's worked before. Like the U.S. invasion of Afghanistan... oh wait. Well, okay, the U.S. didn't lose, right? They did spend like 10,000x as much money as their opponent though. Perhaps there was a better way.

This is where I recommend The Art of War, by the famous Sun Tzu, in the 5th century BC.

I'll admit it, the book is a bit of a cliché at this point. The advice all sounds obvious, in its obtusely-phrased way, perhaps because people have been studying it for millenia and then quoting it in Hollywood movies. Problem is, most people still aren't getting it.

The book starts with "The supreme art of war is to subdue the enemy without fighting" and moves on from there. Why are you fighting? Why are you fighting on the opponent's turf? Why is your strategy so transparent? Do you even have a strategy?

Strategy is a thing that can be learned. You won't learn it all from reading this book, which is pretty short and a bit outdated (although it does have some parts about supply chain management). But it's a start.

If you want to know how Microsoft really did get "a computer on every desk and in every home, running Microsoft software," in a world where they had an inferior product and lots of competitors, this is a good place to start.

Q: I have the Best Architecture and the Fanciest Developers. Why do people hate my product?

Lots of reasons, of course! But an interesting reason is that designers and architects have a tendency to look at their products from a bird's eye, high level, super abstract view. They want the abstraction to look beautiful, and they spend not enough time on the ground, in the dirt, fine tuning things to make them work right.

It's virtually impossible to explain this, in terms of programming, to the programmers who are doing it. And we do it, all of us. But sometimes we can learn from analogy. That's one reason I really liked The Death and Life of Great American Cities, by Jane Jacobs, in 1961. (It was recommended to me long ago by a one-time roommate of mine, and it really changed how I think about cities and about design.)

It has absolutely nothing to do with programing, but like The Innovator's Dilemma talking about bankrupt digging machine companies, we can learn from Jane Jacobs's excellent rant about terrible-brilliant urban planners, with huge budgets, following Best Practices and optimizing literal aerial views, making most of the cities of North America into awful messes, and also severely screwing up large parts of even the cities that are better.

Like the people who said Dependency Injection was a bad idea, Jacobs was ridiculed for a long time, but history has proven her right.

It's fascinating to read why flattening neighbourhoods and rebuilding them from scratch doesn't work; results are vastly worse than simply refactoring (although the term "refactoring" hadn't been invented in 1961). It turns out that a lot of subtle stuff was built up in that neighbourhood over time, and when you start from scratch, you lose it all. You might have heard this somewhere before.

I love this book because not only can you learn how Car Culture happened (while assuming people were doing their best, rather than resorting to a conspiracy theory about car manufacturers and public transit), but also what causes crime, what motivates people to care (or not care) for their neighbourhood buildings, and the mental traps that confound anyone in search of elegance.

Q: Why is software development so unpredictable?

Actually it isn't, you're just predicting it wrong. Sorry. I already wrote way too extensively about that so I'll resist going over it all again.

Out-of-control software development processes turn out to be a special case of general out-of-control processes, and who's the expert on process control? W. Edwards Deming, of course, in his several books, including The New Economics from the year 2000. I wrote about some of his work about a year and a half ago.

Since then, I've had more time to contemplate what I read, and I'm now pretty sure I know the very most important thing I learned: the difference between random variation and outliers.

When we're modeling any kind of statistics, we have a tendency to assume our data fits some kind of standard statistical distribution - for example, Gaussian.

What Deming was teaching was that real-life human processes often don't cleanly fit a continuous model. The reality, in manufacturing and in software, is only mostly Gaussian. Every now and then something wacky will happen that is not just part of the distribution. Imagine the difference between drilling a hole slightly off center, vs breaking the drill bit. How many standard deviations out is a broken drill bit? Wrong question.

Imagine doing a least-square-error curve fit while including those outliers. Because errors are squared (hence the name), a few big errors will move the distribution as much as a large number of small errors. But the outliers are by definition not predictable or reproducible, so including them in the curve fit only stops you from predicting the remaining predictable parts.

Deming's greatest lesson, in my opinion, is that you have to treat outliers and common errors completely differently. When you overdrive a software development process, then things don't just stretch, they break. Teammates getting tired and needing recovery (worse standard deviation) is one thing; teammates quitting (outlier) is another thing entirely. You want to reduce standard deviation, but you usually do that by adjusting a continuous variable (eg. to make recovery time more predictable, work people less hard). To prevent outliers, you need a more discrete change (eg. to keep people from quitting, figure out what is making them quit, and get rid of it).

Deming also provides actual techniques - albeit very handwavy techniques that seem to make "real" statisticians cringe - for detecting outliers vs continuous variables. Something about the standard deviation and the interquartile range. This margin is too small, etc.

...

Well, that got long. Maybe I should have just listed the book titles.

Posted Tue Jul 24 06:20:21 2018 Tags:

Someone linked me to this blog by a boutique proprietary software company complaining about porting to GNU/Linux systems, in which David Power, co-founder of Hiri, says:

Unfortunately, the fundamentalist FOSS mentality we encountered on Reddit is still alive and well. Some Linux blogs and Podcasts simply won’t give us the time of day.

I just want to quickly share a few analogous quotes that show why that statement is an unwarranted and unfair statement about people's reasonably held beliefs. First, imagine if Hiri were not a proprietary software company, but a butcher. Here's how the quote would sound:

Unfortunately, the fundamentalist vegan mentality we encountered on Reddit is still alive and well. Some vegetarian blogs and Podcasts simply won’t give us the time of day.

Should a butcher really expect vegetarian blogs and podcasts to talk about their great new cuts of meat available? Should a butcher be surprised that vegans disagree with them?

How about if Hiri sold non-recycled card stock paper?:

Unfortunately, the fundamentalist recycling mentality we encountered on Reddit is still alive and well. Some environmentalist blogs and Podcasts simply won’t give us the time of day.

If you make a product to which a large part of the potential customer population has a moral objection, you should expect that objection, and it's reasonable for that to happen. To admonish those people because they don't want to promote your product really is akin to a butcher annoyed that vegans won't promote their prime cuts of meat.

Posted Mon Jul 23 20:21:00 2018 Tags:

If you're one of the people in the software freedom community who is attending O'Reilly's Open Source Software Convention (OSCON) next week here in Portland, you may have seen debate about O'Reilly and Associates (ORA)'s surreptitious Code of Conduct change (and quick revocation thereof) to name “political affiliation” as a protected class. If you're going to OSCON or plan to go to an OSCON or ORA event in the future, I suggest that you familiarize yourself with this issue and the political historical context in which these events of the last few days take place.

First, OSCON has always been political: software freedom is inherently a political struggle for the rights of computer users, so any conference including that topic is necessarily political. Additionally, O'Reilly himself had stated his political positions many times at OSCON, so it's strange that, in his response this morning, O'Reilly admits that he and his staff tried to require via agreements that speakers … refrain from all political speech. OSCON can't possibly be a software freedom community event if ORA's intent … [is] to make sure that conferences put on for the exchange of technical information aren't politicized (as O'Reilly stated today). OTOH, I'm not surprised by this tack, because O'Reilly, in large part via OSCON, often pushes forward political views that O'Reilly likes, and marginalizes those he doesn't.

Second, I must strongly disagree with ORA's new (as of this morning) position that Codes of Conduct should only include “protected classes” that the laws of a particular country currently recognize. Codes of Conduct exist in our community not only as mechanism to assure the rights of protected classes, but also to assure that everyone feels safe and free of harassment and hate speech. In fact, most Codes of Conduct in our community have “including but not limited to” language alongside any list of protected classes, and IMO all of them should.

More than that, ORA has missed a key opportunity to delineate hate speech and political speech in a manner that is sorely needed here in the USA and in the software freedom community. We live in a political climate where our Politician-in-Chief governs via Twitter and smoothly co-mingles political positioning with statements that would violate the Code of Conduct at most conferences. In other words, in a political climate where the party-ticket-headline candidate is exposed for celebrating his own sexual harassing behavior and gets elected anyway, we are culturally going to have trouble nationwide distinguishing between political speech and hate speech. Furthermore, political manipulators now use that confusion to their own ends, and we must be ever-vigilant in efforts to assure that political speech is free, but that it is delineated from hate speech, and, most importantly, that our policy on the latter is zero-tolerance.

In this climate, I'm disturbed to see that O'Reilly, who is certainly politically savvy enough to fully understand these delineations, is ignoring them completely. The rancor in our current politics — which is not just at the national level but has also trickled down into the software freedom community — is fueled by bad actors who will gladly conflate their own hate speech and political speech, and (in the irony that only post-fact politics can bring), those same people will also accuse the other side of hate speech, primarily by accusing intolerance of the original “political speech” (which is of course was, from the start, a mix of hate speech and political speech). (Examples of this abound, but one example that comes to mind is Donald Trump's public back-and-forth with San Juan Mayor Carmen Yulín Cruz.) None of ORA's policy proposals, nor O'Reilly's public response, address this nuance. ORA's detractors are legitimately concerned, because blanketly adding “political affiliation” to a protected class, married with a outright ban on political speech, creates an environment where selective enforcement favors the powerful, and furthermore allows the Code of Conduct to more easily become a political weapon by those who engage in the conflation practice I described.

However, it's no surprise that O'Reilly is taking this tack, either. OSCON (in particular) has a long history — on political issues of software freedom — of promoting (and even facilitating) certain political speech, even while squelching other political speech. Given that history (examples of which I include below), O'Reilly shouldn't be surprised that many in our community are legitimately skeptical about why ORA made these two changes without community discussion, only to quickly backpedal when exposed. I too am left wondering what political game O'Reilly is up to, since I recall well that Morozov documented O'Reilly's track record of political manipulation in his article, The Meme Hustler. I thus encourage everyone who attends ORA events to follow this political game with a careful eye and a good sense of OSCON history to figure out what's really going on. I've been watching for years, and OSCON is often a master class in achieving what Chomsky critically called “manufacturing consent” in politics.

For example, back in 2001, when OSCON was already in its third year, Microsoft executives went on the political attack against copyleft (calling it unAmerican and a “cancer”). O'Reilly, long unfriendly to copyleft himself, personally invited Craig Mundie of Microsoft to have a “Great Debate” keynote at the next OSCON — where Mundie would “debate” with “Open Source leaders” about the value of Open Source. In reality, O'Reilly put on stage lots of Open Source people with Mundie, but among them was no one who supported the strategy of copyleft, the primary component of Microsoft's political attacks. The “debate” was artfully framed to have only one “logical” conclusion: “we all love Open Source — even Microsoft (!) — it's just copyleft that can be problematic and which we should avoid”. It was no debate at all; only carefully crafted messaging that left out much of the picture.

That wasn't an isolated incident; both subtle and overt examples of crafted political messaging at OSCON became annual events after that. As another example, ten years later, O'Reilly did almost the same playbook again: he invited the GitHub CEO to give a very political and completely anti-copyleft keynote. After years of watching how O'Reilly carefully framed the political issue of copyleft at OSCON, I am definitely concerned about how other political issues might be framed.

And, not all political issues are equal. I follow copyleft politics because it's my been my day job for two decades. But, I admit there are stakes even higher with other political topics, and having watched how ORA has handled the politics of copyleft for decades, I'm fearful that ORA is (at best) ill-equipped to handle political issues that can cause real harm — such as the current political climate that permits hate speech, and even racist speech (think of Trump calling Elizabeth Warren “Pocahontas”), as standard political fare. The stakes of contemporary politics now leave people feeling unsafe. Since OSCON is a political event, ORA should face this directly rather than pretending OSCON is merely a series of technical lectures.

The most insidious part of ORA's response to this issue is that, until the issue was called out, it seems that all political speech (particularly that in opposition to the status quo) violated OSCON's policies by default. We've successfully gotten ORA to back down from that position, but not without a fight. My biggest concern is that ORA nearly ran OSCON this year with the problematic combination of banning political speech in the speaker agreement, while treating “political affiliation” as a protected class in the Code of Conduct. Regardless of intent, confusing and unclear rules like that are gamed primarily by bad actors, and O'Reilly knows that. Indeed, just days later, O'Reilly admits that both items were serious errors, yet still asks for voluntary compliance with the “spirit” of those confusing rules.

How could it be that an organization that's been running the same event for two decades only just began to realize that these are complex issues? Paradoxically, I'm both baffled and not surprised that ORA has handled this issue so poorly. They still have no improved solution for the original problem that O'Reilly states they wanted to address (i.e., preventing hate speech). Meanwhile, they've cycled through a series of failed (and alarming) solutions without community input. Would it have really been that hard for them to publicly ask first: “We want to welcome all political views at OSCON, but we also detest hate speech that is sometimes joined with political speech. Does anyone want to join a committee to work on improvements to our policies to address this issue?” I think if they'd handled this issue in that (Open Source) way, the outcome would have not be the fiasco it's become.

Posted Thu Jul 12 09:40:00 2018 Tags:

A common error when building from source is something like the error below:


meson.build:50:0: ERROR: Native dependency 'foo' not found
or a similar warning

meson.build:63:0: ERROR: Invalid version of dependency, need 'foo' ['>= 1.1.0'] found '1.0.0'.
Seeing that can be quite discouraging, but luckily, in many cases it's not too difficult to fix. As usual, there are many ways to get to a successful result, I'll describe what I consider the simplest.

What does it mean? Dependencies are simply libraries or tools that meson needs to build the project. Usually these are declared like this in meson.build:


dep_foo = dependency('foo', version: '>= 1.1.0')
In human words: "we need the development headers for library foo (or 'libfoo') of version 1.1.0 or later". meson uses the pkg-config tool in the background to resolve that request. If we require package foo, pkg-config searches for a file foo.pc in the following directories:
  • /usr/lib/pkgconfig,
  • /usr/lib64/pkgconfig,
  • /usr/share/pkgconfig,
  • /usr/local/lib/pkgconfig,
  • /usr/local/share/pkgconfig
The error message simply means pkg-config couldn't find the file and you need to install the matching package from your distribution or from source.

And important note here: in most cases, we need the development headers of said library, installing just the library itself is not sufficient. After all, we're trying to build against it, not merely run against it.

What package provides the foo.pc file?

In many cases the package is the development version of the package name. Try foo-devel (Fedora, RHEL, SuSE, ...) or foo-dev (Debian, Ubuntu, ...). yum and dnf provide a great shortcut to install any pkg-config dependency:


$> dnf install "pkgconfig(foo)"

$> yum install "pkgconfig(foo)"
will automatically search and install the right package, including its dependencies.
apt-get requires a bit more effort:

$> apt-get install apt-file
$> apt-file update
$> apt-file search --package-only foo.pc
foo-dev
$> apt-get install foo-dev
For those running Arch and pacman, the sequence is:

$> pacman -S pkgfile
$> pkgfile -u
$> pkgfile foo.pc
extra/foo
$> pacman -S extra/foo
Once that's done you can re-run meson and see if all dependencies have been met. If more packages are missing, follow the same process for the next file.

Any users of other distributions - let me know how to do this on yours and I'll update the post

My version is wrong!

It's not uncommon to see the following error after installing the right package:


meson.build:63:0: ERROR: Invalid version of dependency, need 'foo' ['>= 1.1.0'] found '1.0.0'.
Now you're stuck and you have a problem. What this means is that the package version your distribution provides is not new enough to build your software. This is where the simple solutions and and it all gets a bit more complicated - with more potential errors. Unless you are willing to go into the deep end, I recommend moving on and accepting that you can't have the newest bits on an older distribution. Because now you have to build the dependencies from source and that may then require to build their dependencies from source and before you know you've built 30 packages. If you're willing read on, otherwise - sorry, you won't be able to run your software today.

Manually installing dependencies

Now you're in the deep end, so be aware that you may see more complicated errors in the process. First of all you need to figure out where to get the source from. I'll now use cairo as example instead of foo so you see actual data. On rpm-based distributions like Fedora run dnf or yum:


$> dnf info cairo-devel # or yum info cairo-devel
Loaded plugins: auto-update-debuginfo, langpacks
Installed Packages
Name : cairo-devel
Arch : x86_64
Version : 1.13.1
Release : 0.1.git337ab1f.fc20
Size : 2.4 M
Repo : installed
From repo : fedora
Summary : Development files for cairo
URL : http://cairographics.org
License : LGPLv2 or MPLv1.1
Description : Cairo is a 2D graphics library designed to provide high-quality
: display and print output.
:
: This package contains libraries, header files and developer
: documentation needed for developing software which uses the cairo
: graphics library.
The important field here is the URL line - got to that and you'll find the source tarballs. That should be true for most projects but you may need to google for the package name and hope. Search for the tarball with the right version number and download it. On Debian and related distributions, cairo is provided by the libcairo2-dev package. Run apt-cache show on that package:

$> apt-cache show libcairo2-dev
Package: libcairo2-dev
Source: cairo
Version: 1.12.2-3
Installed-Size: 2766
Maintainer: Dave Beckett
Architecture: amd64
Provides: libcairo-dev
Depends: libcairo2 (= 1.12.2-3), libcairo-gobject2 (= 1.12.2-3),[...]
Suggests: libcairo2-doc
Description-en: Development files for the Cairo 2D graphics library
Cairo is a multi-platform library providing anti-aliased
vector-based rendering for multiple target backends.
.
This package contains the development libraries, header files needed by
programs that want to compile with Cairo.
Homepage: http://cairographics.org/
Description-md5: 07fe86d11452aa2efc887db335b46f58
Tag: devel::library, role::devel-lib, uitoolkit::gtk
Section: libdevel
Priority: optional
Filename: pool/main/c/cairo/libcairo2-dev_1.12.2-3_amd64.deb
Size: 1160286
MD5sum: e29852ae8e8e5510b00b13dbc201ce66
SHA1: 2ed3534d02c01b8d10b13748c3a02820d10962cf
SHA256: a6099cfbcc6bd891e347dd9abc57b7f137e0fd619deaff39606fd58f0cc60d27
In this case it's the Homepage line that matters, but the process of downloading tarballs is the same as above. For Arch users, the interesting line is URL as well:

$> pacman -Si cairo | grep URL
Repository : extra
Name : cairo
Version : 1.12.16-1
Description : Cairo vector graphics library
Architecture : x86_64
URL : http://cairographics.org/
Licenses : LGPL MPL
....

Now to the complicated bit: In most cases, you shouldn't install the new version over the system version because you may break other things. You're better off installing the dependency into a custom folder ("prefix") and point pkg-config to it. So let's say you downloaded the cairo tarball, now you need to run:


$> mkdir $HOME/dependencies/
$> tar xf cairo-someversion.tar.xz
$> cd cairo-someversion
$> autoreconf -ivf
$> ./configure --prefix=$HOME/dependencies
$> make && make install
$> export PKG_CONFIG_PATH=$HOME/dependencies/lib/pkgconfig:$HOME/dependencies/share/pkgconfig
# now go back to original project and run meson again
So you create a directory called dependencies and install cairo there. This will install cairo.pc as $HOME/dependencies/lib/cairo.pc. Now all you need to do is tell pkg-config that you want it to look there as well - so you set PKG_CONFIG_PATH. If you re-run meson in the original project, pkg-config will find the new version and meson should succeed. If you have multiple packages that all require a newer version, install them into the same path and you only need to set PKG_CONFIG_PATH once. Remember you need to set PKG_CONFIG_PATH in the same shell as you are running configure from.

In the case of dependencies that use meson, you replace autotools and make with meson and ninja:


$> mkdir $HOME/dependencies/
$> tar xf foo-someversion.tar.xz
$> cd foo-someversion
$> meson builddir -Dprefix=$HOME/dependencies
$> ninja -C builddir install
$> export PKG_CONFIG_PATH=$HOME/dependencies/lib/pkgconfig:$HOME/dependencies/share/pkgconfig
# now go back to original project and run meson again

If you keep seeing the version error the most common problem is that PKG_CONFIG_PATH isn't set in your shell, or doesn't point to the new cairo.pc file. A simple way to check is:


$> pkg-config --modversion cairo
1.13.1
Is the version number the one you installed or the system one? If it is the system one, you have a typo in PKG_CONFIG_PATH, just re-set it. If it still doesn't work do this:

$> cat $HOME/dependencies/lib/pkgconfig/cairo.pc
prefix=/usr
exec_prefix=/usr
libdir=/usr/lib64
includedir=/usr/include

Name: cairo
Description: Multi-platform 2D graphics library
Version: 1.13.1

Requires.private: gobject-2.0 glib-2.0 >= 2.14 [...]
Libs: -L${libdir} -lcairo
Libs.private: -lz -lz -lGL
Cflags: -I${includedir}/cairo
If the Version field matches what pkg-config returns, then you're set. If not, keep adjusting PKG_CONFIG_PATH until it works. There is a rare case where the Version field in the installed library doesn't match what the tarball said. That's a defective tarball and you should report this to the project, but don't worry, this hardly ever happens. In almost all cases, the cause is simply PKG_CONFIG_PATH not being set correctly. Keep trying :)

Let's assume you've managed to build the dependencies and want to run the newly built project. The only problem is: because you built against a newer library than the one on your system, you need to point it to use the new libraries.


$> export LD_LIBRARY_PATH=$HOME/dependencies/lib
and now you can, in the same shell, run your project.

Good luck!

Posted Sun Jul 8 23:46:00 2018 Tags:

This post is part of a series: Part 1, Part 2, Part 3, Part 4, Part 5.

In this post I'll describe the X server pointer acceleration for trackpoints. You will need to read Observations on trackpoint input data first to make sense of this post.

As described in that linked post, trackpoint input data varies wildly. Combined with the options we have in the server to configure everything makes this post a bit pointless as almost every single behaviour can be changed.

The linked post also describes the three subjective pressure ranges: no real physical pressure, some physical pressure, and serious pressure. The line between the first two ranges is roughly where the trackpoint sends deltas at the maximum reporting rate (100Hz) but with a value of 1. Below that pressure, the intervals increase but the delta remains at 1. Above that pressure, the interval remains constant at 10ms but the deltas increase. I've used the default kernel trackpoint sensitivity of 128 for any data listed here. Here is the visualisation of how deltas and intervals change again.

The default pointer acceleration profile in the X server is the simple profile. We know this from the earlier posts, it has a double-plateau shape. On a trackpoint mm/s doesn't make sense here so let's look at it in units/ms instead. A unit is simply a device-specific measurement of distance/pressure/tilt/whatever - it all depends on the device. On trackpoints that is (mostly) sideways pressure or tilt. On mice and touchpads we can convert units to mm based on their resolution. On trackpoints, we don't have a physical reference and we thus have to deal with it in units. The obvious problem here is that 1 unit on one device does not equal 1 unit on another device. And for configurable trackpoints, the definition of a unit changes as the sensitivity changes. And that's after the kernel already mangles it (if it does, it doesn't for all devices). So here's a box of asterisks, please sprinkle it liberally.

The smallest delta the kernel can send is 1. At a hardware report rate of 100Hz, continuous pressure to the smallest detected threshold thus generates 1 unit every 10 milliseconds or 0.1 units/ms. If I push uncomfortably hard, I can get deltas of around 10 units every 10ms or 1 unit/ms. In other words, we better zoom in here. Let's look at the meaningful range of this curve.

On my trackpoint, below 0.1 units/ms means virtually no pressure (pressure range one). Pressure range two is 0.1 to 0.4, approximately. Beyond that is pressure range three but that is also the range that becomes pointless quickly - I simply wouldn't want to press this hard in normal operation. 1 unit per ms (10 units per report) is very high pressure. This means the pointer acceleration curve is actually defined for the usable range with only outliers hitting the maximum acceleration. For mice this curve was effectively a constant acceleration for all but slow movements (see here). However, any configuration can change this curve to a point where none of the above applies.

Back to the minimum constant movement of 0.1 units/ms. That one effectively matches the start of the 'no accel' plateau. Anything below that will be decelerated, i.e. a delta of 1 unit will result a pointer delta less than 1 pixel. In other words, anything up to where you have to apply real pressure is decelerated.

The constant factor plateau goes all the way to 0.4 units/ms. Then there's the buggy jump to a factor of ~1.5, followed by a smooth curve to 0.8 units/ms where the factor maxes out. A bit of testing here suggests that 0.4 units/ms is in the upper limits of the second pressure range mentioned above. Going past 0.6 or 0.7 is definitely well within the third pressure range where things get uncomfortable quickly. This means that the acceleration bug is actually sitting right in the highest interesting range. Apparently no-one has noticed for 10 years.

But what does it matter? Well, probably not even that much. The only interesting bit I I can see here is that we have deceleration for most low-pressure movements and a constant acceleration of 1 for most realistic movements. I very much doubt that the range above 0.4 really matters.

But hey, this is just the default configuration. It is affected when someone changes the speed slider in GNOME, or when someone changes the sensitivity at the sysfs level. Other trackpoints wont have the exact same behaviour. Any analysis is thrown out of the window as soon as someone changes the sysfs sensitivity or increases the acceleration threshold.

Let's talk sysfs - if we increase my trackpoint sensitivity to 200, the deltas coming from the trackpoint change. First, the pressure required to give me a constant stream of events often gives me deltas of size 2 or 3. So we're half-way into the no acceleration plateau here. Higher pressures easily give me deltas of size 10 or 1 unit per ms, the edge of the image above.

I wish I could analyse this any further but realistically, the only takeaway here is that any change in configuration options results in some version of trial-and-error by the user until the trackpoint moves as they want to. But without knowing all those options, we just cannot know what exactly is happening.

However, what this is useful for is comparing it to libinput. libinput got a custom trackpoint acceleration function in 1.8, designed around the hardware delta range. The idea was that you (or someone) measures the trackpoint device's range once, if it's outside of the assumed default ranges we add a hwdb entry and voila, it scales back to the right ranges and that device is fixed for good.

Except - this doesn't work. libinput scales into the delta range and calculates the factor from that but it doesn't take the time stamps into account. It works on the assumption that a trackpoint deltas are at a constant frequency with a varying delta. That is simply not the case and the dynamic range of the trackpoint is so small that any acceleration of the deltas results in jerky movement.

This is of course fixable, we can just convert the deltas into a speed and then apply the acceleration curve based on that. So that's the next task, if you're interested in that, subscribe yourself to this issue.

Posted Tue Jun 26 23:15:00 2018 Tags: