The lower-post-volume people behind the software in Debian. (List of feeds.)

After applying a patch that moves a bulk of code that was placed in a wrong file to its correct place, a quick way to sanity-check that the patch does not introduce anything unexpected is to run "git blame -C -M" between HEAD^ and HEAD, like this:

  $ git blame -C -M HEAD^..HEAD -- new-location.c

This should show that the lines moved from the old location in the output as coming from there; lines blamed for the new commit (i.e. not coming from the old location) can then be inspected more carefully to see if it makes sense.

One problem I had while doing exactly that today was that most of the screen real-estate on my 92-column wide terminal was taken by the author name and the timestamp, and I found myself pressing right and left arrow in my pager to scroll horizontally a lot, which was both frustrating and suboptimal.

  $ git blame -h

told me that there is "git blame -s" to omit that information. I thought that I didn't know about the option. Running "git blame" on its source itself revealed that the option was added by me 8 years ago, and it wasn't that I didn't know but I simply forgot ;-)
Posted Mon Jun 29 16:23:00 2015 Tags:
One of the questions I had in the hackathon today is about how to use the CDK to convert SMILES string into InChIs and InChIKeys (see doi:10.1186/1758-2946-5-14). So, here goes. This is the Groovy variant, though you can access the CDK just as well in other programming languages (Python, Java, JavaScript). We'll use the binary jar for CDK 1.5.10.  We can then run code, say test.groovy, using the CDK with:

groovy -cp cdk-1.5.10.jar test.groovy

With that out of the way, let's look at the code. Let's assume we start with a text file with one SMILES string on each line, say test.smi, then we parse this file with:

new File("test.smi").eachLine { line ->
  mol = parser.parseSmiles(line)
}

This already parses the SMILES string into a chemical graph. If we pass this to the generator to create an InChIKey, we may get an error, so we do an extra check:

gen = factory.getInChIGenerator(mol)
if (gen.getReturnStatus() == INCHI_RET.OKAY) {
  println gen.inchiKey;
} else {
  println "# error: " + gen.message
}

If we combine these two bits, we get a full test.groovy program:

import org.openscience.cdk.silent.*
import org.openscience.cdk.smiles.*
import org.openscience.cdk.inchi.*
import net.sf.jniinchi.INCHI_RET

parser = new SmilesParser(
  SilentChemObjectBuilder.instance
)
factory = InChIGeneratorFactory.instance

new File("test.smi").eachLine { line ->
  mol = parser.parseSmiles(line)
  gen = factory.getInChIGenerator(mol)
  if (gen.getReturnStatus() == INCHI_RET.OKAY) {
    println gen.inchiKey;
  } else {
    println "# error: " + gen.message
  }
}

Update: John May suggested an update, which I quite like. If the result is not 100% okay, but the InChI library gave a warning, it still yields an InChIKey which we can output, along with the warning message. For this, replace the if-else statement with this code:

if (gen.returnStatus == INCHI_RET.OKAY) {
  println gen.inchiKey;
} else if (gen.returnStatus == INCHI_RET.WARNING) {
  println gen.inchiKey + " # warning: " + gen.message;
} else {
  println "# error: " + gen.message

}

Posted Sun Jun 28 22:36:00 2015 Tags:

I've been otherwise impressed with John Oliver and his ability on Last Week Tonight to find key issues that don't have enough attention and give reasonably good information about them in an entertaining way — I even lauded Oliver's discussion of non-profit organizational corruption last year. I suppose that's why I'm particularly sad (as I caught up last weekend on an old episode) to find that John Oliver basically fell for the large patent holders' pro-software-patent rhetoric on so-called “software patents”.

In short, Oliver mimics the trade association and for-profit software industry rhetoric of software patent reform rather than abolition — because trolls are the only problem. I hope the worlds' largest software patent holders send Oliver's writing staff a nice gift basket, as such might be the only thing that would signal to them that they fell into this PR trap. Although, it's admittedly slightly unfair to blame Oliver and his writers; the situation is subtle.

Indeed, someone not particularly versed in the situation can easily fall for this manipulation. It's just so easy to criticize non-practicing entities. Plus, the idea that the sole inventor might get funded on Shark Tank has a certain appeal, and fits a USAmerican sensibility of personal capitalistic success. Thus, the first-order conclusion is often, as Oliver's piece concludes, maybe if we got rid of trolls, things wouldn't be so bad.

And then there's also the focus on the patent quality issue; it's easy to convince the public that higher quality patents will make it ok to restrict software sharing and improvement with patents. It's great rhetoric for a pro-patent entities to generate outrage among the technology-using public by pointing to, say, an example of a patent that reads on every Android application and telling a few jokes about patent quality. In fact, at nearly every FLOSS conference I've gone to in the last year, OIN has sponsored a speaker to talk about that very issue. The jokes at such talks aren't as good as John Oliver's, but they still get laughs and technologists upset about patent quality and trolls — but through carefully cultural engineering, not about software patents themselves.

In fact, I don't think I've seen a for-profit industry and its trade associations do so well at public outrage distraction since the “tort reform” battles of the 1980s and 1990s, which were produced in part by George H. W. Bush's beloved M.C. Rove himself. I really encourage those who want to understand of how the anti-troll messaging manipulation works to study how and why the tort reform issue played out the way it did. (As I mentioned on the Free as in Freedom audcast, Episode 0x13, the documentary film Hot Coffee is a good resource for that.)

I've literally been laughed at publicly by OIN representatives when I point out that IBM, Microsoft, and other practicing entities do software patent shake-downs, too — just like the trolls. They're part of a well-trained and well-funded (by trade associations and companies) PR machine out there in our community to convince us that trolls and so-called “poor patent quality” are the only problems. Yet, nary a year has gone in my adult life where I don't see a some incident where a so-called legitimate, non-obvious software patent causes serious trouble for a Free Software project. From RSA, to the codec patents, to Microsoft FAT patent shakedowns, to IBM's shakedown of the Hercules open source project, to exfat — and that's just a few choice examples from the public tip of the practicing entity shakedown iceberg. IMO, the practicing entities are just trolls with more expensive suits and proprietary software licenses for sale. We should politically oppose the companies and trade associations that bolster them — and call for an end to software patents.

Posted Fri Jun 26 19:25:00 2015 Tags:
The latest maintenance release for Git v2.4.x series has been tagged.
  • The setup code used to die when core.bare and core.worktree are set inconsistently, even for commands that do not need working tree.
  • There was a dead code that used to handle git pull --tags and show special-cased error message, which was made irrelevant when the semantics of the option changed back in Git 1.9 days.
  • color.diff.plain was a misnomer; give it color.diff.context as a more logical synonym.
  • The configuration reader/writer uses mmap(2) interface to access the files; when we find a directory, it barfed with "Out of memory?".
  • Recent git prune traverses young unreachable objects to safekeep old objects in the reachability chain from them, which sometimes showed unnecessary error messages that are alarming.
  • git rebase -i fired post-rewrite hook when it shouldn't (namely, when it was told to stop sequencing with exec insn).
It also contains typofixes, documentation updates and trivial code clean-ups.

Enjoy.
Posted Thu Jun 25 20:16:00 2015 Tags:
An early preview of the upcoming Git 2.5 has been tagged as v2.5.0-rc0. It is comprised of 492 non-merge commits since v2.4.0, contributed by 54 people, 17 of which are new faces.

Among notable new features, some of my favourites are:

  • A new short-hand <branch>@{push} denotes the remote-tracking branch that tracks the branch at the remote the <branch> would be pushed to.
  • A heuristic we use to catch mistyped paths on the command line git cmd revs pathspec is to make sure that all the non-rev parameters in the later part of the command line are names of the files in the working tree, but that means git grep string -- \*.c must always be disambiguated with --, because nobody sane will create a file whose name literally is asterisk-dot-see.  We loosen the heuristic to declare that with a wildcard string the user likely meant to give us a pathspec. So you can now simply say git grep string \*.c without --.
  • Filter scripts were run with SIGPIPE disabled on the Git side, expecting that they may not read what Git feeds them to filter. We however treated a filter that does not read its input fully before exiting as an error.  We no longer do and ignore EPIPE when writing to feed the filter scripts.
    This changes semantics, but arguably in a good way.  If a filter can produce its output without fully consuming its input using whatever magic, we now let it do so, instead of diagnosing it as a programming error.
  • Whitespace breakages in deleted and context lines can also be painted in the output of git diff and friends with the new --ws-error-highlight option.
There are a few "experimental" new features, too. They are still incomplete and/or buggy around the edges and likely to change in the future, but nevertheless interesting.
  • git cat-file --batch learned the --follow-symlinks option that follows an in-tree symbolic link when asked about an object via extended SHA-1 syntax.  For example, HEAD:RelNotes may be a symbolic link that points at Documentation/RelNotes/2.5.0.txt.  With the new option, the command behaves as if HEAD:Documentation/RelNotes/2.5.0.txt was given as input instead.
    This is incomplete in a few ways.
    (1) A symbolic link in the index, e.g. :RelNotes, should also be treated the same way, but isn't. (2) Non-batch mode, e.g. git cat-file --follow-symlinks blob HEAD:RelNotes, may also want to behave the same way, but it doesn't.
  • A replacement mechanism for contrib/workdir/git-new-workdir that does not rely on symbolic links and make sharing of objects and refs safer by making the borrowee and borrowers aware of each other has been introduced and accessible via git checkout --to. This is accumulating more and more known bugs but may prove useful once they are fixed.
A draft release notes is there.

Posted Thu Jun 25 20:12:00 2015 Tags:

So I did some IBLT research (as posted to bitcoin-dev ) and I lazily used SHA256 to create both the temporary 48-bit txids, and from them to create a 16-bit index offset.  Each node has to produce these for every bitcoin transaction ID it knows about (ie. its entire mempool), which is normally less than 10,000 transactions, but we’d better plan for 1M given the coming blopockalypse.

For txid48, we hash an 8 byte seed with the 32-byte txid; I ignored the 8 byte seed for the moment, and measured various implementations of SHA256 hashing 32 bytes on on my Intel Core i3-5010U CPU @ 2.10GHz laptop (though note we’d be hashing 8 extra bytes for IBLT): (implementation in CCAN)

  1. Bitcoin’s SHA256: 527.7+/-0.9 nsec
  2. Optimizing the block ending on bitcoin’s SHA256: 500.4+/-0.66 nsec
  3. Intel’s asm rorx: 314.1+/-0.3 nsec
  4. Intel’s asm SSE4 337.5+/-0.5 nsec
  5. Intel’s asm RORx-x8ms 458.6+/-2.2 nsec
  6. Intel’s asm AVX 336.1+/-0.3 nsec

So, if you have 1M transactions in your mempool, expect it to take about 0.62 seconds of hashing to calculate the IBLT.  This is too slow (though it’s fairly trivially parallelizable).  However, we just need a universal hash, not a cryptographic one, so I benchmarked murmur3_x64_128:

  1. Murmur3-128: 23 nsec

That’s more like 0.046 seconds of hashing, which seems like enough of a win to add a new hash to the mix.

Posted Thu Jun 25 07:51:06 2015 Tags:

One of the bits we are currently finalising in libinput are touchpad gestures. Gestures on a normal touchscreens are left to the compositor and, in extension, to the client applications. Touchpad gestures are notably different though, they are bound to the location of the pointer or the keyboard focus (depending on the context) and they are less context-sensitive. Two fingers moving together on a touchscreen may be two windows being moved at the same time. On a touchpad however this is always a pinch.

Touchpad gestures are a lot more hardware-sensitive than touchscreens where we can just forward the touch points directly. On a touchpad we may have to consider software buttons or just HW-limitations of the touchpad. This prevents the implementation of touchpad gestures in a higher level - only libinput is aware of the location, size, etc. of software buttons.

Hence - touchpad gestures in libinput. The tree is currently sitting here and is being rebased as we go along, but we're expecting to merge this into master soon.

The interface itself is fairly simple: any device that may send gestures will have the LIBINPUT_DEVICE_CAP_GESTURE capability set. This is currently only implemented for touchpads but there is the potential to support this on other devices too. Two gestures are supported: swipe and pinch (+rotate). Both come with a finger count and both follow a Start/Update/End cycle. Gestures have a finger count that remains the same for the gestures, so if you switch from a two-finger pinch to a three-finger pinch you will see one gesture end and the next one start. Note that how to deal with this is up to the caller - it may very well consider this the same gesture semantically.

Swipe gestures have delta coordinates (horizontally and vertically) of the logical center of the gesture, compared to the previous event. A pinch gesture has the delta coordinates too and a delta angle (clockwise, in degrees). A pinch gesture also has the notion of an absolute scale, the Begin event always has a scale of 1.0 and that changes as the fingers move towards each other further apart. A scale of 2.0 means they're now twice as far apart as originally.

Nothing overly exciting really, it's a simple API that provides a couple of basic elements of data. Once integrated into the desktop properly, it should provide for some improved navigation. OS X has had this for a log time now and it's only time we caught up.

Posted Thu Jun 25 00:50:00 2015 Tags:

In my last post, I inveighed against using git-svn to do whole-repository conversions from Subversion to git (as opposed to its intended use, which is working a Subversion repository live through a git remote).

Now comes the word that hundreds of projects a week seem to be fleeing SourceForge because of their evil we’ll-hijack-your-repo-and-crapwarify-your installer policy. And they’re moving to GitHub via its automatic importer. Which, sigh, uses git-svn.

I wouldn’t trust that automatic importer (or any other conversion path that uses git-svn) with anything I write, so I don’t know how badly it messes things up.

But as a public service, I follow with a description of how a really well-done repository conversion – the kind I would deliver using reposurgeon – differs from a crappy one.

In evaluating quality, we need to keep in mind why people spelunk into code histories. Typically they’re doing it to trace bugs, understand the history of a feature, or grasp the thinking behind prior design decisions.

These kinds of analyses are hard work demanding attention and cognitive exertion. The last thing anyone doing them needs is to have his or her attention jerked to the fact that back past a certain point of conversion everything was different – commit references in alien and unusable formats, comments in a different style, user IDs missing, ignore patterns not working, etc.

Thus, as a repository translator my goal is for the experience of diving into the past to be as frictionless as possible. Ideally, the converted repository should look as though modern DVCS-like practices had been in use from the beginning of time.

Some of the kinds of glitches I’m going to describe may seem like they ought to be ignorable nits. And individually they often are. But the cumulative effect of all of them is distracting. Unnecessarily distracting.

These some key things that distinguish a really good conversion, one that’s near-frictionless to use, from a poor one.


1. Subversion/CVS/BitKeeper user IDs are properly mapped to Git-style human-name-plus-email identifications.

Sometimes this is a lot of work – for one conversion I did recently I spent many hours Googling to identify hundred of contributors going back to 1999.

The immediate reason this is valuable is so we know who was
responsible for individual commits, which can be important in bug forensics.

A more social reason is that otherwise OpenHub and sites like it in the future won’t be able to do reputation tracking properly. Contributors deserve their kudos and should have it.

2. Commit references are mapped to some reasonably VCS-independent way to identify the commits they point at; I generally use ether unique prefixes of commit comments or commiter/date pairs.

Because ‘r1234′ is useless when you’re not in Subversion-land anymore, Toto. And appending a fossil Subversion ID to every commit comment is heavyweight, ugly, and distracting.

3. Comments are reformatted to be in DVCS form – that is, standalone summary line plus (if there’s more) a spacer line plus following paragraphs.

Yes, this means that to do it right you need to eyeball the entire comment history end edit it into what it would have looked like if the committers had been using those conventions from the beginning. Yes, this is a lot of work. Yes, I do it, and so should you.

The reason this step is really important is that without it tools like gitk and git log can’t do their job properly. This makes it far more difficult for people reading the history to zero in efficiently on what they need to know to get real work done,

4. Ignore patterns and files should be lifted from the syntax and wildcarding conventions of the old system to the syntax and wildcarding conventions of the new one.

This is one of the many things git-svn simply fluffs. Other batch-mode converters could in theory do a better job, but generally don’t.

5. The converted repository should not lose valuable metadata – like release tags.

Yes, I’m actually looking at a GitHub conversion that was that bad.

When the tags are missing, users will be unable to identify or do code diffs against historical release points. It’s a usability crash landing.


Posted Tue Jun 23 12:20:25 2015 Tags:

I like data.  So when Patrick Strateman handed me a hacky patch for a new testnet with a 100MB block limit, I went to get some.  I added 7 digital ocean nodes, another hacky patch to prevent sendrawtransaction from broadcasting, and a quick utility to create massive chains of transactions/

My home DSL connection is 11Mbit down, and 1Mbit up; that’s the fastest I can get here.  I was CPU mining on my laptop for this test, while running tcpdump to capture network traffic for analysis.  I didn’t measure the time taken to process the blocks on the receiving nodes, just the first propagation step.

1 Megabyte Block

Naively, it should take about 10 seconds to send a 1MB block up my DSL line from first packet to last.  Here’s what actually happens, in seconds for each node:

  1. 66.8
  2. 70.4
  3. 71.8
  4. 71.9
  5. 73.8
  6. 75.1
  7. 75.9
  8. 76.4

The packet dump shows they’re all pretty much sprayed out simultaneously (bitcoind may do the writes in order, but the network stack interleaves them pretty well).  That’s why it’s 67 seconds at best before the first node receives my block (a bit longer, since that’s when the packet left my laptop).

8 Megabyte Block

I increased my block size, and one node dropped out, so this isn’t quite the same, but the times to send to each node are about 8 times worse, as expected:

  1. 501.7
  2. 524.1
  3. 536.9
  4. 537.6
  5. 538.6
  6. 544.4
  7. 546.7

Conclusion

Using the rough formula of 1-exp(-t/600), I would expect orphan rates of 10.5% generating 1MB blocks, and 56.6% with 8MB blocks; that’s a huge cut in expected profits.

Workarounds

  • Get a faster DSL connection.  Though even an uplink 10 times faster would mean 1.1% orphan rate with 1MB blocks, or 8% with 8MB blocks.
  • Only connect to a single well-connected peer (-maxconnections=1), and hope they propagate your block.
  • Refuse to mine any transactions, and just collect the block reward.  Doesn’t help the bitcoin network at all though.
  • Join a large pool.  This is what happens in practice, but raises a significant centralization problem.

Fixes

  • We need bitcoind to be smarter about ratelimiting in these situations, and stream serially.  Done correctly (which is hard), it could also help bufferbloat which makes running a full node at home so painful when it propagates blocks.
  • Some kind of block compression, along the lines of Gavin’s IBLT idea. I’ve done some preliminary work on this, and it’s promising, but far from trivial.

 

Posted Fri Jun 19 02:37:04 2015 Tags:

Apple announced last week that its Swift programming language — a currently fully proprietary software successor to Objective C — will probably be partially released under an OSI-approved license eventually. Apple explicitly stated though that such released software will not be copylefted. (Apple's pathological hatred of copyleft is reasonably well documented.) Apple's announcement remained completely silent on patents, and we should expect the chosen non-copyleft license will not contain a patent grant. (I've explained at great length in the past why software patents are a particularly dangerous threat to programming language infrastructure.)

Apple's dogged pursuit for non-copyleft replacements for copylefted software is far from new. For example, Apple has worked to create replacements for Samba so they need not ship Samba in OSX. But, their anti-copyleft witch hunt goes back much further. It began when Richard Stallman himself famously led the world's first GPL enforcement effort against NeXT, and Objective-C was liberated. For a time, NeXT and Apple worked upstream with GCC to make Objective-C better for the community. But, that whole time, Apple was carefully plotting its escape from the copyleft world. Fortuitously, Apple eventually discovered a technically brilliant (but sadly non-copylefted) research programming language and compiler system called LLVM. Since then, Apple has sunk millions of dollars into making LLVM better. On the surface, that seems like a win for software freedom, until you look at the bigger picture: their goal is to end copyleft compilers. Their goal is to pick and choose when and how programming language software is liberated. Swift is not a shining example of Apple joining us in software freedom; rather, it's a recent example of Apple's long-term strategy to manipulate open source — giving our community occasional software freedom on Apple's own terms. Apple gives us no bread but says let them eat cake instead.

Apple's got PR talent. They understand that merely announcing the possibility of liberating proprietary software gets press. They know that few people will follow through and determine how it went. Meanwhile, the standing story becomes: Wait, didn't Apple open source Swift anyway?. Already, that false soundbite's grip strengthens, even though the answer remains a resounding No!. However, I suspect that Apple will probably meet most of their public pledges. We'll likely see pieces of Swift 2.0 thrown over the wall. But the best stuff will be kept proprietary. That's already happening with LLVM, anyway; Apple already ships a no-source-available fork of LLVM.

Thus, Apple's announcement incident hasn't happened in a void. Apple didn't just discover open source after years of neutrality on the topic. Apple's move is calculated, which led various industry pundits like O'Grady and Weinberg to ask hard questions (some of which are similar to mine). Yet, Apple's hype is so good, that it did convince one trade association leader.

To me, Apple's not-yet-executed move to liberate some of the Swift 2.0 code seems a tactical stunt to win over developers who currently prefer the relatively more open nature of the Android/Linux platform. While nearly all the Android userspace applications are proprietary, and GPL violations on Android devices abound, at least the copyleft license of Linux itself provides the opportunity to keep the core operating system of Android liberated. No matter how much Swift code is released, such will never be true with Apple.

I'm often pointing out in my recent talks how complex and treacherous the Open Source and Free Software political climate became in the last decade. Here's a great example: Apple is a wily opponent, utilizing Open Source (the cooption of Free Software) to manipulate the press and hoodwink the would-be spokespeople for Linux to support them. Many of us software freedom advocates have predicted for years that Free Software unfriendly companies like Apple would liberate more and more code under non-copyleft licenses in an effort to create walled gardens of seeming software freedom. I don't revel in my past accuracy of such predictions; rather, I feel simply the hefty weight of Cassandra's curse.

Posted Mon Jun 15 16:32:39 2015 Tags:
CC-BY-SA from WikiMedia.
Despite evidence that it does not make sense to aim for something, I did it again: I aimed at discussion some five CDK-citing papers each week. That was three weeks ago, and I don't really have time today either. But let me cover a few, so that I do not get even further behind.

Subset selection in QSAR modeling
We (intuitively) know that negative data is important for statistical pattern recognition and modelling. We also know that literature is not quite riddled with such data. This paper, however, studies the effect of using sets of inactive compounds in modelling and particularly the part about selecting which compounds should go into the training set. Like with the positive compounds, in the results of this paper too, the selection method matters. The CDK is used to calculate fingerprints.

Smusz, S., Kurczab, R., Bojarski, A. J., Apr. 2013. The influence of the inactives subset generation on the performance of machine learning methods. Journal of Cheminformatics 5 (1), 17+. URL http://dx.doi.org/10.1186/1758-2946-5-17

Using fingerprints to create clustering trees

I need to read this paper by Palacios-Bejarano et al. in more detail, because it seems quite interesting. The use fingerprints to make clustering trees, which, if I understand it correctly, be used to calculate similarities between molecules. That is used in QSAR modeling of the CLogP, and the results suggest that while MCS works better, this approach is more robust. This paper too uses the CDK for fingerprint calculation.

Palacios-Bejarano, B., Cerruela Garcia, G., Luque Ruiz, I., Gómez-Nieto, M., Jun. 2013. An algorithm for pattern extraction in fingerprints. Chemometrics and Intelligent Laboratory Systems 125, 87-100. URL http://dx.doi.org/10.1016/j.chemolab.2013.04.003
Posted Sun Jun 7 16:17:00 2015 Tags:
Last Friday I attended the PhD defense of, now, Dr. Jonathan Alvarsson (Dept. Pharmaceutical Biosciences, Uppsala University), who defended his thesis Ligand-based Methods for Data Management and Modelling (PDF). Key papers resulting from his work include (see the list below) one about Bioclipse 2, particularly covering his work on plug-able managers that enrich scripting languages (JavaScript, Python, Groovy) with domain specific functionality, which I make frequent use of (doi:10.1186/1471-2105-10-397), a paper about Brunn, a LIMS system for microplates, which is based on Bioclipse 2 (doi:10.1186/1471-2105-12-179), and a set of chemometrics papers, looking at scaling up pattern recognition via QSAR model buildings (e.g. doi:10.1021/ci500361u). He is also author on several other papers and we collaborated on several of them, so you will find his name in several more papers. Check his Google Scholar profile.

In Sweden there is one key opponent, though further questions can be asked by a few other committee members. John Overington (formerly of ChEMBL) was the opponent and he asked Jonathan questions for at least an hour, going through the thesis. Of course, I don't remember most of it, but there were a few that I remember and want to bring up. One issue was about the uptake of Bioclipse by the community, and, for example, how large the community is. The answer is that this is hard to answer; there are download statistics and there is actual use.

Download statistics of the Bioclipse 2.6.1 release.
Like citation statistics (the Bioclipse 1 paper was cited close to 100 times, Bioclipse 2 is approaching 40 citations), download statistics reflect this uptake but are hardly direct measurements. When I first learned about Bioclipse, I realized that it could be a game changer. But it did not. I still don't quite understand why not. It looks good, is very powerful, very easy to extend (which I still make a lot of use of), it is fairly easy to install (download 2.6.1 or the 2.6.2 beta), etc. And it resulted in a large set of applications, just check the list of papers.

One argument could be, it is yet another tool to install, and developers are turning to web-based solutions. Moreover, the cheminformatics community has many alternatives and users seem to prefer smaller, more dedicated tools, like a file format converter, like Open Babel, or a dedicated descriptor calculator, like PaDEL. Simpler messages seem more successful; this is expected for politics, but I guess science is more like politics that we like to believe.

A second question I remember was about what Jonathan would like to see changed in ChEMBL, the database Overington has worked so long on. As a data analyst you are in a different mind set: rather than thinking about single molecules, you think about classes of compounds, and rather than thinking about the specific interaction of a drug with a protein, you think about the general underlying chemical phenomenon. A question like this one requires a different kind of thinking: it needs one to think like an analytical chemist, that worries about the underlying experiments. Obvious, but easy to return too once thinking at a higher (different) level. That experimental error information in ChEMBL can actually support modelling, is something we showed using Bayesian statistics (well, Martin Eklund particularly) in Linking the Resource Description Framework to cheminformatics and proteochemometrics (doi:10.1186/2041-1480-2-S1-S6) by including the assay confidence assigned by the ChEMBL curation team. If John would have asked me, I would have said I wanted ChEMBL to capture as much of the experimental details as possible.

Integration of RDF technologies in Bioclipse. Alvarsson worked on the integration of the RDF editor in Bioclipse.
The screenshot shows that if you click a RDF resource reflecting a molecule, it will show the structure (if there is a
predicte providing the SMILES) and information by predicates in general.
The last question I want to discuss was about the number of rotable bonds in paracetamol. If you look at this structure, you would identify four purely σ bonds (BTW, can you have π bonds without sigma bonds?). So, four could be the expected answer. You can argue that the peptide bond should not be considered rotatable, and should be excluded, and thus the answer would be two. Now, the CDK answers two, as shown in an example of descriptor calculation in the thesis. I raised my eyebrows, and thought: "I surely hope this is not a bug!". (Well, my thoughts used some more words, which I will not repeat here.)

But thinking about that, I valued the idea of Open Source: I could just checked, and took my tablet from my bag, opened up a browser, went to GitHub, and looked up the source code. It turned out it was not a bug! Phew. No, in fact, it turned out that the default parameters of this descriptor excludes the terminal rotatable bonds:


So, problem solved. Two follow up questions, though: 1. can you look up source code during a thesis defense? Jonathan had his laptop right in front of him. I only thought of that yesterday, when I was back home, having dinner with the family. 2. I wonder if I should discuss the idea of parameterizable descriptors more; what do you think? There is a lot of confusion about this design decision in the CDK. For example, it is not uncommon that the CDK only calculates some two hundred descriptor values, whereas tool X calculates more than a thousand. Mmmm, that always makes me question the quality of that paper in general, but, oh well...

There also was a nice discussion about chemometrics. Jonathan argues in his thesis that a fast modeling method may be a better way forward at this moment, and more powerful statistical methods. He presented results with LIBLINEAR and signature fingerprints, comparing it to other approaches. The latter was compared with industry standards, like ECFP (which Clark and Ekins implemented for the CDK and have been using in Bayesian statistics on the mobile phone), and for the first Jonathan showed that LINLINEAR can handle more data than regular SVM libraries, and that the using more training data still improves the model more than a "better" statistical method (which quite matches my own experiences). And with SVMs, finding the right parameters typically is an issue. Using a RBF kernel only adds one, and since Jonathan also indicated that the Tanimoto distance measure for fingerprints is still a more than sufficient approach, which makes me wonder if the chemometrics models should not be using a Tanimoto kernel instead of a RBF kernel (though doi:10.1021/ci800441c suggests RBF may really do better for some tasks, but at the expense of more parameter optimization needed).

To wrap up, I really enjoyed working with Jonathan a lot and I think he did excellent multidisciplinary work. I am also happy that I was able to attend his defense and the events around that. In no way does this post do justice or reflect the defense; it merely reflects that how relevant his research is in my opinion, and just highlights some of my thoughts during (and after) the defense.

Jonathan, congratulations!

Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E. L., Steinbeck, C., Wikberg, J. E., Dec. 2009. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 10 (1), 397+.
Alvarsson, J., Andersson, C., Spjuth, O., Larsson, R., Wikberg, J. E. S., May 2011. Brunn: An open source laboratory information system for microplates with a graphical plate layout design process. BMC Bioinformatics 12 (1), 179+.
Alvarsson, J., Eklund, M., Engkvist, O., Spjuth, O., Carlsson, L., Wikberg, J. E. S., Noeske, T., Oct. 2014. Ligand-Based target prediction with signature fingerprints. J. Chem. Inf. Model. 54 (10), 2647-2653.
Posted Sun Jun 7 12:41:00 2015 Tags:

libinput provides a number of different out-of-the-box configurations, based on capabilities. For example: middle mouse button emulation is enabled by default if a device only has left and right buttons. On devices with a physical middle button it is available but disabled by default. Likewise, whether tapping is enabled and/or available depends on hardware capabilities. But some requirements cannot be gathered purely by looking at the hardware capabilities.

libinput uses a couple of udev properties, assigned through udev's hwdb, to detect device types. We use the same mechanism to provide us with specific tags to adjust libinput-internal behaviour. The udev properties named LIBINPUT_MODEL_.... tag devices based on a set of udev rules combined with hwdb matches. For example, we tag Chromebooks with LIBINPUT_MODEL_CHROMEBOOK.

Inside libinput, we parse those tags and use them for model-specific configuration. At the time of writing this, we use the chromebook tag to automatically enable clickfinger behaviour on those touchpads (which matches the google defaults on chromebooks). We tag the Lenovo X230 touchpad to give it it's own acceleration method. This touchpad is buggy and the data it sends has a very bad resolution.

In the future these tags will likely expand and encompass more devices that need customised tweaks. But the goal is always that libinput works well out of the box, even if the hardware is quirky. Introducing these tags instead of a sleigh of configuration options has short-term drawbacks: it increases the workload on us maintainers and it may require software updates to get a device to work exactly the way it should. The long-term benefits are maintainability and testability though, as well as us being more aware of what hardware is out there and how it needs to be fixed. Plus the relief of not having to deal with configuration snippets that are years out of date, do all the wrong things but still spread across forums like an STD.

Note: the tags are considered private API and may change at any time, depending what we want or need to do with them. Do not use them for home-made configuration.

Posted Thu Jun 4 23:28:00 2015 Tags:

I watched the most recent Silicon Valley episode last night. I laughed at some parts (not as much as a usual episode) and then there was a completely unbelievable tech-related plot twist — quite out of character for that show. I was surprised.

When the credits played, my draw dropped when I saw the episode's author was Dan Lyons. Lyons (whose work has been promoted by the Linux Foundation) once compared me to a communist and a member of organized crime (in, Forbes, a prominent publication for the wealthy) because of my work enforcing the GPL.

In the years since Lyons' first anti-software freedom article (yes, there were more), I've watched many who once helped me enforce the GPL change positions and oppose GPL enforcement (including allies who once received criticism alongside me). Many such allies went even further — publicly denouncing my work and regularly undermining GPL enforcement politically.

Attacks by people like Dan Lyons — journalists well connected with industry trade associations and companies — are one reason so many people are too afraid to enforce the GPL. I've wondered for years why the technology press has such a pro-corporate agenda, but it eventually became obvious to me in early 2005 when listening to yet another David Pogue Apple product review: nearly the entire tech press is bought and paid for by the very companies on which they report! The cartoonish level of Orwellian fear across our industry of GPL enforcement is but one example of many for-profit corporate agendas that people like Lyons have helped promulgate through their pro-company reporting.

Meanwhile, I had taken Silicon Valley (until this week) as pretty good satire on the pathetic state of the technology industry today. Perhaps Alec Berg and Mike Judge just liked Lyons' script — not even knowing that he is a small part of the problem they seek to criticize. Regardless as to why his script was produced, the line between satirist and the satirized is clearly thinner than I imagined; it seems just as thin as the line between technology journalist and corporate PR employee.

I still hope that Berg and Judge seek, just as Judge did in Office Space, to pierce the veil of for-profit corporate manipulation of employees and users alike. However, for me, the luster of their achievement fades when I realize at least some of their creative collaborators participate in the central to the problem they criticize.

Shall we start a letter writing campaign to convince them to donate some of Silicon Valley's proceeds to Free Software charities? Or, at the very least, to convince Berg to write one of his usually excellent episodes about how the technology press is completely corrupted by the companies on which they report?

Posted Thu Jun 4 00:15:00 2015 Tags:

libinput uses udev tags to determine what a device is. This is a significant difference to the X.Org stack which determines how to deal with a device based on an elaborate set of rules, rules grown over time, matured, but with a slight layer of mould on top by now. In evdev's case that is understandable, it stems from a design where you could just point it at a device in your xorg.conf and it'd automagically work, well before we had even input hotplugging in X. What it leads to now though is that the server uses slightly different rules to decide what a device is (to implement MatchIsTouchscreen for example) than evdev does. So you may have, in theory, a device that responds to MatchIsTouchscreen only to set itself up as keyboard.

libinput does away with this in two ways: it punts most of the decisions on what a device is to udev and its ID_INPUT_... properties. A device marked as ID_INPUT_KEYBOARD will initialize a keyboard interface, an ID_INPUT_TOUCHPAD device will initialize a touchpad backend. The obvious advantage of this is that we only have one place where we have generic device type rules. The second advantage is that where this one place isn't precise enough, it's simple to override with custom rule sets. For example, Wacom tablets are hard to categorise just by looking at the device alone. libwacom generates a udev rule containing the VID/PID of each known device with the right ID_INPUT_TABLET etc. properties.

This is a libinput-internal behaviour. Externally, we are a lot more vague. In fact, we don't tell you at all what a device is, other than what events it will send (pointer, keyboard, or touch). We have thought about implementing some sort of device identifier and the conclusion is that we won't implement this as part of libinput's API because it will simply be wrong some of the time. And if it's wrong, it requires the caller to re-implement something on top of it. At which point the caller may as well implement all of it instead. Why do we expect it to be wrong? Because libinput doesn't know the exact context that requires a device to be labelled as a specific type.

Take a keyboard for example. There are a great many devices that send key events. To the client a keyboard may be any device that can get an XKB layout and is used for typing. But to the compositor, a keyboard may be anything that can send a few specific keycodes. A device with nothing but KEY_POWER? That's enough for the compositor to decide to shut down but that device may not otherwise work as a keyboard. libinput can't know this context. But what libinput provides is the API to query information. libinput_device_pointer_has_button() and libinput_device_keyboard_has_key() are the two candidates here to query about a specific set of buttons and/or keys.

Touchpads, trackpoints and mice all look send pointer events and there is no flag that tells you the device type and that is intentional. libinput doesn't have any intrinsic knowledge about what is a touchpad, we take the ID_INPUT_TOUCHPAD tag. At best, we refuse some devices that were clearly mislabelled but we don't init devices as touchpads that aren't labelled as such. Any device type identification would likely be wrong - for example some Wacom tablets are touchpads internally but would be considered tablets in other contexts.

So in summary, callers are encouraged to rely on the udev information and other bits they can pull from the device to group it into the semantically correct device type. libinput_device_get_udev_device() provides a udev handle for a libinput device and all configurable features are query-able (e.g. "does this device support tapping?"). libinput will not provide a device type because it would likely be wrong in the current context anyway.

Posted Wed Jun 3 05:59:00 2015 Tags:

What happens if bitcoin blocks fill?  Miners choose transactions with the highest fees, so low fee transactions get left behind.  Let’s look at what makes up blocks today, to try to figure out which transactions will get “crowded out” at various thresholds.

Some assumptions need to be made here: we can’t automatically tell the difference between me taking a $1000 output and paying you 1c, and me paying you $999.99 and sending myself the 1c change.  So my first attempt was very conservative: only look at transactions with two or more outputs which were under the given thresholds (I used a nice round $200 / BTC price throughout, for simplicity).

(Note: I used bitcoin-iterate to pull out transaction data, and rebuild blocks without certain transactions; you can reproduce the csv files in the blocksize-stats directory if you want).

Paying More Than 1 Person Under $1 (< 500000 Satoshi)

Here’s the result (against the current blocksize):

Sending 2 Or More Sub-$1 Outputs

Let’s zoom in to the interesting part, first, since there’s very little difference before 220,000 (February 2013).  You can see that only about 18% of transactions are sending less than $1 and getting less than $1 in change:

Since March 2013…

Paying Anyone Under 1c, 10c, $1

The above graph doesn’t capture the case where I have $100 and send you 1c.   If we eliminate any transaction which has any output less than various thresholds, we’ll catch that. The downside is that we capture the “sending myself tiny change” case, but I’d expect that to be rarer:

Blocksizes Without Small Output Transactions

This eliminates far more transactions.  We can see only 2.5% of the block size is taken by transactions with 1c outputs (the dark red line following the block “current blocks” line), but the green line shows about 20% of the block used for 10c transactions.  And about 45% of the block is transactions moving $1 or less.

Interpretation: Hard Landing Unlikely, But Microtransactions Lose

If the block size doesn’t increase (or doesn’t increase in time): we’ll see transactions get slower, and fees become the significant factor in whether your transaction gets processed quickly.  People will change behaviour: I’m not going to spend 20c to send you 50c!

Because block finding is highly variable and many miners are capping blocks at 750k, we see backlogs at times already; these bursts will happen with increasing frequency from now on.  This will put pressure on Satoshdice and similar services, who will be highly incentivized to use StrawPay or roll their own channel mechanism for off-blockchain microtransactions.

I’d like to know what timescale this happens on, but the graph shows that we grow (and occasionally shrink) in bursts.  A logarithmic graph prepared by Peter R of bitcointalk.org suggests that we hit 1M mid-2016 or so; expect fee pressure to bend that graph downwards soon.

The bad news is that even if fees hit (say) 25c and that prevents all the sub-$1 transactions, we only double our capacity, giving us perhaps another 18 months. (At that point miners are earning $1000 from transaction fees as well as $5000 (@ $200/BTC) from block reward, which is nice for them I guess.)

My Best Guess: Larger Blocks Desirable Within 2 Years, Needed by 3

Personally I think 5c is a reasonable transaction fee, but I’d prefer not to see it until we have decentralized off-chain alternatives.  I’d be pretty uncomfortable with a 25c fee unless the Lightning Network was so ubiquitous that I only needed to pay it twice a year.  Higher than that would have me reaching for my credit card to charge my Lightning Network account :)

Disclaimer: I Work For BlockStream, on Lightning Networks

Lightning Networks are a marathon, not a sprint.  The development timeframes in my head are even vaguer than the guesses above.  I hope it’s part of the eventual answer, but it’s not the bandaid we’re looking for.  I wish it were different, but we’re going to need other things in the mean time.

I hope this provided useful facts, whatever your opinions.

Posted Wed Jun 3 03:57:29 2015 Tags:

I used bitcoin-iterate and gnumeric to render the current bitcoin blocksizes, and here are the results.

My First Graph: A Moment of Panic

This is block sizes up to yesterday; I’ve asked gnumeric to derive an exponential trend line from the data (in black; the red one is linear)

Woah! We hit 1M blocks in a month! PAAAANIC!

That trend line hits 1000000 at block 363845.5, which we’d expect in about 32 days time!  This is what is freaking out so many denizens of the Bitcoin Subreddit. I also just saw a similar inaccurate [correction: misleading] graph reshared by Mike Hearn on G+ :(

But Wait A Minute

That trend line says we’re on 800k blocks today, and we’re clearly not.  Let’s add a 6 hour moving average:

Oh, we’re only halfway there….

In fact, if we cluster into 36 blocks (ie. 6 hours worth), we can see how misleading the terrible exponential fit is:

What! We’re already over 1M blocks?? Maths, you lied to me!

Clearer Graphs: 1 week Moving Average

Actual Weekly Running Average Blocksize

So, not time to panic just yet, though we’re clearly growing, and in unpredictable bursts.

Posted Wed Jun 3 02:34:55 2015 Tags:

Gallus: Bitcoin needs an increase in the block chain max size to avert disaster. If the block size limit isn’t increased soon, the limit will be hit and disaster will ensue.

Simo: Current blocks are nowhere near the size limit.

Gallus: At the current rate of growth of transactions, we’ll get there soon.

Simo: Lightning Network can handle it!

Gallus: Lightning Network isn’t working yet, forces big transactions to happen when its timeouts sunset and requires a lot of complexity and endpoint diligence.

Simo: Just wait for the new paper! With a relative timelock opcode everything nets out until the settlement between two counterparties exceeds the amount they initially deposited no matter how long their relationship has gone on, and the diligence can be outsourced to third parties.

Besides, most of the current transactions might be garbage anyway and the right way to handle everything is with transaction fee increases

Gallus: You’re just speculating about how much is garbage, and transaction fees destroy zeroconf.

Simo: If you’re more conservative about what sort of malleability you accept then zeroconf works, uh, about as well as it does now. Specifically, if you disallow changes to the outputs other than decreasing payouts and thus implicitly increasing the transaction fee then there’s little opportunity to defraud via the usual channels. Zeroconf is still a profoundly bad idea though. If it ever became widespread, then that would inevitably lead to the creation of a darknet where alternative transactions could get posted which included kickbacks to the targets of previous mining rewards. There is no good counter to this. Zeroconf advocates should get over it.

Reiterating on transaction fees are the right way to handle everything, the current technique for avoiding denial of service by putting through lots of garbage transactions boils down to letting through larger transactions first, so anyone trying to make lots of transactions with a de minimis amount of cash fronted will have their amounts spread so thin that legitimate transactions will take priority. Since there are an average of 600 seconds in a block, and blocks can handle about 4 transactions per second, if we add a factor of 10 assuming that the attacker wants to keep transactions from going through even when blocks happen to be bigger, then if an attacker wanted to prevent any transactions of less than $10 from going through, they’d have to front $10 * 600 * 4 * 10 or about $25,000 to keep that from happening. And that’s fronted, not spent, the attacker can always sell their coins off later (although their value would likely have been badly damaged in the interim). Even adding significantly to this wouldn’t make the security margin particularly good. Transaction fees really are needed.

Gallus: The wallet codebases are poorly written and maintained and can’t be realistically expected to be made to handle real transaction fees.

Simo: If they really needed to then wallets would fix their shit. This is Bitcoin, where the whole point is supposed to be that all the endpoint mutually enforce security from everybody else. If you’re concerned about supporting code which is so shitty it shouldn’t have existed in the first place you should go work for Microsoft. Besides, even a busted wallet can have its keys extracted and put into a wallet which isn’t busted.

Gallus: There aren’t any good way to handle transaction fees.

Simo: Receiver pays (where a new transaction is created spending the output of an old one so it can pay the fee for both of them to go through) works well, even when some wallets are busted, as does the previously mentioned conservative approach to allowing malleability.

Gallus: Receiver pays doubles the number of transactions, making the block size limit problem worse.

Simo: If a new opcode were added requiring that a particular thing *not* be in the current utxo set, then that would allow for multiple receiver pays to be bundled together with each of them using few bits, be very robust against history reorgs, and only require a single lookup in the already necessary utxo database on miners.

Gallus: This is all very complicated to handle.

Simo: It’s just software. See my earlier comment about working for Microsoft.

We need to find out how much those transactions are really worth to be able to use good judgement, and hitting the limit is the only way to find out.

Gallus: You’re proposing invoking disaster just to gather some academic data!

Simo: If anything we should be going the other way, artificially forcing transaction fees up in advance of needing to by limiting block sizes below the requirement. Then if there were problems we could let the limit go back to normal and spend some time fixing the problems without creating a compatibility problem. Besides, we don’t even know if any real damage would be done by hitting limits, because without a demonstrated willingness to pay transaction fees, even temporarily, we have little evidence that bitcoin transactions are creating any real value. Core developers favor doing an experiment like this more than they favor increasing the block size limit.

Getting back to the main point, Increasing the max block size limit would be ruinous to the bitcoin ecosystem, with vastly fewer full nodes being run.

Gallus: The rate of bandwidth increase is exponential, and there will be plenty.

Simo: As of today, the amount of bandwidth to run a full node is a significant disincentive to running one. The start-up time to get the current blockchain history when starting a new one makes the problem much worse than the ongoing rate of download. The rate of growth of bandwidth is much slower than Moore’s law is for computational power, and if you assume that everybody has mass quantities of bandwidth it would be much better to use it to have wallets run full nodes and retire SPV.

Besides, increasing the block size is a hard fork, which is unlikely to even happen. At best it would result in two different chains. The miners, who hardly even respond to developers’s entreaties about urgent issues, have little reason to go for it because the whole goal is to avoid or reduce transaction fees, which cuts directly into their bottom line, and demonstrating an ability to make a backwards incompatible change undermines the claim that Bitcoin is a truly decentralized system.

Gallus: The new fork can be merge-mined along with the classic fork. Miners will do whatever is of marginal value to them, and if they can mine both at once at no extra cost they will.

Simo: With only partial miner cooperation the new fork would have substantially less security, and the two of them coexisting would be a disaster of indeterminate state of coins which were spent on one fork but not the other, causing far worse problems for wallets than transaction fees would.

Eventually the only mining incentive left will be transaction fees. If transaction fees aren’t made significant by then, disaster will ensue.

Gallus: Mining rewards could be changed as well.

Simo: Increasing mining fees would be a yet even more outrageous hard fork than increasing the block size limit. It would cause extraordinary amounts of real world waste for no proven value. Our goal is eventually to make Bitcoin be more than a cryptographic curiosity and an exercise in the platonic ideal of marxist value creation, it should provide some service of value. If it can’t do that, it deserves to fail and be abandoned.

Posted Tue Jun 2 01:48:18 2015 Tags:

TLDR: as of libinput 0.16 you can end a touchpad tap-and-drag with a final additional tap

libinput originally only supported single-tap and double-tap. With version 0.15 we now support multi-tap, so you can tap repeatedly to get a triple, quadruple, etc. click. This is quite useful in text editors where a triple click highlights a line, four clicks highlight a paragraph, and 28 clicks order a new touchpad from ebay. Multi-tap also works with drag-and drop, so a triple tap followed by a finger down and hold will send three clicks followed by a single click.

We also support continuous tap-and-drag which is something synaptics didn't support provided with the LockedDrags option: Once the user is in dragging mode (x * tap + hold finger down) they can lift the finger and set it down again without the drag being interrupted. This is quite useful when you have to move across the screen, especially on smaller touchpads or for users that prefer a slow acceleration.

Of course, this adds a timeout to the drag release since we need to wait and see whether the finger comes down again. To help accelerate this, we added a tap-to-release feature (contributed by Velimir Lisec): once in drag mode a final tap will release the button immediately. This is something that OS X has supported for years and after a bit of muscle memory retraining it becomes second nature quickly. So the new timeout-free way to tap-and-drag on a touchpad is now:


tap, finger-down, move, .... move, finger up, tap
Update 03/06/25: add synaptics LockedDrag option reference
Posted Tue Jun 2 01:01:00 2015 Tags:

I’ve been trying not to follow the Great Blocksize Debate raging on reddit.  However, the lack of any concrete numbers has kind of irked me, so let me add one for now.

If we assume bandwidth is the main problem with running nodes, let’s look at average connection growth rates since 2008.  Google lead me to NetMetrics (who seem to charge), and Akamai’s State Of The Internet (who don’t).  So I used the latter, of course:

Akamai’s Average Connection Speed Chart Q4/07 to Q4/14

I tried to pick a range of countries, and here are the results:

Country % Growth Over 7 years Per Annum
Australia 348 19.5%
Brazil 349 19.5%
China 481 25.2%
Philippines 258 14.5%
UK 333 18.8%
US 304 17.2%

 

Countries which had best bandwidth grew about 17% a year, so I think that’s the best model for future growth patterns (China is now where the US was 7 years ago, for example).

If bandwidth is the main centralization concern, you’ll want block growth below 15%. That implies we could jump the cap to 3MB next year, and 15% thereafter. Or if you’re less conservative, 3.5MB next year, and 17% there after.

Posted Mon Jun 1 01:20:36 2015 Tags: