Welcome to my web log. See the first post for an introduction. See the archive page for all posts. (There is an english language feed if you don't want to see Finnish.)

Archives Tags Moderation policy Main site

All content outside of comments is copyrighted by Lars Wirzenius, and licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. Comments are copyrighted by their authors. (No new comments are allowed.)


This is an idea. I don't have the time to work on it myself, but I thought I'd throw it out in case someone else finds it interesting.

When you install a Debian package, it pulls in its dependencies and recommended packages, and those pull in theirs. For simple cases, this is all fine, but sometimes there's surprises. Installing mutt to a base system pulls in libgpgme, which pulls in gnupg, which pulls in a pinentry package, which can pull in all of GNOME. Or at least people claim that.

It strikes me that it'd be cool for someone to implement a QA service for Debian that measures, for each package, how much installing it adds to the system. It should probably do this in various scenarios:

  • A base system, i.e., the output of debootstrap.
  • A build system, with build-essentian installed.
  • A base GNOME system, with gnome-core installed.
  • A full GNOME system, with gnome installed.
  • Similarly for KDE and each other desktop environment in Debian.

The service would do the installs regularly (daily?), and produce reports. It would also do alerts, such as notify the maintainers when installed size grows too large compared to installing it in stable, or a previous run in unstable. For example, if installing mutt suddenly installs 100 gigabytes more than yesterday, it's probably a good idea to alert interested parties.

Implementing this should be fairly easy, since the actual test is just running debootstrap, and possibly apt-get install. Some experimentation with configuration, caching, and eatmydata may be useful to gain speed. Possibly actual package installation can be skipped, and the whole thing could be implemented just by analysing package metadata.

Maybe it even exists, and I just don't know about it. That'd be cool, too.

Posted Wed Oct 24 10:42:00 2018 Tags:

I've been learning Rust lately. As part of that, I rewrote my summain program from Python to Rust (see summainrs). It's not quite a 1:1 rewrite: the Python version outputs RFC822-style records, the Rust one uses YAML. The Rust version is my first attempt at using multithreading, something I never added to the Python version.

Results:

  • Input is a directory tree with 8.9 gigabytes of data in 9650 files and directories.
  • Each file gets stat'd, and regular files get SHA256 computed.
  • Run on a Thinkpad X220 laptop with a rotating hard disk. Two CPU cores, 4 hyperthreads. Mostly idle, but desktop-py things running in the background. (Not a very systematic benchmark.)
  • Python version: 123 seconds wall clock time, 54 seconds user, 6 second system time.
  • Rust version: 61 seconds wall clock (50% of the speed), 56 seconds user (104%), and 4 seconds system time (67&).

A nice speed improvement, I think. Especially, since the difference between the single and multithreaded version of the Rust program is four characters (par_iter instead of iter in the process_chunk function).

Posted Mon Oct 15 10:59:00 2018 Tags:

I don't think any of Flatpak, Snappy, traditional Linux distros, non-traditional Linux distros, containers, online services, or other forms of software distribution are a good solution for all users. They all fail in some way, and each of them requires continued, ongoing effort to be acceptable even within their limitations.

This week, there's been some discussion about Flatpak, a software distribution approach that's (mostly) independent of traditional Linux distributions. There's also, Snappy, which is Canonical's similar thing.

The discussion started with the launch of a new website attacking Flatpak as a technology. I'm not going to link to it, since it's an anonymous attack and rant, and not constructive. I'd rather have a constructive discussion. I'm also not going to link to rebuttals, and will just present my own view, which I hope is different enough to be interesting.

The website raises the issue that Flatpak's sandboxing is not as good as it should be. This seems to be true. Some of Flatpak's defenders respond that it's an evolving technology, which seems fair. It's not necessary to be perfect; it's important to be better than what came before, and to constantly improve.

The website also raises the point that a number of flatpaks themselves contain unfixes security problems. I find this to be more worrying than an imperfect sandbox. A security problem inside a perfect sandbox can still be catastrophic: it can leak sensitive data, join a distributed denial of service attack, use excessive CPU and power, and otherwise cause mayhem. The sandbox may help in containing the problem somewhat, but to be useful for valid use, the sandbox needs to allow things that can be used maliciously.

As a user, I want software that's...

  • easy to install and update
  • secure to install (what I install is what the developers delivered)
  • always up to date with security fixes, including for any dependencies (embedded in the software or otherwise)
  • reasonably up to date with other bug fixes
  • sufficiently up to date with features I want (but I don't care a about newer features that I don't have a use for)
  • protective of my freedoms and privacy and other human rights, which includes (but is not restricted to) being able to self-host services and work offline

As a software developer, I additionally want my own software to be...

  • effortless to build
  • automatically tested in a way that gives me confidence it works for my users
  • easy to deliver to my users
  • easy to debug
  • not be broken by changes to build and runtime dependencies, or at least make such changes be extremely obvious, meaning they result in a build error or at least an error during automated tests

These are requirements that are hard to satisfy. They require a lot of manual effort, and discipline, and I fear the current state of software development isn't quite there yet. As an example, the Linux kernel development takes great care to never break userland, but that requires a lot of care when making changes, a lot of review, and a lot of testing, and a willingness to go to extremes to achieve that. As a result, upgrading to a newer kernel version tends to be a low-risk operation. The glibc C library, used by most Linux distributions, has a similar track record.

But Linux and glibc are system software. Flatpak is about desktop software. Consider instead LibreOffice, the office suite. There's no reason why it couldn't be delivered to users as a Flatpak (and indeed it is). It's a huge piece of software, and it needs a very large number of libraries and other dependencies to work. These need to be provided inside the LibreOffice Flatpak, or by one or more of the Flatpak "runtimes", which are bundles of common dependencies. Making sure all of the dependencies are up to date can be partly automated, but not fully: someone, somewhere, needs to make the decision that a newer version is worth upgrading to right now, even if it requires changes in LibreOffice for the newer version to work.

For example, imagine LO uses a library to generate PDFs. A new version of the library reduces CPU consumption by 10%, but requires changes, because the library's API (programming interface) has changed radically. The API changes are necessary to allow the speedup. Should LibreOffice upgrade to the new version of not? If 10% isn't enough of a speedup to warrant the effort to make the LO changes, is 90%? An automated system could upgrade the library, but that would then break the LO build, resulting in something that doesn't work anymore.

Security updates are easier, since they usually don't involve API changes. An automated system could upgrade dependencies for security updates, and then trigger automated build, test, and publish of a new Flatpak. However, this is made difficult by there is often no way to automatically, reliably find out that there is a security fix released. Again, manual work is required to find the security problem, to fix it, to communicate that there is a fix, and to upgrade the dependency. Some projects have partial solutions for that, but there seems to be nothing universal.

I'm sure most of this can be solved, some day, in some manner. It's definitely an interesting problem area. I don't have a solution, but I do think it's much too simplistic to say "Flatpaks will solve everything", or "the distro approach is best", or "just use the cloud".

Posted Thu Oct 11 10:12:00 2018 Tags:

I've started my new job. I now work in the release engineering team at Wikimedia, the organisation that runs sites such as Wikipedia. We help put new versions of the software that runs the sites into production. My role is to help make that process more automated and frequent.

Posted Wed Oct 10 09:02:00 2018

I now have a rudimentary [roadmap][] for reaching 1.0 of [vmdb2][], my Debian image building tool.

Visual roadmap

The visual roadmap is generated from the following YAML file:

vmdb2_1_0:
  label: |
    vmdb2 is production ready
  depends:
    - ci_builds_images
    - docs
    - x220_install

docs:
  label: |
    vmdb2 has a user
    manual of acceptable
    quality

x220_install:
  label: |
    x220 can install Debian
    onto a Thinkpad x220
    laptop

ci_builds_images:
  label: |
    CI builds and publishes
    images using vmdb2
  depends:
    - amd64_images
    - arm_images

amd64_images:
  label: |
    CI: amd64 images

arm_images:
  label: |
    CI: arm images of
    various kinds
Posted Thu Sep 20 10:58:00 2018 Tags:

I've set up a new website for vmdb2, my tool for building Debian images (basically "debootstrap, except in a disk image"). As usual for my websites, it's ugly. Feedback welcome.

Posted Thu Sep 13 19:43:00 2018 Tags:

I'm starting a new job in about a month. Until then, it'd be really helpful if I could earn some money via a short-term contracting or consulting job. If your company or employer could benefit from any of the following, please get in touch. I will invoice via a Finnish company, not as a person (within the EU, at least, this makes it easier for the clients). I also reside in Finland, if that matters (meaning, meeting outside of Helsinki gets tricky).

  • software architecture design and review
  • coding in Python, C, shell, or code review
  • documentation: writing, review
  • git training
  • help with automated testing: unit tests, integration tests
  • help with Ansible
  • packaging and distributing software as .deb packages
Posted Sat Sep 8 09:12:00 2018

In the modern world, a lot of computing happens on other people's computers. We use a lot of services provided by various parties. This is a problem for user freedom and software freedom. For example, when I use Twitter, the software runs on Twitter's servers, and it's entirely proprietary. Even if it were free software, even if it were using the Affero GPL license (AGPL), my freedom would be limited by the fact that I can't change the software running on Twitter's servers.

If I could, it would be a fairly large security problem. If I could, then anyone could, and they might not be good people like I am.

If the software were free, instead of proprietary, I could run it on my own server, or find someone else to run the software for me. This would make me more free.

That still leaves the data. My calendars would still be on Twitter's servers: all my tweets, direct messages, the lists of people I follow, or who follow me. Probably other things as well.

For true freedom in this context, I would need to have a way to migrate my data from Twitter to another service. For practical freedom, the migration should not be excessively much work, or be excessively expensive, not just possible in principle.

For Twitter specifically, there's free-er alternatives, such as Mastodon.

For ick, my CI / CD engine, here is my current thinking: ick should not be a centralised service. It should be possible to pick and choose between instances of its various components: the controller, the workers, the artifact store, and Qvisqve (authentication server). Ditto for any additional components in the future.

Since users and the components need to have some trust in each other, and there may be payment involved, this may need some co-ordination, and it may not be possible to pick entirely freely. However, as a thought experiment, let's consider a scenario.

Alice has a bunch of non-mainstream computers she doesn't use herself much: Arm boards, RISCV boards, PowerPC Macs, Amigas, etc. All in good working condition. She'd be happy to set them up as build workers, and let people use them, for a small fee to cover her expenses.

Bettina has a bunch of servers with lots of storage space. She'd be happy to let people use them as artifact stores, for a fee.

Cecilia has a bunch of really fast x86-64 machines, with lots of RAM and very fast NVMe disks. She'd also be happy to rent them out as build workers.

Dinah needs a CI system, but only has one small server, which would work fine as a controller for her own projects, but is too slow to comfortably do any actual building.

Eliza also needs a CI system, but wants to keep her projects separate from Dinah's, so wants to have her own controller. (Eliza and Dinah can't tolerate each other and do not trust each other.)

Fatima is trusted by everyone, except Eliza, and would be happy to run a secure server with Qvisqve.

Georgina is like Fatima, except Eliza trusts her, and Dinah doesn't.

The setup would be like this:

  • Alice and Cecilia run build workers. The workers trust both Fatima's and Georgina's Qvisqves. All of their workers are registered with both Qvisqves, and both Dinah's and Eliza's controllers.

  • Bettina's artifact store also trusts both Qvisqves.

  • Dinah creates an account on Fatima's Qvisqve. Eliza on Georgina's Qvisqve. They each get an API token from the respective Qvisqve.

  • When Dinah's project builds, her controller uses the API token to get an identity token from Fatima's Qvisqve, and gives that to each worker used in her builds. The worker checks the ID token, and then accepts work from Dinah's controller. The worker reports the time used to do the work to its billing system, and Alice or Cecilia uses that information to bill Dinah.

  • If a build needs to use an artifact store, the ID token is again used to bill Dinah.

  • For Eliza, the same thing happens, except with another Qvisqve, and costs from he builds go to her, not Dinah.

This can be generalised to any number of ick components, which can be used criss-cross. Each component needs to be configured as to which Qvisqves it trusts.

I think this would be a nicer thing to have than the centralised hosted ick I've been thinking about so far. Much more complicated, and much more work, of course. But interesting.

There are some interesting and difficult questions about security to solve. I don't want to start thinking about the details yet, I'll play with the general idea first.

What do you think? Send me your thoughts by email.

Posted Thu Aug 30 17:09:00 2018 Tags:

I was watching the Matthew Garret "Heresies" talk from Debconf17 today. The following thought struck me:

True software freedom for this age: you can get the source code of a service you use, and can set it up on your own server. You can also get all your data from the service, and migrate it to another service (hosted by you or someone else). Futher, all of this needs to be easy, fast, and cheap enough to be feasible, and there can't be "network effects" that lock you into a specific service instance.

I will need to think hard what this means for my future projects.

Posted Fri Aug 24 23:20:00 2018 Tags:

This week's issue of LWN has a quote by Linus Torvalds on translating kernel messages to something else than English. He's against it:

Really. No translation. No design for translation. It's a nasty nasty rat-hole, and it's a pain for everybody.

There's another reason I fundamentally don't want translations for any kernel interfaces. If I get an error report, I want to be able to just 'git grep" it. Translations make that basically impossible.

So the fact is, I want simple English interfaces. And people who have issues with that should just not use them. End of story. Use the existing error numbers if you want internationalization, and live with the fact that you only get the very limited error number.

I can understand Linus's point of view. The LWN readers are having a discussion about it, and one of the comments there provoked this blog post:

It somewhat bothers me that English, being the lingua franca of of free software development, excludes a pretty huge parts of the world from participation. I thought that for a significant part of the world, writing an English commit message has to be more difficult than writing code.

I can understand that point of view as well.

Here's my point of view:

  • It is entirely true that if a project requires English for communication within the project, it discriminates against those who don't know English well.

  • Not having a common language within a project, between those who contribute to the project, now and later, would pretty much destroy any hope of productive collaboration.

    If I have a free software project, and you ask me to merge something where commit messages are in Hindi, error messages in French, and code comments in Swahili, I'm not going to understand any of them. I won't merge what I don't understand.

    If I write my commit messages in Swedish, my code comments in Finnish, and my error messages by entering randomly chosen words from /usr/share/dict/words into search engine, and taking the page title of the fourteenth hit, then you're not going to understand anything either. You're unlikely to make any changes to my project.

    When Bo finds the project in 2038, and needs it to prevent the apocalypse from 32-time timestamps ending, and can't understand the README, humanity is doomed.

    Thus, on balance, I'm OK with requiring the use of a single language for intra-project communication.

  • Users should not be presented with text in a language foreign to them. However, this raises a support issue, where a user may copy-paste an error message in their native language, and ask for help, but the developers don't understand the language, and don't even know what the error is. If they knew the error was "permission denied", they could tell the user to run the chmod command to fix the permissions. This is a dilemma.

    I've solved the dilemma by having a unique error code for each error message. If the user tells me "R12345X: Xscpft ulkacsx ggg: /etc/obnam.conf!" I can look up R12345X and see that the error is that /etc/obnam.conf is not in the expected file format.

    This could be improved by making the "parameters" for the error message easy to parse. Perhaps something like this:

    R12345X: Xscpft ulkacsx ggg! filename=/etc/obnam.conf

    Maintaining such error codes by hand would be quite tedious, of course. I invented a module for doing that. Each error message is represented by a class, and the class creates its own error code by taking the its Python module and class name, and computing and MD5 of that. The first five hexadecimal digits are the code, and get surrounded by R and X to make it easier to grep.

    (I don't know if something similar might be used for the Linux kernel.)

  • Humans and inter-human communication is difficult. In many cases, there is not solution that's good for everyone. But let's not give up.

Posted Fri Aug 3 15:49:00 2018