A Universal Package Specification

One of my favourite things about Unix-like operating systems is package management. They have many technical advantages such as:

ease of installation - just search, install,
automatic dependency resolution,
reduction in storage usage by utilizing shared libraries (no DLL hell),
clean filesystem,
etc.

It also has many practical advantages such as listing installed packages, showing package information (version, url), and finding which package a file belongs to. Unfortunately the implementation of this concept isn’t always perfect. For instance, the learning curve for Debian packaging is quite steep, and RPM has often been criticised for dependency hell.

When I first looked at Archlinux’s package management system, I was pleasantly surprised at how simple the PKGBUILD format is. It is a shell script, in which variables define metadata, and a build() function specifies the steps required to build the package.

Problems

The Archlinux User Repository allows users to contribute PKGBUILDs without a lengthy review process. The code behind it wasn’t in the greatest shape, so I decide to rewrite it in Python + Django. During development I ran into the problem of parsing the PKGBUILD format. I had to get the metadata into the database somehow. Since a PKGBUILD is just a shell script, I thought sourcing it and outputting Python, then eval-ing it would be the easiest way of doing it. Unfortunately this has many security problems, including the execution of malicious code, or infinite loops (the web server would hang). I was forced to write a minimal shell parser to extract the metadata. While it removes the security concerns, it raises others, such as inaccuracy and maintenance problems (the specification and parser code is tightly coupled). It turns out the Shell grammar isn’t exactly the simplest one.

The PKGBUILD format has some warts, which are mostly due to the lack of data structures in bash, such as hash tables. For instance there are two arrays, source and md5sums. Each element in the md5sums array maps to the respective source element. As you can imagine this can easily result in bugs. Fortunately makepkg (the package creation utility) is able to create this field automatically.

Another problem associated with sources and checksums is with binary sources. A binary source is typically different for each supported architecture. There is no way of specifying this easily in the PKGBUILD format. A common solution is to do something like this:

arch=('i686' 'x86_64')
if [ "$CARCH" = "x86_64" ]; then
	source=("http://foo.com/${pkgname}-${pkgver}-x86_64.bin")
	md5sums=(4dacc4474e93bcd4e168fdc48c4e6aee)
else
	source=("http://foo.com/${pkgname}-${pkgver}-i386.bin")
	md5sums=(5001378e4f83d0d6db014eec9182f7b4)
fi

The checksum generation feature of makepkg no longer works properly, and in order to parse this metadata, an interpreter is now required, not just a parser.

Solutions

An Archlinux user started defining an alternate PKGBUILD specification. It addresses some problems with the shell format, such as extendability (to a degree) and ease of parsing. Unfortunately this format is a completely new data format, and thus requires a parser of its own.

Lately I’ve been toying with the idea of creating a universal package specification. The idea of this specification is to provide a portable way of defining package metadata, while keeping it simple, and extendable. Ideally any package manager would be able to use this format, and have enough metadata to do what they need to. It is extendable with an extensions field, which allows package managers to get any data they require, which is not already included in the specification. If there is enough demand for an extension, it should be added to the next revision of the specification.

A common data serialization format is YAML. It is simple, easy to parse, and very versatile. For these reasons it was my first choice for the specification. There are already many different parsers in many different languages. Thus, the format should be easily accessible in most languages. Unfortunately it does not seem that bash is one of them, so parts of makepkg would have to be rewritten.

Suggested Format

name: foo
version: 1.0
release: 1
summary: A fictional package to show the YAML package format
description: >
  This package is an example of the YAML universal package format, which
  aims to be portable and extandable. This description can be as long as
  it needs to be.
architectures: # Keys denote architectures supported by this package
  any:
    sources: # URIs of source files, such as a source tarball
      - uri: http://www.foo.org/-.tar.gz
        checksums: # keys denote algorithm, values are the checksums
          md5: fd085a845298afb36f6676feac855e63
          sha1: e19f73b340aeae43b98908006094af62e0c7b5b9
      - uri: shared_data-.tar.gz
        checksums:
          md5: 2edd3a155dcbb6632029accd2926d33b
          sha1: 776b13c7ff8e8b120a1ea4910a8b8b64c289e6b6
        extract: Yes # Whether an archive should be extracted
icon: foo.png # An icon for GUI package managers
licenses: [MIT]
url: http://www.foo.org
categories: [fictional, example] # Keywords associated with this package
distributions: null # Allows to install a group of packages as a single
                    # target
requires:
  - bar >= 2.1
  - eggs == 1.1.2
provides:
  - example
conflicts:
  - foobar
optional: # Keys denote optional packages, values describe why they are
          # optional and what features they provide
  - eggs: To make a nice omelette
  - spam: If you like that sort of thing...
extensions: # User-defined data
  # The following is metadata used by makepkg, but not used by other
  # common package managers.
  options: [trip, "!docs", libtool, emptydirs, zipman]
  backup:
    - /etc/foo.rc
  install: foo.install
  build: |
    ./configure --prefix=/usr
    make
    make DESTDIR=$pkgdir install
#...

Most of these fields are analogous to those in the current PKGBUILD format. The major difference is with architectures. The keys of this hash table denote supported architectures. The any architecture has two sources. The URI and checksums are defined. No longer is a conditional required. The appropriate architecture is simple retrieved and its sources are used. The source URLs and checksums now have a one-to-one mapping, reducing human error. There is still one problem with this, however. If the sources do not differ for multiple architectures, there will be duplication. To rectify this to some degree, anchors can be used:

architectures:
  i686: &common_sources
    sources:
      - uri: http://www.foo.org/-.tar.gz
        checksums:
          md5: fd085a845298afb36f6676feac855e63
          sha1: e19f73b340aeae43b98908006094af62e0c7b5b9
  x86_64: *common_sources

In some cases it might even be useful to exploit hash merges to specify a common subset of sources and add architecture-specific ones.

Another re-use of data was common in PKGBUILDs - using variables to reference the package name and version in sources. It looked something like this:

pkgname="foo"
pkgver=1.0
source=("http://foo.org/${pkgname}-${pkgver}.tar.gz")

You might have noticed that a similar syntax was used in the uri:

- uri: "http://www.foo.org/${name}-${version}.tar.gz"

This is not something that YAML supports - it would have to be parsed separately. I have not decided on this format yet, and perhaps YAML does indeed have an appropriate feature. For now these values can either be hardcoded, or parsed on the second pass of the data.

The extensions field is the interesting part. If the specification doesn’t have some required data, such as the options field in the PKGBUILD specification, it can be added here.