RSDP : the Reproducible Software Distribution Protocol

In broad terms, most if not all software development follows a specific lifecycle :

modify the source (change the code, update dependencies, add a build tool, ...)
build it
test it
ship it

Given this model, discrepancies may arise when the build or test environments (which are controlled and maintained by the development team) are too different from the target environments (typically client machines, maintained by external entities). Those discrepancies can lead to many sorts of hard-to-detect bugs and vulnerabilities, when a dependency isn't found or worse, when it silently behaves differently than in the test environments.

A solution to these problems can be to impose stringent requirements on target environments, to bring them closer to a testing configuration. However, this forces developers to adhere to a strict dependency locking discipline, and increases the cost of supporting a variety of targets, which can be intractable for complex applications spanning multiple language ecosystems.

In this document, we present an approach to dealing with software builds that facilitates building and distribution of software packages, and answers to the following criteria :

seamless support for cross-platform builds, requiring little to no effort on the part of developers to handle building for any system, on any system (even building directly on client machines)
minimal to no administrative access needed to build or install software, leading to both more freedom and more security for end users
fully traceable dependencies, providing a full and exhaustive bill-of-material for free for any project
language- and tool-agnostic, allowing for arbitrarily complex combinations of tools and ecosystems to coexist
lightweight and minimal, so that reproducing a given environment require as few resources (data, compute, and network) as possible. Also, eco-friendly
no implicit configuration or central authority, to enable a truly universal and decentralized workflow

Concepts and keywords

In order to achieve all the goals listed above, we first need to agree on a precise vocabulary to correctly describe the processes that are involved.

In order to distribute software to a given machine, we need to know three things :

on what system should the software run
where to find the software if it is installed (and thus, where to install it)
how to build the software if it isn't

Note that there is no mention of the system on which the build takes place. This isn't an omission. Once the software is deployed on a target, there should be no difference whether it was built on a system or another.

In reality, there will be situations where a given package can only be built on a select number of platforms. In those cases, the build tools will simply not be available on other platforms, leading to clear errors when trying to fetch them. In such complex setups, it will also be possible to dispatch build tasks on machines that can handle them when necessary.

Regardless, we only need those three informations to completely describe a given package. The most central information is how to build the software, which we call a plan.

A given plan may build multiple parts of a software distribution (for instance, build a library and its API documentation in one go). Taken individually, we call those parts the components of a plan (for example, "lib" and "doc"). As such, a component is completely described as a pair of <plan,component name>. We also, for symmetry's sake, define a notion of distribution, which is described as a pair of <target,plan>.

Thus, a package is defined as a triple <target,plan,component name>. Conversely, a plan should contain enough information to build its components for any target. We will describe how this information can be structured further in this document.

Suppose we have a binary encoding for plans, independent of any platform, then we can define a plan ID to be a (cryptographic) hash of the encoding of a plan. This allows for the content-addressing of plans according to their plan ID. That is, given a plan ID, it should be possible to request its corresponding plan, and (more importantly), to verify upon reception that we indeed got the plan we asked for.

Similarly to plans, a package ID can be defined as a triplet <target,plan ID,component>, so that all the information necessary to build it can be retrieved from a compact digest.

The same goes for a component ID, which is just a pair of <plan ID,component name>, and a distribution ID, which is a pair of <target,plan ID>.

We end up with the following structural model :

                    +------+
                    | plan |
                    +------+
                    /       \
                   /          \
          +target /             \ +component name
                 /                \
                /                   \
         +--------------+    +-----------+
         | distribution |    | component |
         +--------------+    +-----------+
               \                     /
+component name \                  / +target
                 \               /
                  \            /
                   +---------+
                   | package |
                   +---------+

Now all we need is to define what it means to install software. Following in the footsteps of the Nix ecosystem, we can attribute a distinct path to each package in a central store, and only allow writing to that path when building the package. Suppose we have a directory, then a package path is defined as the path <store>/<target>/<plan ID#64>.<component name>.<shortname> (where the "shortname" is the plan name, as a and the "plan ID#64" is the base64-encoded plan ID, to produce a valid file name).

The shortname isn't strictly needed at any point during distribution, but it can be useful when navigating the store filesystem, to be able to guess what a package contains (if you see /store/linux-x86-64/aaabbb...zzz.main.firefox-129, you don't need to look the plan up to know what program is installed there).

A package is thus said to be installed if its package path exists on the target filesystem. Otherwise, it must be built (or optionally downloaded from a cache).

(additionally, since all the information of a package ID can be retrieved from the package path, it is also possible to install a package given only its path)

The plan file format

We now have all the concepts needed to describe a build plan in a platform-independent way. A valid plan should contain a CBOR encoding of the following schema, given as a JSON-like grammar for clarity (with tagged data shown as TAG_NUMBER~VALUE) :

PLAN      ::= 800~[ PLAN_NAME, BUILD_INFO, METADATA_TREE ]
PLAN_NAME ::= STRING([a-zA-Z0-9+~._-]+)

First of all, a plan contains a plan name, some metadata, and some build information. And a plan name is a string of filename-safe characters, since it will be appended to the package path.

Plan metadata

METADATA_TREE ::= 121~METADATA_DIR
                | 122~STRING
                | 123~INT
METADATA_DIR  ::= { ( METADATA_FIELD_NAME: METADATA_TREE )* }

The plan metadata can contain arbitrary user-defined data, organized in a directory-like structure. So, for example, you could store information about the author of a package, and a short description, as :

{ 
  "author": { 
    "name": "Devid McDevson", 
    "email": "devid.mcdevson@dev.org", 
    "pubkey": "pub-xxxxxxxx",
    "age": 22
  }, 
  "synopsis": "An awesome package, that does wonderful stuff"
}

Metadata attributes are completely arbitrary, and in no way neeeded to perform a correct build.

However, they can be used to provide human-friendly features, such as information about build times (for progress bars), or links to other versions of a plan.

For example, if you store a public key in a standardized metadata slot, you can sign additional metadata using the corresponding private key, without changing the plan (or the plan ID). This allows you to, for example, point users to a newer version of a plan, or give information about the size of a package, both of which are by definition unknown at the time of producing the original plan.

This way of providing extra information preserves the distributed nature of the plan-based workflow. A developer does not need permission to suggest a path to upgrade, and users do not need to ask a central authority to tell them when one is available.

Build information

TODO describe in more detail

BUILD_INFO ::= 121~COMMAND_BUILDER
             | 122~SOURCE_BUILDER
             | 123~INDIRECT_BUILDER
COMMAND_BUILDER     ::= [ COMMAND_PATH, COMMAND_ARGS, COMMAND_COMPONENTS, COMMAND_ENV ]
SOURCE_BUILDER      ::= { ( COMPONENT_NAME: [COMPONENT_SOURCE, COMPONENT_DEPS] )* }
INDIRECT_BUILDER    ::= PATH_FROM(COMPONENT_ID)

COMMAND_PATH         ::= PATH_FROM(COMPONENT_ID)
COMMAND_ARGS         ::= [STRING]
COMMAND_COMPONENTS   ::= { ( COMPONENT_NAME: COMPONENT_DEPS )* }
COMMAND_ENV          ::= { ( VAR_NAME: VAR_VALUE )* }

COMPONENT_SOURCE    ::= PER_TARGET(SOURCE_ID)
COMPONENT_DEPS      ::= [ VAR_NAME* ]

VAR_VALUE ::= 121~ENV_DEPENDENCIES
            | 122~PER_TARGET(ENV_STRING)
ENV_DEPENDENCIES ::= [ DEP_IS_TARGET, [ PER_TARGET(META_PACKAGE_ID)* ] ]
ENV_STRING       ::= [ ENV_STRING_IS_FILE, STRING ]

# Source archives
SOURCE_TREE ::= 121~SOURCE_DIRECTORY
              | 122~[ FILE_IS_EXECUTABLE, FILE_CONTENTS ]
              | 123~SYMLINK_DEST
SOURCE_DIRECTORY ::= { (FILE_NAME: SOURCE_TREE)* }
FILE_CONTENTS ::= BYTES
SYMLINK_DEST ::= FILE_PATH

# Basic types
PER_TARGET(DATA) ::= [ DATA, { (SYSTEM_ID: DATA)* } ]
PATH_FROM(ROOT)  ::= [ ROOT, [SUB_PATH*] ]

ENV_STRING_IS_FILE ::= BOOL
DEP_IS_TARGET      ::= BOOL
FILE_IS_EXECUTABLE ::= BOOL

CONTENT_HASH(DATA)  ::= BYTES<32>                 # A SHA-256 hash of the corresponding encoding
METADATA_FIELD_NAME ::= STRING([a-zA-Z0-9_-]+)
SUB_PATH            ::= FILE_NAME
FILE_NAME           ::= STRING([:filename:]+)
FILE_PATH           ::= BYTES
COMPONENT_NAME      ::= STRING([a-zA-Z0-9_-]+)
VAR_NAME            ::= STRING([a-zA-Z_]+)

# Identifiers
SYSTEM_ID       ::= 1                              # Linux x86 64bit
                  | 2                              # Linux ARM 64bit
META_SYSTEM_ID  ::= 0                              # The target system
                  | SYSTEM_ID
PLAN_ID         ::= CONTENT_HASH(PLAN)
SOURCE_ID       ::= CONTENT_HASH(SOURCE_TREE)
COMPONENT_ID    ::= [ PLAN_ID, COMPONENT_NAME ]
META_PACKAGE_ID ::= [ SYSTEM_ID, PLAN_ID, META_SYSTEM_ID ]

Why use RDSP ?

There is already a rich ecosystem of tools that promise reproducible builds and installation of software (Nix, Lix and Guix, a lot of language-specific tooling, as well as Bazel, Docker, and too many others to list). It seems reasonable to ask whether another such tool would be necessary.

RDSP vs Nix (or Guix)

Pros of Nixoid :

no administrative access on the host
language-agnostic
composable and expressive

Cons of Nixoid :

implicit central configuration authority : nixpkgs
heavyweight, and tied to a single toolset
platform-specific builds

The RDSP is heavily influenced by Nix, from the definition of a build step as a command to execute, to the notion of a store full of hash-indexed packages. Why, then, not simply use Nix ?

The difference is mainly in the decoupling of plans from the system they will run on. In Nix, a derivation contains a system, and describes a build for that system only. The same derivation will not run on another system.

Worse, given a two build systems $B$ and $B'$, and a target system $T$, a derivation that builds a given program for $T$ will be different whether that program was built on $B$ or $B'$.

A direct consequence of this multiplicity is that you can't distribute Nix derivations (those .drv files). And this in turn is why the Nix ecosystem focuses on distributing Nix code instead.

RDSP vs containers/apptainers (Docker)

Pros of *tainers :

simple to use
no administrative access on the host
language-agnostic

Cons of *tainers :

platform-specific builds
non-traceable software dependencies, buried within a whole system
definitely not lightweight
tool-specific, usually with a central configuration authority
non-composable

This one is rather simple : containers don't compose well, if at all. Once an app has been packaged as a container, it is no longer possible to use it as a dependency for a more complex app (i.e. you can't have containers inside containers, and if you do it starts to sound like a maintenance nightmare).

Moreover, containers are by their very nature tied to a given architecture. For instance, a Docker image that runs on x86 will contain a full x86 system, and won't be nearly as useful when run on an ARM chip.

RDSP vs language package managers

Pros of LPMs :

TODO