RSDP : the Reproducible Software Distribution Protocol
In broad terms, most if not all software development follows a specific lifecycle :
- modify the source (change the code, update dependencies, add a build tool, ...)
- build it
- test it
- ship it
Given this model, discrepancies may arise when the build or test environments (which are controlled and maintained by the development team) are too different from the target environments (typically client machines, maintained by external entities). Those discrepancies can lead to many sorts of hard-to-detect bugs and vulnerabilities, when a dependency isn't found or worse, when it silently behaves differently than in the test environments.
A solution to these problems can be to impose stringent requirements on target environments, to bring them closer to a testing configuration. However, this forces developers to adhere to a strict dependency locking discipline, and increases the cost of supporting a variety of targets, which can be intractable for complex applications spanning multiple language ecosystems.
In this document, we present an approach to dealing with software builds that facilitates building and distribution of software packages, and answers to the following criteria :
-
seamless support for cross-platform builds, requiring little to no effort on the part of developers to handle building for any system, on any system (even building directly on client machines)
-
minimal to no administrative access needed to build or install software, leading to both more freedom and more security for end users
-
fully traceable dependencies, providing a full and exhaustive bill-of-material for free for any project
-
language- and tool-agnostic, allowing for arbitrarily complex combinations of tools and ecosystems to coexist
-
lightweight and minimal, so that reproducing a given environment require as few resources (data, compute, and network) as possible. Also, eco-friendly
-
no implicit configuration or central authority, to enable a truly universal and decentralized workflow
Concepts and keywords
In order to achieve all the goals listed above, we first need to agree on a precise vocabulary to correctly describe the processes that are involved.
In order to distribute software to a given machine, we need to know three things :
- on what system should the software run
- where to find the software if it is installed (and thus, where to install it)
- how to build the software if it isn't
Note that there is no mention of the system on which the build takes place. This isn't an omission. Once the software is deployed on a target, there should be no difference whether it was built on a system or another.
In reality, there will be situations where a given package can only be built on a select number of platforms. In those cases, the build tools will simply not be available on other platforms, leading to clear errors when trying to fetch them. In such complex setups, it will also be possible to dispatch build tasks on machines that can handle them when necessary.
Regardless, we only need those three informations to completely describe a given package. The most central information is how to build the software, which we call a plan.
A given plan may build multiple parts of a software distribution (for
instance, build a library and its API documentation in one go). Taken
individually, we call those parts the components of a plan (for
example, "lib" and "doc"). As such, a component is completely
described as a pair of <plan,component name>
. We also, for symmetry's
sake, define a notion of distribution, which is described as a pair
of <target,plan>
.
Thus, a package is defined as a triple <target,plan,component name>
.
Conversely, a plan should contain enough information to build
its components for any target. We will describe how this information
can be structured further in this document.
Suppose we have a binary encoding for plans, independent of any platform, then we can define a plan ID to be a (cryptographic) hash of the encoding of a plan. This allows for the content-addressing of plans according to their plan ID. That is, given a plan ID, it should be possible to request its corresponding plan, and (more importantly), to verify upon reception that we indeed got the plan we asked for.
Similarly to plans, a package ID can be defined as a triplet
<target,plan ID,component>
, so that all the information necessary to
build it can be retrieved from a compact digest.
The same goes for a component ID, which is just a pair of
<plan ID,component name>
, and a distribution ID, which is a pair of
<target,plan ID>
.
We end up with the following structural model :
+------+
| plan |
+------+
/ \
/ \
+target / \ +component name
/ \
/ \
+--------------+ +-----------+
| distribution | | component |
+--------------+ +-----------+
\ /
+component name \ / +target
\ /
\ /
+---------+
| package |
+---------+
Now all we need is to define what it means to install
software. Following in the footsteps of the Nix ecosystem, we can
attribute a distinct path to each package in a central store, and only
allow writing to that path when building the package. Suppose we have
a <store>/<target>/<plan ID#64>.<component name>.<shortname>
(where
the "shortname" is the plan name, as a and the "plan ID#64" is the
base64-encoded plan ID, to produce a valid file name).
The shortname isn't strictly needed at any point during distribution,
but it can be useful when navigating the store filesystem, to be able
to guess what a package contains (if you see
/store/linux-x86-64/aaabbb...zzz.main.firefox-129
, you don't need to
look the plan up to know what program is installed there).
A package is thus said to be installed if its package path exists on the target filesystem. Otherwise, it must be built (or optionally downloaded from a cache).
(additionally, since all the information of a package ID can be retrieved from the package path, it is also possible to install a package given only its path)
The plan file format
We now have all the concepts needed to describe a build plan in a
platform-independent way. A valid plan should contain a CBOR
encoding of the following schema, given as a JSON-like grammar for
clarity (with tagged data shown as TAG_NUMBER~VALUE
) :
PLAN ::= 800~[ PLAN_NAME, BUILD_INFO, METADATA_TREE ]
PLAN_NAME ::= STRING([a-zA-Z0-9+~._-]+)
First of all, a plan contains a plan name, some metadata, and some build information. And a plan name is a string of filename-safe characters, since it will be appended to the package path.
Plan metadata
METADATA_TREE ::= 121~METADATA_DIR
| 122~STRING
| 123~INT
METADATA_DIR ::= { ( METADATA_FIELD_NAME: METADATA_TREE )* }
The plan metadata can contain arbitrary user-defined data, organized in a directory-like structure. So, for example, you could store information about the author of a package, and a short description, as :
{
"author": {
"name": "Devid McDevson",
"email": "devid.mcdevson@dev.org",
"pubkey": "pub-xxxxxxxx",
"age": 22
},
"synopsis": "An awesome package, that does wonderful stuff"
}
Metadata attributes are completely arbitrary, and in no way neeeded to perform a correct build.
However, they can be used to provide human-friendly features, such as information about build times (for progress bars), or links to other versions of a plan.
For example, if you store a public key in a standardized metadata slot, you can sign additional metadata using the corresponding private key, without changing the plan (or the plan ID). This allows you to, for example, point users to a newer version of a plan, or give information about the size of a package, both of which are by definition unknown at the time of producing the original plan.
This way of providing extra information preserves the distributed nature of the plan-based workflow. A developer does not need permission to suggest a path to upgrade, and users do not need to ask a central authority to tell them when one is available.
Build information
TODO describe in more detail
BUILD_INFO ::= 121~COMMAND_BUILDER
| 122~SOURCE_BUILDER
| 123~INDIRECT_BUILDER
COMMAND_BUILDER ::= [ COMMAND_PATH, COMMAND_ARGS, COMMAND_COMPONENTS, COMMAND_ENV ]
SOURCE_BUILDER ::= { ( COMPONENT_NAME: [COMPONENT_SOURCE, COMPONENT_DEPS] )* }
INDIRECT_BUILDER ::= PATH_FROM(COMPONENT_ID)
COMMAND_PATH ::= PATH_FROM(COMPONENT_ID)
COMMAND_ARGS ::= [STRING]
COMMAND_COMPONENTS ::= { ( COMPONENT_NAME: COMPONENT_DEPS )* }
COMMAND_ENV ::= { ( VAR_NAME: VAR_VALUE )* }
COMPONENT_SOURCE ::= PER_TARGET(SOURCE_ID)
COMPONENT_DEPS ::= [ VAR_NAME* ]
VAR_VALUE ::= 121~ENV_DEPENDENCIES
| 122~PER_TARGET(ENV_STRING)
ENV_DEPENDENCIES ::= [ DEP_IS_TARGET, [ PER_TARGET(META_PACKAGE_ID)* ] ]
ENV_STRING ::= [ ENV_STRING_IS_FILE, STRING ]
# Source archives
SOURCE_TREE ::= 121~SOURCE_DIRECTORY
| 122~[ FILE_IS_EXECUTABLE, FILE_CONTENTS ]
| 123~SYMLINK_DEST
SOURCE_DIRECTORY ::= { (FILE_NAME: SOURCE_TREE)* }
FILE_CONTENTS ::= BYTES
SYMLINK_DEST ::= FILE_PATH
# Basic types
PER_TARGET(DATA) ::= [ DATA, { (SYSTEM_ID: DATA)* } ]
PATH_FROM(ROOT) ::= [ ROOT, [SUB_PATH*] ]
ENV_STRING_IS_FILE ::= BOOL
DEP_IS_TARGET ::= BOOL
FILE_IS_EXECUTABLE ::= BOOL
CONTENT_HASH(DATA) ::= BYTES<32> # A SHA-256 hash of the corresponding encoding
METADATA_FIELD_NAME ::= STRING([a-zA-Z0-9_-]+)
SUB_PATH ::= FILE_NAME
FILE_NAME ::= STRING([:filename:]+)
FILE_PATH ::= BYTES
COMPONENT_NAME ::= STRING([a-zA-Z0-9_-]+)
VAR_NAME ::= STRING([a-zA-Z_]+)
# Identifiers
SYSTEM_ID ::= 1 # Linux x86 64bit
| 2 # Linux ARM 64bit
META_SYSTEM_ID ::= 0 # The target system
| SYSTEM_ID
PLAN_ID ::= CONTENT_HASH(PLAN)
SOURCE_ID ::= CONTENT_HASH(SOURCE_TREE)
COMPONENT_ID ::= [ PLAN_ID, COMPONENT_NAME ]
META_PACKAGE_ID ::= [ SYSTEM_ID, PLAN_ID, META_SYSTEM_ID ]
Why use RDSP ?
There is already a rich ecosystem of tools that promise reproducible builds and installation of software (Nix, Lix and Guix, a lot of language-specific tooling, as well as Bazel, Docker, and too many others to list). It seems reasonable to ask whether another such tool would be necessary.
RDSP vs Nix (or Guix)
Pros of Nixoid :
- no administrative access on the host
- language-agnostic
- composable and expressive
Cons of Nixoid :
- implicit central configuration authority : nixpkgs
- heavyweight, and tied to a single toolset
- platform-specific builds
The RDSP is heavily influenced by Nix, from the definition of a build step as a command to execute, to the notion of a store full of hash-indexed packages. Why, then, not simply use Nix ?
The difference is mainly in the decoupling of plans from the system they will run on. In Nix, a derivation contains a system, and describes a build for that system only. The same derivation will not run on another system.
Worse, given a two build systems $B$ and $B'$, and a target system $T$, a derivation that builds a given program for $T$ will be different whether that program was built on $B$ or $B'$.
A direct consequence of this multiplicity is that you can't distribute
Nix derivations (those .drv
files). And this in turn is why the Nix
ecosystem focuses on distributing Nix code instead.
RDSP vs containers/apptainers (Docker)
Pros of *tainers :
- simple to use
- no administrative access on the host
- language-agnostic
Cons of *tainers :
- platform-specific builds
- non-traceable software dependencies, buried within a whole system
- definitely not lightweight
- tool-specific, usually with a central configuration authority
- non-composable
This one is rather simple : containers don't compose well, if at all. Once an app has been packaged as a container, it is no longer possible to use it as a dependency for a more complex app (i.e. you can't have containers inside containers, and if you do it starts to sound like a maintenance nightmare).
Moreover, containers are by their very nature tied to a given architecture. For instance, a Docker image that runs on x86 will contain a full x86 system, and won't be nearly as useful when run on an ARM chip.
RDSP vs language package managers
Pros of LPMs :
TODO