Fossil

Fossil is not Relational
Login

An Introduction to the Fossil Data Model

Upon hearing that Fossil is based on sqlite, it's natural for people unfamiliar with its internals to assume that Fossil stores its SCM-relevant data in a database-friendly way and that the SCM history can be modified via SQL. The truth, however, is far stranger than that.

This document introduces, at a relatively high level:

  1. The underlying enduring and immutable data format, which is independent of any specific storage engine.

  2. The blob table: Fossil's single point of SCM-relevant data storage.

  3. The transformation of (1) from its immutable raw form to a transient database-friendly form.

  4. Some of the consequences of this model.

Part 1: Artifacts

Artifacts Client files blob table Crosslink process Auxiliary tables
AllObjects: [
A: file "Artifacts" fill lightskyblue;
down; move to A.s; move 50%;
F: file "Client" "files";
right; move 1; up; move 50%;
B: cylinder "blob table"
right;
arrow from A.e to B.w;
arrow from F.e to B.w;
arrow dashed from B.e;
C: box rad 0.1 "Crosslink" "process";
arrow
AUX: cylinder "Auxiliary" "tables"
arc -> cw dotted from AUX.s to B.s;
] # end of AllObjects

The centerpiece of Fossil's architecture is a data format which describes what we call "artifacts." Each artifact represents the state of one atomic unit of SCM-relevant data, such as a single checkin, a single wiki page edit, a single modification to a ticket, creation or cancellation of tags, and similar SCM constructs. In the cases of checkins and ticket updates, an artifact may record changes to multiple files resp. ticket fields, but the change as a whole is atomic. Though we often refer to both fossil-specific SCM data and client-side content as artifacts, this document uses the term artifact solely for the former purpose.

From the data format's main documentation:

The global state of a fossil repository is kept simple so that it can endure in useful form for decades or centuries. A fossil repository is intended to be readable, searchable, and extensible by people not yet born.

This format has the following major properties:

Notably, the artifact file format does not...

Such aspects are all considered to be implementation details of higher-level applications (be they in the main fossil binary or a hypothetical 3rd-party application), and have no effect on the underlying artifact data model. That said, in Fossil:

Sidebar: SCM-relevant vs Non-SCM-relevant State

Certain data in Fossil are "SCM-relevant" and certain data are not. In short, SCM-relevant data are managed in a way consistent with controlled versioning of that data. Conversely, non-SCM-relevant data are essentially any state neither specified by nor unambiguously refererenced by the artifact file format and are therefore not versioned.

SCM-relevant state includes:

Non-SCM-relevant state includes:

Terminology Hair-splitting: Manifest vs. Artifact

We sometimes refer to artifacts as "manifests," which is technically a term for artifacts which record checkins. The various other artifact types are arguably not "manifests," but are sometimes referred to as such because the internal APIs use that term.

A Very Basic Example

The following artifact, truncated for brevity, represents a typical checkin artifact (a.k.a. a manifest):

C Bug\sfix\sin\sthe\slocal\sdatabase\sfinder.
D 2007-07-30T13:01:08
F src/VERSION 24bbb3aad63325ff33c56d777007d7cd63dc19ea
F src/add.c 1a5dfcdbfd24c65fa04da865b2e21486d075e154
F src/blob.c 8ec1e279a6cd0cfd5f1e3f8a39f2e9a1682e0113
<SNIP>
F www/selfcheck.html 849df9860df602dc2c55163d658c6b138213122f
P 01e7596a984e2cd2bc12abc0a741415b902cbeea
R 74a0432d81b956bfc3ff5a1a2bb46eb5
U drh
Z c9dcc06ecead312b1c310711cb360bc3

Each line is a single data record called a "card." The first letter of each line tells us the type of data stored on that line and the following space-separated tokens contain the data for that line. Tokens which themselves contain spaces (notably the checkin comment) have those escaped as \s. The raw text of wiki pages/comments, forum posts, and ticket bodies/comments is stored directly in the corresponding artifact, but is stored in a way which makes such escaping unnecessary.

The hashes seen above are a critical component of the architecture:

Part 2: The blob Table

Artifacts Client files blob table Crosslink process Auxiliary tables
AllObjects: [
A: file "Artifacts";
down; move to A.s; move 50%;
F: file "Client" "files" fill lightskyblue;
right; move 1; up; move 50%;
B: cylinder "blob table" fill lightskyblue;
right;
arrow from A.e to B.w;
arrow from F.e to B.w;
arrow dashed from B.e;
C: box rad 0.1 "Crosslink" "process";
arrow
AUX: cylinder "Auxiliary" "tables"
arc -> cw dotted from AUX.s to B.s;
] # end of AllObjects

The blob table is the core-most storage of a Fossil repository database, storing all SCM-relevant data (and only SCM-relevant data). Each row of this table holds a single artifact or the content for a single version of a single client-side file. Slightly truncated for clarity, its schema contains the following fields:

Sidebar: How does blob Distinguish Between Artifacts and Client Content?

Notice that the blob table has no flag saying "this record is an artifact" or "this record is client data." Similarly, there is no place in the database dedicated to keeping track of which blob records are artifacts and which are file content.

That said, (A) the type of a blob can be implied via certain table relationships and (B) the event table (the /timeline's main data source) incidentally has a list of artifacts and their sub-types (checkin, wiki, tag, etc.). However, given that all of those relationships, including the timeline, are transient, how can Fossil distinguish between the two types of data?

Fossil's artifact format is extremely rigid and is strictly enforced internally, with zero room provided for leniency. Every artifact which is internally created is re-parsed for validity before it is committed to the database, making it impossible that Fossil can inject an invalid artifact into the repository. Because of the strictness of the artifact parser, the chances that any given piece of arbitrary client data could be successfully parsed as an artifact, even if it is syntactically 99% similar to an artifact, are effectively zero.

Thus Fossil's rule of interpreting the contents of the blob table is: if it can be parsed as an artifact, it is an artifact, else it is opaque client-side data.

That rule is most often relevant in operations like rebuild and reconstruct, both of which necessarily have to sort out artifacts and non-artifact blobs from arbitrary collections of blobs.

It is, in fact, possible to store an artifact unrelated to the current repository in that repository, and it will be parsed and processed as an artifact (see below), but it likely refers to other artifacts or blobs which are not part of the current repository, thereby possibly introducing "strange" data into the UI. If this happens, it's potentially slightly confusing but is functionally harmless.

Part 3: Crosslinking

Artifacts Client files blob table Crosslink process Auxiliary tables
AllObjects: [
A: file "Artifacts";
down; move to A.s; move 50%;
F: file "Client" "files";
right; move 1; up; move 50%;
B: cylinder "blob table"
right;
arrow from A.e to B.w;
arrow from F.e to B.w;
arrow dashed from B.e;
C: box rad 0.1 "Crosslink" "process" fill lightskyblue;
arrow
AUX: cylinder "Auxiliary" "tables" fill lightskyblue;
arc -> cw dotted from AUX.s to B.s;
] # end of AllObjects

Once an artifact is stored in the blob table, how does one perform SQL queries against its plain-text format? In short: One Does Not Simply Query the Artifacts.

Crosslinking, as its colloquially known, is a one-way processing step which transforms an immutable artifact's state into something database-friendly. Crosslinking happens automatically every time Fossil generates, or is given, a new artifact. Crosslinking of any given artifact may update many different auxiliary tables, all of which are transient in the sense that they may be destroyed and then recreated by crosslinking all artifacts from the blob table (which is exactly what the rebuild command does). The overwhelming majority of individual database records in any Fossil repository are found in these transient auxiliary tables, though the blob table tends to account for the overwhelming majority of a repository's disk space.

This approach to mapping data from artifacts to the db gives Fossil the freedom to change its database model, effectively at will, with minimal client-side disruption (at most, a call to rebuild). This allows, for example, Fossil to take advantage of new improvements in sqlite without affecting compatibility with older repositories.

Auxiliary tables hold data mappings such as:

And numerous other bits and pieces.

The many auxiliary tables maintained by the app-level code reference the blob table via its RID field, as that's far more efficient than using hashes (blob.uuid) as foreign keys. The contexts of those auxiliary data unambiguously tell us whether the referenced blobs are artifacts or file content, so there is no efficiency penalty there for hosting both opaque blobs and artifacts in the blob table.

The complete SQL schemas for the core-most auxiliary tables can be found at:

/finfo/src/schema.c?ci=trunk

Noting, however, that all database tables are effectively internal APIs, with no API stability guarantees and subject to change at any time. Thus their structures generally should not be relied upon in client-side scripts.

Part 4: Implications and Consequences of the Model

Some of the implications and consequences of Fossil's data model combined with the higher-level access via SQL include: