Fossil File Formats
The global state of a fossil repository is kept simple so that it can endure in useful form for decades or centuries. A fossil repository is intended to be readable, searchable, and extensible by people not yet born.
The global state of a fossil repository is an unordered set of artifacts. An artifact might be a source code file, the text of a wiki page, part of a trouble ticket, a description of a check-in including all the files in that check-in with the check-in comment and so forth. Artifacts are broadly grouped into two types: content artifacts and structural artifacts. Content artifacts are the raw project source-code files that are checked into the repository. Structural artifacts have special formatting rules and are used to show the relationships between other artifacts in the repository. It is possible for an artifact to be both a structure artifact and a content artifact, though this is rare. Artifacts can be text or binary.
In addition to the global state, each fossil repository also contains local state. The local state consists of web-page formatting preferences, authorized users, ticket display and reporting formats, and so forth. The global state is shared in common among all repositories for the same project, whereas the local state is often different in separate repositories. The local state is not versioned and is not synchronized with the global state. The local state is not composed of artifacts and is not intended to be enduring. This document is concerned with global state only. Local state is only mentioned here in order to distinguish it from global state.
1.0 Artifact Names
Each artifact in the repository is named by a hash of its content. No prefixes, suffixes, or other information is added to an artifact before the hash is computed. The artifact name is just the (lower-case hexadecimal) hash of the raw artifact.
Fossil currently computes artifact names using either SHA1 or SHA3-256. It is relatively easy to add new algorithms in the future, but there are no plans to do so at this time.
When referring to artifacts in using tty commands or webpage URLs, it is sufficient to specify a unique prefix for the artifact name. If the input prefix is not unique, Fossil will show an error. Within a structural artifact, however, all references to other artifacts must be the complete hash.
Prior to Fossil version 2.0, all names were formed from the SHA1 hash of the artifact. The key innovation in Fossil 2.0 was adding support for alternative hash algorithms.
2.0 Structural Artifacts
A structural artifact is an artifact with a particular format that is used to define the relationships between other artifacts in the repository. Fossil recognizes the following kinds of structural artifacts:
These eight structural artifact types are described in subsections below.
Structural artifacts are ASCII text. The artifact may be PGP clearsigned. After removal of the PGP clearsign header and suffix (if any) a structural artifact consists of one or more "cards" separated by a single newline (ASCII: 0x0a) character. Each card begins with a single character "card type". Zero or more arguments may follow the card type. All arguments are separated from each other and from the card-type character by a single space character. There is no surplus white space between arguments and no leading or trailing whitespace except for the newline character that acts as the card separator. All cards must be in strict lexicographical order (except for an historical bug compatibility). There may not be any duplicate cards.
In the current implementation (as of 2017-02-27) the artifacts that make up a fossil repository are stored as delta- and zlib-compressed blobs in an SQLite database. This is an implementation detail and might change in a future release. For the purpose of this article "file format" means the format of the artifacts, not how the artifacts are stored on disk. It is the artifact format that is intended to be enduring. The specifics of how artifacts are stored on disk, though stable, is not intended to live as long as the artifact format.
2.1 The Manifest
A manifest defines a check-in. A manifest contains a list of artifacts for each file in the project and the corresponding filenames, as well as information such as parent check-ins, the username of the programmer who created the check-in, the date and time when the check-in was created, and any check-in comments associated with the check-in.
Allowed cards in the manifest are as follows:
B baseline-manifest
C checkin-comment
D time-and-date-stamp
F filename ?hash? ?permissions? ?old-name?
N mimetype
P artifact-hash+
Q (+|-)artifact-hash ?artifact-hash?
R repository-checksum
T (+|-|*)tag-name * ?value?
U user-login
Z manifest-checksum
A manifest may optionally have a single B card. The B card specifies another manifest that serves as the "baseline" for this manifest. A manifest that has a B card is called a delta-manifest and a manifest that omits the B card is a baseline-manifest. The other manifest identified by the argument of the B card must be a baseline-manifest. A baseline-manifest records the complete contents of a check-in. A delta-manifest records only changes from its baseline.
A manifest must have exactly one C card. The sole argument to the C card is a check-in comment that describes the check-in that the manifest defines. The check-in comment is text. The following escape sequences are applied to the text: A space (ASCII 0x20) is represented as "\s" (ASCII 0x5C, 0x73). A newline (ASCII 0x0a) is "\n" (ASCII 0x5C, x6E). A backslash (ASCII 0x5C) is represented as two backslashes "\\". Apart from space and newline, no other whitespace characters are allowed in the check-in comment. Nor are any unprintable characters allowed in the comment.
A manifest must have exactly one D card. The sole argument to the D card is a date-time stamp in the ISO8601 format. The date and time should be in coordinated universal time (UTC). The format one of:
YYYY-MM-DDTHH:MM:SS
YYYY-MM-DDTHH:MM:SS.SSS
A manifest has zero or more F cards. Each F card identifies a file that is part of the check-in. There are one, two, three, or four arguments. The first argument is the pathname of the file in the check-in relative to the root of the project file hierarchy. No ".." or "." directories are allowed within the filename. Space characters are escaped as in C card comment text. Backslash characters and newlines are not allowed within filenames. The directory separator character is a forward slash (ASCII 0x2F). The second argument to the F card is the lower-case hexadecimal artifact hash of the content artifact. The second argument is required for baseline manifests but is optional for delta manifests. When the second argument to the F card is omitted, it means that the file has been deleted relative to the baseline (files removed in baseline manifests versions are not added as F cards). The optional 3rd argument defines any special access permissions associated with the file. This can be defined as "x" to mean that the file is executable or "l" (small letter ell) to mean a symlink. All files are always readable and writable. This can be expressed by "w" permission if desired but is optional. The file format might be extended with new permission letters in the future. The optional 4th argument is the name of the same file as it existed in the parent check-in. If the name of the file is unchanged from its parent, then the 4th argument is omitted.
A manifest has zero or one N cards. The N card specifies the mimetype for the text in the comment of the C card. If the N card is omitted, a default mimetype is used.
A manifest has zero or one P cards. Most manifests have one P card. The P card has a varying number of arguments that define other manifests from which the current manifest is derived. Each argument is a lowercase hexadecimal artifact hash of a predecessor manifest. All arguments to the P card must be unique within that card. The first argument is the artifact hash of the direct ancestor of the manifest. Other arguments define manifests with which the first was merged to yield the current manifest. Most manifests have a P card with a single argument. The first manifest in the project has no ancestors and thus has no P card or (depending on the Fossil version) an empty P card (no arguments).
A manifest has zero or more Q cards. A Q card is similar to a P card in that it defines a predecessor to the current check-in. But whereas a P card defines the immediate ancestor or a merge ancestor, the Q card is used to identify a single check-in or a small range of check-ins which were cherry-picked for inclusion in or exclusion from the current manifest. The first argument of the Q card is the artifact ID of another manifest (the "target") which has had its changes included or excluded in the current manifest. The target is preceded by "+" or "-" to show inclusion or exclusion, respectively. The optional second argument to the Q card is another manifest artifact ID which is the "baseline" for the cherry-pick. If omitted, the baseline is the primary parent of the target. The changes included or excluded consist of all changes moving from the baseline to the target.
The Q card was added to the interface specification on 2011-02-26. Older versions of Fossil will reject manifests that contain Q cards.
A manifest may optionally have a single R card. The R card has a single argument which is the MD5 checksum of all files in the check-in except the manifest itself. The checksum is expressed as 32 characters of lowercase hexadecimal. The checksum is computed as follows: For each file in the check-in (except for the manifest itself) in strict sorted lexicographical order, take the pathname of the file relative to the root of the repository, append a single space (ASCII 0x20), the size of the file in ASCII decimal, a single newline character (ASCII 0x0A), and the complete text of the file. Compute the MD5 checksum of the result.
A manifest might contain one or more T cards used to set tags or properties on the check-in. The format of the T card is the same as described in Control Artifacts section below, except that the second argument is the single character "*" instead of an artifact ID. The * in place of the artifact ID indicates that the tag or property applies to the current artifact. It is not possible to encode the current artifact ID as part of an artifact, since the act of inserting the artifact ID would change the artifact ID, hence a * is used to represent "self". T cards are typically added to manifests in order to set the branch property and a symbolic name when the check-in is intended to start a new branch.
Each manifest has a single U card. The argument to the U card is the login of the user who created the manifest. The login name is encoded using the same character escapes as is used for the check-in comment argument to the C card.
A manifest must have a single Z card as its last line. The argument to the Z card is a 32-character lowercase hexadecimal MD5 hash of all prior lines of the manifest up to and including the newline character that immediately precedes the "Z", excluding any PGP clear-signing prefix. The Z card is a sanity check to prove that the manifest is well-formed and consistent.
A sample manifest from Fossil itself can be seen here.
2.2 Clusters
A cluster is an artifact that declares the existence of other artifacts. Clusters are used during repository synchronization to help reduce network traffic. As such, clusters are an optimization and may be removed from a repository without loss or damage to the underlying project code.
Allowed cards in the cluster are as follows:
M artifact-id
Z checksum
A cluster contains one or more M cards followed by a single Z card. Each M card has a single argument which is the artifact ID of another artifact in the repository. The Z card works exactly like the Z card of a manifest. The argument to the Z card is the lower-case hexadecimal representation of the MD5 checksum of all prior cards in the cluster. The Z card is required.
An example cluster from Fossil can be seen here.
2.3 Control Artifacts
Control artifacts are used to assign properties to other artifacts within the repository. Allowed cards in a control artifact are as follows:
D time-and-date-stamp
T (+|-|*)tag-name artifact-id ?value?
U user-name
Z checksum
A control artifact must have one D card, one U card, one Z card and one or more T cards. No other cards or other text is allowed in a control artifact. Control artifacts might be PGP clearsigned.
The D card and the Z card of a control artifact are the same as in a manifest.
The T card represents a tag or property that is applied to some other artifact. The T card has two or three values. The second argument is the lowercase artifact ID of the artifact to which the tag is to be applied. The first value is the tag name. The first character of the tag is either "+", "-", or "*". The "+" means the tag should be added to the artifact. The "-" means the tag should be removed. The "*" character means the tag should be added to the artifact and all direct descendants (but not descendants through a merge) down to but not including the first descendant that contains a more recent "-", "*", or "+" tag with the same name. The optional third argument is the value of the tag. A tag without a value is a Boolean.
When two or more tags with the same name are applied to the same artifact, the tag with the latest (most recent) date is used.
Some tags have special meaning. The "comment" tag when applied to a check-in will override the check-in comment of that check-in for display purposes. The "user" tag overrides the name of the check-in user. The "date" tag overrides the check-in date. The "branch" tag sets the name of the branch that at check-in belongs to. Symbolic tags begin with the "sym-" prefix.
The U card is the name of the user that created the control artifact. The Z card is the usual required artifact checksum.
An example control artifact can be seen here.
2.4 Wiki Pages
A wiki artifact defines a single version of a single wiki page. Wiki artifacts accept the following card types:
C change-comment
D time-and-date-stamp
L wiki-title
N mimetype
P parent-artifact-id+
U user-name
W size \n text \n
Z checksum
The D card is the date and time when the wiki page was edited. The P card specifies the parent wiki pages, if any. The L card gives the name of the wiki page. The optional N card specifies the mimetype of the wiki text. If the N card is omitted, the mimetype is assumed to be text/x-fossil-wiki. The U card specifies the login of the user who made this edit to the wiki page. The Z card is the usual checksum over the entire artifact and is required.
The W card is used to specify the text of the wiki page. The argument to the W card is an integer which is the number of bytes of text in the wiki page. That text follows the newline character that terminates the W card. The wiki text is always followed by one extra newline.
The C card on a wiki page is optional. The argument is a comment that explains why the changes was made. The ability to have a C card on a wiki page artifact was added on 2019-12-02 at the suggestion of user George Krivov and is not currently used or generated by the implementation. Older versions of Fossil will reject a wiki-page artifact that includes a C card.
An example wiki artifact can be seen here.
2.5 Ticket Changes
A ticket-change artifact represents a change to a trouble ticket. The following cards are allowed on a ticket change artifact:
D time-and-date-stamp
J ?+?name ?value?
K ticket-id
U user-name
Z checksum
The D card is the usual date and time stamp and represents the point in time when the change was entered. The U card is the login of the programmer who entered this change. The Z card is the required checksum over the entire artifact.
Every ticket has a distinct ticket-id: 40-character lower-case hexadecimal number. The ticket-id is given in the K card. A ticket exists if it contains one or more changes. The first "change" to a ticket is what brings the ticket into existence.
J cards specify changes to the "value" of "fields" in the ticket. If the value parameter of the J card is omitted, then the field is set to an empty string. Each fossil server has a ticket configuration which specifies the fields its understands. The ticket configuration is part of the local state for the repository and thus can vary from one repository to another. Hence a J card might specify a field that do not exist in the local ticket configuration. If a J card specifies a field that is not in the local configuration, then that J card is simply ignored.
The first argument of the J card is the field name. The second value is the field value. If the field name begins with "+" then the value is appended to the prior value. Otherwise, the value on the J card replaces any previous value of the field. The field name and value are both encoded using the character escapes defined for the C card of a manifest.
An example ticket-change artifact can be seen here.
2.6 Attachments
An attachment artifact associates some other artifact that is the attachment (the source artifact) with a ticket or wiki page or technical note to which the attachment is connected (the target artifact). The following cards are allowed on an attachment artifact:
A filename target ?source?
C comment
D time-and-date-stamp
N mimetype
U user-name
Z checksum
The A card specifies a filename for the attachment in its first argument. The second argument to the A card is the name of the wiki page or ticket or technical note to which the attachment is connected. The third argument is either missing or else it is the lower-case artifact ID of the attachment itself. A missing third argument means that the attachment should be deleted.
The C card is an optional comment describing what the attachment is about. The C card is optional, but there can only be one.
A single D card is required to give the date and time when the attachment was applied.
There may be zero or one N cards. The N card specifies the mimetype of the comment text provided in the C card. If the N card is omitted, the C card mimetype is taken to be text/plain.
A single U card gives the name of the user who added the attachment. If an attachment is added anonymously, then the U card may be omitted.
The Z card is the usual checksum over the rest of the attachment artifact. The Z card is required.
2.7 Technical Notes
A technical note or "technote" artifact (formerly known as an "event" artifact) associates a timeline comment and a page of text (similar to a wiki page) with a point in time. Technotes can be used to record project milestones, release notes, blog entries, process checkpoints, or news articles. The following cards are allowed on an technote artifact:
C comment
D time-and-date-stamp
E technote-time technote-id
N mimetype
P parent-artifact-id+
T +tag-name * ?value?
U user-name
W size \n text \n
Z checksum
The C card contains text that is displayed on the timeline for the technote. The C card is optional, but there can only be one.
A single D card is required to give the date and time when the technote artifact was created. This is different from the time at which the technote appears on the timeline.
A single E card gives the time of the technote (the point on the timeline where the technote is displayed) and a unique identifier for the technote. When there are multiple artifacts with the same technote-id, the one with the most recent D card is the only one used. The technote-id must be a 40-character lower-case hexadecimal string.
The optional N card specifies the mimetype of the text of the technote that is contained in the W card. If the N card is omitted, then the W card text mimetype is assumed to be text/x-fossil-wiki, which is the Fossil wiki format.
The optional P card specifies a prior technote with the same technote-id from which the current technote is an edit. The P card is a hint to the system that it might be space efficient to store one technote as a delta of the other.
A technote might contain one or more T cards used to set tags or properties on the technote. The format of the T card is the same as described in Control Artifacts section above, except that the second argument is the single character "*" instead of an artifact ID and the name is always prefaced by "+". The * in place of the artifact ID indicates that the tag or property applies to the current artifact. It is not possible to encode the current artifact ID as part of an artifact, since the act of inserting the artifact ID would change the artifact ID, hence a * is used to represent "self". The "+" on the name means that tags can only be add and they can only be non-propagating tags. In a technote, T cards are normally used to set the background display color for timelines.
The optional U card gives name of the user who entered the technote.
A single W card provides wiki text for the document associated with the technote. The format of the W card is exactly the same as for a wiki artifact.
The Z card is the required checksum over the rest of the artifact.
2.8 Forum Posts
Forum posts are intended as a mechanism for users and developers to discuss a project. Forum posts are like messages on a mailing list.
The following cards are allowed on an forum post artifact:
D time-and-date-stamp
G thread-root
H thread-title
I in-reply-to
N mimetype
P parent-artifact-id
U user-name
W size \n text \n
Z checksum
Every forum post must have either one I card and one G card or one H card. Forum posts are organized into topic threads. The initial post for a thread (the root post) has an H card giving the title or subject for that thread. The argument to the H card is a string in the same format as a comment string in a C card. All follow-up posts have an I card that indicates which prior post in the same thread the current forum post is replying to, and a G card specifying the root post for the entire thread. The argument to G and I cards is the artifact hash for the prior forum post to which the card refers.
In theory, it is sufficient for follow-up posts to have only an I card, since the G card value could be computed by following a chain of I cards. However, the G card is required in order to associate the artifact with a forum thread in the case where an intermediate artifact in the I card chain is shunned or otherwise becomes unreadable.
A single D card is required to give the date and time when the forum post was created.
The optional N card specifies the mimetype of the text of the technote that is contained in the W card. If the N card is omitted, then the W card text mimetype is assumed to be text/x-fossil-wiki, which is the Fossil wiki format.
The optional P card specifies a prior forum post for which this forum post is an edit. For display purposes, only the child post is shown, though the historical post is retained as a record. If P cards are used and there exist multiple versions of the same forum post, then I cards for other artifacts refer to whichever version of the post was current at the time the reply was made, but G cards refer to the initial, unedited root post for the thread. Thus, following the chain of I cards back to the root of the thread may land on a different post than the one given in the G card. However, following the chain of I cards back to the thread root, then following P cards back to the initial version of the thread root must give the same artifact as is provided by the G card, otherwise the artifact containing the G card is considered invalid and should be ignored.
In general, P cards may contain multiple arguments, indicating a merge. But since forum posts cannot be merged, the P card of a forum post may only contain a single argument.
The U card gives name of the user who entered the forum post.
A single W card provides wiki text for the forum post. The format of the W card is exactly the same as for a wiki artifact.
The Z card is the required checksum over the rest of the artifact.
3.0 Card Summary
The following table summarizes the various kinds of cards that appear on Fossil artifacts. A blank entry means that combination of card and artifact is not legal. A number or range of numbers indicates the number of times a card may (or must) appear in the corresponding artifact type. e.g. a value of 1 indicates a required unique card and 1+ indicates that one or more such cards are required.
Card Format | Used By | |||||||
---|---|---|---|---|---|---|---|---|
Manifest | Cluster | Control | Wiki | Ticket | Attachment | Technote | Forum | |
A filename target ?source? | 1 | |||||||
B baseline | 0-1 | |||||||
C comment-text | 1 | 0-1 | 0-1 | 0-1 | ||||
D date-time-stamp | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
E technote-time technote-id | 1 | |||||||
F filename ?uuid? ?permissions? ?oldname? | 0+ | |||||||
G thread-root | 0-1 | |||||||
H thread-title | 0-1[4] | |||||||
I in-reply-to | 0-1[4] | |||||||
J name ?value? | 1+ | |||||||
K ticket-uuid | 1 | |||||||
L wiki-title | 1 | |||||||
M uuid | 1+ | |||||||
N mimetype | 0-1 | 0-1 | 0-1 | 0-1 | 0-1 | |||
P uuid ... | 0-1 | 0-1 | 0-1 | 0-1[5] | ||||
Q (+|-)uuid ?uuid? | 0+ | |||||||
R md5sum | 0-1 | |||||||
T (+|*|-)tagname uuid ?value?[1] | 0+ | 1+[2] | 0+[3] | |||||
U username | 1 | 1 | 1 | 1 | 0-1 | 0-1 | 1 | |
W size \n text \n | 1 | 1 | 1 | |||||
Z md5sum | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Footnotes:
1) T-card names may not be made up of only hexadecimal characters, as they would be indistinguishable from a hash prefix.
2) Tags in Control Artifacts may not be self-referential. i.e. their target hash may not be *.
3) Tags in Technotes must be self-referential. i.e. their target hash must be *. Similarly, technote tags may only be non-propagating "add" tags. i.e. their name prefix must be +.
4) Forum Posts must have either one H-card or one I-card, not both.
5) Forum Post P-cards may have only a single parent hash. i.e. they may not have merge parents.
4.0 Addenda
This section contains additional information which may be useful when implementing algorithms described above.
4.1 Relaxed Card Ordering Due To An Historical Bug
All cards of a structural artifact should be in lexicographical order. The Fossil implementation verifies this and rejects any structural artifact which has out-of-order cards. Futhermore, when Fossil is generating new structural artifacts, it runs the generated artifact through the parser to confirm that all cards really are in the correct order before committing the transaction. In this way, Fossil prevents bugs in the code from accidentally inserting misformatted artifacts. The test parse of newly created artifacts is part of the self-check strategy of Fossil. It takes a few more CPU cycles to double check each artifact before inserting it. The developers consider those CPU cycles well-spent.
However, the card-order safety check was accidentally disabled due to [a bug]. And while that bug was lurking undetected in the code, [another bug] caused the N cards of Technical Notes to occur after the P card rather than before. Thus for a span of several years, Technical Note artifacts were being inserted into Fossil repositories that had their N and P cards in the wrong order.
Both bugs have now been fixed. However, to prevent historical Technical Note artifacts that were inserted by users in good faith from being rejected by newer Fossil builds, the card ordering requirement is relaxed slightly. The actual implementation is this:
"All cards must be in strict lexicographic order, except that the N and P cards of a Technical Note artifact are allowed to be interchanged."
Future versions of Fossil might strengthen this slightly to only allow the out of order N and P cards for Technical Notes entered before a certain date.
4.2 R-Card Hash Calculation
Given a manifest file named MF, the following Bash shell code demonstrates how to compute the value of the R card in that manifest. This example uses manifest [28987096ac]. Lines starting with # are shell input and other lines are output. This demonstration assumes that the file versions represented by the input manifest are checked out under the current directory.
# head MF -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 C Make\sthe\s"clearsign"\sPGP\ssigning\sdefault\sto\soff. D 2010-02-23T15:33:14 F BUILD.txt 4f7988767e4e48b29f7eddd0e2cdea4555b9161c F COPYRIGHT-GPL2.txt 06877624ea5c77efe3b7e39b0f909eda6e25a4ec ... # grep '^R ' MF R c0788982781981c96a0d81465fec7192 # for i in $(awk '/^F /{print $2}' MF); do \ echo $i $(stat -c '%s' $i); \ cat $i; \ done | md5sum c0788982781981c96a0d81465fec7192 -
Minor caveats: the above demonstration will work only when none of the filenames in the manifest are "fossilized" (encoded) because they contain spaces. In that case the shell-generated hash would differ because the stat calls will fail to find such files (which are output in encoded form here). That approach also won't work for delta manifests. Calculating the R card for delta manifests requires traversing both the delta and its baseline in lexical order of the files, preferring the delta's copy if both contain a given file.