Artifact data scheme

An artifact can be thought of as a source of knowledge. For example, if I am keeping notes on a research paper, the artifact is that paper.

At a minimum, an artifact should have a standard header with metadata. It should store some authorship information (e.g. citation information). An artifact will have snapshots, which indicate current content either at a specific point in time or in a specific format. A website might have snapshots for different times it was scraped; a book might have snapshots for different editions of the book or for different formats (e.g. PDF and EPUB).

The header

This datatype will be common to all objects, including structures later in the knowledge graph itself. In many cases, such as a blob, the tags will be empty as they will be inherited implicitly through the parent Artifact type.

type Header struct {
    ID         string
    Type       ObjectType
    Created    int64
    Modified   int64
    Categories []string
    Tags       []string
    Meta       Metadata
}

Metadata

Metadata is a mapping of keys to values. These values might not be integers; consider the case where we'd want to track filesize or something like that. Metadata is defined as

type Value struct {
    Contents string
    Type     string
}

type Metadata map[string]Value

Blobs

With these two types defined, we can define a blob. A Blob has a header, a content type, and some data.

type Blob struct {
    ID     string
    Format string // MIME type
    Body   io.ReadCloser
}

Citations

A citation can be thought of as the bibliographic information for the artifact. Nothing in this should be strictly required. A citation occurs at the artifact level, but it could also occur at the snapshot level. This is like having base information (such as author and publisher) that applies to all of the snapshots, while the snapshot might override attributes like the specific edition.

Publishers

A starting point is the publisher type.

type Publisher struct {
    Header  Header
    Name    string
    Address string
}

This is simple enough; the publisher really just needs a name and address, and it gets a Header whose Metadata can be used to inject any additional fields.

Citations defined

Putting some of these pieces together:

type Citation struct {
    Header    Header
    DOI       string
    Title     string
    Year      int
    Published time.Time
    Authors   []string
    Publisher *Publisher
    Source    string
    Abstract  string
}

We are strictly interested in containing the fields; the presentation layer can handle linking to the DOI, for example.

Snapshots

So we have the basic pieces in place now to define a snapshot:

type Snapshot struct {
    Header     Header
    ArtifactID string
    Stored     time.Time
    DateTime   time.Time
    Citation   *Citation
    Blobs      map[MIME]*Blob
}

It needs to know the ID of the artifact that it belongs to. We track the time it was stored --- which could be a unix timestamp, but for consistency with the other fields, we'll keep it as a standard time. DateTime is the time used for the snapshot; it can be a built off the year from the citation if needed, or it could be more refined.

One design choice here that could be questioned is the used of the MIME type associated with the blob. The example I can think of here is the [[no-bs-guide-to-math-and-physics]], which has a pair of PDFs; one for reading on a tablet, and one for printing. I think that could be solved by using a [[MIME types|media type]] parameter like "application/pdf; format=screen".

The artifact type

Combining these together, we have the artifact type itself.

type Artifact struct {
    ID      string
    Type    ArtifactType
    Latest  time.Time // latest snapshot
    History map[time.Time]*Snapshot
}

The Type is an enumeration that can be added to; a few known types to start with are * Unknown * Custom * Article * Book * URL * Paper * Video * Image

If the type is "Custom", the Header should have a metadata entry for "ArtifactType" to custom define it.

The Latest should refer to the most Snapshot.DateTime in its collection of snapshots.

Timestamps

All timestamps should be suitable for referencing dates prior to epoch 0; they should be encoded in UTC and locally converted. For example, if the client is uploading a new artifact, it should convert its local time to UTC, then send this to the server. We can enforce this in Go using the Local timezone, but it's not foolproof.

Next steps

  • Define protobufs.
  • Define a SQL schema.