> [!tldr] **Data Package** is an open source set of standards and associated softwares developed by a not-for-profit organization called "Open Knowledge".
> Data Package is a **standard** consisting of a set of **simple yet extensible specifications** to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates **[[Findability, Accessibility, Interoperability, and Reusability]] (FAIR)** of data.
It is rebranding to some extent under the name "frictionless".
# Software
- **Open Data Editor** - looks like [[Spreadsheet|Excel]]-ish
- **Data Curator** - GUI for describing, validating, and sharing data
- **[[Flatterer]]** - an opinionated converter of data between various formats including [[CSV]], [[JSON]], [[SQLite]], [[Postgres]], [[Parquet]], and [[Excel]]
- Several **validators** for various languages exist - often under the branding _frictionless-_(language shortname)
- `frictionless-py`, `frictionless-js`, `frictionless-r`, and julia, ruby, java, php, swift, and go
### Play Around
I gave the [[CLI]] tool a go - it was super easy and fast.
```shell
# inside a folder full of TSV files
frictionless describe --yaml *.tsv > datapackage.yaml
```
...and I had a new [[YAML]] file describing all the [[TSV]]s in the folder pretty well.
> [!tip] The standard doesn't matter - it's what you can **do** with the tools that *use* it.
# Specs
## Data Package
A description of a collection of data in a single package. Includes [[Data Package (standard)#Data Resource]]s, and optionally:
- Name, id, licenses, title, description, homepage, image, version, created, keywords, contributors (with given properties), sources (with given properties)
## Data Resource
A description of a single data source, e.g. a file or table. Includes things like `name`& `path` (both required), `title`, `description`, `format`, `mediatype` (mimetype), `encoding` (e.g. utf-8), `bytes` (file size), `hash`, `sources`, `licenses`, and a [[Data Package (standard)#Table Dialect]] for any tabular data, and a [[Data Package (standard)#Table Schema]].
## Table Dialect
> [!tip] Probably the best standard? Covers [[JSON]] and [[YAML]], too, weirdly.
A description of how the dataset should be interpreted - things like "what escape character are you using?". This is defined for lots of data types from [[CSV]] to [[Spreadsheet]]s to [[SQL]] tables.
## Table Schema
The thing that made me learn about Data Package - table schema allows you to define and describe how different types of tables should look. You can insert validation rules for [[CSV]]s and other helpful metadata. You have overall schema descriptors, then descriptors for each of the fields contained therein. You can even handle [[Primary Key]]s and [[Foreign Key]]s.
```json
{
"fields": [
{
"name": "name of field (e.g. column name)",
"title": "A nicer human readable label or title for the field",
"type": "A string specifying the type",
"format": "A string specifying a format",
"example": "An example value for the field",
"description": "A description for the field"
...
},
...
],
"missingValues": [ ... ],
"primaryKey": [ ... ],
"foreignKeys": [... ]
}
```
This _can_ hook into [[Resource Description Framework|RDF]] at a field level. It can also hook into [[JSON Schema]] for Object/JSON type values. Sweet.
> [!note] Collocation with Data
> You could, in theory, include a table schema, represented as [[YAML]][^1], as [[Frontmatter]] - so argues ME and the dude who wrote about [[Why does CSV + Frontmatter Not Exist?|CSVY]].
You can specify all sorts of [[Data Types]] and constraints, such as [[Enumeration]]s.
****
# More
## Source
- https://datapackage.org/
[^1]: and since JSON is valid YAML, you can actually just use JSON.