Data is the new gold... but data (outside of the types of machine learning systems used to train [[Large Language Models]]) needs a few things to be _useful_.
This list is from my own mind. There are probably more authoritative sources that would suggest a better set of principles.
Good data is:
- **Discoverable & accessible** - you can't use what you don't know exists or can't find
- **Interpretable & documented** - you can't use what you can't comprehend
- **Trusted & secure** - you can't use what's been tampered with, lost, or destroyed
- **Traceable to process** - data stemming from a process should bear some mark (metadata) of the particulars of the process that generated it
- Ideally a specific, serialized activation of the process; but at a minimum a [[Process Specification]] of some kind.
- **Consistent & standardized** - the fewer anomalies and differences you have within a given dataset or between similar types of datasets, the easier they are to work with
- Example: dates should use [[ISO 8601]], unless the database technology has built-in [[Data Types|Date Types]].
- **Complete** - sort of a facet of being "Trustable", but you shouldn't unknowingly have missing data
****
# More
## Source
## Related