Data is the new gold... but data (outside of the types of machine learning systems used to train [[Large Language Models]]) needs a few things to be _useful_. This list is from my own mind. There are probably more authoritative sources that would suggest a better set of principles. Good data is: - **Discoverable & accessible** - you can't use what you don't know exists or can't find - **Interpretable & documented** - you can't use what you can't comprehend - **Trusted & secure** - you can't use what's been tampered with, lost, or destroyed - **Traceable to process** - data stemming from a process should bear some mark (metadata) of the particulars of the process that generated it - Ideally a specific, serialized activation of the process; but at a minimum a [[Process Specification]] of some kind. - **Consistent & standardized** - the fewer anomalies and differences you have within a given dataset or between similar types of datasets, the easier they are to work with - Example: dates should use [[ISO 8601]], unless the database technology has built-in [[Data Types|Date Types]]. - **Complete** - sort of a facet of being "Trustable", but you shouldn't unknowingly have missing data **** # More ## Source ## Related