There's not "one, holy way" to look at this field. This is a way to segment it about its fuzzy borders. My thoughts on [[Religion]] actually feel relevant. Oh also this field is **constantly changing**.
# Data Tool Layers
## Summary
1. Data Storage
2. Ingestion & Transformation
3. [[Exploratory Data Analysis]]
4. Modeling, [[Machine Learning]], and [[Statistics (index)|statistics]]
5. Visualization & BI tools
## Tooling: Split on Code / No-Code
Most layers have tools split into **code** and **no-code** (or *low-code*) tools. When it comes to coding, the predominant language is [[Python]].
1. Data Storage - (split is a bit different here)
1. **Machine-first**: flat files like [[CSV]], [[JSON]], [[Parquet]]
2. **Human-first**: Spreadsheet files
3. **Either**: Databases ([[Postgres]], [[SQLite]], [[MySQL]], [[DuckDB]], [[MongoDB]], etc) & [[Data Warehouse]]s (BigQuery, Snowflake)
1. **Code-based**: [[SQL]], ORMs, Scripts
2. **GUI-based**: pgAdmin, DBeaver, etc
2. Ingestion & Transformation
1. **Code**: [[Python]], [[SQL]], dbt
2. **No-code**: Airbite, Fivetran, Stitch
3. [[Exploratory Data Analysis]]
1. **Code**: [[Jupyter Notebook]], [[Pandas vs Polars|pandas and Polars]]
2. **No-code**: [[Spreadsheet]]s
4. Modeling / ML / Stats
1. **Code**: `scikit-learn`, `statsmodels`, `PyTorch`
2. **No-code**: (not many, not common)
5. Visualization
1. **Code**: `matplotlib`, `seaborn`, [[Vega-Lite]]
2. **No-code**: Tableau, Power BI, Looker
## Data Storage & Sourcing
- Flat Files
- [[CSV]]
- [[Parquet]]
- [[JSON]]
- [[Relational Databases]]
- [[Postgres]]
- [[SQLite]]
- [[MySQL]]
- Analytical Databases
- [[DuckDB]]
- ClickHouse
- ...
- [[Data Warehouse]]s
- BigQuery
- Snowflake
- Redshift
- Exotic
- [[Graph Database]]
## Data Ingestion & Transformation
Includes cleaning & munging.
- ETL Frameworks
- Airbyte
- Fivetran
- Stitch
- Workflow schedulers / orchestration --- use [[DAG|Directed Acyclic Graph]]s
- Airflow (older, but widely used)
- Prefect
- Dagster (newer)
- Streaming
- [[Kafka]]
- Kinesis
- APIs
- [[Python]] → Requests
- [[JavaScript]] → Fetch
- Programmatic transforms
- Declarative transforms
- ...
- [[Python]] → Pandas, Polars
- [[SQL]] → CTEs & views
- Transformation Frameworks → DBT
- [[Spreadsheet]]s → Excel, etc
## [[Exploratory Data Analysis]]
- Tabular inspection
- Visual exploration
- Summary statistics
- Notebooks → [[Jupyter Notebook]]s
- [[Python]] → [[Pandas]], [[Seaborn]], [[Matplotlib]]
- R → Tidyverse, ggplot2
- [[Spreadsheet]]s → [[PivotTables]]
## Stats & Modeling
- Classical statistics
- Machine learning
- Probabilistic modeling
- [[Python]] → scipy, statmodels
- [[Machine Learning]] → scikit-learn, XGBoost
- Deep Learning → PyTorch, TensorFlow
- Probabilistic modeling → PyMC, Stan
## Visualization & Communication
- Static charts
- Interactive charts
- Dashboards
- Reports
- [[JavaScript]] → [[D3]], [[Vega-Lite]]
- [[Python]] → [[Matplotlib]], [[Seaborn]], Plotly
- [[BI Tools]] → Tableau, Power BI, Google's Looker
- APIs → FastAP
- Apps → Streamlit, Dash
- Reports → Quatro, Jupyter Book
## Meta Tools & Concepts
Development & Reproducibility. Communication tools.
- Environment Management
- [[Python]] → PIP, `venv`, `requirements.txt`
- [[JavaScript]] → [[Coding Package Manager|NPM]], `package.json`
- [[Version Control]] → [[Git]] & GitHub
****
# More
## Source
- Grad school
- conversation with ChatGPT