Data Tools Overview - Gillespedia

There's not "one, holy way" to look at this field. This is a way to segment it about its fuzzy borders. My thoughts on [[Religion]] actually feel relevant. Oh also this field is **constantly changing**. # Data Tool Layers ## Summary 1. Data Storage 2. Ingestion & Transformation 3. [[Exploratory Data Analysis]] 4. Modeling, [[Machine Learning]], and [[Statistics (index)|statistics]] 5. Visualization & BI tools ## Tooling: Split on Code / No-Code Most layers have tools split into **code** and **no-code** (or *low-code*) tools. When it comes to coding, the predominant language is [[Python]]. 1. Data Storage - (split is a bit different here) 1. **Machine-first**: flat files like [[CSV]], [[JSON]], [[Parquet]] 2. **Human-first**: Spreadsheet files 3. **Either**: Databases ([[Postgres]], [[SQLite]], [[MySQL]], [[DuckDB]], [[MongoDB]], etc) & [[Data Warehouse]]s (BigQuery, Snowflake) 1. **Code-based**: [[SQL]], ORMs, Scripts 2. **GUI-based**: pgAdmin, DBeaver, etc 2. Ingestion & Transformation 1. **Code**: [[Python]], [[SQL]], dbt 2. **No-code**: Airbite, Fivetran, Stitch 3. [[Exploratory Data Analysis]] 1. **Code**: [[Jupyter Notebook]], [[Pandas vs Polars|pandas and Polars]] 2. **No-code**: [[Spreadsheet]]s 4. Modeling / ML / Stats 1. **Code**: `scikit-learn`, `statsmodels`, `PyTorch` 2. **No-code**: (not many, not common) 5. Visualization 1. **Code**: `matplotlib`, `seaborn`, [[Vega-Lite]] 2. **No-code**: Tableau, Power BI, Looker ## Data Storage & Sourcing - Flat Files - [[CSV]] - [[Parquet]] - [[JSON]] - [[Relational Databases]] - [[Postgres]] - [[SQLite]] - [[MySQL]] - Analytical Databases - [[DuckDB]] - ClickHouse - ... - [[Data Warehouse]]s - BigQuery - Snowflake - Redshift - Exotic - [[Graph Database]] ## Data Ingestion & Transformation Includes cleaning & munging. - ETL Frameworks - Airbyte - Fivetran - Stitch - Workflow schedulers / orchestration --- use [[DAG|Directed Acyclic Graph]]s - Airflow (older, but widely used) - Prefect - Dagster (newer) - Streaming - [[Kafka]] - Kinesis - APIs - [[Python]] → Requests - [[JavaScript]] → Fetch - Programmatic transforms - Declarative transforms - ... - [[Python]] → Pandas, Polars - [[SQL]] → CTEs & views - Transformation Frameworks → DBT - [[Spreadsheet]]s → Excel, etc ## [[Exploratory Data Analysis]] - Tabular inspection - Visual exploration - Summary statistics - Notebooks → [[Jupyter Notebook]]s - [[Python]] → [[Pandas]], [[Seaborn]], [[Matplotlib]] - R → Tidyverse, ggplot2 - [[Spreadsheet]]s → [[PivotTables]] ## Stats & Modeling - Classical statistics - Machine learning - Probabilistic modeling - [[Python]] → scipy, statmodels - [[Machine Learning]] → scikit-learn, XGBoost - Deep Learning → PyTorch, TensorFlow - Probabilistic modeling → PyMC, Stan ## Visualization & Communication - Static charts - Interactive charts - Dashboards - Reports - [[JavaScript]] → [[D3]], [[Vega-Lite]] - [[Python]] → [[Matplotlib]], [[Seaborn]], Plotly - [[BI Tools]] → Tableau, Power BI, Google's Looker - APIs → FastAP - Apps → Streamlit, Dash - Reports → Quatro, Jupyter Book ## Meta Tools & Concepts Development & Reproducibility. Communication tools. - Environment Management - [[Python]] → PIP, `venv`, `requirements.txt` - [[JavaScript]] → [[Coding Package Manager|NPM]], `package.json` - [[Version Control]] → [[Git]] & GitHub **** # More ## Source - Grad school - conversation with ChatGPT