Poor Man's Lakehouse¶
A composable, local Open Source Lakehouse built with modern data engineering tools. Mix and match catalog backends, compute engines, and table formats to build your ideal data platform — all running locally on Docker.
Why?¶
Setting up a modern lakehouse stack typically requires cloud infrastructure, vendor lock-in, and complex orchestration. This project gives you the same architecture on your laptop:
- Experiment freely with Iceberg, Delta Lake, and different catalog backends
- Compare engines side-by-side: PySpark, Polars, DuckDB through a unified interface
- Learn lakehouse patterns without cloud costs or complexity
- Prototype locally before deploying to production
Architecture¶
+---------------------------------------------------------------+
| Your Application |
+---------------------------------------------------------------+
| LakehouseConnection | Spark Builders |
| (Polars, DuckDB, Arrow, | (PySpark + Iceberg/Delta) |
| Ibis, catalog browsing) | |
+---------------------------------------------------------------+
| get_catalog() — PyIceberg |
| Unified catalog factory for all backends |
+---------------------------------------------------------------+
| Catalog Layer |
| Nessie | Lakekeeper | PostgreSQL | AWS Glue |
+---------------------------------------------------------------+
| Storage Layer |
| MinIO (S3-compatible) | AWS S3 (Glue only) |
+---------------------------------------------------------------+
Components at a Glance¶
| Component | Purpose | Catalog Support |
|---|---|---|
get_catalog() |
PyIceberg catalog factory — single source of truth for catalog config | All (Nessie, Lakekeeper, PostgreSQL, Glue) |
LakehouseConnection |
Unified lightweight connector: catalog browsing, Polars/Arrow scans, DuckDB engine, Ibis wrappers, SQL, writes | All (via get_catalog()) |
SparkBuilder |
Configured SparkSession per catalog type | All |
| Sail | Rust-powered Spark Connect engine (no JVM) — Delta, Iceberg, S3 | Local file-based |
Quick Start¶
# Clone, install, configure
git clone https://github.com/montanarograziano/poor-man-lakehouse.git
cd poor-man-lakehouse
just install
cp .env.example .env
# Start with Lakekeeper (recommended)
just up lakekeeper
# Open a notebook and start querying
jupyter lab notebooks/
See the Installation Guide for detailed setup instructions.