Skip to content

Poor Man's Lakehouse

A composable, local Open Source Lakehouse built with modern data engineering tools. Mix and match catalog backends, compute engines, and table formats to build your ideal data platform — all running locally on Docker.

Why?

Setting up a modern lakehouse stack typically requires cloud infrastructure, vendor lock-in, and complex orchestration. This project gives you the same architecture on your laptop:

  • Experiment freely with Iceberg, Delta Lake, and different catalog backends
  • Compare engines side-by-side: PySpark, Polars, DuckDB through a unified interface
  • Learn lakehouse patterns without cloud costs or complexity
  • Prototype locally before deploying to production

Architecture

+---------------------------------------------------------------+
|                      Your Application                          |
+---------------------------------------------------------------+
|   LakehouseConnection          |   Spark Builders              |
|   (Polars, DuckDB, Arrow,      |   (PySpark + Iceberg/Delta)   |
|    Ibis, catalog browsing)     |                               |
+---------------------------------------------------------------+
|                   get_catalog() — PyIceberg                    |
|   Unified catalog factory for all backends                     |
+---------------------------------------------------------------+
|                      Catalog Layer                              |
|   Nessie  |  Lakekeeper  |  PostgreSQL  |  AWS Glue            |
+---------------------------------------------------------------+
|                      Storage Layer                              |
|          MinIO (S3-compatible)  |  AWS S3 (Glue only)          |
+---------------------------------------------------------------+

Components at a Glance

Component Purpose Catalog Support
get_catalog() PyIceberg catalog factory — single source of truth for catalog config All (Nessie, Lakekeeper, PostgreSQL, Glue)
LakehouseConnection Unified lightweight connector: catalog browsing, Polars/Arrow scans, DuckDB engine, Ibis wrappers, SQL, writes All (via get_catalog())
SparkBuilder Configured SparkSession per catalog type All
Sail Rust-powered Spark Connect engine (no JVM) — Delta, Iceberg, S3 Local file-based

Quick Start

# Clone, install, configure
git clone https://github.com/montanarograziano/poor-man-lakehouse.git
cd poor-man-lakehouse
just install
cp .env.example .env

# Start with Lakekeeper (recommended)
just up lakekeeper

# Open a notebook and start querying
jupyter lab notebooks/

See the Installation Guide for detailed setup instructions.