Skip to content

Connectors Overview

Poor Man's Lakehouse provides a small set of focused connectors for different use cases. This guide helps you choose the right one.

Decision Tree

Do you need to write data?
  YES -> LakehouseConnection (DuckDB engine), SparkBuilder, or Sail
  NO  -> Continue...

Do you need a JVM?
  NO  -> LakehouseConnection (Polars, DuckDB, Arrow scans) or Sail
  YES -> SparkBuilder or LakehouseConnection.ibis_pyspark()

Do you want PySpark API without a JVM?
  YES -> Sail (Rust-based Spark Connect engine)

Connector Comparison

Feature LakehouseConnection SparkBuilder Sail
Read tables Polars, Arrow, DuckDB, Ibis (PySpark/Polars/DuckDB) PySpark PySpark API
Write tables DuckDB (Iceberg) PySpark Delta, Iceberg, Parquet
Catalog browsing Namespaces, tables, schemas, snapshots Via Spark SQL -
SQL execution DuckDB, PySpark (via Ibis) Spark SQL PySpark SQL
Requires JVM Only for ibis_pyspark() Yes No (Rust)
Catalog support All (Nessie, Lakekeeper, PostgreSQL, Glue) All Local file-based
Context manager Yes No Manual start/stop
Backed by PyIceberg + Ibis Spark + Iceberg/Delta JARs Rust Spark Connect

When to Use Each

LakehouseConnection

Best for: Most use cases. Catalog browsing, reading tables into Polars/DuckDB/Arrow, writing via DuckDB, multi-engine comparison via Ibis — all without a JVM (except PySpark engine).

from poor_man_lakehouse import LakehouseConnection

with LakehouseConnection() as conn:
    # Browse the catalog (no JVM)
    namespaces = conn.list_namespaces()
    tables = conn.list_tables("default")
    schema = conn.table_schema("default", "users")

    # Scan to Polars (no JVM)
    lf = conn.scan_polars("default", "users")
    result = lf.filter(pl.col("age") > 25).collect()

    # DuckDB SQL with attached Iceberg catalog
    result = conn.sql("SELECT * FROM lakekeeper.default.users WHERE age > 25")

    # Write data via DuckDB
    conn.create_table("default", "output", "id INTEGER, name VARCHAR")
    conn.write_table("default", "output", query="SELECT 1, 'Alice'")

    # Ibis multi-engine access
    duck_ibis = conn.ibis_duckdb()
    spark_ibis = conn.ibis_pyspark()  # this one needs JVM

See the Lakehouse Connector guide for full details.

SparkBuilder

Best for: Full Spark ecosystem access, complex ETL, Spark-specific features (broadcast joins, UDFs, streaming), or when you need Spark SQL specifically.

from poor_man_lakehouse import get_spark_builder, CatalogType

builder = get_spark_builder(CatalogType.LAKEKEEPER)
spark = builder.get_spark_session()

spark.sql("CREATE TABLE lakekeeper.default.users (id INT, name STRING)")
spark.sql("INSERT INTO lakekeeper.default.users VALUES (1, 'Alice')")
df = spark.sql("SELECT * FROM lakekeeper.default.users")
df.show()

See the Spark Builders guide for full details.

Sail (pysail)

Best for: PySpark-compatible workloads without a JVM, Delta/Iceberg/Parquet reads and writes, fast local development.

Sail is a Rust-based compute engine that implements the Spark Connect protocol. It provides the PySpark API without requiring Java.

from pysail.spark import SparkConnectServer
from pyspark.sql import SparkSession

server = SparkConnectServer()
server.start()
addr = server.listening_address

spark = SparkSession.builder.remote(f"sc://{addr[0]}:{addr[1]}").getOrCreate()

df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df.write.format("delta").save("s3://warehouse/my_table")

spark.stop()
server.stop()

S3 credentials via environment variables

Sail reads S3 config from env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT_URL, AWS_ALLOW_HTTP), not from Spark config properties.

No REST catalog support yet

Sail does not support Lakekeeper/Nessie REST catalogs. It works with local file-based table formats and direct S3 paths.

Catalog Support Matrix

Catalog get_catalog() LakehouseConnection SparkBuilder
Nessie REST All features NessieCatalogSparkBuilder
Lakekeeper REST All features LakekeeperCatalogSparkBuilder
PostgreSQL SQL Browsing + scans (no DuckDB attach) PostgresCatalogSparkBuilder
Glue Glue All features GlueCatalogSparkBuilder