Lakehouse Connector¶

LakehouseConnection is the unified lightweight connector that replaces the previous IbisConnection, PolarsClient, PyIcebergClient, and CatalogBrowser. It provides catalog browsing, native scans, DuckDB engine access, Ibis multi-engine wrappers, SQL execution, and DuckDB-based writes — all backed by a single PyIceberg catalog via get_catalog().

Basic Usage¶

from poor_man_lakehouse import LakehouseConnection

# Uses settings.CATALOG (nessie, lakekeeper, postgres, or glue)
conn = LakehouseConnection()

# Or specify explicitly
conn = LakehouseConnection(catalog_type="lakekeeper")

Catalog Browsing¶

Browse namespaces, tables, schemas, and snapshot history — no JVM required:

conn = LakehouseConnection()

# List namespaces
namespaces = conn.list_namespaces()
# ['default', 'staging']

# List tables in a namespace
tables = conn.list_tables("default")
# ['users', 'orders', 'events']

# Get table schema
schema = conn.table_schema("default", "users")
# [
#   {"field_id": 1, "name": "id", "type": "long", "required": True},
#   {"field_id": 2, "name": "name", "type": "string", "required": False},
#   {"field_id": 3, "name": "age", "type": "int", "required": False},
# ]

# Get snapshot history
history = conn.snapshot_history("default", "users")
# [
#   {
#     "snapshot_id": 123456789,
#     "timestamp_ms": 1711234567890,
#     "summary": {"operation": "append", "added-records": "100"}
#   },
# ]

# Load a raw PyIceberg Table object (for advanced metadata operations)
table = conn.load_table("default", "users")
print(table.schema())
print(table.metadata.partition_specs)

Native Scans¶

Read Iceberg tables directly into Polars or Arrow — no JVM, no DuckDB, just PyIceberg:

Polars LazyFrame¶

import polars as pl

lf = conn.scan_polars("default", "users")

# Build queries with Polars expressions
result = (
    lf.filter(pl.col("age") > 25)
    .group_by("department")
    .agg(pl.col("salary").mean())
    .collect()
)

PyArrow Table¶

arrow_table = conn.scan_arrow("default", "users")
print(arrow_table.schema)
print(arrow_table.num_rows)

DuckDB Engine¶

Access a DuckDB connection with the Iceberg catalog already attached. For REST catalogs (Lakekeeper, Nessie), the catalog is mounted via the DuckDB Iceberg extension. For Glue, it uses the AWS credential chain.

# The DuckDB connection is lazily initialized on first access
duck = conn.duckdb_connection

# Query tables directly via SQL
result = duck.sql("SELECT * FROM lakekeeper.default.users LIMIT 10")

# Or use as an Ibis DuckDB backend
duck.list_tables()

DuckDB and PostgreSQL catalog

When CATALOG=postgres, the DuckDB connection gets S3 credentials configured but no Iceberg catalog is attached (DuckDB doesn't support attaching SQL-backed Iceberg catalogs). Use scan_polars() or scan_arrow() to read tables instead.

Ibis Multi-Engine Access¶

Get typed Ibis backends for DuckDB, Polars, or PySpark:

DuckDB via Ibis¶

duck_ibis = conn.ibis_duckdb()
# Same object as conn.duckdb_connection — full DuckDB Ibis backend
result = duck_ibis.sql("SELECT count(*) FROM lakekeeper.default.users")

Polars via Ibis¶

Polars has no native catalog support, so the table is loaded via PyIceberg and registered in a Polars Ibis connection:

polars_ibis = conn.ibis_polars("default", "users")
# Table is registered as "default.users" in the Polars backend
table = polars_ibis.table("default.users")
result = table.filter(table.age > 25).execute()

PySpark via Ibis¶

spark_ibis = conn.ibis_pyspark()
# Full PySpark Ibis backend connected to the current Spark session
result = spark_ibis.sql("SELECT * FROM default.users")

PySpark requires a JVM

ibis_pyspark() calls retrieve_current_spark_session(), which starts a JVM and Spark session. The other methods (DuckDB, Polars, Arrow scans) are JVM-free.

SQL Execution¶

Execute SQL queries through DuckDB or PySpark:

# DuckDB (default engine)
result = conn.sql("SELECT * FROM lakekeeper.default.users WHERE age > 25")

# PySpark
result = conn.sql("SELECT count(*) FROM default.users", engine="pyspark")

The engine parameter accepts "duckdb" (default) or "pyspark". Returns an Ibis table expression.

Writing Tables (DuckDB)¶

DuckDB 1.5+ supports Iceberg writes through REST catalogs:

Create a Table¶

conn.create_table("default", "users", "id INTEGER, name VARCHAR, age INTEGER")

Insert from SQL¶

conn.write_table("default", "users", query="SELECT 1, 'Alice', 30")

Insert with Overwrite¶

conn.write_table("default", "users", query="SELECT 2, 'Bob', 25", mode="overwrite")

Insert from Ibis Expression¶

duck = conn.ibis_duckdb()
source = duck.sql("SELECT * FROM lakekeeper.default.staging_users")
conn.write_table("default", "users", data=source, mode="append")

Write modes

"append" (default): INSERT INTO — adds rows to existing data
"overwrite": INSERT OVERWRITE — replaces all existing data

Connection Lifecycle¶

LakehouseConnection supports context managers for automatic cleanup:

# Context manager (recommended)
with LakehouseConnection() as conn:
    namespaces = conn.list_namespaces()
    lf = conn.scan_polars("default", "users")
    # DuckDB connection auto-closes on exit

# Manual cleanup
conn = LakehouseConnection()
try:
    result = conn.sql("SELECT 1")
finally:
    conn.close()  # clears cached DuckDB connection

The close() method clears the cached duckdb_connection property. The PyIceberg catalog itself is lightweight and doesn't need explicit cleanup.

Full API Summary¶

Category	Method	Returns	JVM Required
Browsing	`list_namespaces()`	`list[str]`	No
	`list_tables(namespace)`	`list[str]`	No
	`table_schema(namespace, table)`	`list[dict]`	No
	`snapshot_history(namespace, table)`	`list[dict]`	No
	`load_table(namespace, table)`	PyIceberg `Table`	No
Scans	`scan_polars(namespace, table)`	`pl.LazyFrame`	No
	`scan_arrow(namespace, table)`	`pa.Table`	No
DuckDB	`duckdb_connection` (property)	Ibis DuckDB backend	No
Ibis	`ibis_duckdb()`	Ibis DuckDB backend	No
	`ibis_polars(namespace, table)`	Ibis Polars backend	No
	`ibis_pyspark()`	Ibis PySpark backend	Yes
SQL	`sql(query, engine="duckdb")`	Ibis table expression	Only if `engine="pyspark"`
Write	`write_table(namespace, table, ...)`	`None`	No
	`create_table(namespace, table, schema)`	`None`	No
Lifecycle	`close()`	`None`	No