Skip to content

Configuration

All settings are managed via the Settings class in config.py, powered by Pydantic Settings. Values are loaded from environment variables and .env file.

Environment Variables

Catalog Selection

Variable Default Description
CATALOG nessie Active catalog backend: nessie, lakekeeper, postgres, glue
CATALOG_NAME nessie Catalog name used in Spark/DuckDB/PyIceberg configuration
CATALOG_DEFAULT_SCHEMA default Default schema/namespace

Match CATALOG to Docker Profile

Your CATALOG setting must match the Docker profile you started:

Docker Profile CATALOG Value
just up nessie nessie
just up lakekeeper lakekeeper
just up (core only) postgres
(no Docker needed) glue (uses AWS credentials)

AWS / MinIO

Variable Default Description
AWS_ACCESS_KEY_ID "" MinIO access key (or AWS key for Glue)
AWS_SECRET_ACCESS_KEY "" MinIO secret key (or AWS secret for Glue)
AWS_ENDPOINT_URL http://minio:9000 S3-compatible endpoint (not used with Glue)
AWS_DEFAULT_REGION eu-central-1 AWS region
BUCKET_NAME warehouse S3 bucket for table data

AWS Glue

Variable Default Description
GLUE_CATALOG_ID "" AWS account ID for cross-account Glue access; empty = default account

PostgreSQL

Variable Default Description
POSTGRES_HOST localhost PostgreSQL host
POSTGRES_USER postgres Database user
POSTGRES_PASSWORD "" Database password
POSTGRES_DB lakehouse_db Database name

Catalog URIs

Variable Default Used By
NESSIE_NATIVE_URI http://nessie:19120/api/v2 Nessie native API (Spark)
NESSIE_REST_URI http://nessie:19120/iceberg Nessie REST/Iceberg API (PyIceberg, DuckDB)
LAKEKEEPER_SERVER_URI http://lakekeeper:8181/catalog Lakekeeper REST catalog

Spark

Variable Default Description
SPARK_MASTER spark://localhost:7077 Spark master URL (use local[*] for no cluster)
SPARK_DRIVER_HOST 172.18.0.1 Driver host (Docker bridge IP)
SPARK_DRIVER_PORT 7001 Driver port
SPARK_DRIVER_BLOCK_MANAGER_PORT 7002 Block manager port

Logging

Variable Default Description
LOG_VERBOSITY DEBUG Log level: DEBUG, INFO, WARNING, ERROR
LOG_ROTATION_SIZE 100MB Log file rotation size
LOG_RETENTION 30 days Log file retention period

Computed Fields

Some settings are computed automatically from other values:

  • WAREHOUSE_BUCKET = s3://{BUCKET_NAME}/ -- automatically derived from BUCKET_NAME
  • SETTINGS_PATH = {REPO_PATH}/settings -- derived from REPO_PATH
  • LOG_FILE_PATH = {LOG_FOLDER}/{LOG_FILE_NAME} -- derived from log settings

These computed fields respect environment variable overrides of their source fields.

Storage Options

Two computed dictionaries are built at startup based on CATALOG:

  • S3_STORAGE_OPTIONS -- S3 credentials, endpoint, region for general S3 access (Polars, Delta)
  • ICEBERG_STORAGE_OPTIONS -- PyIceberg-specific S3 config. For Glue, this contains only region and warehouse. For other catalogs, it includes full S3 credentials and endpoint.

These are consumed internally by get_catalog() and LakehouseConnection. You rarely need to access them directly.

Programmatic Access

from poor_man_lakehouse import settings, reload_settings

# Access any setting
print(settings.CATALOG)              # "lakekeeper"
print(settings.AWS_ENDPOINT_URL)     # "http://minio:9000"
print(settings.WAREHOUSE_BUCKET)     # "s3://warehouse/" (computed)
print(settings.GLUE_CATALOG_ID)      # "" (empty = default account)

# Reload settings (clears cache, re-reads .env)
new_settings = reload_settings()