Skip to content

Spark Builders

The spark_connector module provides catalog-specific SparkSession builders. Each builder configures Spark with the correct JARs, extensions, and catalog settings for its backend.

Factory Pattern

from poor_man_lakehouse import get_spark_builder, CatalogType

# From enum
builder = get_spark_builder(CatalogType.NESSIE)

# From string
builder = get_spark_builder("lakekeeper")

# Get a configured SparkSession
spark = builder.get_spark_session()

Or use the current catalog from settings:

from poor_man_lakehouse import retrieve_current_spark_session

# Uses settings.CATALOG to determine which builder
spark = retrieve_current_spark_session()

Available Builders

NessieCatalogSparkBuilder

Uses Nessie's native catalog with git-like versioning.

builder = get_spark_builder(CatalogType.NESSIE)
spark = builder.get_spark_session()

spark.sql("CREATE TABLE nessie.default.users (id INT, name STRING)")
spark.sql("INSERT INTO nessie.default.users VALUES (1, 'Alice')")

LakekeeperCatalogSparkBuilder

Uses Lakekeeper's REST catalog interface with credential vending (no static S3 keys in Spark config).

builder = get_spark_builder(CatalogType.LAKEKEEPER)
spark = builder.get_spark_session()

spark.sql("SELECT * FROM lakekeeper.default.users").show()

Note

Lakekeeper always uses "lakekeeper" as the catalog name, regardless of CATALOG_NAME setting.

PostgresCatalogSparkBuilder

Uses PostgreSQL as the Iceberg catalog backend via JDBC. This is the tested and recommended implementation for local development.

builder = get_spark_builder(CatalogType.POSTGRES)
spark = builder.get_spark_session()

GlueCatalogSparkBuilder

Uses AWS Glue as the Iceberg catalog backend. Credentials are resolved via the AWS default credential chain (env vars, ~/.aws/credentials, IAM role) -- no static S3 keys in Spark config.

builder = get_spark_builder(CatalogType.GLUE)
spark = builder.get_spark_session()

Optionally set GLUE_CATALOG_ID for cross-account Glue access.

Supported Catalog Types

Enum Value String Builder Class Credential Handling
CatalogType.POSTGRES "postgres" PostgresCatalogSparkBuilder Static S3 keys from settings
CatalogType.NESSIE "nessie" NessieCatalogSparkBuilder Static S3 keys from settings
CatalogType.LAKEKEEPER "lakekeeper" LakekeeperCatalogSparkBuilder Credential vending (no static keys)
CatalogType.GLUE "glue" GlueCatalogSparkBuilder AWS credential chain (no static keys)

Common Configuration

All builders share:

  • Iceberg runtime (4.0 for Scala 2.13, version 1.10.1)
  • Delta Lake via configure_spark_with_delta_pip
  • Hadoop AWS for S3/MinIO access
  • Nessie Spark extensions for git-like operations
  • PostgreSQL JDBC driver for catalog metadata

The JARs are resolved via Maven/Ivy at session creation time.

Custom App Name

builder = get_spark_builder(CatalogType.NESSIE)
builder._app_name = "My Custom App"
spark = builder.get_spark_session()

Spark Cluster

When using the Docker Spark cluster (just up spark), set:

SPARK_MASTER="spark://localhost:7077"
SPARK_DRIVER_HOST="172.18.0.1"  # Docker bridge gateway IP
SPARK_DRIVER_PORT=7001
SPARK_DRIVER_BLOCK_MANAGER_PORT=7002

For local mode (no cluster):

SPARK_MASTER="local[*]"

See the Spark Cluster guide for details on the Docker-based Spark standalone cluster.