One Table, Two Engines: Building a Unified Lakehouse with Spark, Trino, and Apache Iceberg

5 min readNov 19, 2024

In today’s hybrid data world, interoperability is no longer a luxury — it’s a necessity. This blog dives into how you can seamlessly bridge Spark and Trino with Apache Iceberg for a unified, future-proof data lakehouse.

What is Interoperability?

Interoperability in a data lake architecture means enabling multiple execution engines, like Apache Spark and Trino, to seamlessly access and operate on the same dataset. This capability eliminates redundant processing, ensures consistency, and unlocks diverse workloads in a unified environment.

In this blog, I’ll walk you through configuring Spark and Trino to share Apache Iceberg tables stored on S3, backed by a Hive metastore on AWS RDS. With this setup, you’ll achieve seamless read and write operations across both engines, powering efficient hybrid workflows.

If you’re new to Apache Iceberg or curious why it’s the future-proof solution for data lakes, explore my previous blog on Apache Iceberg.

Why Interoperability Matters?

In modern data processing, hybrid workloads are essential. Spark shines in handling complex ETL workflows, while Trino is unparalleled in ad-hoc analytics and federated queries. By enabling both engines to access and modify the same Iceberg tables stored on S3, backed by a Hive metastore, you eliminate redundancy, reduce latency, and maintain consistency across all operations. This interoperability ensures your data architecture is both efficient and flexible.

Setting Up The Environment

To create a shared Iceberg environment for Spark and Trino, ensure the following prerequisites are met:

IAM Permissions: Configure proper IAM roles to allow secure access to S3 and the Hive metastore for Spark and Trino nodes.
Separate Clusters: Use distinct clusters for Spark and Trino to optimize resources and isolate workloads.
Metastore Configuration:
AWS RDS: Set up RDS as the Hive metastore, ensuring you have its JDBC URL, username, and password.
Other Options: Iceberg supports AWS Glue and Hadoop catalogs. Use similar configurations as needed.
Networking: Enable connectivity to the Hive metastore via JDBC URI and open necessary ports, such as Hive Thrift (9083).

These steps establish a unified metadata layer, ensuring Spark and Trino can seamlessly share Iceberg tables for diverse workloads.

Configure Apache Spark for Iceberg

Apache Iceberg transforms your data lake into a high-performance, open data lakehouse. To achieve this, Apache Spark needs to integrate seamlessly with Iceberg through a Hive metastore. Here’s how to set up Spark for Iceberg while leveraging a Hive metastore hosted on AWS RDS.

Step 1: Setting Up the Hive Metastore

The Hive metastore serves as the metadata layer for your Iceberg tables, acting like a centralized “lookup table.” Each table name maps to the directory in S3 holding the files that represent its data. Here’s how to configure the Hive metastore:

"Classification": "hive-site",
    "Properties": {
      "javax.jdo.option.ConnectionDriverName": "<JDBC-DRIVER-OF-YOUR-RDS>",
      "javax.jdo.option.ConnectionPassword": "<RDS-DB-PASSWORD>",
      "javax.jdo.option.ConnectionURL": "<RDS-JDBC-CONNECTION-STRING>",
      "javax.jdo.option.ConnectionUserName": "<RDS-DB-USERNAME>"
    }

This configuration ensures the Hive metastore communicates with AWS RDS to store metadata.

Step 2: Configuring Spark to Use the Hive Metastore

Once the Hive metastore is ready, Spark must connect to it via the Thrift protocol. The Hive URI includes the IP or hostname of the EMR master node or your dedicated Hive metastore host:

"Classification": "spark-defaults",
    "Properties": {
      "spark.hadoop.hive.metastore.uris": "thrift://<EMR-IP-OR-MASTER-NODE-IP>:<HIVE-POrT-NUMBER>",
      "spark.hadoop.javax.jdo.option.ConnectionDriverName": "<JDBC-DRIVER-OF-YOUR-RDS>",
      "spark.hadoop.javax.jdo.option.ConnectionPassword": "<RDS-DB-PASSWORD>",
      "spark.hadoop.javax.jdo.option.ConnectionURL": "<RDS-JDBC-CONNECTION-STRING>",
      "spark.hadoop.javax.jdo.option.ConnectionUserName": "<RDS-DB-USERNAME>",
      "spark.jars": "<LOCATION-OF-SPARK-ICEBERG-JAR-FILE>", //"iceberg-spark3-runtime.jar"
      "spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog",
      "spark.sql.catalog.iceberg.catalog-impl": "org.apache.iceberg.hive.HiveCatalog",
      "spark.sql.catalog.iceberg.warehouse": "<S3-DIRECTORY-TO-LOCATE-YOUR-TABLES-BY-DEFAULT>",
      "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
    }

spark.jars: Ensure you include the Iceberg runtime JAR for Spark (e.g., iceberg-spark3-runtime.jar).
spark.sql.catalog.iceberg: This configures Iceberg as the catalog for Spark, leveraging the Hive metastore to store metadata.
spark.sql.catalog.iceberg.warehouse: Points to the S3 directory where Iceberg tables will reside.

Step 3: Writing Data to Iceberg

Iceberg’s robust support for transactions and schema evolution makes data writes seamless. Use the following code to write data into an Iceberg table:

df.writeTo("iceberg.<SCHEMA-NAME>.<TABLE-NAME>")
  .using("iceberg")
  .createOrReplace()

Note: The above code leverages DataFrameWriterV2, which is compatible with the current Iceberg release. If you’re using an older release, consult the Iceberg documentation for compatibility.

Step 4: Reading Data from Iceberg

To read data from Iceberg tables in Spark, you can use either of these methods:

Using Table Reference:

df = spark.table("iceberg.<SCHEMA-NAME>.<TABLE-NAME>")

2. Using Format Specification:

df = spark.read.format("iceberg").load("<SCHEMA-NAME>.<TABLE-NAME>")

By setting up Spark with Iceberg, you enable powerful data operations, such as time travel and incremental snapshots, while maintaining consistency and performance. Ready to supercharge your workflows? Let’s dive into configuring Trino next!

Configure Trino for Iceberg

Setting up Trino to work with Apache Iceberg is straightforward and enables powerful ad-hoc querying on shared tables. Here’s how to configure Trino to connect to Iceberg through a Hive metastore.

Step 1: Set Up the Hive Metastore

Just like with Spark, Trino requires a Hive metastore to manage table metadata. Use the following configuration for your Hive metastore, connecting it to your AWS RDS instance:

"Classification": "hive-site",
    "Properties": {
      "javax.jdo.option.ConnectionDriverName": "<JDBC-DRIVER-OF-YOUR-RDS>",
      "javax.jdo.option.ConnectionPassword": "<RDS-DB-PASSWORD>",
      "javax.jdo.option.ConnectionURL": "<RDS-JDBC-CONNECTION-STRING>",
      "javax.jdo.option.ConnectionUserName": "<RDS-DB-USERNAME>"
    }

This step ensures that Trino and Spark share the same metadata repository, enabling seamless access to Iceberg tables.

Step 2: Configure Trino Catalog for Iceberg

In Trino, catalogs are configured using a catalog.properties file. To set up Iceberg, create or modify the configuration file at /etc/trino/conf/catalog/iceberg.properties with the following:

connector.name=iceberg
hive.metastore.uri=thrift://<EMR-IP-OR-MASTER-NODE-IP>:<HIVE-PORT-NUMBER>
iceberg.catalog.type=HIVE_METASTORE

connector.name: Specifies Iceberg as the connector for Trino.
hive.metastore.uri: Points to the Hive metastore, which must be accessible via the Thrift protocol.
iceberg.catalog.type: Specifies the metastore type (HIVE_METASTORE), enabling Iceberg to utilize the Hive configuration.

This configuration ensures that Trino queries the same metadata as Spark, providing a unified view of Iceberg tables.

Step 3: Query Iceberg Tables in Trino

Once the catalog is configured, querying Iceberg tables in Trino is as simple as writing standard SQL:

SELECT * FROM iceberg.<SCHEMA-NAME>.<TABLE-NAME>;

With this setup, Trino can perform fast, ad-hoc queries on Iceberg tables stored in S3, ensuring compatibility with Spark and leveraging the power of a shared Hive metastore.

By integrating Trino with Iceberg, you gain the ability to query large datasets effortlessly, while maintaining consistency across engines. Ready to unlock seamless data access? Start querying!

Benefits and Best Practices

Integrating Apache Spark and Trino with a shared Iceberg table ensures seamless workflows and efficient operations. Here’s why it matters and how to optimize:

Benefits

Operational Consistency: Spark and Trino access the same data, maintaining integrity.
Hybrid Workflows: Combine Spark’s ETL power with Trino’s ad-hoc analytics.
Efficiency: Reduce duplication and streamline resource usage.
Future-Ready: Iceberg’s ACID transactions, time travel, and schema evolution future-proof your architecture.

Best Practices

Monitor Metadata: Keep Hive metadata updated for consistent operations.
Optimize S3: Use compaction to reduce small files and hidden partitioning for faster queries.
Maintain Snapshots: Regularly clean up old snapshots to save storage.
Test Setup: Validate Spark and Trino integration with read/write operations.

Unlock the full potential of your data lake by implementing these best practices today!