discover why the lakehouse architecture has become the leading choice for data management and explore the future trends shaping data architecture.

The Lakehouse Has Won: What’s Next for Data Architecture

The landscape of data architecture reached a pivotal point by 2025. For years, organizations grappled with a fundamental dichotomy: the rigid reliability of centralized data warehouses versus the flexible, low-cost storage of open data lakes. Warehouses, while excellent for trustworthy business intelligence, proved cumbersome with diverse, unstructured data and struggled with elastic scaling. Conversely, data lakes offered freedom but often devolved into ungoverned “data swamps” lacking the transactional rigor and performance guarantees crucial for analytics. This schism created inefficiencies, silos, and a significant barrier to leveraging data for advanced analytics and emerging artificial intelligence workloads. The traditional daily batch processing, once a standard, became an antique, leaving decision-makers operating with stale information in a world demanding real-time insights.

This persistent agitation across data teams and business leaders underscored a growing need for a unified approach. The inability to seamlessly integrate varied data types, coupled with the escalating demands of continuous, low-latency data streams for AI, highlighted the limitations of existing paradigms. Enterprises found themselves constantly rebuilding, migrating, or patching systems, sacrificing agility for trust or vice versa. The pressure to innovate, driven by competitive markets and the promise of intelligent automation, made the status quo untenable. It was clear that a new foundation was required, one that could bridge the gap between structure and flexibility, ensuring data integrity while empowering rapid innovation.

Into this challenging environment, the data lakehouse emerged victorious, not as a mere compromise, but as a robust operating model. By combining the scalable, open storage of data lakes with the transactional integrity and metadata-driven management of data warehouses, it forged a unified platform. This architecture, solidified by 2025, leverages open table formats, transactional metadata, and multi-engine access over a single, governed body of data. It transforms raw information into a reliable, high-performance, and AI-ready asset, empowering organizations to finally meet the complex demands of 2026 and beyond. This guide delves into the foundational triumph of the lakehouse and explores the exciting new frontiers it unlocks for data engineers and architects.

En bref :

  • The data lakehouse has solidified its position as the preferred data architecture by 2025, overcoming the limitations of traditional data warehouses and data lakes.
  • Key innovations include file-level metadata tracking and open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon) enabling ACID transactions, schema evolution, and time travel.
  • Modern lakehouse architecture is layered, separating storage, table formats, ingestion, catalog/governance, and consumption.
  • Streaming-first ingestion has become a standard, with platforms like Confluent, Aiven, and Redpanda offering native lakehouse integrations.
  • Specialized lakehouse catalogs (Apache Polaris, Gravitino, Nessie, Unity Catalog) are critical for governance, discoverability, and multi-engine access.
  • Ongoing optimization through compaction, metadata cleanup, and query acceleration is vital for sustaining performance at scale.
  • The lakehouse is evolving to support advanced workloads, including agentic AI, Python-native development (DuckDB, Dask, Daft, Bauplan), and integrated graph analytics (PuppyGraph).
  • Future horizons extend to edge inference with platforms like Spice AI, bringing AI processing closer to data sources, and exploring alternative metadata approaches like DuckLake.

The inevitable rise of the lakehouse: Addressing past limitations

The journey to the data lakehouse was paved by the very real challenges inherent in its predecessors: the data warehouse and the data lake. For decades, data warehouses stood as bastions of truth for business intelligence. They enforced strict schemas, guaranteeing data quality and consistency across an enterprise. However, by the 2010s and 2020s, their inherent rigidity became a significant impediment. Changes to source systems necessitated labor-intensive ETL (Extract, Transform, Load) processes to keep schemas synchronized, and novel data types like JSON, images, or sensor streams simply did not fit their structured mold. Furthermore, warehouses often coupled compute and storage, leading to escalating costs as data volumes grew, with organizations frequently overpaying for underutilized resources. Data freshness was another casualty, with ETL pipelines running daily or hourly, leaving decision-makers with outdated information in a fast-paced world. Crucially, warehouses, while excelling at structured SQL queries, proved ill-equipped to handle the diverse, often unstructured, and massive datasets required for machine learning and artificial intelligence.

In response, data lakes emerged, promising unbridled flexibility. By shifting to a “schema-on-read” approach, organizations could ingest virtually any data type—logs, media, semi-structured JSON, raw database dumps—at a fraction of the cost, thanks to cloud object storage. Data scientists embraced the direct access to raw, unmodeled data, accelerating their exploratory work. Yet, this newfound freedom introduced its own set of significant challenges. Without enforced schemas or strict governance, data lakes quickly devolved into “data swamps,” vast, uncurated collections of files that few trusted or understood. Security, data lineage, and access controls were often afterthoughts, inconsistently applied, making it difficult for teams to ascertain data existence or safety. Performance bottlenecks also surfaced; query engines had to scan immense directories of files, and without transactional guarantees, concurrent writes risked data corruption or incomplete results for analysts. The operational overhead of managing partitions, small files, and manual compactions became a persistent drain on engineering resources. The clear gap was a system that offered both the structural reliability of a warehouse and the open flexibility of a lake, laying the groundwork for the inevitable rise of the modern data lakehouse.

The core innovation: From directories to files in data management

Before the advent of modern lakehouse formats, Apache Hive was instrumental in making large-scale data within Hadoop clusters queryable using SQL. Hive introduced the Metastore, a vital component that stored table definitions—schemas, partitions, and locations. This allowed analysts to execute SQL-like queries over files residing in HDFS or cloud storage, marking an early, significant attempt to imbue a data-lake-like environment with a relational feel. However, Hive’s foundational approach, which managed tables by tracking directories of files, eventually presented structural limitations that became bottlenecks as data volumes and performance expectations scaled.

Evolution of table management: From Hive’s bottlenecks to modern metadata

Hive’s directory-centric table management meant that each table corresponded to a folder, and each partition to a subfolder. Query engines would scan these directories at runtime to discover files, a process that worked adequately with modest data volumes. However, in modern cloud object stores, operations like listing millions of files before query execution became agonizingly slow and expensive, often dominating total query time. Critically, Hive tables fundamentally lacked ACID (Atomicity, Consistency, Isolation, Durability) transactions. They were primarily append-only, leaving concurrent writers vulnerable to corrupting tables and exposing readers to partial data during updates. While later ACID extensions attempted to mitigate this, they introduced complexity and were not universally supported across engines. Modifying data or evolving schemas in Hive was also inefficient, requiring the rewriting of entire partitions or tables for updates, deletes, or schema changes, which often broke downstream jobs.

See also  Snowflake vs. Databricks: The Battle Defining Modern Data

Another persistent issue was the “small files problem.” Frequent ingestion pipelines, especially streaming ones, generated floods of tiny files, degrading query performance because engines had to open and read from thousands of small files. Without built-in small-file management, engineers were burdened with implementing periodic compaction jobs. The turning point for modern lakehouse formats arrived with a deceptively simple yet profoundly impactful idea: shifting from tracking directories of files to explicitly tracking individual files within structured metadata. Instead of inferring table state from the file system, engines now read compact metadata that precisely lists every file belonging to a table, alongside statistics about its content. This innovation, driven by file-level manifests and versioned snapshots, enables faster query planning, atomic commits, robust schema evolution, and efficient time travel capabilities. It transformed raw data lakes into reliable, database-like systems, forming the bedrock of the lakehouse architecture.

The new generation of open lakehouse formats

With file-level tracking established as a breakthrough, several open-source projects have redefined how data lakes function. These table formats provide the transactional, metadata-rich foundation that elevates a raw data lake into a comprehensive lakehouse. While sharing core principles like ACID transactions, schema evolution, and time travel, each project emphasizes distinct strengths and operational models, catering to varied enterprise needs.

Apache Iceberg: Openness and engine agnostic scalability

Originating at Netflix and now a top-level Apache project, Iceberg prioritizes engine-agnostic interoperability. Its hierarchical metadata structure—from table metadata to manifest lists, then to manifest files—allows it to scale efficiently to billions of files while maintaining rapid query planning. Key features include hidden partitioning and partition evolution, broad engine support across Spark, Flink, Trino, Dremio, and a strong commitment to openness via its REST catalog API. Iceberg’s rich schema evolution, encompassing column renames and type promotions, has made it a de facto standard for enterprises seeking a future-proof lakehouse built on open standards.

Delta Lake: Spark-native transactions and optimizations

Initially developed by Databricks, Delta Lake popularized the concept of a transactional log for data lakes. It maintains an append-only transaction log (`_delta_log`) comprising JSON entries and Parquet checkpoints to meticulously track file states. Its core strengths include ACID transactions deeply integrated with Apache Spark, providing reliable time travel and schema evolution. Delta Lake also offers crucial optimizations such as Z-Ordering for data clustering, enhancing query performance. While strongly integrated within the Databricks ecosystem, its community adoption is steadily growing beyond Spark-centric workflows, making it a powerful choice for teams leveraging Spark for their data processing.

Apache Hudi: Real-time updates and incremental processing

Apache Hudi, conceived at Uber, was an early pioneer in bringing database-like capabilities to data lakes, excelling in incremental processing and change data capture (CDC). It offers two storage modes: Copy-on-Write (CoW) for workloads optimized for reads, and Merge-on-Read (MoR) for write-heavy, near-real-time use cases. Hudi provides native upsert and delete operations, along with built-in indexing for efficient record-level management. With tight integrations across Spark, Flink, and Hive, Hudi is particularly well-suited for pipelines that demand frequent updates and streaming ingestion, enabling robust, real-time data platforms.

Apache Paimon: Streaming-first design for unified processing

A more recent entrant, Apache Paimon (formerly Flink Table Store) is architected around a streaming-first lakehouse design. It employs an LSM-tree (Log-Structured Merge-tree) style file organization to seamlessly unify both batch and stream processing. Its key features include native CDC and incremental queries, alongside deep integration with Apache Flink. Paimon boasts snapshot isolation with continuous compaction, ensuring data consistency and efficiency for high-frequency updates. Its growing ecosystem extends support to Spark, Hive, and other engines, making Paimon a compelling solution for event-driven architectures where real-time data ingestion and analytics converge.

Architecting the modern data lakehouse: A layered approach

At its essence, the data lakehouse is not a monolithic product but a cohesive architectural pattern. It skillfully merges the immense scalability and openness characteristic of data lakes with the transactional reliability and robust governance found in data warehouses. By 2025, a clear consensus emerged on its optimal structure: a successful lakehouse is built upon distinctly defined layers, each fulfilling a specific role while collaborating seamlessly to form a unified, powerful system.

The five core architectural layers explained

The foundation of any lakehouse is its Storage Layer, typically cloud object storage like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. This layer provides infinitely scalable, durable, and low-cost storage for all data types, decoupling compute from storage and allowing multiple engines to access the same data without duplication. Resting upon this, the Table Format Layer—powered by technologies like Apache Iceberg, Delta Lake, Hudi, or Paimon—transforms raw files into logical tables through metadata. This layer delivers ACID transactions, schema evolution, efficient query pruning, and time travel, converting a potential “data swamp” into structured, queryable datasets.

The Ingestion Layer is responsible for moving data into the lakehouse. This encompasses both batch ingestion, using tools like Fivetran or Airbyte, and increasingly, streaming ingestion from platforms such as Apache Kafka, Redpanda, or Flink for real-time event processing and change data capture. Next, the Catalog and Governance Layer acts as the central registry, tracking table metadata, enforcing access policies, and providing discovery and lineage. Examples include AWS Glue, Unity Catalog, Dremio Catalog, or open-source solutions like Apache Polaris. This layer bridges storage with consumption, ensuring data is both secure and discoverable.

Finally, the Federation and Consumption Layer is where data becomes actionable for end-users. Query engines like Dremio or Trino access tables via the catalog, providing federation capabilities to join data from disparate systems. These engines execute BI queries, run machine learning pipelines, or feed AI agents. A semantic layer often resides here, ensuring consistent metrics and business concepts across various tools, thus delivering a “single version of truth.” This layered design—storage, table format, ingestion, catalog, and consumption—has become the reference architecture, providing the flexibility, trust, and performance needed for modern data demands. For an even deeper dive into these layers, resources like the 2025 & 2026 Ultimate Guide to the Data Lakehouse Ecosystem provide extensive insights.

Operational excellence: Ingestion and streaming in the lakehouse

With the foundational lakehouse layers established, the next critical challenge involves the efficient and reliable flow of data into the system. Ingestion strategies are paramount, as they dictate not only data freshness but also the overall health of tables, the organization of files, and the downstream usability for analytics and AI. A well-designed ingestion pipeline balances the need for timely data with cost-effectiveness and system stability, ensuring data quality meets the demands of 2026 workloads.

Seamless batch and real-time data ingestion

Batch ingestion remains a prevalent entry point for many organizations. ETL/ELT services such as Fivetran and Airbyte efficiently extract data from SaaS applications, relational databases, and APIs, landing it directly into cloud object storage. Many of these tools now integrate with open table formats like Iceberg, Delta, and Hudi, circumventing the need to dump raw CSVs. Custom jobs, often built with Python, Spark, or dbt, transform and load data on schedules—whether nightly, hourly, or in micro-batches—offering predictable loads and simpler monitoring. However, data freshness is inherently limited by these schedules, and frequent batches can lead to an accumulation of small files if not carefully managed through writer-side batching or file-size thresholds.

See also  Snowflake vs. Databricks: The Battle Defining Modern Data

Conversely, streaming ingestion has evolved from a luxury to an expectation in 2026, driven by the demands for real-time insights. Event stream platforms like Apache Kafka, Redpanda, Aiven, Confluent, StreamNative, and Apache Pulsar capture high-velocity events—such as clickstream or IoT data—and push them into the lakehouse using specialized connectors or stream processors. Change Data Capture (CDC) pipelines, exemplified by tools like Debezium and Estuary Flow, replicate updates from operational databases into Iceberg or Delta tables with minimal latency, ensuring data parity. Stream processing engines like Apache Flink and Spark Structured Streaming apply transformations in-line, committing results directly to lakehouse tables. This convergence of batch and real-time approaches, coupled with robust small file management and idempotent pipeline designs, ensures that new source columns are gracefully added via schema drift handling and that data observability platforms like Monte Carlo can proactively alert on load failures or unexpected data volume deviations.

Embracing streaming-first workloads

If 2025 was about batch and real-time convergence, 2026 is truly the year of streaming-first lakehouses. Streaming is no longer an add-on; it’s an inherent capability, from ingestion to processing and query serving. Confluent, a commercial leader in Apache Kafka, has spearheaded this shift, with products like Tableflow and Stream Designer writing directly to Iceberg and Delta Lake, offering exactly-once guarantees for CDC ingestion. This transforms Kafka topics into real-time queryable lakehouse tables, reducing the need for custom Flink or Spark jobs. Similarly, Aiven has expanded its Kafka, Flink, and Postgres services with native Iceberg integrations, offering a turnkey solution for capturing, transforming, and landing events into a governed lakehouse.

Redpanda, providing Kafka-API-compatible streaming with superior throughput and lower latency, introduced Iceberg Topics in 2025, enabling every topic to automatically materialize into an Iceberg table. This innovation merges log storage with table metadata, allowing developers to interact with the same data as both a stream and a table. StreamNative, built around Apache Pulsar, extends the lakehouse deeper into event-driven architectures by unifying messaging and lakehouse storage, making historical message backlogs instantly queryable as tables. RisingWave focuses on streaming databases, publishing real-time materialized views directly into the lakehouse, bridging operational and historical analytics within a unified architecture. Other notable platforms include Materialize for real-time materialized views and ksqlDB for Kafka-native SQL transformations, all contributing to a lakehouse where batch and real-time seamlessly coexist, delivering always-fresh data.

Lakehouse catalogs: The control plane for open data

The lakehouse catalog serves as the indispensable control plane for open tables, meticulously tracking metadata locations, managing permissions, and exposing standardized APIs to various query engines. By 2026, understanding the diverse landscape of catalog options is crucial for data professionals aiming to build multi-engine, multi-cloud lakehouse environments. These catalogs are more than just directories; they are the central nervous system that ensures data discoverability, security, and interoperability across a complex data ecosystem.

Navigating the landscape of lakehouse catalog solutions

For teams standardizing on Apache Iceberg and seeking vendor-neutral interoperability, Apache Polaris (Incubating) is an open-source, full-featured REST catalog. It is designed to be consumed by Spark, Flink, Trino, Dremio, and StarRocks/Doris via the Iceberg REST API, with its open governance minimizing lock-in. Apache Gravitino (Incubating) offers a “catalog of catalogs” approach, managing metadata across heterogeneous sources—file stores, RDBMS, streams—and presenting a unified view, ideal for hybrid/multi-cloud estates needing a single governance and discovery layer. For those entrenched in specific cloud ecosystems, AWS Glue Data Catalog (with Lake Formation) provides a managed, Hive-compatible catalog with native support for Iceberg/Delta/Hudi tables, integrating seamlessly with Athena, EMR, and Redshift Spectrum.

Microsoft Fabric’s OneLake Catalog serves as the central, Delta-native catalog for its unified data platform, offering discovery and governance across Spark, SQL, and Power BI over ADLS/OneLake. On Google Cloud, Google BigLake offers an open lakehouse layer, where its Metastore catalogs Iceberg tables on GCS, enabling native BigQuery reads and multi-engine access via the Iceberg REST interface. For robust data versioning and reproducibility, Project Nessie acts as a Git-like, transactional catalog atop Iceberg, adding branches, tags, and time-travel capabilities for isolated dev/test environments. Databricks’ Unity Catalog, now open-sourced, aims for universal governance across Delta, Iceberg (via REST/UniForm), and other assets, offering broad policy enforcement across engines and clouds. Finally, Lakekeeper, a lightweight, Rust-based Apache Iceberg REST catalog, emphasizes speed and security, making it suitable for self-hosted open lakehouses. When designing a multi-engine Iceberg lakehouse, a common and effective pattern is to deploy Polaris as the primary REST catalog for engines, then layer in Nessie for advanced branching and isolated environments, offering both flexibility and powerful version control.

Optimizing your lakehouse for performance and scale

Bringing a data lakehouse into production is just the beginning; sustaining its performance and efficiency at scale requires continuous optimization. Without diligent management, query times inevitably lengthen, storage costs can escalate, and data reliability may diminish. The core of optimization lies in meticulously managing both the physical layout of data and the growth of metadata. By proactively addressing these areas, organizations can ensure their lakehouse remains a high-performing and cost-effective asset, even as data volumes and workloads expand through 2026.

Maintaining peak performance: Compaction and metadata hygiene

A common challenge in lakehouse environments, particularly with frequent batch loads and streaming pipelines, is the proliferation of thousands of small Parquet or ORC files. This “small files problem” forces query engines to spend an inordinate amount of time opening files rather than scanning actual data, severely impacting performance. The solution lies in leveraging the compaction capabilities inherent in modern table formats. These features rewrite numerous small files into fewer, larger ones (ideally hundreds of MBs to 1GB), drastically improving query efficiency. For instance, Apache Iceberg’s `rewriteDataFiles` action and Delta Lake’s `OPTIMIZE` command (which includes Z-Ordering for clustering) efficiently merge small files. Apache Hudi, designed for streaming, offers asynchronous background compaction for Merge-on-Read tables, continuously optimizing write-heavy use cases. These mechanisms are crucial for maintaining a healthy file structure and preventing performance degradation.

Equally vital is the management of metadata growth. Modern table formats retain snapshots for time travel, a powerful feature, but unchecked snapshot accumulation can lead to metadata bloat, affecting both storage and query planning. Regular metadata cleanup is essential: Iceberg’s `expireSnapshots` safely removes old snapshots and associated data files, while Delta Lake’s `VACUUM` command cleans up unreferenced files after a specified retention period. Hudi’s timeline service supports configurable retention for commits and delta logs. Beyond physical file layout and metadata, intelligent partitioning and clustering strategies significantly reduce the amount of data scanned per query. Iceberg’s hidden partitions abstract complexity from end-users, and its partition evolution capabilities allow dynamic changes to partitioning strategies without breaking historical data. Query acceleration techniques further boost speed; Dremio’s Reflections and materialized views provide always-fresh, cache-like performance enhancements that adapt to changing workloads without manual tuning. Column statistics and bloom filters stored within metadata also allow query engines to skip irrelevant files entirely, further streamlining execution. Lakehouse optimization is an ongoing discipline, ensuring predictable performance and controlled costs at scale.

Beyond analytics: Lakehouse for agentic AI, Python, and graphs

The utility of the lakehouse extends far beyond traditional business intelligence, evolving into a multifaceted platform that drives the next generation of data-intensive applications. By the end of 2025, the industry conversation shifted profoundly from merely managing data to making it inherently intelligent and AI-ready. This transition underscores the lakehouse’s critical role in powering complex agentic AI, enabling Python-native development workflows, and even integrating sophisticated graph analytics directly within the data platform, transforming raw data into actionable intelligence across diverse domains.

See also  Snowflake vs. Databricks: The Battle Defining Modern Data

Powering agentic AI with intelligent lakehouse platforms like Dremio

Traditional BI queries are often predictable, yielding weekly reports or dashboards. In contrast, AI-driven workloads, particularly those involving large language models and autonomous agents, generate dynamic, ad-hoc queries spanning datasets in unpredictable ways. This demands consistent low-latency responses, a self-optimizing platform, and seamless integration between structured data, semantic meaning, and AI agents. Dremio exemplifies this intelligent lakehouse paradigm. It functions as more than a query engine; it is a self-optimizing lakehouse platform built natively on Apache Iceberg and Arrow. Dremio’s Reflections provide always-fresh materializations that automatically accelerate queries, adapting to changing workloads without manual tuning. Its semantic layer offers a unified space to define datasets, metrics, and business concepts, ensuring consistency whether an analyst is writing SQL or an AI agent is generating queries. Through Arrow Flight and REST endpoints, Dremio streams data directly into Python, notebooks, or AI frameworks with zero-copy efficiency, effectively bridging the gap between analytics and machine learning pipelines. By embracing open standards like Iceberg, Polaris for catalogs, and Arrow, Dremio guarantees interoperability, allowing AI agents or external engines to interact with the same governed data without vendor lock-in. This enables successful agentic analytics applications, transforming raw data into reliable, actionable, and AI-ready insights.

Python-native tools for lakehouse workflows

Python has firmly established itself as the lingua franca of modern data engineering and data science, and the lakehouse ecosystem has embraced this trend. By 2026, a robust suite of Python-first tools and frameworks emerged, simplifying the ingestion, processing, analysis, and serving of data directly from open table formats such as Apache Iceberg, Delta, and Hudi. These tools facilitate lightweight experimentation and power production-grade pipelines, rivaling traditional big data stacks. DuckDB, often dubbed the “SQLite for analytics,” is an in-process analytical database that excels at local workloads. It directly queries Parquet files and integrates with Iceberg catalogs, making it ideal for prototyping, ad hoc exploration, and embedding analytics within applications due to its vectorized execution engine. Dask, a parallel computing framework, scales Python workflows from laptops to clusters, working with familiar NumPy, pandas, and scikit-learn APIs while distributing workloads, perfect for machine learning preprocessing or large-scale data transformations where Spark might be excessive. Daft, a newer distributed data processing engine, is optimized for AI and ML workloads. Built on Apache Arrow for fast columnar in-memory processing, Daft runs locally or on clusters, supporting both CPUs and GPUs, reading directly from Parquet and Iceberg sources for high-performance pipelines. Bauplan Labs delivers a serverless, Python-first lakehouse approach, enabling data pipelines written in Python to execute in an automatically scaling serverless runtime. Bauplan integrates Iceberg tables with Git-like branching via catalogs like Nessie, making schema and data versioning first-class features, emphasizing reproducibility and modular pipelines with minimal infrastructure overhead. Together, these tools provide Python developers with a comprehensive toolkit for building, maintaining, and consuming modern data lakehouses efficiently.

Graph analytics seamlessly integrated with PuppyGraph

While data lakehouses excel at tabular and relational analytics, many critical real-world problems are inherently graph-shaped, such as detecting fraud rings, analyzing identity networks, optimizing supply chains, and understanding data lineage. Traditionally, these challenges necessitated loading data into specialized graph databases, adding an extra layer of ETL and storage complexity. PuppyGraph revolutionizes this by bringing graph analytics directly into the lakehouse. PuppyGraph is a cloud-native graph engine that operates atop existing data in your lakehouse, allowing you to query Iceberg, Delta, Hudi, or Hive tables as a graph without proprietary databases or data duplication. It connects directly to open table formats, relational databases, and warehouses, automatically sharding and scaling queries. This means existing datasets can be transformed into a graph model in minutes, without additional ETL. PuppyGraph integrates seamlessly with Apache Iceberg (including REST catalogs like Tabular or Polaris), Delta Lake, Apache Hudi, Hive Metastore, AWS Glue, and traditional databases like PostgreSQL and Redshift. Each source is treated as a catalog, allowing the definition of a graph schema across multiple data sources. For instance, connecting customer nodes in PostgreSQL with transaction edges in Iceberg becomes a reality, all without moving the underlying data. Supporting popular graph query languages like Gremlin (Apache TinkerPop) and openCypher, PuppyGraph ensures compatibility with existing graph tooling. Its performance is optimized for large, complex traversals, demonstrating multi-hop traversals over hundreds of millions of edges in under a second, often surpassing traditional graph database performance in cached modes. Beyond traditional graph applications, PuppyGraph supports Graph RAG (Retrieval Augmented Generation), enabling LLMs and agents to query structured relationships for enhanced context and reasoning, effectively removing the barrier between tabular and graph analytics within the lakehouse.

Future horizons: Extending the lakehouse to the edge and beyond

As organizations increasingly adopt AI-driven applications, the edge has emerged as a crucial deployment frontier. Instead of centralizing all data processing, inference can now occur closer to where data is generated—on IoT devices, factory floors, mobile applications, or in regional data centers. The lakehouse, traditionally a central cloud hub, is actively extending its reach outwards. Platforms like Spice AI are pivotal in enabling this transition, bringing the full power of the lakehouse closer to the source of action and insight. For example, considering the evolving data ecosystem, a comparison between leading data platforms like Snowflake vs Databricks also highlights the critical importance of foundational architecture choices.

Edge inference: Bringing AI closer to data generation with Spice AI

Edge inference is critical for several reasons: it drastically reduces latency, enabling AI decisions in milliseconds for use cases like predictive maintenance or real-time fraud detection, which would be impossible with cloud-roundtrips. It also offers significant cost efficiencies by reducing bandwidth and centralized cloud compute needs, especially for high-volume sensor data. Moreover, edge inference enhances resilience, allowing applications to function even with intermittent network connectivity, syncing back to the central lakehouse when available. Privacy and compliance benefits are also notable, as processing data locally minimizes the movement of sensitive information. Spice AI positions itself as an operational data lakehouse tailored for real-time and AI workloads at the edge. It leverages the Rust-based DataFusion engine (part of the Arrow ecosystem) for high-performance local querying, enabling lightweight nodes to join, filter, and aggregate data directly. Spice AI combines vector search for embeddings with SQL-style queries, allowing edge applications to execute both semantic AI lookups and structured analytics in a single step. It runs in containers or edge environments with a small footprint, supporting open table formats like Iceberg, Delta, and Hudi. Crucially, its hybrid sync capabilities ensure that local inferences and results can be materialized and then synchronized back to the central lakehouse, ensuring global consistency without sacrificing local responsiveness.

DuckLake: A fresh perspective on metadata management

While formats like Iceberg, Delta, and Hudi significantly advanced the lakehouse by introducing ACID transactions to data lakes, they also introduced operational complexities with JSON manifests, Avro metadata files, and separate catalog services. DuckLake, a new open table format developed by the DuckDB team, offers a refreshingly simple alternative. Its core premise is to store the entire metadata layer—schemas, snapshots, table versions, statistics, and transactions—within a relational database, while keeping the actual table data as Parquet files in object storage or local filesystems. This architecture eliminates the need for manifest lists, external Hive Metastore, or additional catalog API services, relying instead on standard SQL tables to track metadata. This design leads to faster commits through single SQL transactions, offers strong consistency without relying on eventually consistent file stores, and simplifies operations by allowing metadata DB backup or replication. DuckLake enables advanced features like multi-table transactions, time travel, and transactional schema changes without a complex infrastructure stack. Shipping as a DuckDB extension, it allows users to create, insert, update, delete, and query tables with full ACID guarantees. While DuckLake is its own format, it’s designed for interoperability; its Parquet and delete files are compatible with Iceberg, and it can import Iceberg metadata directly, even preserving snapshot history, making it a flexible companion in mixed-format environments and a potential bridge for experimentation or migration.

Scroll to Top