Components

We've blended together a selection of popular open source tools, managed serverless infrastructure, wiring, and simplified interfaces to modularize data productization. The 3 main components of the system are a data lake, data pipelines, and a compliance audit trail.

Data Lake

The basis of a Lagoon is an Iceberg data lake —a high-performance format for huge analytic tables. Storage is cheap; moving data and compute is expensive. Having made this mistake many times, the first step is always get all of your raw data together, in a flexible format, for permanent storage. No matter how good your pipeline is, something always comes up and you need to re-materialize. Keeping your raw assets in structured Iceberg tables in S3 enables performant queries without additional data transfer costs —pipelines run in-region.

With Iceberg as the underlying table structure, you only pay for compute when queries run. The lake is pre-configured for use with AWS Athena for SQL and Spark based querying. Or you can use any number of popular tools such as Snowflake, Databricks, Airflow, PySpark, EMR, dbt, Fivetran, Flink, and numerous others to query your lake.

Data Pipelines

With your raw data assets consolidated and structured, the next step is to transform your assets into production-ready datasets. This often includes cleaning steps like standardization and filtering, unification steps like joining and enrichment, and compliance steps like de-identification and sanitization. Your Lagoon comes with dbt and dagster already set up for creating and automating data pipelines.

All your team needs to do is add in your transformation logic using simple dbt models. The lagoon comes set up with event automation and orchestration of DAGs using dagster. As you add new raw data assets, your pipelines automatically execute and deliver the results back into your S3 bucket as Iceberg tables for use in production workloads.

🚧
We're working on an open source dbt library of common transformations. If there's a particular step you'd love to see included, reach out! We'll get it on the roadmap.

Audit Trail

Compliance. More specifically automated enforcement. Productizing and sharing data, even internally, is both necessary and sensitive. If you're handling large sets of data, you likely already have a data governance solution in-place to capture and handle permissions. If not, we have a ton of tooling for mobile and web based apps to collect and even incentivize consent.

However, the tricky part is rarely the collection of consent, it's ensuring that when you expose data it's both safe and within the bounds of your legal agreements (at least that's been our experience). Every lagoon includes an immutable audit trail of data license records and transactions with metadata compatible with Iceberg and dbt. Via a REST API you can connect existing governance tools or applications to generate enforceable data licenses. Reference these licenses within pipelines or via query to ensure production data always stays within legal bounds.