Storage Layout

The easiest way to understand how a Lagoon works is to learn about the data storage model.

Each lagoon is a single S3 bucket containing all of your data assets. If you elect to self-host, buckets only come with the necessary configuration for the functionality of a Lagoon. For example, recommend you set up backups, replication, and logging.

Folder Structure

/load
Parquet data files ready waiting to be written into the Iceberg lake. Just add files to the bucket following the key name: load/<database>/<table>/<filename>.parquet and they will either automatically create or merge into corresponding Iceberg tables under /stg. Daily partitioning is automatically added based on ingestion time using the dbt/dagster compatible column _etl_loaded_at. Once loaded, files are deleted from this directory as to not store duplicate data.

/stg
Iceberg tables full of raw application data live here under the key prefix stg/<database>/table/. Within the directory are two folders data/ and metadata/ containing the necessary files for Iceberg compatibility. If a table already exists, data is merged in. If a table does not exist, it's automatically created using the schema from the first Parquet file loaded.

/tmp
Temporary and intermediate tables required by your dbt pipelines are stored here as needed.

/prd
Iceberg tables full of production data live here under the key prefix stg/<database>/table/. Within the directory are two folders data/ and metadata/ containing the necessary files for Iceberg compatibility. This is where the results of your pipelines are delivered.

/prepare
A pre-load staging area for non-parquet files. You can add files from any supported format (JSON, CSV, Avro, ORC) where they are batched together, converted to parquet and moved into /load for loading into the data lake. Add files using the key prepare/<database>/<table>/<filename>.<filetype> for automatic conversion. You can asynchronously stream data direct to this directory or periodically bulk dump an export from a database.

/trail
Contains all of the immutable audit trail records and the corresponding Iceberg compatible metadata tables. Trail records are stored under /records in a pseudo blockchain/reverse linked tree binary format. /metadata contains a copy of the metadata in an Iceberg format for use in automating enforcement.

/assets
Stores non-data assets such as documentation and configuration files for use with other mytiki.com services such as our data storefront.

/stats
Statistics tables, also in Iceberg format, as a result of your pipeline execution. Top-level statistics such as the last updated date and record count are automatically added. You can easily add a stat generation step to your pipeline for added insight. Useful in generating performant dashboards and summary metrics. Used by the mytiki.com data storefront service.

/log
All logs minus S3 bucket logs (never store these in the same bucket) as recorded here. Logs are also stored as Iceberg tables for easy querying and debugging. Errors in pipeline execution, data loading, and more end up in here.

/queries
Query requests and results should reside here. All Athena queries when using the mytiki-lagoon workgroup are preconfigured for this bucket. When using an alternative asynchronous querying engine, consider pointing the output files to the same directory.