Delta Lake is an open-source protocol that sits on top of your existing data lake (e.g. AWS S3) and turns it into a reliable, ACID-compliant data lakehouse.
Delta Table is the default table format in Databricks.
When you create a table in Databricks, by default, it creates a Delta Table — a parquet file with transaction logs stored in a cloud storage.
How it Works
- Data files, in parquet format, live in cloud object storage (S3, ADLS, or GCS).
- A transaction log (JSON + checkpoint files) called
_delta_logrecords all changes - inserts, updates, deletes - Readers and writers consult the log to guarantee consistency. Think of it as:
- Raw files in a data lake → With Delta Lake = ACID table with history
In short: Delta Lake makes your data lake behave like a data warehouse (but cheaper and more flexible).
Delta Lake vs Traditional Data Lake
| Feature | Data Lake (plain Parquet/CSV) | Delta Lake |
|---|---|---|
| ACID Transactions | ❌ | ✅ |
| Schema Enforcement | ❌ | ✅ |
| Time Travel | ❌ | ✅ |
| Batch + Streaming | ❌ | ✅ |
| Performance Optimizations | ❌ | ✅ |
🔑 What Delta Lake Provides
- ACID Transactions
- Ensures consistency for concurrent reads and writes.
- Example: If multiple pipelines are writing to the same dataset, Delta Lake guarantees you don’t end up with corrupt or partial data.
- Schema Enforcement & Evolution
- Prevents bad or unexpected data from being written (e.g., wrong column types).
- Can evolve the schema over time if new columns are added.
- Time Travel (Versioning)
- Keeps a history of all changes to your tables.
- You can query older versions of the data (e.g.,
SELECT * FROM table VERSION AS OF 5). - Useful for audits, debugging, or reproducing reports.
- Unified Batch & Streaming
- Same Delta table can be read and written to by both batch jobs and streaming jobs.
- Removes the need for maintaining separate batch and streaming pipelines.
- Performance Enhancements
- Stores data in Parquet format under the hood (columnar, compressed, efficient).
- Adds transaction logs (_delta_log) to keep track of changes.
- Optimizations: data skipping, Z-ordering, caching, compaction.