LakeFS
Enterprise data version control platform
From Wikipedia, the free encyclopedia
lakeFS is an open-source data version control system for managing data stored in object storage.[1] It provides Git-like operations such as branching, committing, merging, and reverting for large-scale data stored in systems including Amazon S3, Azure Blob Storage, and Google Cloud Storage, as well as other S3-compatible object storage platforms.[2] lakeFS is used in data engineering and machine learning workflows to manage changes to data, support reproducibility, and enable data governance across data lakes.[3] The software is available as an open-source project, as well as in enterprise and managed service offerings, including lakeFS Cloud.[3][1]
| lakeFS | |
|---|---|
| Original authors | Einat Orr Oz Katz |
| Developer | Treeverse |
| Initial release | August 3, 2020 |
| Stable release | 1.72.0
|
| Written in | Go |
| Type | Data version control |
| License | Apache 2.0 |
| Website | lakefs |
| Repository | https://github.com/treeverse/lakeFS |
History
lakeFS was created in 2020 by Einat Orr and Oz Katz at Treeverse.[4] Its first public release, version 0.8.1, appeared in August 2020 and introduced Git-style operations with support for Amazon S3.[5]
In 2021, Treeverse raised $23 million in a Series A funding round led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.[6] The same year, lakeFS was included in InfoWorld’s Best of Open Source Software (Bossie) awards.[7]
In June 2022, Treeverse introduced lakeFS Cloud, a managed service providing hosted lakeFS deployments for cloud-based data lakes.[3] Version 1.0 was released in October 2023, adding integrations with platforms such as Databricks and Apache Iceberg, as well as support for orchestration tools including Apache Airflow.[1][8] Public case studies and conference materials have described usage of lakeFS by organizations such as Microsoft, Volvo, and NASA.[1]
In July 2025, Treeverse announced an additional $20 million in growth funding to support further development of lakeFS.[9][10]
In November 2025, Treeverse announced the acquisition of the open-source data version control project DVC.[11]
Software
Overview
lakeFS provides Git-like operations such as branching, committing, merging, and reverting for datasets stored in object storage.[1] These operations are used to manage changes to data, test modifications in isolation, reproduce specific data states, and recover from errors or unintended updates.[2]
Architecture
lakeFS operates as a metadata layer on top of object storage systems such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.[2] It stores repository metadata describing commits, branches, and tags, enabling versioned views of data without copying underlying objects.[2]
The system provides access through multiple interfaces, including a web user interface, command-line tools, a REST API, and software development kits.[2] It is designed to integrate with existing data engineering and machine learning workflows, and can be deployed either in self-hosted environments or as a managed service.[3]
Functions
lakeFS provides version control functionality for data stored in object storage–based data lakes. Core features include:
- Atomic commits and version tracking for datasets, supporting reproducibility and auditability.[1]
- Branching and merging mechanisms that allow isolated development and testing without duplicating data.[2]
- Configurable hooks that can validate data or trigger external processes during commit and merge operations.[1]
- The ability to revert repositories to earlier states to recover from data errors or failed changes.[2]
- Recording of commit history and associated metadata for lineage tracking.[3]
- Support for managing data across multiple object storage systems, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.[3]
- Use of fixed data versions to reproduce experiments and machine learning model training.[1]
Integrations
Coverage of lakeFS has described integrations with platforms such as Databricks and Apache Iceberg, as well as support for environments including Red Hat OpenShift.[1][2] Additional materials describe its use with Trino, including validation of data changes prior to merging in versioned data workflows, as well as compatibility with orchestration tools such as Apache Airflow.[12]