A Smarter Way to Manage Algorithm Changes in Data Pipelines with lakeFS

5 min readMay 23, 2021

Background

A major output of my team at Similarweb is a product called Shopper Intelligence. At a high level what we do is take cross-platform browsing and purchase data as inputs and feed them into proprietary algorithms to forecast future behaviors on the Amazon marketplace. The predictions we generate are then used by our customers to make smarter decisions for their own businesses.

The more accurate we can be, the more value we can provide to our customers. And therefore we are constantly testing new ways to improve the accuracy of our models.

Iteration Challenges

The strategy we use to improve our models is constant iteration. This involves frequently trying out new sources of data, model combinations, and weighting parameters.

Since we do not know if a change will increase prediction accuracy, it is important to calculate multiple versions of a model and data and test the results in parallel.

Collection level prediction

The way we manage this is through a simple numbered versioning system that is used for both data collections and the algorithms applied to them. For example, the results of a specific version of an algorithm applied over a specific version of a collection are saved under a unique path in S3 containing both version numbers, e.g. v1, v2, etc.

While effective initially, this system results in a quickly growing list of lists of different predictions, as shown below.

Input: Dataset 1, V1Output: Results V1, Results V2,……, Results V18Input: Dataset 1, V2Output: Results V1, Results V2,……, Results V18Input: Dataset 1, V3Output: Results V1, Results V2,……, Results V18Input: Dataset 2, V1Output: Results V1, Results V2,……, Results V32Input: Dataset 2, V2Output: Results V1, Results V2,……, Results V32....Input: Dataset xyz, VxOutput: Results V1, Results V2,……, Results V8

The explosion in results datasets produces a manageability problem for the next phase of the pipeline.

Producing Cross-Collection Views

The next step in the pipeline is to produce a joined view over all results datasets. This is performed by a set of spark jobs, each joining a subset of the results. These jobs are orchestrated by an Airflow DAG that takes the latest version of each dataset as input.

This gets messy fast when you consider that we have a lot more collections than what’s pictured above. Apart from being error prone, this kind of labeling also makes it hard to keep track of which algorithm version corresponds to which collection.

With many models and hundreds of different versions in total, this is not a simple task. The result is a laborious process to deploy a new algorithm to production, slowing down our pace of iteration.

Also tricky is the rollback process to a previous version in the event of an error. Since the current version minus one is not necessarily the correct one (when development versions never released were tested in the interim) it’s not always straightforward to know what version to rollback to.

Adopting lakeFS

To solve these problems, we needed a tool that could easily let us synchronize the versions of different datasets to a single version of our outputs. lakeFS enables this functionality through git-like operations over collection in S3.

The first step was to create a repository in lakeFS containing all of our collections.

Next we imported into the repo the latest version of each data collection, such as clicks, searches, transactions, etc.

When a newer version of a collection is ready to get bumped to production, it gets committed to the repository without including an incremented version number in the path. Instead we can add a tag with the version to the unique commit ID generated by lakeFS.

Development environment

Let’s walk through the process of testing out a change to an algorithm, randomly named A61, with lakeFS integrated in our environment.

The first step is to create a branch with the name of the algorithm. Note that this is purely a metadata operation and happens instantly.

We now have a place where we can safely test our changes, without affecting the main branch. Additionally, anyone else using the same repository will not see our changes.

Since lakeFS exposes an S3-compatible API, the paths to the branch are simply S3 paths: the only change required in the code is to add the branch name to the path.

With this pattern, we now no longer need to maintain the long list of versioned paths for the Spark jobs! This process can now be simplified by pointing to the main branch of the lakeFS repo:

Promoting a Ready Model

When a change to an algorithm is ready, we merge the results of the experiment branch into main via a lakectl command and then tag it:

$ lakectl merge lakefs://prediction-repo/main lakefs://prediction-repo/a61-algo$ lakectl tag create v18 <commit_ID>

The change is now visible to consumers on the main branch and tagged with its model version number. Using this pattern of only merging dev branches to main that are approved for production, we produce a commit history on the main branch that easily allows for rollbacks. This is because simply reverting (another lakectl command) to the previous commit results in exposing the correct version of the data to production.

Conclusion

Managing the development lifecycle of a data-intensive product built over versioned datasets and algorithms is a hard problem. I hope the solution we’ve shared in this article inspires you to experiment more confidently in your own environment!