Getting Started with DVC — a Git for Data and Models
If you are reading this blog, you might have been familiar with what Git is, and how it has been an integral part of software development. Similarly, DVC is an open-source, Git-based version management for Machine Learning development that instills best practices across the teams. Advantages of DVC are:
- ML project version control: DVC lets you connect with storage providers like AWS S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, HDFS etc to store ML models and datasets
- ML experiment management: It helps in easy navigation for automatic metric tracking
- Deployment and Collaboration: DVC introduces pipelines which helps in easy bundling of ML models, data, and code into production, remote machines, or a colleague’s computer.
Some of the use cases of DVC are:
- Versioning Data and Models: We can track versions of data and ML models using git commits. A meta file with .dvc extension is created for the data/models that needs to be tracked by dvc which contains the metadata information like md5 hash, size, number of files and the path.
- CI/CD for Machine Learning: DVC helps in managing data/models and reproducible pipelines
- Fast and Secure Data Caching Hub: DVC’s built in data caching speeds up data transfers and lets us setup a shared DVC cache that prevents repetitive transfers by linking working files and directories
- Experiment Tracking: Running DVC Experiments in your workspace captures relevant changes automatically (input data, source code, hyperparameters, artifacts, etc.). This helps in iterating quickly on experiments, creating checkpoints, and comparing results.
- Model Registry: DVC enables us to catalog ML models and versions. This helps in organizing model versions from different sources, sharing metadata, deploying specific models on dev, test and production environments.
- Data Registry: DVC enables cross-project reusability of data artifacts i.e different projects can depend on different repositories.
Installation
You can install dvc from PyPi repository using the following command line:
pip install dvc
Depending on the type of remote storage that will be used, we have to install optional dependencies: [s3]
, [gdrive]
, [gs]
, [azure]
, [ssh]
, [hdfs]
, [webdav]
, [oss]
. Use [all]
to include them all. In this blog, we will be using google drive as remote storage, so pip install dvc[gdrive]
for installing gdrive dependencies.
Getting Started
In this blog, we will see how to use dvc for tracking data and ml models with gdrive as remote storage. Imagine the Git repository which contains the following structure:
The data and models folder will be very huge when compared to source code of the repository. This is where DVC comes into picture which helps to track data and models folder. Go to the root of the Git repository (a repository which includes data, ml models folders) and initialize dvc using the command:
dvc init
To start tracking data and models directory, run the following command:
dvc add data
dvc add models
Now, this creates a special file with .dvc extension (data.dvc and models.dvc). This .dvc file contains the metadata information like md5 hash, size, number of files and the path. These .dvc files are versioned with source code with Git. The dvc add
command will also add data and models folder to .gitignore file. Then, we need to commit the changes to git using the following command:
git add -A
git commit -m "track data and models using dvc"
Gdrive Remote Configuration
Now, we need to configure gdrive remote storage. Go to your google drive and create a folder called dvc_storage in it. Open the folder dvc_storage. Get the folder-id of the dvc_storage folder from the url:
https://drive.google.com/drive/folders/folder-id
# example: https://drive.google.com/drive/folders/0AIac4JZqHhKmUk9PDA
Now, use the following command to use the dvc_storage folder created in the google drive as remote storage:
dvc remote add myremote gdrive://folder-id
# example: dvc remote add myremote gdrive://0AIac4JZqHhKmUk9PDA
Now, we need to commit the changes to git repository by using the command:
git add -A
git commit -m "configure dvc remote storage"
To push the data to remote storage, we use the following command:
dvc push
Then, we push the changes to git using the command:
git push
To pull data from dvc, we can use the following command:
dvc pull
Conclusion
This blog was aimed to help you get started with basics of DVC and to setup dvc using google drive as remote storage. For advanced uses (like CI/CD etc), we need to setup DVC remote configuration using Google Cloud project (click here). There are also other storage types supported like AWS S3, Microsoft Azure Blob Storage, self hosted SSH servers, HDFS, HTTP etc. DVC has most of the commands analogous to git (like dvc fetch
, dvc checkout
, dvc status
etc and a lot more). It also has Visual Studio Extension which makes things easier for developers using VS Code. Checkout their GitHub repository to learn more about DVC and everything it offers.
References:
[1] https://dvc.org/
[2] https://github.com/iterative/dvc