MLflow & Vectice: Auto-Documenting Beyond Model Data

Leverage Vectice on top of MLflow to auto-document and approve your models and datasets.

In this blog post, from data scientists to data scientists, we'll show how Vectice enhances data science project documentation and governance by seamlessly documenting key assets and insights at every step of the data science project lifecycle. Vectice complements tools like MLflow by adding critical context and visibility to drive accountability of ML projects.

Benefits of Integrating Vectice with MLflow

Vectice MLfLow integration capabilities enable you to retrieve and augment all the assets you previously stored in MLFlow and centralize them in the Vectice app. This makes it easier to manage your machine-learning projects.

With a few lines of code leveraging the Vectice API library, key project assets such as models, datasets, code, and other artifacts will become documented and accessible in a standard user interface for your team, stakeholders, and managers. Collaborating with your colleagues, sharing your progress, and soliciting inputs and reviews will become easier.

Vectice lets you define reusable best practices for driving your DS projects, enabling you to communicate effectively and consistently, following an approach trusted by your team.

Finally, Vectice simplifies the discovery of existing work to quickly locate essential models, their training datasets, and relevant documentation so you never have to start a project from scratch again.

By combining MLflow with Vectice, you will learn how to:

Effortlessly document new and existing key models and assets from MLflow to Vectice with a few lines of code.
Capture and access model insights that were locked into MLflow UI.
Easily access documented assets in MLflow UI using Vectice, providing insights about decisions and tradeoffs at every step of the ML project lifecycle.
Dataset lineage is a click away; you can easily get an overview or more detailed insights into a model’s creation and the features selected in the dataset.
Trigger a review process to get feedback from additional stakeholders and foster broader collaboration.

Identify key Experiments in MLflow you want to Document in Vectice

As part of your modeling process, you may save a large number of experiments in MLflow that are not relevant to the success of the project or whose insights are worth sharing.

The many saved experiment runs in MLflow can be overwhelming and irrelevant to the broader team

However, some runs and model versions are keys to explaining the insights you gathered, the decisions you made or are candidates for the next phase of your project, like production deployment.

Vectice enables you to organize key assets as part of your data science project documentation, solicit feedback from your peers and subject matter experts, share insights, and ask for formal reviews.

Thanks to Vectice MLFlow API integration, you can document an existing MLFlow run with a few lines of code into Vectice as shown below:

run_id = run.info.run_uuid  #MLflow run

vect_baseline_model = Model.mlflow(run_id=run_id, client=mlflow)

iteration.log(vect_baseline_model)

You can access our full sample notebook here on how to document MLflow experiments to Vectice: https://github.com/vectice/GettingStarted/blob/main/23.3/samples/mlflow_sample.ipynb

The logged experiment in Vectice includes all the information previously saved in MLFlow but will now enable you to share it with a broader team and provide more context to your manager, domain experts and other business partners with proper access control and notification.

In the Vectice app, all the models' metadata and their artifacts that were in MLflow are now accessible as part of your DS project and visible to the rest of your team. You can easily edit and augment them from the Vectice app with complete access control to ensure the information remains secure.

The activity history gives you insights into what was changed and when. Thus, changes in key experiments are maintained if some of your colleagues are working with you on the same project.

The model version in the Vectice app shows the lineage, metrics, and properties of the model version.

Extend MLFlow by Documenting Analytical Datasets

Vectice treats datasets as first-class citizens to facilitate lineage and column transformations, data exploration, and feature selection. Vectice is commonly used for documenting those use cases as part of your typical DS workflow.

Here is a short example of documenting a clean dataset with a few lines of code:

cleaned_dataset = Dataset.clean(name="Cleaned Dataset", 

resource = FileResource(paths="dataset.csv", dataframes=df_iris))

iteration.log(cleaned_dataset)

You can access our full sample notebook here on how to document MLflow experiments to Vectice: https://github.com/vectice/GettingStarted/blob/main/23.3/samples/mlflow_sample.ipynb

The dataset version in the Vectice app shows the lineage, properties, data location, and metadata.

Auto-Document MLflow Key Assets in a few Clicks with Vectice

Once you have logged and augmented key assets metadata, thanks to Vectice auto-generated widgets, you can simply document your projects with a few clicks. Start by going to a project phase and clicking the “Insert” dropdown menu.

You then select an asset to insert:

The Model widget selection tool in the Vectice App documentation to auto-generate content in the Vectice project.

You can choose the granularity at which you want to view your models. Whether it is a comparison table or highlighting a key model, it’s possible with Vectice.

Here is an example of a model table comparison

Attachments are crucial in documentation as they allow relevant graphs, visualizations, and other attachments to be directly included where needed. With Vectice, you always have the right attachment at the right place when you need it.

Inserted artifacts are also quick links to the detailed tab of the artifact, which can be used to drill down into the artifacts themselves.

You will find lineage, history, and a comprehensive set of capabilities for collaboration, such as feedback and reviews. Below is an example of the inserted widgets in the documentation:

The vectice documentation page shows model versions and an inserted visualization that belongs to the model version.

By leveraging the various widgets, you can reorganize the content on the page in a way that is more suitable for you to share and make it easy to keep up to date with minimal effort.

Enrich your MLflow Experiments with Vectice Dataset Lineage

Now, we will enrich your ML project with dataset lineage; the insights from lineage improve decision-making and add context to your ML project and documentation. Datasets can also have a lineage between each other. For example, your origin dataset can be associated with your cleaned dataset.

You will notice the additional data lineage that was not previously available in MLFlow so you can capture deep information about the datasets used for training, including the list of features, data transformations, and data origin. Below is the lineage between a dataset and a model;

The lineage that is generated in the Vectice app, the above sample has a dataset as an input, code that is automatically logged, and a model as the output.

By clicking on the artifacts in the lineage, you can view them; this means the lineage gives you quick access to more detailed information for each artifact. If a data subject matter expert or a security analyst in your team needed to view which dataset was used for a model, it would be a click away.

This allows them to see dataset statistics, columns, and resources, making reviewing sensitive information in a context accessible and convenient. The dataset overview can also be easily shared by exporting the overview in an Excel file.

Below is the lineage between the two datasets and the code used to log the dataset:

#Pass a file path and a pandas DataFrame (this capture statistics)

dataframe = pd.read_csv("file.csv")
dataset_resource = FileResource(paths="file.csv", dataframes=dataframe)

dataset = Dataset.origin(
        name="Dataset Origin",
        resource=dataset_resource,
    )

iteration.log(dataset)

You can access our full sample notebook here on how to document MLflow experiments to Vectice: https://github.com/vectice/GettingStarted/blob/main/23.3/samples/mlflow_sample.ipynb

The lineage is two pointers to the input and output datasets. With the lineage, you can easily access the datasets and their versions with a simple click.

Vectice Lineage shows datasets as inputs and outputs.

The lineage between datasets allows quick access to dataset statistics, the actual dataset source, and accompanying artifacts created for analysis.

Dataset Resources found in the Vectice App show the file resources and statistics of the associated files of the dataset.

You can also add your own data source to wrap your columnar data and metadata into Vectice. Vectice supports integrating data sources, including Google Cloud Storage, AWS S3, BigQuery, and Databricks.

Review more than MLflow Metrics with Vectice Approval Features

Reviews in Vectice create the opportunity to ask for an official validation of your work and solicit feedback from other team members to ensure their whole team is engaged in reviewing the work and aligned on the outcomes. You can take advantage of this feature as you see fit. Vectice aims to give teams the tools to collaborate but not force strict workflows that cause friction.

Below is an example of a key experiment in MLflow; the model will be staged for production once approved. The data scientists are discussing a solution for mitigation steps to take, as some of the data comes from external sources.

This example demonstrates one of the benefits of Vectice - you will always have access to why and when decisions were made.

By using MLFLow and Vectice together, you are able to automatically generate documentation for my project milestones and key assets, soliciting feedback from my colleagues at different stages of development. Feedback from data experts and security analysts was garnered to ensure faster deployment and more trustworthy models deployed to production.

Full documentation view in the Vectice App.

Final Thoughts

In this blog post, we illustrated how to combine MLFlow and Vectice together to enable seamless documentation and sharing of key assets in data science projects to foster collaboration, best practices, and alignment around the work you produce as a data scientist.

We also saw how to log MLflow run information into Vectice using a few lines of code and how to capture dataset metadata and dataset lineage, which are other critical pieces of the ML lifecycle documentation and can be done very efficiently with the help of Vectice.

Looking Beyond

We only touched on a few capabilities of Vectice to promote project visibility, guide the team with best practices, establish project governance, and facilitate cross-functional collaboration. If you want to learn more about it, you can contact us or give it a spin:

➡️ Try Vectice today

MLflow & Vectice: Auto-Documenting MLflow Models

Benefits of Integrating Vectice with MLflow

Identify key Experiments in MLflow you want to Document in Vectice

Extend MLFlow by Documenting Analytical Datasets

Auto-Document MLflow Key Assets in a few Clicks with Vectice

Enrich your MLflow Experiments with Vectice Dataset Lineage

Review more than MLflow Metrics with Vectice Approval Features

Final Thoughts

Looking Beyond

➡️ Try Vectice today