If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

Why are Feature Stores Important?

This blog post was written by Preet Sanghavi as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.


A machine learning project is composed of many components: data, code, environment, and of course the models. Many machine learning projects are delayed before they go into production due to inadequate feature engineering techniques, failure to store and reuse features, inconsistency between training and production features, and incoherence among feature definitions. Recent advancements in technology have given rise to feature stores. Feature stores can be defined as a software tool for data management that serves features to models. Simply put, feature stores act as an advanced storage system that helps multiple teams access the same data in a consistent and reliable manner.

Now that we have understood the meaning of the term “feature store”, let us try to understand how they work. A feature store has 5 primary components associated with it.

Feature Store Components

1. Data Transformation

Periodic data processing and converting new data points to feature variable values is the primary task carried out during data transformation within a feature store.

2. Data Storage

Feature data is stored in this component which makes it easy to extract and share data amongst different teams.

3. Feature Service

This component serves the machine learning model under consideration with all the necessary features to be used for training.

4. Feature Metrics Monitoring

This component helps monitor and update the different metrics of features involved. These metrics are primarily concerned with monitoring the correctness and quality of the features involved.

5. Centralized Registry

Stores all the feature definitions and metadata.

Source

Importance of Feature Stores:

  1. Features are hosted at multiple places by different teams in an enterprise. Feature stores provide a centralized feature storage option.
  2. Feature stores also help transform the features suitable for the production environment eliminating the need for extensive data engineering.
  3. Feature stores promote feature reusability and sharing of features consistently throughout the organization amongst different teams.
  4. Feature stores also monitor the correctness of the feature pipelines in production environments.
  5. Feature stores also provide feature definition consistency, feature versioning, and metadata.
Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

Features of a Feature Store:

1. Support for Temporal data

It is not uncommon to see static data for building machine learning models on Kaggle. It is relatively easy to manage the data and perform feature engineering on Kaggle. However, while working on applied research methodology-based problems, many data sources are involved and data is continuously updated at regular intervals. Such data is at high risk of mishandling or leaking.

Feature stores make it easier to manage temporal data. This is done by assigning a timestamp for each and every entry in the dataset. This makes it easier to track all the associated features for a particular entry as it exists at a particular point in time indicated by the timestamp. This feature is also referred to as “Point-in-time correctness”.

2. Entity Linking

Feature stores help data scientists think about the entities rather than tables and columns of different databases. For example, while trying to understand the customer lifetime value of a particular organization, a feature store manipulates the directive from working with customer and product tables to linking customer and product entities. Thus, simply put, a feature store identifies all the necessary entities of a business and helps establish a link to connect them together.

3. Application-based support

Feature stores are highly adaptive. They support feature services for both online and offline models. Online models are the models that are used for real-time predictions and are hosted on the web. On the other hand, offline models involve batch training or batch predictions and are generally preferred to cater to large data.

Evidently, real-time models require features from a couple of records to make efficient and quick predictions. On the other hand, feature stores create a feature set from a significantly larger corpus. Speed is not a concern in this case. Thus, both the modes have different architectures.

source

Creating a feature store:

There are multiple options to create a feature store. Some of them are Amazon SageMaker Feature Store, Databricks, H2O.ai, Rasgo, Scribble Data, Kaskada, and Tecton. FeatureStoreComparison has build a great way to compare different feature stores using factors like products, pricing, and other operations support.

Let us try to explore the AWS SageMaker feature store. It is important to note that the supported data types are string (default), integer and float. Moreover, to start using the feature store, one needs to setup the SageMaker session, boto3 session and Feature Store session. More information about AWS SageMaker feature store can be found here.

One can make use of the following code to start using the feature store in Python.

import boto3
import sagemaker
from sagemaker.session import Session

boto_session = boto3.Session(region_name=region)
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-featurestore'
offline_feature_store_bucket = 's3://*{}*/*{}*'.format(default_bucket, prefix)

sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

Python Code to implement the AWS SageMaker Feature Store

Let us try to explore how different companies around the world make use of feature stores.

Use Case Examples:

1. iFood

iFood is a great example of a company that makes use of feature stores. As the name suggests, iFood is one of the biggest food-tech organizations in America and has more than 20 million users. As the number of users increases, the total amount of data generated every second also increases largely. This data includes the addresses of customers, their food preferences, restaurant details, dish details, and so on.

iFood also makes use of Machine learning and AI to recommend dishes to their users based on their food preferences. In order to train such a model, iFood developed its own feature store. This helps them generate relevant features from the data, serve features to different models, and design goal-oriented distributed data processing pipelines. They made use of Spark and Databricks to build their real-time feature store.

2. Uber

Uber has revolutionized the world not only with its highly efficient cab services but also with its ability to deliver highly accurate machine learning systems. One of the major problems supposedly faced by Uber is scaling. As more and more people sign up and update their travel preferences, it becomes cumbersome to scale the machine learning operations.

Uber’s Michelangelo has been largely useful to ameliorate this problem. The Michelangelo feature store was specifically built for goal driven platform for features. It was used to convert raw data into useful features, storing the features, serving features to the models for online and offline batch training, and feature reusage. This feature store was directly advantageous to the users everytime they scheduled a ride and interacted with the app.

This can be understood with the help of the following steps:

  1. Fetch and store relevant feature sets from the data of user such as their location, address and destination.
  2. Serve the features to the machine learning models and provide accurate expected time of arrival.
  3. Update the permanent feature sets with the user’s data for ‘Point-In-Time’ correctness of features.

source

Cons of using a Feature Store:

While there are multiple advantages of using a feature store, it is also important to realise its cons.

Source

These can be illustrated as follows:

1. Research focused environment.

If a particular team in an organization is merely research oriented and not concerned with the operations of the model, building or maintaining a feature store can result in an overkill.

2. Usage complexity.

Feature stores are also fairly intricate to use. An organization will have to incur costs to hire a professional to build, manage and monitor a feature store. Moreover, it is difficult to select an efficient and easy-to-use feature store that understands the existing data fairly well and provides entity resolution.

3. Not feasible to build.

While many organizations have build their own feature stores, it took them years of work and cost them a fortune. It is not easy for budding organizations or startups to build their own feature stores from scratch. Moreover, it is difficult to choose a feature store that can cater to an organization’s needs efficiently. Thus, it is important to find the right balance and compare different options while working with a feature store.

To explore Deepchecks’ open-source library, go try it out yourself! Don’t forget to star their Github repo, it’s really a big deal for open-source-led companies like Deepchecks.

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.