Why and how to implement a feature store with Feast

What is a feature store?

12 min readNov 2, 2022

Before we dive into what a feature store is, quick refresher: in machine learning, a feature is data used as input in a predictive model. It is the x in f(x) = y

A feature store is an ML-specific system that:

Transforms raw data into feature values for use by ML models — think a data pipeline
Stores and manages this feature data, and
Serves feature data consistently for training and inference purposes

What problem are feature stores trying to solve?

Feature stores are trying to solve 3 problems:

When an ML model is trained on preprocessed data, it is necessary to carry out the identical steps on incoming prediction requests. This is because we need to provide the model data with the same characteristics as the data it was trained on. If we don’t do that, we will get a difference between training and serving, and model predictions will not be as good.
Many companies will use the same features across a variety of models and so it is a central hub for those features to be used by many models. Feature stores make sure there is no repetitive engineering setup as well as different pre-processing steps for the same features
It takes care of the engineering burden making sure features are pre-loaded into low-latency storage without the engineering work as well as making sure that these features were calculated the same way

When to use a feature store

In most cases feature stores add unnecessary complexity, and are only suited for specific ML uses cases. You might even be asking, “If a feature store is simply making sure the same pre-processing happens on the data, why can’t I do that transformation during inference on the raw data?”

There are two scenarios that it isn’t viable:

The first situation is if the feature value will not be known by clients requesting predictions, but has to instead be computed on the server. If the clients requesting predictions will not know the feature values, then we need a mechanism to inject the feature values into incoming prediction requests. The feature store plays that role. For example, one of the features of a dynamic pricing model may be the number of web site visitors to the item listing over the past hour. The client (think of a mobile app) requesting the price of a hotel will not know this feature’s value. This information has to be computed on the server using a streaming pipeline on clickstream data and inserted into the feature store. You can also imagine that if you have to fetch a lot of data, this cannot be done quick enough.
The second situation is to prevent unnecessary copies of the data. For example, consider that you have a feature that is computationally expensive and is used in multiple ML models. Rather than using a transform function and storing the transformed feature in multiple ML training datasets, it is much more efficient and maintainable to store it in a centralized repository.

To summarize, a feature-store is most valuable when:

A feature is unknown by user and needs to be fetched/computed server-side
A feature requires intensive computation
A feature is used by many different models

Tutorial Brief

During the model experimentation, let us assume that our data scientists suggested implementing a slightly different feature structure for our model that requires low-latency as some features require server-side processing.

Throughout this tutorial, we will be predicting whether a transaction made by a given user will be fraudulent. This prediction will be made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.

Our system will perform the following workflows:

Computing and backfilling feature data from raw data
Building point-in-time correct training datasets from feature data and training a model
Making online predictions from feature data

We will be using an open-source framework called Feast which is built from the team at Tecton, one of the leading feature-store companies globally. Tecton is a hosted version of Feast and comes with a few more beneficial features such as monitoring. We will then be deploying our application to AWS.

Data Setup

If you don’t have it, download the data required for this tutorial from here. This is originally from a Kaggle dataset for Fraud Detection. Place this dataset in a data directory in the root of your project. You can run this project either in VS Code or in Google Colab here. You can also checkout the Github repo here.

We’re going to convert this dataset into a format that Feast can understand, a parquet file. We also need to add 2 columns, event_timestamp and created_timestamp, so that feast can index the data time. We’ll do this by min-max normalizing the TransactionDT column, assigning a timestamp from the current date to a year back using the normalized column, and then adding these columns to the data.

In a Python REPL or as a separate file, run the following code block:

import pandas as pd
from datetime import datetime, timedeltadf = pd.read_csv("data/train_transaction.csv")
df["TransactionDT"] = df["TransactionDT"] / df["TransactionDT"].max()end = datetime.today()
start = (end - timedelta(days=365)).timestamp()
end = end.timestamp()df["event_timestamp"] = pd.to_datetime(
    df["TransactionDT"].apply(lambda x: round(start + x * (end - start))), unit="s"
)
df["created_timestamp"] = df["event_timestamp"].copy()df = df[
    [
        "TransactionID",
        "ProductCD",
        "TransactionAmt",
        "P_emaildomain",
        "R_emaildomain",
        "card4",
        "M1",
        "M2",
        "M3",
        "created_timestamp",
        "event_timestamp",
        "isFraud",
    ]
]
df.columns = [x.lower() for x in df.columns]
df.to_parquet("data/train_transaction.parquet")

Setup the Infrastructure

Since infrastructure and architecture are not the purpose of this tutorial we will use terraform to quickly setup our infrastructure in AWS to continue with the rest of the tutorial.

Without deviating too much let us quickly explain what terraform is and the different components we set up:

Terraform is an infrastructure as code tool that allows you to create and change infrastructure predictably. In plain english, think of it as a setup definition file and with one command you can create a development and production environment that are exact replicas of each other.

The following is created from the terraform file:

S3 bucket — this is where we are storing our data files to be using in this tutorial
Redshift cluster — this is the AWS data warehouse we will be using
AWS Glue — this is the AWS ELT tool that we will use to get our data from S3 to Redshift.
AWS IAM Roles — We create the roles thats needed for these 3 resources to interact.

Okay enough geeking out on Terraform — lets keep moving!

We need to setup our AWS credentials in order to deploy this terraform setup to our account. To start make sure you have your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables setup. If not, go to your AWS console and follow the instructions below:

Go to the IAm service
Click “Users” in the sidebar
Go through the steps to create a user and attach the following policies below.

If you already have a user, make sure you have the following permissions:

AmazonRedshiftDataFullAccess
AmazonS3FullAccess
AWSGlueConsoleFullAccess
IAMFullAccess

Once a user is created, you can click on your user and go to the tab that says “Security Credentials”. Scroll down and click the button that says “Create access key”. You should then see an Access Key and Secret Key generated for you.

Run the code below pasting in the generated keys

export AWS_ACCESS_KEY_ID=<your-access-key>
export AWS_SECRET_ACCESS_KEY=<your-secret-key>

Install the Terraform framework. We use Homebrew on macOS but you may install it however you prefer.

brew install terraform

In your terminal, go to the “infra” folder that came along with this tutorial. We are going to initialise Terraform in this folder and apply the plan. Name the project fraud-classifier.

cd infra
terraform init
export TF_VAR_region="us-west-2"
export TF_VAR_project_name="fraud-classifier"
terraform apply -var="admin_password=thisISTestPassword1"

Once your infrastructure is deployed you should see the following fields in your output in your terminal. Save these, we will need them.

- `redshift_cluster_identifier`
- `redshift_spectrum_arn`
- `transaction_features_table`

We are now going to create a Glue job to get our data from S3 to Redshift, creating a schema called spectrum. Use the values from the previous output.

aws redshift-data execute-statement \
--region us-west-2 \
--cluster-identifier <redshift_cluster_identifier> \
--db-user admin \
--database dev \
--sql "create external schema spectrum from data catalog database 'dev' iam_role '<redshift_spectrum_arn>' create external database if not exists;"

You should then get a JSON result back. Grab the Id field returned, and run a describe-statement below with that value to check if the job completed successfully. You should see a Status of FINISHED.

aws redshift-data describe-statement --id <Id> --region us-west-2

If that is all running successfully then we are done with our AWS setup!

Setup the feature store

To get started, let us install the Feast framework. Feast can be installed using pip.

pip install feast

Make sure you now cd back into the root of the project. In Feast, you define your features using a .yaml file in a repository. To create a repository run the command below and follow the prompts. The redshift database name will be dev and the user will be admin. For the staging location, use the s3://fraud-classifier-bucket bucket that was created in plan. Use arn:aws:iam::<account_number>:role/s3_spectrum_role as the S3 IAM role.

cd ..
# running this can take some time, we prefer you use the files in tutorial folder
feast init -t aws feature_repo

This will create a few files in a folder called feature_repo that are mostly example files (you should delete driver_repo.py and test.py) but we only care about:

- feature_store.yaml: This is a configuration file where you will define the location of your Redshift cluster, S3 bucket and DynamoDB Database.

NB: Make sure to use the same AWS region you used in your terraform setup

This file contains the following fields:

project: The name you would like to call the project.
registry: The registry is a central catalog of all the feature definitions and their related metadata. It is a file that you can interact with through the Feast API
provider: The cloud provider you are using — in our case AWS
online_store: The Online store is used for low-latency online feature value lookups. Feature values are loaded into the online store from data sources. Online stores only hold the latest values per entity key. An online store would be something such as Redis or DynamoDB — low latency.
offline_store: The offline stores store historic feature values and does not generate these values. The offline store is used as the interface for querying existing features or loading these features into an online store for low latency prediction. An offline store would be something like a data warehouse or storage bucket — high latency and a a lot of historical data.

Since we are using AWS, we have to use aws in the command. However, you can replace that with other cloud providers (e.g. Google Cloud you can use gcp).

Within the feature_repo folder, create a file called features.py in which we will define our features. Before we get started, we need to understand the concept of an Entity and a FeatureView:

Entity: An entity is a collection of semantically related features. For example, Uber would have customers and drivers as two separate entities that group features that correspond to those entities.
FeatureView: A feature view is an object that represents a logical group of time-series feature data as it is found in a data source. They consist of zero or more entities, one or more features and a data source.

Fill the file with the following contents:

from datetime import timedelta
from feast import Entity, Feature, FeatureView, RedshiftSource, ValueTypetransaction = Entity(name="transactionid")transaction_source = RedshiftSource(
    query=("SELECT * FROM spectrum.transaction_features"),
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created_timestamp",
)transaction_features = FeatureView(
    name="transaction_features",
    entities=["transactionid"],
    ttl=timedelta(days=30),
    features=[
        Feature(name="productcd", dtype=ValueType.STRING),
        Feature(name="transactionamt", dtype=ValueType.DOUBLE),
        Feature(name="p_emaildomain", dtype=ValueType.STRING),
        Feature(name="r_emaildomain", dtype=ValueType.STRING),
        Feature(name="card4", dtype=ValueType.STRING),
        Feature(name="m1", dtype=ValueType.STRING),
        Feature(name="m2", dtype=ValueType.STRING),
        Feature(name="m3", dtype=ValueType.STRING),
    ],
    batch_source=transaction_source,
)

First we create our transaction entity and define the SQL that will fetch the required features from our Redshift data warehouse. We then create a featureView that uses the Redshift instance to fetch the features and define the data type for each feature. We also define the time we would like the feature to contain. In this case we want 30 days worth of data.

Deploy the feature store by running apply from within the feature/ folder.

cd feature_repo
feast apply

If everything was created correctly, you would have seen the following output:

Created entity transaction
Created feature view transaction_featuresDeploying infrastructure for transaction_features

Next we load our features into the online store using the materialize-incremental command. This command will load the latest feature values from a data source into the online store from the last materialize call. There is an alternative command, materialize, that will allow you to load features from a specific date range rather that the latest data. You can read more about it here.

CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

If successful, you should see some activity in your terminal that its uploading the features. Once completed, you should see the results in our DynamoDB instance on AWS. This will take just under 6 minutes, so you may wanna grab a coffee!

Integrate the feature store with your model

In our project, we have two files with respect to our model:

run.py: This is a helper file that is going through the full model workflow. It fetches the historical loan data, trains our model and then makes a prediction to determine if the sample loan was approved or not.
credit_model.py: This file shows you how we use Feast during our model building as well as during our inference.

Let’s go through run.py first as it’s quite simple. Here, we simply load our training data, train our model and make a prediction with the online feast store.

import boto3
import pandas as pdfrom fraud_detection_model import FraudClassifierModel# Get historic transactions from parquet
transactions = pd.read_parquet("data/train_transaction.parquet")# Create model
model = FraudClassifierModel()# Train model (using Redshift for transaction history features)
if not model.is_model_trained():
    model.train(transactions)# Make online prediction (using DynamoDB for retrieving online features)
loan_request = {
    "transactionid": [3577537],
    "transactionamt": [30.95],
    "productcd": ["W"],
    "card4": ["mastercard"],
    "p_emaildomain": ["gmail.com"],
    "r_emaildomain": [None],
    "m1": ["T"],
    "m2": ["F"],
    "m3": ["F"],
}result = model.predict(loan_request)if result == 0:
    print("Transaction OK!")
elif result == 1:
    print("Transaction FRAUDULENT!")

For fraud_detection_model.py we won’t go through the entire file but rather just snippets in the file. We start by defining our model features which we do by specifying the [entity name]: [column name].

# Line 21
feast_features = [
    "transaction_features:productcd",
    "transaction_features:transactionamt",
    "transaction_features:p_emaildomain",
    "transaction_features:r_emaildomain",
    "transaction_features:card4",
    "transaction_features:m1",
    "transaction_features:m2",
    "transaction_features:m3",
]

During the initialization of our model we attach the feature store to our model object to use later. The repo path is where the folder that contains our feature_store.yaml and example.py that we created above — Feast fetches the configuration from there.

# Line 57
self.fs = feast.FeatureStore(repo_path="feature_repo")

When we would like to train our model, we want to get the historical data relating to our features. The method below launches a job that executes a join of features from the offline store onto the entity dataframe.

An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe. In our case, transactions contains a column called ‘transactionid’ to which we use to get all the transaction features. We should also ensure the target variable is attached to the entity dataframe.

Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df().

# Line 66
training_df = self.fs.get_historical_features(
    entity_df=transactions[["transactionid", "event_timestamp", "isfraud"]],
    features=self.feast_features,
).to_df()

When we do online inference (prediction) using our model, we don’t want to have to fetch all the historical data or anything really from our data warehouse since that will take multiple seconds. Rather we want to get the data we need from a low-latency data-source so we can have a low response time (~100ms). We do that below with the get_online_features function.

# Line 108
transaction = request["transactionid"][0]return self.fs.get_online_features(
    entity_rows=[{"transactionid": transaction}],
    features=self.feast_features,
).to_dict()

The above allows us to pass in the specific transaction and get the feature values for this user instantaneously. We can then use these values in our predict function to return what we predicted for the loan

Now let us run our run.py file to see this live and the output of our model

python run.py

That’s it for our tutorial on feature stores! Feature stores can add a lot of value to your ML infrastructure when it comes to using the same features across multiple models as well as doing server-side feature calculations however can add some complxity. Using Feast is great to implement this but if you want a more managed approach with extra functionality such as identifying model drift then you can try Tecton, or the the features stores that are native to the AWS and Google platforms.