ML Platform Architecture: Building an ML Platform from Scratch

If you would like to contribute your own blog post, feel free to reach out to us via We typically pay a symbolic fee for content that’s accepted by our reviewers.


As ML and AI teams grow, it becomes increasingly important to standardize how models are designed, trained, deployed, and monitored. One way to improve standardization is through the adoption of an ML platform. In this article, I’ll outline the steps you can take to create your own ML platform from scratch using open-source tools.

But first, let’s address the elephant in the room:

“Why should I build my ML platform instead of just buying one?” 

Ah yes, “Build vs. Buy”; a debate as old as time – OK, not quite, but you know what I mean.

While building an ML platform from scratch can seem overwhelming, and many commercial ML platforms exist that claim to abstract away this complexity, buying an enterprise ML platform is rarely the best course of action.

To begin with, no ML platform is entirely turnkey. As such, ML teams must have significant knowledge of MLOps to get the most from the platform.

Additionally, licensing a proprietary ML platform creates vendor lock-in for an organization.

Why is that a problem?

Simply put, once an organization is locked into a specific vendor, it is dependent upon that vendor.

What if you need functionality that the vendor can’t provide, the vendor decides to raise the license cost, or you have to purchase additional add-ons to get the service you need? If the organization is locked in, they have two options: deal with it, or migrate to a different platform and start over from scratch.

To that end, it would be a wise choice to just take advantage of open-source tools to design your own platform from scratch instead of attempting to fit your business’s use case into the predefined mold of an enterprise ML platform.

Why Would You Want To Do So?

For starters, open-source tools offer greater flexibility than enterprise solutions. Open-source tools are flexible, so adding, adapting, or removing a given tool is significantly easier than attempting to do the same with enterprise tools.

Additionally, the best open-source tools are supported by a diverse community which results in these tools having extensive and accurate documentation. In other words, if you have a question, finding a solution is usually a web search away as opposed to contacting customer service or requesting a consultation with sales.

Finally, MLOps talent wants to learn while they earn.

As such, open-source tools provide MLOps team members with the opportunity to learn and share transferable skills with others in the community, which only serves to burnish your organization’s brand and reputation; it’s much easier to attract top talent if they know they’re going to master the tools found throughout the industry.

“But what if I just want to get started using ML tools? Do I really have to use open-source tools?“

Of course not; there is no one-size-fits-all approach to MLOps. We should always start by clearly stating our goal and what resources we have at our disposal.

For example, if time is of the essence because we’re attempting to create a proof-of-concept application that doesn’t need to scale up or down to meet demand, then using a ready-made proprietary tool that fits our use case could be the best option.

Additionally, open-source tools are built by and for programmers and DevOps engineers. To that end, if your team is light on engineering talent, then hiring engineers to build and maintain your ML platform could be the best place to start.

“So I have to choose between open-source and enterprise solutions?”

Thankfully, the answer is no. In fact, many organizations use enterprise and open-source tools to create their platform. For example, an organization could choose to use Amazon Sagemaker to train a model using a GPU instance but chooses to create and manage the pipeline using Pulumi or Sagify to avoid costly errors (more on that later). The question isn’t “We have this tech, what can we use it for?” It is, “We have this problem. What is the best tech we can use to solve it?”

With that out of the way, let’s get started!

Scope Of The Project

As the diagram below clearly illustrates the MLOps lifecycle is incredibly complicated.

Machine Learning Development Life Cycle Process

Machine Learning Development Life Cycle Process (Source)

Additionally, the number of tools available to complete those tasks is massive!

Fortunately, we can create a basic platform architecture using only a handful of open-source tools to accomplish the following key tasks:

  • Data and code management
  • Creating a feature store
  • Managing training and experiments
  • Packaging and deploying the model
  • Serving the model
  • Monitoring the model in production

Additionally, all the tools we’ll describe are platform-agnostic, meaning they can be used on AWS, Azure, Paperspace, or any other cloud provider.

Which tools are those? 

Thank you so much for asking!

Versioning Data And Code

All reproducible projects require code repositories for version control and continuous integration and continuous deployment (CI/CD). With that in mind, select a hosting site like GitHub, GitLab, or Bitbucket that provides version control of git along with CI/CD actions.

As for versioning data, emailing CSV files back and forth is a recipe for disaster. Sites like GitHub have size limits for files, so organizations need a method for versioning data. As such, tools like DVC make it simple to identify what data was used to train which models and where it is located (e.g. AWS S3, Azure Blob).

It’s crucial to monitor the data the ML model consumes, so be sure to use monitoring software like Deepchecks.

Creating A Feature Store

While on the topic of data, data science professionals spend just under 40% of their time cleaning and transforming data.

If you subscribe to the DRY principle — and if you’re reading this I’m sure you do — then you know how much time is saved by not repeating yourself. As such, employing a feature store like Feast or Hopworks that will store, transform, serve, monitor, and register features frees the MLOps team to focus on solving business problems.

However, some people say that a feature store is superfluous, and this is especially true if you’re working on a proof-of-concept, have an incredibly small team, and/or are only training a limited number of models.

However, if you have any plans of scaling, then the old expression “an ounce of prevention is worth a pound of cure” is incredibly relevant.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Create The Relevant Infrastructure

The next step would be to create the infrastructure necessary to train, test, deploy, and monitor our model.

I highly recommend using an Infrastructure as Code (IaC) tool like Pulumi, Terraform, Chef, or Puppet, to create as much infrastructure as possible. You could choose to configure everything manually, but I wouldn’t recommend it.

IaC is beneficial on several levels.

To begin with, returning to the DRY principle, IaC saves considerable time by automating the generation of the infrastructure needed for projects.

Moreover, IaC significantly increases transparency by being self-documenting. Rather than relying on practitioners to document everything they do via the user interface, IaC creates documentation through code that the team can review and version, which new team members can quickly review to learn the best practices.

Implementing IaC can help prevent costly errors by reducing the risk that a single person could forget to shut down an instance, select a GPU of a CPU instance, or any number of other mistakes that can lead to a huge bill. If a mistake was made, you can review the code and change it.

So what infrastructure should we create?

Since we’ll track the results of our models in training and production, we’ll need a database for storing the metadata.

On top of that, we’ll need to create buckets/blobs for versioning our data and storing our model artifacts.

Container Management

To begin with, we’ll need to set up a container orchestration tool (e.g., Kubernetes, Docker Swarm, Nomad) for managing and serving our model on our cloud platform of choice.

Now I know what you’re thinking. “If I’m using [insert cloud platform here], why can’t I just use their tool?

Two words: vendor lock-in. The more vendor-specific the tools you use are, the more difficult it will be to leave that vendor. A key reason for adopting open-source tools is flexibility and portability; the more open-source tools you use, the easier it is to move platforms if/when necessary.

And what is in those containers?

Serving the Model(s)

Once we’ve trained and validated our model (or models), we need to put it into production for it to create value. We can use Fast API, Django REST framework, or Flask to create a web API for making predictions with live data.

Model Monitoring

Organizations spend a tremendous amount of resources gathering data and training models before putting them into production. Unfortunately, models begin to decay as soon as they are put into production due to a myriad of reasons.  A modeling software like Evidently for identifying model drift and decay is essential for identifying when the model’s performance in production significantly differs from its performance in training.

Tracking The Model

Lastly, we’ll want to track the results of the model and compare them with predictions made by updated versions, so we need to use a tool like MLFlow to track the results and store the metadata about the models in the database we created earlier.

If you’ve made it this far, pat yourself on the back 😀

If all the steps above seem like a lot of work, it’s because it is!

However, to quote B. Dave Walters, “I’m not saying it’s going to be easy; I’m saying it’s going to be worth it.” By building an ML platform entirely from scratch, you will avoid vendor lock-in, increase the probability of attracting and retaining top talent, and scale up or down as needed to achieve your business goals.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo

Recent Blog Posts

LLM Evaluation With Deepchecks & Vertex AI
LLM Evaluation With Deepchecks & Vertex AI
The Role of Root Mean Square in Data Accuracy
The Role of Root Mean Square in Data Accuracy
5 LLMs Podcasts to Listen to Right Now
5 LLMs Podcasts to Listen to Right Now