The development of ML models and their delivery to the user is governed by the Machine Learning life cycle. It is a process that involves the preparation of data, training (building) models, and deploying them. While it enables businesses to acquire value, it aids them in managing their resources. These resources could range from business assets like customer data and capital to human resources like data scientists, ML engineers, and DevOps collaborating to make this process successful.
Before this cyclic process commences, businesses need to define the problem they want to solve, create a roadmap, set objectives, and metrics to measure success or failure. It could be customer segmentation for their coffee business using K-means clustering to increase the consumer conversion rate or recommendation systems to enable customers easily find what they may want to buy on their site. All these have to be figured out so that it creates a clear direction for the teams involved.
The ML project life cycle can generally be divided into three main stages: data preparation, model creation, and deployment. All three of these components are essential for creating quality models that will bring added value to your business. It is called a cycle because, when properly executed, the insights gained from the existing model will direct and define the next model to be deployed.
Training Machine Learning models require data. A lot of it. The data must be accurate, clean, and relevant for the desired task we hope our model will master. Preparing your data correctly will save you a lot of time debugging your model later on. Typically, this is achieved with a data pipeline, which consists of a sequence of data processing phases from collecting data to loading it to a target site like a data lake or warehouse, depending on the project demand. The Extract, Transform, and Load (ETL) pipeline is mostly used by practitioners.
A data preparation process consists of the following:
- Data collection and labelling
- Data cleaning
- Data management
Data Collection and Labelling
Generally speaking, the more complex the task is to learn, the more data you will need. Thus, when possible, it is worth looking for existing datasets that might match your needs since the process of generating new datasets can be quite costly. When existing data doesn’t match the target task exactly, it might still be worth attempting to use a transfer learning approach to reduce the required dataset size (this approach is very popular in NLP).
When you do need to create your own dataset, here are some points to consider:
- Can you use “natural” data and annotate it (less expensive but less task-oriented), or must you create a synthetic dataset?
- Does data labeling require domain expertise (e.g. medical professional), can it be outsourced to random people (Amazon Mechanical Turk is popular if the answer is “yes”)
When data is scarce, data augmentation can help you magnify your dataset by using automatic alterations to the data. For example, if an image of a cat is rotated–it’s still an image of a cat. More complex augmentation can include manipulating the given label as well. For example, in a sentiment analysis task, if we’re given a movie review that is labeled as “positive”, and we then add a negation to the review text, we can expect that the new label should be “negative”.
Datasets often contain missing values, or they may contain values of the wrong type or range. If you’ve ever experienced a shift of an entry in a spreadsheet, think about how that might ruin your dataset. Additionally, the removal of redundant features can greatly help the training process. Proper data cleaning could be laborious, but proper implementation and automation it will boost the quality of your data, and therefore your model, with a minimal amount of effort.
The data validation process is used to check the quality, integrity, and accuracy of the source data before sending it to a database. For businesses to reduce the likelihood that the data ingested will be the cause of future complications in the model development process or cause model degradation via data drift, this step is very important.
Teams can use libraries like Great Expectations to set expectations for their data or use tools like deepchecks to check the labels, dimensions, or data distributions to ensure that the data being used is adequate for the task.
Databases undergo changes and revisions. When a new data source becomes available, it might make sense to add additional columns or tables. ETL processes are often used to bring the data to its final format. Proper management and maintenance of the data is a must when it comes to building quality models. Data versioning can be utilized here to keep records of historical data and new sources or any changes made to the database. Just like code versioning, this can save the company if mistakes are made to preserve a more stable version of their product.
Model development is the core of the Machine Learning model lifecycle. The central roles in this stage are the data scientist and the ML engineer.
Selecting an Architecture
Begin by selecting a baseline architecture. This should be a relatively simple model, which is expected to have solid results with minimal effort. This model can later be compared to the more complex models that are trained later. You may want to start with vanilla classical ML solutions (e.g. logistic regression, xgboost) when possible, as they require minimal training resources and experimentation.
Later on, you may want to experiment with more complex DL architectures, ensembles, complex feature engineering, and feature selection. These methods will require more experimentation to find what best matches the problem you attempt to solve. Training these could be quite expensive, so limiting the space to explore by starting from well-established settings is a good idea.
In this phase, data scientists will experiment with different architectures along with feature engineering and feature selection. These models are then trained on the training set with the hope that they will learn the desired task and generalize to new examples as well. For large models, the training process itself could involve a whole large engineering operation (see GPT-3 for example). The validation set is then used for hyperparameter tuning and error analysis that can lead to changing the model architecture or introducing new features.
These experiments can involve many different models with different architectures and hyperparameters, and so it is extremely important to manage and keep track of all trained models and their performance in a way that enables easy reconstruction (check out MLflow for example).
Basic evaluation looks at metrics such as accuracy, precision, or F1 score to determine which model is the best fit to solve the problem. Proper evaluation should include in-depth investigation and understanding of when the model makes mistakes and why.
Evaluation metrics should enable you to not only compare different models to each other but also determine whether you’ve found a solution that satisfies your business goals. You may find that none of the models are good enough for your use case, in which case you can choose to experiment further, improve the dataset, or continue on to another task.
So you’ve trained a model that satisfies your needs and you’re quite happy with it. Deploying the model to production poses a new set of challenges. Here are some of the main issues that need to be dealt with:
- What kind of resources are required to run the model in production smoothly? Do you need a load balancing mechanism? What GPU capacity is required?
- How do you ensure that the model is still operating as expected? Has there been any significant data drift or concept drift that may deem your model unfit for the task?
- How can you perform Machine Learning model monitoring that informs you regarding model performance in real-time, alignment with KPIs and regulations, and whether something is broken in the data pipeline?
- How do you develop insights from the model’s performance in production to inform and help with retraining new models when the time comes?
For proper deployment and maintenance of ML models in production, there is a need for collaboration between data scientists who generally know a lot about Machine Learning models, but not much about production code and systems, and DevOps who might not understand much about the inner workings of ML models.
The data science life cycle is a guide for Machine Learning projects, and these stages require tools to achieve set goals. This is a high-level picture of each stage in the Machine Learning development process, and with this simplified overview, it is easy to know the steps to take when working on an ML project.
Proper business planning followed by a concerted effort by teams in each stage of the cycle can radically improve the quality and efficiency of ML solutions. This leads to better outcomes for both the business and the user in terms of what they perceive as value.