The field of machine learning involves many different disciplines ranging from business roles, to data scientists and to DevOps. A thorough understanding of the ML model development life cycle will help you manage resources correctly and gain a deeper understanding of where you stand in the process.
The data science development life cycle consists of three main stages: Data preparation, modelling and deployment (source)
The ML project life cycle can generally be divided into three main stages: data preparation, model creation, and deployment. All three of these components are essential for creating quality models that will bring added value to your business. We call this process a cycle because when properly executed, the insights gained from the existing model will direct and define the next model to be deployed.
Training machine learning models requires data. A lot of it. The data must be accurate, clean, and relevant for the desired task we hope our model will master. Preparing your data correctly will save you a lot of time debugging your model later on.
Data Collection and Labelling
Generally speaking, the more complex the task to learn, the more data you will need. Thus, when possible it is worth looking for existing datasets that might match your need since the process of generating new datasets can be quite costly. When existing data doesn’t match the target task exactly, it might still be worth attempting to use a transfer learning approach to reduce the required dataset size (this approach is very popular in NLP).
When you do need to create your own dataset, here are some points to consider:
- Can you use “natural” data and annotate it (less expensive but less task oriented), or must you create a synthetic dataset?
- Does data labelling require domain expertise (e.g. medical professional), can it be outsourced to random people (Amazon Mechanical Turk is popular if the answer is “yes”)
When data is scarce, data augmentation can help you magnify your dataset by using automatic alterations on the data. For example if an image of a cat is rotated – it’s still an image of a cat. More complex augmentation can include manipulating the given label as well. For example, in a sentiment analysis task, if we’re given a movie review that is labeled as “positive”, and we then add a negation to the review text, we can expect that the new label should be “negative”.
Datasets are often missing values, or they may contain values of the wrong type or range. If you’ve ever experienced a shift of an entry in a spreadsheet, think about how that might ruin your dataset. Additionally, removal of redundant features can help the training process greatly. Proper data cleaning could be laborious but with proper implementation and automation it will boost the quality of your data, and therefore your model, with a minimal amount of effort.
DBs undergo changes and revisions. When a new data source becomes available, it might make sense to add additional columns or tables. ETL processes are often used to bring the data to its final format. Proper managing and maintenance of the data is a must when it comes to building quality models.
Model development is the core of the machine learning model lifecycle. The central roles in this stage are the data scientist and the ML engineer.
Selecting an Architecture
Begin with selecting a baseline architecture. This should be a relatively simple model which is expected to have solid results with minimal effort. This model can later be compared to the more complex models that are trained later. You may want to start with vanilla classical ML solutions (e.g. logistic regression, xgboost) when possible, as they require minimal training resources and experimentation.
Later on, you may want to experiment with more complex DL architectures, ensembles, complex feature engineering and feature selection. These methods will require more experimentation to find what best matches the problem you attempt to solve. Training these could be quite expensive, so limiting the space to explore by starting from well-established settings is a good idea.
In this phase data scientists will experiment with different architectures along with feature engineering and feature selection. These models are then trained on the training set with the hope that they will learn the desired task and generalize to new examples as well. For large models, the training process itself could involve a whole large engineering operation (see GPT-3 for example). The validation set is then used for hyperparameter tuning and error analysis that can lead to changing the model architecture or introducing new features.
These experiments can involve many different models with different architectures and hyperparameters, and so it is extremely important to manage and keep track of all trained models and their performance in a way that enables easy reconstruction (check out MLflow for example).
Basic evaluation looks at metrics such as accuracy, precision or F1 score, to determine which model is best fit to solve the problem. Proper evaluation should include in-depth investigation and understanding of when the model makes mistakes and why.
Evaluation metrics should enable you to not only compare different models to each other, but determine whether you’ve found a solution that satisfies your business goals. You may find that none of the models are good enough for your use-case, in which case you can choose to experiment further, improve the dataset or continue on to another task.
So you’ve trained a model that satisfies your needs and you’re quite happy with it. Deploying the model to production poses a new set of challenges. Here are some of the main issues that need to be dealt with:
- What kind of resources are required to run the model in production smoothly? Do you need a load balancing mechanism? What GPU capacity is required?
- How do you ensure that the model is still operating as expected? Has there been any significant data drift or concept drift that may deem your model unfit for the task?
- How can you perform machine learning model monitoring that informs you regarding model performance in real time, alignment with KPIs and regulations, and whether something is broken in the data pipeline?
- How do you develop insights from the model’s performance in production to inform and help with retraining new models when the time comes?
For proper deployment and maintenance of ML models in production, there is a need for collaboration between data scientists who generally know a lot about machine learning models, but not much about production code and systems, and DevOps who might not understand much about the inner workings of ML models.
To conclude, we have discussed the main stages of machine learning projects, and tried to paint a high-level picture of each of these stages. We strongly believe that proper understanding of these stages, along with the parties involved in each of them, can lead to a more healthy and successful process of creating ML solutions.