The importance of data in machine learning models: Is there a magic number that determines the amount of training data required

Home > All  > The importance of data in machine learning models: Is there a magic number that determines the amount of training data required

This is perhaps one of the most difficult questions to answer when the development of a machine learning project is underway: how much data or examples are needed to build a training dataset able to achieve a good performance model?

Let’s use deep learning –one of the main approaches that has experienced a major boom in recent years– as a particular case study to analyze this issue, and particularly, the image classification models.

There are two main reasons that explain the resurgence of this area in recent years. On the one hand, the increase in data availability largely due to the emergence of the Internet, which has facilitated the collection and distribution of large data sets, to the interaction of people with new digital devices (laptops, mobile devices). On the other hand, the evolution of the available computational resources, which allow for the execution of increasingly complex models and the development of new techniques to train deep networks [2, 1].

We talk about a resurgence in recent years because there are commercial applications that have been using deep learning since the 90’s, when it was considered more of an art than a technology, and could only be applied by an expert with a specific set of skills to obtain a good performance from the algorithm. However, while this specific knowledge is still required today to apply these types of algorithms, the number of required skills is reduced as the amount of training data increases [2].

Figure 1 (taken from [3]) shows an example of how the performance of different types of algorithms evolves as the amount of training data increases.
Usually, in older or more traditional learning algorithms (such as linear regressions or logistic regressions) performance becomes stagnant or limited. This means that the learning curve flattens out, and the algorithm stops improving even with more data input [3].
On the other hand, if a small neural network (with a few hidden layers and units) is trained on the same supervised learning task, it is likely to achieve a small improvement, which can be even more considerable if a bigger (deeper) neural network is trained, increasing the complexity of the model [3].

There are some general approximations or rules that attempt to define objective values, such as in [2], where the authors mention that in 2016, it was established as a general rule that a supervised deep learning algorithm generally achieves acceptable performance using around 5000 examples per category.

However, the reality is that it is very difficult or virtually impossible to determine in advance and with complete certainty the ideal size of a data set.
Some of the factors that influence the amount of data required to train a model and achieve good performance are:

● Complexity of the learning task. For example, when it comes to image classification, how different from each other are the classes or categories on which a model is to be trained, and under which context the captures are taken (amount of noise in the image).
● What data augmentation variants can be used on the data.
● If pre-trained models exist and the ability to use them (transfer learning). This means to use part of the weights of the trained models on a similar task.
● The type of input data, its dimension or size.
● Application of preprocessing tasks on the data (e.g., dimensionality reduction)
● The complexity of the used model, determined by its architecture.
● The quality of the data, which may affect performance if data present excessive noise or don’t include the necessary information to predict the desired outcome. It is important to replicate the training and implementation environments, meaning that the training data are similar in context to those that will be used as input when the model is deployed.

Overfitting is an indicator that can let us know if further data collection is required. However, and more so if increasing the size of the data is complex due to the nature of the problem, it is possible to use some strategies beforehand to improve the generalization of the model, such as the use of pre-trained models (transfer learning), reducing the complexity of the model, or incorporating regularization strategies.

In conclusion, while it is true that perhaps one of the safest ways to improve the performance of an algorithm is by training a large model (a deep network) on a large amount of data, it is important to analyze each case promptly, possess a good initial size of training data (having certain general rules as a starting point), of quality and consistent with the objective task, knowing full well that the optimal number for the expected performance will ultimately be influenced by several factors.

References
[1] Francois, Chollet. Deep learning with Python” (2017).
[2] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[3] Ng, A. Machine learning yearning: Technical strategy for ai engineers in the era of deep learning. (2019).