Are you working on a project where no data is available to you? Maybe you are expected to work with a small dataset or a dataset that has some missing values. Here are some ideas that might help you in the process.
Why is missing data a problem?
First things first, let’s understand the problem. Why is missing data an issue?
Let’s tackle the “No data” scenario first. Even though the problem of having no data might be obvious to data scientists, many people are not aware of the issue. Here is a high-level explanation of the problem. Imagine yourself sitting in an empty room. Someone tells you that in 10 minutes you will be asked to recognize Chinese characters. Let’s say that you have no previous knowledge of Chinese, nor is there a way you can learn anything. In 10 minutes, an interviewer comes and starts showing you some characters. You have two options - say nothing or guess the meaning. How well did you perform? Think of machine learning models as small, very simplified artificial brains. Whether we are talking about labeled or unlabeled data, training machine learning models requires data. Without it, the best models can do is guess.
Similar issues arise when working with small datasets or datasets with missing data entries - biased estimates. Imagine yourself in the same room, but this time there is a Chinese visual dictionary with 50 characters in it. You read it, memorize some characters, possibly detect some patterns that might be useful for detecting even more characters in the future. The interviewer comes and asks you to recognize 10 characters. Luckily, all 10 characters were in the book and your answers are 100% accurate. Now we can deduce that you have an excellent knowledge of Chinese! No? Why? What happens if you get another 10 characters you haven’t seen before? Based on those 10 characters, the interviewer can make any conclusion and it can vary from “highly proficient in written Chinese” to “no knowledge of Chinese whatsoever”. You can imagine how much damage could a poorly estimated machine learning model make in production.
In conclusion, working with limited amount of data will very likely result in poor models, biased estimates, wrong assumptions, incorrect error estimation. To solve these issues, we must minimise the amount of missing data, make the right assumptions when it comes to working with small datasets and choose the right algorithm and analytics approach to work with.
How much data do I need?
The more the merrier! But how much is enough? The amount of data required for a machine learning model to work depends mostly on the problem and the algorithm that is going to be used.
Here are some ideas on how you can decide how much data you need. Keep in mind that your model’s job is to capture correlations between input features and/or between input and output features. You should provide enough data to cover at least the most representative scenarios. The more complex correlations are, the more data you need. Here’s an example. Imagine you have to create a connect the dots game where the result should be a sine wave. How many dots would you put? Could you describe it with one dot? Two? Three? Five? One hundred? When should you stop? Similarly, any model’s performance depends on how many good data samples you provide. Hopefully, it makes sense now that nonlinear algorithms will often require more data than linear ones.
Another approach you should consider is analysing your model’s performance when trained with different amounts of data. Maybe you will realise that you are providing more data than it is needed, or maybe you will realise that the model performs better every time you add more data so therefore you should try to collect more. If none of these work for you, try looking for similar problems that have already been solved. In papers you will more often than not find information about the datasize they used to solve an issue. Take it as a guideline.
If none of these apply to you, start somewhere, take an educated guess! How big is your problem? For complex deep learning problems such as image recognition, you will probably need hundreds of thousands to millions of data samples (images). For simpler problems, try with few hundreds, few thousands, tens of thousands and see how your model performs. Try simpler models. Trial and error method is your best friend.
Handling missing data
Missing data is a common issue that occurs in almost every research. From biased estimates to invalid conclusions, the problem of missing data must be identified, understood and resolved. Here are some ideas on how you can handle your missing data.
Collecting more data
Rather obvious choice is collecting more data. But how do you do it? If your problem is domain specific and you have some unlabeled data, consider hiring a person who will label your dataset. It will save you some time. Depending on the problem you are trying to solve, consider conducting a survey, creating web crawlers (do check whether crawling certain website is legal beforehand), working with multiple datasets that have similar features, or try some of the augmentation methods listed below. Also, you could try collecting more data in a similar domain. For example, if you are predicting weather for a certain country, include information from other countries as well; if you are working on a sentiment analysis of comments on a certain website, collect comments or text from other websites as well. Another choice would be fine-tuning existing models using your dataset if something like this is applicable to your problem.
On the other hand, if you are experiencing issues with missing values in your dataset, instead of removing tuples or dealing with errors and bad performance of your algorithm, consider imputation methods. Some popular methods include mean, regression, stochastic and multiple imputation. Give them a try.
Data augmentation methods
Why not make the best of what you have by using data augmentation methods? The task of these methods is to increase the amount of data that is already available by, let’s say, tinkering with some parameters (in a meaningful manner). Why should you consider this option? It’s mainly the cheapest one in terms of human effort, computational resources and the time consumed.
There are many ways to augment data or artificially generate more data. For example, if you are working with images, consider rotating them, flipping or cropping. This way, one image can result in multiple ones, already labeled if you are working with a labeled dataset. If you are working with tuples that contain features, think about which parameters can be modified or artificially created. Maybe averaging or mixing some features over similar tuples could result in new ones. Also, if you are working with multiclass classification, give “One vs. Rest” strategy a try. All data points that do not belong to the observed class can represent negative samples for your binary classifier.
When it comes to artificially generating more data, maybe it would be cheaper to first create a model that will learn from existing samples and generate more data for you by using generative or recursive adversarial networks for example.
Working with small amount of data
If none of these work for you, all that is left is using algorithms that will give decent performance even with small amount of data. If you haven’t done it before, analyse your data! See if all features are really necessary and/or consider regularization and model averaging. Pay special attention to noise and outliers. They can have much higher negative impact on your results when your dataset is small. Work with simpler models and rule out complex algorithms that involve non-linearity or feature interactions. Lastly, introduce confidence intervals. For example, when classifying your data consider probabilistic classification. It can be quite helpful when analysing model’s performance.
Here we presented our opinion on importance of having enough data and some approaches you can try to make the best of what you already have. Even though there are many approaches you can try, most of them focus on using simpler, preferably linear models, uncertainty quantification and regularization. We hope you got an idea on how to handle and expand your dataset.