Machine Learning - Error Analysis

In machine learning, usually we have the following steps to train an initial model and then gradually improve it.

For different learning algorithms, they often find similar categories of examples that are difficult to classify. And by having a quick and dirty implementation, that's often a quick way to let you identify some errors and what are the hard examples so that you can focus on those.

Plot the learning curves of the training and test errors to try to figure out if your learning maybe suffering high bias or high variance or something else. And use that to try to decide if having more data or more features and so on are likely to help. Where we should spend our time in the process of machine learning is hard to tell in advance (e.g.before seeing the learning curve). So we should let evidence guide our decision on where to spend our time rather use "gut feeling" which is often wrong.

The learning curves for high bias and high variance are as follows:

Besides using error value to evaluate our model, we can also use accuracy/precision/recall/F1 score.

Try a range of values of threshold and evaluate these different thresholds on cross validation set and pick the threshold that gives you the highest F score

After seeing the learning curve and conduct corresponding improvements. We can do error analysis which can inspire you to design new features or tell you what are the current shortcomings of the system and then come up with improvements to it.

There is an example of error analysis for spam email detection:

In the above example, we find that the emails of type "steal passwords" are very easy to be wrongly classified. Then we can focus on this type of emails and try to find out whether there are some features can help deal with this type of emails.

The following things are just some aspects on which we can improve our classifier in practice.

There is another example for debugging a learning algorithm:

One question we need to note is "Will it always be more helpful when there is a larger size of training data? (i.e. add more training data)"

This depends whether there is sufficient information/features to predict the results accurately. The way to judge whether the features are enough is by seeing whether human experts in this domain can use these features to predict accurately. If they can, it tells that the current features are efficient.

If the features are sufficient, it says that the algorithm has low bias (i.e. training error is low). Then if we add more training data, it is likely to make the test error small as well. And large training set can also ensure that we have low variance.

Note: This is the reason why we use validation set or cross-validation set: If we develop new features by examining the test set, then we may end up choosing features that work well specifically for the test set, so ${ J }_{ test }(\theta )$ is no longer a good estimate of how well we generalize to new examples.