In his 2017 Amazon shareholder letter, Jeff Bezos wrote something interesting about Alexa, Amazon’s voice-driven intelligent assistant:
In the U.S., U.K., and Germany, we’ve improved Alexa’s spoken language understanding by more than 25% over the last 12 months through enhancements in Alexa’s machine learning components and the use of semi-supervised learning techniques. (These semi-supervised learning techniques reduced the amount of labeled data needed to achieve the same accuracy improvement by 40 times!)
Given those results, it might be interesting to try semi-supervised learning on our own classification problems. But what is semi-supervised learning? What are its advantages and disadvantages? How can we use it?
What is semi-supervised learning?
As you might expect from the name, semi-supervised learning is intermediate between supervised learning and unsupervised learning. Supervised learning starts with training data that are tagged with the correct answers (target values). After the learning process, you wind up with a model with a tuned set of weights, which can predict answers for similar data that haven’t already been tagged.
Semi-supervised learning uses both tagged and untagged data to fit a model. In some cases, such as Alexa’s, adding the untagged data actually improves the accuracy of the model. In other cases, the untagged data can make the model worse; different algorithms have vulnerabilities to different data characteristics, as I’ll discuss below.
In general, tagging data costs money and takes time. That isn’t always an issue, since some data sets already have tags. But if you have a lot of data, only some of which is tagged, then semi-supervised learning is a good technique to try.
Semi-supervised learning algorithms
Semi-supervised learning goes back at least 15 years, possibly more; Jerry Zhu of the University of Wisconsin wrote a literature survey in 2005. Semi-supervised learning has had a resurgence in recent years, not only at Amazon, because it reduces the error rate on important benchmarks.
Sebastian Ruder of DeepMind wrote a blog post in April 2018 about some of the semi-supervised learning algorithms, the ones that create proxy labels. These include self-training, multi-view learning, and self-ensembling.
Self-training uses a model’s own predictions on unlabeled data to add to the labeled data set. You essentially set some threshold for the confidence level of a prediction, often 0.5 or higher, above which you believe the prediction and add it to the labeled data set. You keep retraining the model until there are no more predictions that are confident.
This begs the question of the actual model to be used for training. As in most machine learning, you probably want to try every reasonable candidate model in the hopes of finding one that works well.
Self-training has had mixed success. The biggest flaw is that the model is unable to correct its own mistakes: one high-confidence (but wrong) prediction on, say, an outlier, can corrupt the whole model.
Multi-view training trains different models on different views of the data, which may include different feature sets, different model architectures, or different subsets of the data. There are a number of multi-view training algorithms, but one of the best known is tri-training. Essentially, you create three diverse models; every time two models agree on the label of a data point, that label is added to the third model. As with self-training, you stop when no more labels are being added to any of the models.
Self-ensembling typically uses a single model with several different configurations. In the ladder network method, the prediction on a clean example is used as a proxy label for a randomly perturbed example, with the aim of developing features that are robust to noise.
Jerry Zhu’s 2007 tutorial also considers a number of other algorithms. These include generative models (such as ones that assume a Gaussian distribution for each class), semi-supervised support vector machines, and graph-based algorithms.
Semi-supervised learning in the cloud
Semi-supervised learning is slowly making its way into mainstream machine learning services. For example, Amazon SageMaker Ground Truth uses Amazon Mechanical Turk for manual labeling and boundary determination of part of an image set and uses neural network training to label the rest of the image set.
Similar semi-supervised learning schemes can be used for other kinds of semi-supervised learning, including natural language processing, classification, and regression on several services. However, you’ll have to write your own glue code for the semi-supervised algorithm on most of them.