There’s no doubt that we have a data deluge problem, while at the same time also going through an AI revolution, it is possible one influences the other. Solving these issues together will possibly be one of the biggest challenges of the 21st century. I want to talk about one small aspect of one small step towards a solution. But first, a short detour..
Group Testing was a statistical sampling technique that emerged during WW II as a method to rapidly estimate the number of soldiers who suffered from syphilis. The doctors of the time, suggested pooling blood samples into pools which could be tested for the virus. The advantage of such a method was that all the soldiers who belonged to a pool, were free to go if the test for the virus was negative for that pool. Alternatively, if it was positive, then at least one soldier was infected. As it turns out, when the rate of infection (percentage of population that has the virus) is very low (~ 1-5%), probability theory affords us very accurate estimates of the rate of infection, by observing very few samples. Since then, a large community of bio-statisticians have studied the problem and applied it to a wide variety of applications ranging from estimating the spread of a crop disease to HIV. Interestingly, there has been a lot of work studying group testing’s relation to compressive sensing as well.
We recently used this technique as an easy way to obtain “group labels” within a Human-in-the-Loop system to estimate a classifier’s performance in the absence of labeled data (paper) .
Proportion-SVM was proposed a couple of years ago, as very weak supervised learning setting that allows you to learn a classifier, when the only supervision available is the ratio of one class to another. It is a fascinating problem, because it reduces the effort required for learning down to estimating proportions! A paper which I have been reading, solves this problem by setting all the samples to random labels (in a binary classification setting) and flips them to observe if their cost function reduces. They also have an analytic solution, which solves it exactly after a convex relaxation of the objective function. The cost function resembles a typical SVM one, with the addition of an extra term that adds a penalty when the current proportion of classes is different from what was provided during training.
Imbalanced Learning is increasingly becoming relevant within our world deluded by data. More often than not, we have more (much more) of one thing than another. To be more precise, let us look at the example of spam filtering. For every 100 emails you receive, probably 10 are spam which you want your inbox to filter out automatically. In a realistic scenario, the numbers can be even more skewed (1:99), so how do we learn under this constrained setting? It is known that classifiers perform poorly when they are trained on such skewed datasets, because they tend to optimize the cost function almost entirely for the dominant class, ignoring the other class because it doesn’t significantly affect the cost (There’s a joke about the wealthiest 1% in there, but I’ll keep this discussion purely academic). There are many strategies to account for, and correct this imbalance in training data to achieve the best performance on test data. A nice survey of all the approaches is found in this paper.
Here’s the case I’m making:
Group testing works splendidly when the rate (or proportions) are very low, of the order of 1-5%. As it turns out, much of the real world data tends to be skewed in favor of one class, and classifiers that need to be learned will generalize poorly. Proportion SVMs don’t generalize well when the proportions are skewed.
In the field, one can estimate proportions very well using group testing, estimate a classifier on the data using only proportion information between classes. This not only reduces the effort required on labeling, but provides a path towards having intelligent systems in the wild jungle that is the internet. The hurdle that needs to be crossed before we can achieve this, is to develop techniques to change/modify or propose new methods to learn proportion-classifiers on extremely skewed datasets, taking ques from the theory of imbalanced learning.