This paper proposes a method to “whittle” away parts of the image search space with some user feedback. They attack the problem of image searching where the user is able to provide attributes that are different from the current iteration of results, until the obtained search results are closely matching what the user has in mind. Ex: “Show me images like these, but sportier“. The authors make use of this information to learn functions for each “nameable attribute” (sportiness, furriness etc.) offline which update themselves based on the feedback.
Image search is a hard problem in general. There has always been a semantic gap between high level attributes and low level features that are used. In the recent years, due to the rise in popularity of image descriptors and large image databases, searching with images has seen growth. The authors believe that the best way to solve this problem is to add a “Human in the loop” providing high level feedback regarding the images they are interested in, thereby refining their search results.Using feedback to refine search has been used before, however the feedback is either very coarse (relevant/irrelevant) or involves adjusting parameters for the algorithm to reiterate. While the former is more intuitive, it leaves the algorithm clueless about which part of the image the user has found (ir)relevant and the latter is hard for the user who does not understand the intricacies of the working algorithm. This way by providing high level feedback, it is a step further into image search. For example, when browsing images of potential dates on a dating website, he can say: “I am interested in someone who looks like this, but with longer hair and more smiling.”
The authors first learn functions offline to determine the attributes for a given image. For the image search part it is assumed that these attributes are available (it is also mentioned that such attributes can be learned). The first step is to predict attributes for each image, performed manually using Amazon’s MTurk. A sample question for the user is shown below:
Once the predictive functions are learned, the authors learn ranking functions, one per attribute, such that for two image descriptors x_i and x_j, if x_i has more of one attribute than x_j, it will rank better according to the function learned. The objective function to be optimized is similar to SVM training and “is solvable using similar decomposition algorithms”. Now each image can be described by its attributes and one has the knowledge of how much of each attribute is present in it as well. The feedback is incorporated as an additional constraint such that in the next iteration, the desired attribute is more or less or similar to the reference images presented in the current iteration. More details from this excerpt of the paper:
The paper can be found here