[Review] Detecting Irregularities in Images and Video (ICCV, 2005)

note: This is a slightly older paper, but I am reviewing it because it appeared to be a popular paper in the area of anomaly detection, in which I have a new found interest. 

  • What?

    This paper focuses on the problem of anomaly detection in images and videos. This problem is defined as finding (detecting) parts of the signal (image or video) that are significantly (and semantically??) different from the rest of the signal. This is also similar to studying saliency in images. They pose the problem in the following way: Given a signal (‘query’) they try to see if they can compose it using previously learned spatio-temporal image patches. Higher the number of patches employed in explaining the query, greater the familiarity of the query. They also explain the saliency problem wherein, their database is now the rest of the image/video it self. That is, if any portion of the signal cannot “sufficiently well explained” by the rest of the signal then it is important (an anomaly).An illustration of the method employed in this paper

An illustration of the method employed in this paper

  • Why

    One of the  biggest concerns in security is identifying anomalies in video. The problem is an important one, particularly hard when this paper was written (prior to the “activity recognition” topic gaining immense popularity). The authors also address the problem of saliency in a new way that they claim to be more intuitive. This formulation of theirs is general enough to address several problems in vision such as : 1. Attention in images/video 2. Recognition of suspicious actions and 3. Recognition of suspicious objects.
  • How?

    Each spatial (or spatio-temporal, in case of video) patch is broken down into several hundreds of smaller patches in various scales to form an “ensemble of patches” corresponding to a particular point in the image or video. Each patch within the ensemble is associated with two kinds of information i) A descriptor vector (a simple one is used here, with gradient information at different scales that is stacked into a vector and normalized to sum to 1) and ii) its location in absolute coordinates. (A side note : Since they are comparing patches, it is easy to see that no two patches are exactly the same. Therefore, if we want to measure similarity there must be enough non-rigidity to allow for small variations. This is why they include geometric information and absolute coordinates.) Given an ensemble of a query signal y, we would like to compute the joint likelihood between it and a hidden patch in the database x, as P(y,x) = P(y|x)P(x). P(x) is non parametric and must be defined from the samples directly. The authors assume the similarity between the i^th & j^th observed patches in x and y as a Gaussian distribution to make the computation of the likelihood more tractable. With a few more assumptions similar to this, the authors are able to factorize the likelihood function. The relationship between all the variables involved is depicted as a Bayesian network as shown below. Now that we have P(y,x), for a query ensemble y, we seek the hidden ensemble x that maximizes its MAP (maximum a-posterior probability) assignment. The factorized expression for P(y,x) can be “phrased as a message passing (Belief Propagation)algorithm in the graph” shown below. The authors first produce a set of possible candidates based on their descriptor similarity and then compare the possible set of origins c_x. Their method with an efficient search that is proposed in the paper, is shown to scale well with increase in number of patches.
    Screen Shot 2013-01-04 at 1.23.00 PM

This paper can be found here http://www.wisdom.weizmann.ac.il/~boiman/publications/boiman_irani_detecting_irregularities.pdf

[Review] WhittleSearch: Image Search with Relative Attribute Feedback (CVPR, 2012)

  • What?
    This paper proposes a method to “whittle” away parts of the image search space with some user feedback. They attack the problem of image searching where the user is able to provide attributes that are different from the current iteration of results, until the obtained search results are closely matching what the user has in mind.  Ex: “Show me images like these, but sportier“. The authors make use of this information to learn functions for each “nameable attribute” (sportiness, furriness etc.) offline which update themselves based on the feedback. 
  • Why?  
    Image search is a hard problem in general. There has always been a semantic gap between high level attributes and low level features that are used. In the recent years, due to the rise in popularity of image descriptors and large image databases, searching with images has seen growth. The authors believe that the best way to solve this problem is to add a “Human in the loop” providing high level feedback regarding the images they are interested in, thereby refining their search results.Using feedback to refine search has been used before, however the feedback is either very coarse (relevant/irrelevant) or involves adjusting parameters for the algorithm to reiterate. While the former is more intuitive, it leaves the algorithm clueless about which part of the image the user has found (ir)relevant and the latter is hard for the user who does not understand the intricacies of the working algorithm. This way by providing high level feedback, it is a step further into image search. For example, when browsing images of potential dates on a dating website, he can say: “I am interested in someone who looks like this, but with longer hair and more smiling.”
  • How?
    The authors first learn functions offline to determine the attributes for a given image. For the image search part it is assumed that these attributes are available (it is also mentioned that such attributes can be learned). The first step is to predict attributes for each image, performed manually using Amazon’s MTurk. A sample question for the user is shown below:

    Sample question on MTurk

    Once the predictive functions are learned, the authors learn ranking functions, one per attribute, such that for two image descriptors x_i and x_j, if x_i has more of one attribute than x_j, it will rank better according to the function learned. The objective function to be optimized is similar to SVM training and “is solvable using similar decomposition algorithms”. Now each image can be described by its attributes and one has the knowledge of how much of each attribute is present in it as well.  The feedback is incorporated as an additional constraint such that in the next iteration, the desired attribute is more or less or similar to the reference images presented in the current iteration. More details from this excerpt of the paper:

    Incorporating user feedback

The paper can be found here