[Review] Detecting Irregularities in Images and Video (ICCV, 2005)

note: This is a slightly older paper, but I am reviewing it because it appeared to be a popular paper in the area of anomaly detection, in which I have a new found interest. 

  • What?

    This paper focuses on the problem of anomaly detection in images and videos. This problem is defined as finding (detecting) parts of the signal (image or video) that are significantly (and semantically??) different from the rest of the signal. This is also similar to studying saliency in images. They pose the problem in the following way: Given a signal (‘query’) they try to see if they can compose it using previously learned spatio-temporal image patches. Higher the number of patches employed in explaining the query, greater the familiarity of the query. They also explain the saliency problem wherein, their database is now the rest of the image/video it self. That is, if any portion of the signal cannot “sufficiently well explained” by the rest of the signal then it is important (an anomaly).An illustration of the method employed in this paper

An illustration of the method employed in this paper

  • Why

    One of the  biggest concerns in security is identifying anomalies in video. The problem is an important one, particularly hard when this paper was written (prior to the “activity recognition” topic gaining immense popularity). The authors also address the problem of saliency in a new way that they claim to be more intuitive. This formulation of theirs is general enough to address several problems in vision such as : 1. Attention in images/video 2. Recognition of suspicious actions and 3. Recognition of suspicious objects.
  • How?

    Each spatial (or spatio-temporal, in case of video) patch is broken down into several hundreds of smaller patches in various scales to form an “ensemble of patches” corresponding to a particular point in the image or video. Each patch within the ensemble is associated with two kinds of information i) A descriptor vector (a simple one is used here, with gradient information at different scales that is stacked into a vector and normalized to sum to 1) and ii) its location in absolute coordinates. (A side note : Since they are comparing patches, it is easy to see that no two patches are exactly the same. Therefore, if we want to measure similarity there must be enough non-rigidity to allow for small variations. This is why they include geometric information and absolute coordinates.) Given an ensemble of a query signal y, we would like to compute the joint likelihood between it and a hidden patch in the database x, as P(y,x) = P(y|x)P(x). P(x) is non parametric and must be defined from the samples directly. The authors assume the similarity between the i^th & j^th observed patches in x and y as a Gaussian distribution to make the computation of the likelihood more tractable. With a few more assumptions similar to this, the authors are able to factorize the likelihood function. The relationship between all the variables involved is depicted as a Bayesian network as shown below. Now that we have P(y,x), for a query ensemble y, we seek the hidden ensemble x that maximizes its MAP (maximum a-posterior probability) assignment. The factorized expression for P(y,x) can be “phrased as a message passing (Belief Propagation)algorithm in the graph” shown below. The authors first produce a set of possible candidates based on their descriptor similarity and then compare the possible set of origins c_x. Their method with an efficient search that is proposed in the paper, is shown to scale well with increase in number of patches.
    Screen Shot 2013-01-04 at 1.23.00 PM

This paper can be found here http://www.wisdom.weizmann.ac.il/~boiman/publications/boiman_irani_detecting_irregularities.pdf

[Review] WhittleSearch: Image Search with Relative Attribute Feedback (CVPR, 2012)

  • What?
    This paper proposes a method to “whittle” away parts of the image search space with some user feedback. They attack the problem of image searching where the user is able to provide attributes that are different from the current iteration of results, until the obtained search results are closely matching what the user has in mind.  Ex: “Show me images like these, but sportier“. The authors make use of this information to learn functions for each “nameable attribute” (sportiness, furriness etc.) offline which update themselves based on the feedback. 
  • Why?  
    Image search is a hard problem in general. There has always been a semantic gap between high level attributes and low level features that are used. In the recent years, due to the rise in popularity of image descriptors and large image databases, searching with images has seen growth. The authors believe that the best way to solve this problem is to add a “Human in the loop” providing high level feedback regarding the images they are interested in, thereby refining their search results.Using feedback to refine search has been used before, however the feedback is either very coarse (relevant/irrelevant) or involves adjusting parameters for the algorithm to reiterate. While the former is more intuitive, it leaves the algorithm clueless about which part of the image the user has found (ir)relevant and the latter is hard for the user who does not understand the intricacies of the working algorithm. This way by providing high level feedback, it is a step further into image search. For example, when browsing images of potential dates on a dating website, he can say: “I am interested in someone who looks like this, but with longer hair and more smiling.”
  • How?
    The authors first learn functions offline to determine the attributes for a given image. For the image search part it is assumed that these attributes are available (it is also mentioned that such attributes can be learned). The first step is to predict attributes for each image, performed manually using Amazon’s MTurk. A sample question for the user is shown below:

    Sample question on MTurk

    Once the predictive functions are learned, the authors learn ranking functions, one per attribute, such that for two image descriptors x_i and x_j, if x_i has more of one attribute than x_j, it will rank better according to the function learned. The objective function to be optimized is similar to SVM training and “is solvable using similar decomposition algorithms”. Now each image can be described by its attributes and one has the knowledge of how much of each attribute is present in it as well.  The feedback is incorporated as an additional constraint such that in the next iteration, the desired attribute is more or less or similar to the reference images presented in the current iteration. More details from this excerpt of the paper:

    Incorporating user feedback

The paper can be found here

A Reading Group for Computer Vision, Machine Learning & Pattern Recognition at Arizona State University

[edit: Jan 20, 2016] This post was written a long ago, when I was interested in figuring out who was working on what within ASU. It turned out to be a lot harder than I expected to get something like this started, and I have since graduated from ASU. However, I hope this post will serve as an (incomplete) entry point into some of the vision research that’s being conducted at ASU. There have also been several newer faculty who have joined with exciting research areas!

With every passing day, I realize more – how important it is for me to document my opinions, findings and thoughts on several topics that I read. Not only does this help me learn faster, expose myself to a broader spectrum of papers and ideas but also helps me create this portfolio of my works. I realize this is something that’s important to every PhD aspirant. It is my understanding that doing a PhD is a lot like a beginner’s course in Entrepreneurship. You have to develop ideas, form opinions and sell them to your community. Of course, you don’t go out of business and have your start-up crashing down if your product doesn’t sell, but you have different pressures like establishing your ideas firmly.

Reading Group
With those side notes apart, I wish to begin a small reading group here at ASU. I have read about a similar group in CMU (http://www.cs.cmu.edu/~misc-read/) that is well established now. Inspired by this blog post, a similar group will do folks at ASU a lot of good.

Research Groups working on Computer Vision and/or Machine Learning at ASU:
Although ASU’s research in EE is strongly towards communication, networking etc. Few professors are changing the research landscape to add to computer vision research. The CS department, however, is known to have strong faculty working on Machine Learning, Pattern Recognition and some computer vision. I shall try to note all the groups here in order to make a comprehensive list of people who will be the ideal audience for the aforementioned seminar sessions.

Electrical, Computer & Energy Engineering:

    • Dr. Andreas Spanias’ group (Main contributors [1], [2]) –  Sparse representations, dictionary learning methods, high dimensional geometry etc.
    • Dr. Lina Karam’s group – Well established group, work mainly on Compression codecs, Visual quality assessment, Saliency in Images/Videos etc.
    • Dr. Pavan Turaga‘s group – Relatively new, main focus – topics on activity analysis, compressive sensing, dictionary learning, non linear geometries etc.

Next is the CS group, with many more faculty working on different aspects of these areas. I am less familiar with them and hence will just list them (in no particular order) for record’s sake.

Computing, Informatics and Decision Systems Engineering:

So assuming each of them have around 5 students and out of that at least 2 are interested in this reading session that gives us around ~15 students to start off this with. Which seems like a reasonable number. Hopefully, if the graduate student association of ASU (GPSA) recognizes this as a grad organization, they’ll even fund some part of it.

More updates as things progress.

— Rushil