note: This is a slightly older paper, but I am reviewing it because it appeared to be a popular paper in the area of anomaly detection, in which I have a new found interest.
- What? This paper focuses on the problem of anomaly detection in images and videos. This problem is defined as finding (detecting) parts of the signal (image or video) that are significantly (and semantically??) different from the rest of the signal. This is also similar to studying saliency in images. They pose the problem in the following way: Given a signal (‘query’) they try to see if they can compose it using previously learned spatio-temporal image patches. Higher the number of patches employed in explaining the query, greater the familiarity of the query. They also explain the saliency problem wherein, their database is now the rest of the image/video it self. That is, if any portion of the signal cannot “sufficiently well explained” by the rest of the signal then it is important (an anomaly).
An illustration of the method employed in this paper
One of the biggest concerns in security is identifying anomalies in video. The problem is an important one, particularly hard when this paper was written (prior to the “activity recognition” topic gaining immense popularity). The authors also address the problem of saliency in a new way that they claim to be more intuitive. This formulation of theirs is general enough to address several problems in vision such as : 1. Attention in images/video 2. Recognition of suspicious actions and 3. Recognition of suspicious objects.
Each spatial (or spatio-temporal, in case of video) patch is broken down into several hundreds of smaller patches in various scales to form an “ensemble of patches” corresponding to a particular point in the image or video. Each patch within the ensemble is associated with two kinds of information i) A descriptor vector (a simple one is used here, with gradient information at different scales that is stacked into a vector and normalized to sum to 1) and ii) its location in absolute coordinates. (A side note : Since they are comparing patches, it is easy to see that no two patches are exactly the same. Therefore, if we want to measure similarity there must be enough non-rigidity to allow for small variations. This is why they include geometric information and absolute coordinates.) Given an ensemble of a query signal y, we would like to compute the joint likelihood between it and a hidden patch in the database x, as P(y,x) = P(y|x)P(x). P(x) is non parametric and must be defined from the samples directly. The authors assume the similarity between the i^th & j^th observed patches in x and y as a Gaussian distribution to make the computation of the likelihood more tractable. With a few more assumptions similar to this, the authors are able to factorize the likelihood function. The relationship between all the variables involved is depicted as a Bayesian network as shown below. Now that we have P(y,x), for a query ensemble y, we seek the hidden ensemble x that maximizes its MAP (maximum a-posterior probability) assignment. The factorized expression for P(y,x) can be “phrased as a message passing (Belief Propagation)algorithm in the graph” shown below. The authors first produce a set of possible candidates based on their descriptor similarity and then compare the possible set of origins c_x. Their method with an efficient search that is proposed in the paper, is shown to scale well with increase in number of patches.
This paper can be found here http://www.wisdom.weizmann.ac.il/~boiman/publications/boiman_irani_detecting_irregularities.pdf