# Sensory Substitution & Enhancement for Perceiving High Dimensional Data

As humans, we are severely limited in our capability to understand, visualize or perceive dimensions beyond the spatial 3 dimensions. We can also perceive changes in time, which gives us the ability to visualize 4-D changes and patterns. However, it is extremely unintuitive for us to understand spaces that are any higher than 3D. I remember reading about this first in Michio Kaku’s Hyperspace, where he constructs an analogy between us and fictional 2D beings, that can only perceive two dimensions. These beings would look at everything in 2D, so we (3D beings) would look as contours constantly changing shapes, since they are projected onto the 2D plane that those beings can see.

Today, since we have very different kinds of data that is often high dimensional – “HD” –  (different from Big Data!) there is an increasing appetite for methods that allow us to visualize patterns and changes in these HD spaces. Areas such as topological data analysis (TDA) attempts to solve these problems using tools from Topology, a field in mathematics that has been very popular since the 1930s. Ayasdi a data startup founded by Gunner Carlsson of Stanford, attempts to provide solutions to data problems using topological tools, it recently raised \$55M in funding, and is seeing ~400% growth!

High dimensional data visualization is important and needs more attention from mathematicians and engineers. Interestingly a recent TED talk this year spoke about Sensory Substitution, which is basically the process where you take high dimensional real world signals such as images, audio etc. and map them to a much lower dimensional space and feed the low dimensional signal to the brain via different sensory inputs such as electrodes on the tongue, or tactic feedback on the skin etc. The claim is that the brain can automatically learn to “see” or “hear” given a few weeks of training with these new “eyes” or “ears”, since all perception happens in the brain. It is very inspiring and gives a lot of hope for people who have one or more of their primary senses which do not work as they were designed.

Now imagine, instead of “visualizing” 3D data all the time what if we could map dimension #4 to an auditory signal, #5 to a tactile signal, #6 to some other type of signal and have a human “perceive” the data (here I refer to perception as the generalization of visualization). How effective are we, as humans, in obtaining all this information at once and understanding the patterns in the high dimensional space? We will encounter cognitive overload, where beyond a certain point our brain cannot make any sense of the patterns. But as long as we operate within that limit, it will be very exciting to see how we understand 6- dimensional data. An added benefit is that we can map very HD data to 6 dimensions which a whole lot better than mapping it to 3 dimensions (see curse of dimensionality).

Going even further, if we perfect the art of data perception in HD spaces, can we also perceive why some machine learning algorithms fail and some don’t? Can we perceive boundaries between classes in massive datasets? The interesting thing about these questions is that I am not sure who could answer it better – a neuroscientist or an engineer!

# Learning Under an Extremely Weak Supervision Framework

There’s no doubt that we have a data deluge problem, while at the same time also going through an AI revolution, it is possible one influences the other.  Solving these issues together will possibly be one of the biggest challenges of the 21st century. I want to talk about one small aspect of one small step towards a solution. But first, a short detour..

Group Testing was a statistical sampling technique that emerged during WW II as a method to rapidly estimate the number of soldiers who suffered from syphilis. The doctors of the time, suggested pooling blood samples into pools which could be tested for the virus. The advantage of such a method was that all the soldiers who belonged to a pool, were free to go if the test for the virus was negative for that pool. Alternatively, if it was positive, then at least one soldier was infected. As it turns out, when the rate of infection (percentage of population that has the virus) is very low (~ 1-5%), probability theory affords us very accurate estimates of the rate of infection, by observing very few samples. Since then, a large community of bio-statisticians have studied the problem and applied it to a wide variety of applications ranging from estimating the spread of a crop disease to HIV.  Interestingly, there has been a lot of work studying group testing’s relation to compressive sensing as well.

We recently used this technique as an easy way to obtain “group labels”  within a Human-in-the-Loop system to estimate a classifier’s performance in the absence of labeled data (paper) .

Proportion-SVM was proposed a couple of years ago, as very weak supervised learning setting that allows you to learn a classifier, when the only supervision available is the ratio of one class to another. It is a fascinating problem, because it reduces the effort required for learning down to estimating proportions! A paper which I have been reading, solves this problem by setting all the samples to random labels (in a binary classification setting) and flips them to observe if their cost function reduces. They also have an analytic solution, which solves it exactly after a convex relaxation of the objective function.  The cost function resembles a typical SVM one, with the addition of an extra term that adds a penalty when the current proportion of classes is different from what was provided during training.

Imbalanced Learning is increasingly becoming relevant within our world deluded by data. More often than not, we have more (much more) of one thing than another. To be more precise, let us look at the example of spam filtering. For every 100 emails you receive, probably 10 are spam which you want your inbox to filter out automatically. In a realistic scenario, the numbers can be even more skewed (1:99), so how do we learn under this constrained setting? It is known that classifiers perform poorly when they are trained on such skewed datasets, because they tend to optimize the cost function almost entirely for the dominant class, ignoring the other class because it doesn’t significantly affect the cost (There’s a joke about the wealthiest 1% in there, but I’ll keep this discussion purely academic). There are many strategies to account for, and correct this imbalance in training data to achieve the best performance on test data. A nice survey of all the approaches is found in this paper.

Here’s the case I’m making:

Group testing works splendidly when the rate (or proportions) are very low, of the order of 1-5%. As it turns out, much of the real world data tends to be skewed in favor of one class, and classifiers that need to be learned will generalize poorly. Proportion SVMs don’t generalize well when the proportions are skewed.

In the field, one can estimate proportions very well using group testing, estimate a classifier on the data using only proportion information between classes. This not only reduces the effort required on labeling, but provides a path towards having intelligent systems in the wild jungle that is the internet. The hurdle that needs to be crossed before we can achieve this, is to develop techniques to change/modify or propose new methods to learn proportion-classifiers on extremely skewed datasets, taking ques from the theory of imbalanced learning.

# Online Subspace Learning (and recent applications in background subtraction)

In this post I will review two recent papers that I have been reading up on background subtraction, which follow the assumption that the background in an image approximately lies on a low dimensional subspace, under different lighting conditions. The two works namely –GROUSEGRASTA have shown interesting results for realtime background subtraction, which could potentially be useful and more robust than standard techniques which are fast like using the median filter, or mixture of Gaussians per pixel. Online Subspace Learning: The underlying premise of these methods is that if a dynamical system is very high dimensional, it can often be well approximated by a low dimensional subspace. As a result, instead of actually observing all the dimensions (which maybe impractical), one can randomly subsample the dimensions and make inferences on the lower dimensional data without losing a lot of information, i.e. learn the subspace accurately. All the three works I will focus on use the fact that the space of all subspaces is called the Grassmannian manifold. So the optimization of the cost function that will be proposed in each case will involve some gradient descent on the curved space.

(side note) The Grassmannian is an interesting manifold which I encounter in my work (example 1, 2) quite often but in a slightly different context. The Grassmannian provides a nice way to represent shapes in the context of human silhouettes for activity analysis where a shape is represented as a subspace.

More info on the Grassmann manifold and its properties here: The Geometry of algorithms with orthogonality constraints GROUSE: (Grassmannian Rank-One Update Subspace Estimation) To track a d-dimensional subspace of $\mathcal{R}^n$ that evolves over time denoted by $S[t]$. At every time instant we observe a vector $v_t \in S[t]$  at locations $\Omega_t \subset {1,\dots,n}$. Basically randomly picking dimensions within the large $n$ dimensional vector to accurately approximate the $d$ – dimensional subspace. Let $U$ be a basis for $S$. The cost function is as follows: $F(S;t) = \underset{a}{\text{min}} ||U_{\Omega_t}a - v_{\Omega_t}||^2$ Here $U$ lies on the Grassmann manifold as it represents the subspace. To solve for $U$ at each step, we need to find the gradient on the Grassmannian manifold. To do this, the authors represent the cost function in terms of $U$ rather than $U_{\Omega_t}$ as follows: $F(S;t) = \underset{a}{\text{min}} ||\Delta_{\Omega_t}(Ua - v_t||^2$ we have now pushed the indexing into $\Delta_{\Omega_t}$. Next, we take the derivate of this expression with respect to you and use that information to obtain the gradient of the function on the Grassmannian. This allows us to traverse along the direction of the gradient to find the best update of U using the following algorithm.   The important contribution of this work is that one can operate on a small subset of dimensions and still recover the entire subspace. This allows us to search and update the subspace much quicker than traditional methods. It is also been shown that such a technique is an excellent proxy for the amount of energy that lies in the subspace, provided that the number of dimensions ($|\Omega|$) is greater than the true subspace dimension ($d$). GRASTA : Grasmannian Robust Adaptive Subspace Tracking Algorithm Using the same idea of subspace tracking, the GRASTA attempts to address the problem of background subtraction. They make the assumption that in modeling a scene, the foreground belong to the objects that change more rapidly than the background. The simplest approach for BG-Subtraction is to keep a running mean or median of the previous frames which is efficient and easy to implement, but is not very responsive to quick changes in the scene like lighting  etc. Using the median gives a 0-dimensional model for the background where we can measure the residual of a given frame as its distance from a point (the median). This model can be made more powerful by keeping a d-dimensional subspace or affine set. Here each image is vectorized into a long vector, and such frames are observed. This matrix can be expressed in an d-dimensional orthonormal basis with weights (similar to above). Finally the model is described as below: At each time step t, a vector frame is generated as follows: $v_t = U_t w_t + s_t + \eta_t$ where $U$ is the orthonormal basis, $w_t$ is the weight/coefficient vector, $s_t$ is the sparse outlier vector that models the foreground (since only a small part of the long vector $v_t$ is the foreground), and the last term is the noise. The problem then is to find out the optimal basis $U_t$ and the weight vectors along with the sparse term. This is solved using the gradient descent on the Grassmannian as earlier and Alternating Direction Method of Multipliers (ADMM) respectively. The “robustness” as in GRASTA is from the fact that the cost function is formulated using the $\ell_1$ norm as opposed to $\ell_2$ norm as is done in the case of GROUSE.

# Thoughts from MMDS 2014, Berkeley

I had the opportunity to attend a workshop on Modern Massive Data Sets previous week and it had a wonderful mix of industry and academic talks from the people who are leading the efforts to understand, analyze and process massive data sets efficiently. I was attending as an intern at Dropcam, which is a home monitoring startup that currently has more inbound video than Youtube.

Naturally, this was an interesting conference for us with a lot of ideas, which I will try to summarize in this post. The mix between Graph Theory and Linear Algebra was perfect to satisfy CS and EE folks. A note – italics in blue are my thoughts from that day which may or may not be related to the talks. Day 1 highlights

• Large scale ML at Verizon – Ashok Srivastava: Recommending ads based on context, using Non-Negative Matrix Factorization and Multiple Kernel Anomaly Detection
• (What is the minimum amount of information to learn that can perform inferences?)
• Content based search from 50TB of video – Gerald Friedland, ICSI: While this was not a new problem, it was very relevant to us. The speaker spoke of searching inside videos using “mental concepts”. Another interesting point he made was about the difference between “precepts” and “concepts”.
• (there were other interesting talks, but I missed the second session of day 1)

Day 2 highlights

• Counterfactual Reasoning and Learning Systems – Leon Bottou, MSR Spoke about the complex systems that arise in most web-scale experiments such as search engines, ad placement engines etc. Where the user interaction, placement of an ad, context etc. contribute to a wealth of information during the event. The speaker’s claims as that these manifest themselves as a correlation/causation paradox. Some other thoughts – Reichenbach’s Principle or the common cause principle. If two events A, B are correlated, how does manipulating ‘B’ affect ‘A’? Related to A/B testing, counter factual reasoning. Relevant paper http://arxiv.org/abs/1209.2355
• Lot of PageRank based talks – PageRank, Personalized PageRank, Generalized Personalized PageRank!

Day 3 highlights

• “Embarrassingly Parallel” was a phrase that got thrown around very often. I had no idea what it meant so I looked it up: Any problem that does not require any/much effort to separate into parallel problems that can be solved simultaneously.
• Analyzing Big Graphs via Sketching and Streaming – Andrew McGregor, UMass This has to be among one of my favorite talks from the conference. Prof. McGregor set up the problem well enough that someone with little knowledge about the topic was able to quickly appreciate the novelty. Essentially their goal was to extend streaming and sketching algorithms to graphs. These techniques are popularly used to address problems in time series analysis where the length of the data maybe be very large that one requires to use special techniques of processing. The speaker’s goal was to devise an efficient way to determine if a graph was connected using sketching. Relevant survey – McGregor, SIGMOD 2014
• Sketching is an interesting domain where you want to project your data onto a lower dimensional space (like dimensionality reduction) where some properties such as the norm, distance between points etc. are preserved to some degree tau. When you impose a sparse structure on the input vector x, compressive sensing  tells us that when we use a random matrix transform, we are able to exactly recover x. However, x need not always be sparse, and we can use it like Andrew McGregor explained.
• IPython: A language independent framework for computation and data, Fernando Perez, UC Berkeley website: http://ipython.org I was very impressed by the presentation, IPython looks like a very good scientific computing language. For most of us in the scientific community, this maybe an effective and free competitor to MatLab. I have become fond of Python very quickly with its various packages such as NumPy, SciPy etc. which are sufficient for most numerical/data analysis. IPython also has the ability to interact with code then and there, which is very cool for analysis. When I typically write code, I end up spitting out various results, several times with different parameters. This is ideal for that purpose, with a very neat way of visualizing results. More information regarding handling different types of data can be found here: IPython Notebooks
• CUR Factorization: This was the first I heard about this factorization method, and I am glad I did! CUR matrix approximation basically attempts to factorize a matrix into three constituent matrices (‘C’ ‘U’ ‘R’) where ‘C’ and ‘R’ can be the columns and rows of the original matrix it self! This is similar to SVD decomposition, but it is a worse approximation. The reason it interests me (and many, more qualified others) is because this decomposition is much more intuitive to interpret than compared to the SVD. To put it simply, you can identify which columns of the matrix you want to keep, based on the approximation error that you can afford. I am interested in the problem of “smart sampling”, for applications in vision (shameless plug). Previous methods I have come across like Manifold Precis ( Shroff et al., 2011) which uses rank reducing QR decomposition to solve for a similar problem. I am definitely looking forward to reading about this, to see where I can apply it. At MMDS, a notable talk on this was by Mark Embre from Virginia Tech who proposed to achieve CUR factorization using discrete empirical interpolation.

It was a great workshop and I got to meet a lot of interesting folks!   Rushil

# [Review] Detecting Irregularities in Images and Video (ICCV, 2005)

note: This is a slightly older paper, but I am reviewing it because it appeared to be a popular paper in the area of anomaly detection, in which I have a new found interest.

• What?

This paper focuses on the problem of anomaly detection in images and videos. This problem is defined as finding (detecting) parts of the signal (image or video) that are significantly (and semantically??) different from the rest of the signal. This is also similar to studying saliency in images. They pose the problem in the following way: Given a signal (‘query’) they try to see if they can compose it using previously learned spatio-temporal image patches. Higher the number of patches employed in explaining the query, greater the familiarity of the query. They also explain the saliency problem wherein, their database is now the rest of the image/video it self. That is, if any portion of the signal cannot “sufficiently well explained” by the rest of the signal then it is important (an anomaly).

An illustration of the method employed in this paper

• Why

One of the  biggest concerns in security is identifying anomalies in video. The problem is an important one, particularly hard when this paper was written (prior to the “activity recognition” topic gaining immense popularity). The authors also address the problem of saliency in a new way that they claim to be more intuitive. This formulation of theirs is general enough to address several problems in vision such as : 1. Attention in images/video 2. Recognition of suspicious actions and 3. Recognition of suspicious objects.
• How?

Each spatial (or spatio-temporal, in case of video) patch is broken down into several hundreds of smaller patches in various scales to form an “ensemble of patches” corresponding to a particular point in the image or video. Each patch within the ensemble is associated with two kinds of information i) A descriptor vector (a simple one is used here, with gradient information at different scales that is stacked into a vector and normalized to sum to 1) and ii) its location in absolute coordinates. (A side note : Since they are comparing patches, it is easy to see that no two patches are exactly the same. Therefore, if we want to measure similarity there must be enough non-rigidity to allow for small variations. This is why they include geometric information and absolute coordinates.) Given an ensemble of a query signal y, we would like to compute the joint likelihood between it and a hidden patch in the database x, as P(y,x) = P(y|x)P(x). P(x) is non parametric and must be defined from the samples directly. The authors assume the similarity between the i^th & j^th observed patches in x and y as a Gaussian distribution to make the computation of the likelihood more tractable. With a few more assumptions similar to this, the authors are able to factorize the likelihood function. The relationship between all the variables involved is depicted as a Bayesian network as shown below. Now that we have P(y,x), for a query ensemble y, we seek the hidden ensemble x that maximizes its MAP (maximum a-posterior probability) assignment. The factorized expression for P(y,x) can be “phrased as a message passing (Belief Propagation)algorithm in the graph” shown below. The authors first produce a set of possible candidates based on their descriptor similarity and then compare the possible set of origins c_x. Their method with an efficient search that is proposed in the paper, is shown to scale well with increase in number of patches.

This paper can be found here http://www.wisdom.weizmann.ac.il/~boiman/publications/boiman_irani_detecting_irregularities.pdf

# [Review] WhittleSearch: Image Search with Relative Attribute Feedback (CVPR, 2012)

• What?
This paper proposes a method to “whittle” away parts of the image search space with some user feedback. They attack the problem of image searching where the user is able to provide attributes that are different from the current iteration of results, until the obtained search results are closely matching what the user has in mind.  Ex: “Show me images like these, but sportier“. The authors make use of this information to learn functions for each “nameable attribute” (sportiness, furriness etc.) offline which update themselves based on the feedback.
• Why?
Image search is a hard problem in general. There has always been a semantic gap between high level attributes and low level features that are used. In the recent years, due to the rise in popularity of image descriptors and large image databases, searching with images has seen growth. The authors believe that the best way to solve this problem is to add a “Human in the loop” providing high level feedback regarding the images they are interested in, thereby refining their search results.Using feedback to refine search has been used before, however the feedback is either very coarse (relevant/irrelevant) or involves adjusting parameters for the algorithm to reiterate. While the former is more intuitive, it leaves the algorithm clueless about which part of the image the user has found (ir)relevant and the latter is hard for the user who does not understand the intricacies of the working algorithm. This way by providing high level feedback, it is a step further into image search. For example, when browsing images of potential dates on a dating website, he can say: “I am interested in someone who looks like this, but with longer hair and more smiling.”
• How?
The authors first learn functions offline to determine the attributes for a given image. For the image search part it is assumed that these attributes are available (it is also mentioned that such attributes can be learned). The first step is to predict attributes for each image, performed manually using Amazon’s MTurk. A sample question for the user is shown below:

Sample question on MTurk

Once the predictive functions are learned, the authors learn ranking functions, one per attribute, such that for two image descriptors x_i and x_j, if x_i has more of one attribute than x_j, it will rank better according to the function learned. The objective function to be optimized is similar to SVM training and “is solvable using similar decomposition algorithms”. Now each image can be described by its attributes and one has the knowledge of how much of each attribute is present in it as well.  The feedback is incorporated as an additional constraint such that in the next iteration, the desired attribute is more or less or similar to the reference images presented in the current iteration. More details from this excerpt of the paper:

The paper can be found here

# A Reading Group for Computer Vision, Machine Learning & Pattern Recognition at Arizona State University

[edit: Jan 20, 2016] This post was written a long ago, when I was interested in figuring out who was working on what within ASU. It turned out to be a lot harder than I expected to get something like this started, and I have since graduated from ASU. However, I hope this post will serve as an (incomplete) entry point into some of the vision research that’s being conducted at ASU. There have also been several newer faculty who have joined with exciting research areas!

With every passing day, I realize more – how important it is for me to document my opinions, findings and thoughts on several topics that I read. Not only does this help me learn faster, expose myself to a broader spectrum of papers and ideas but also helps me create this portfolio of my works. I realize this is something that’s important to every PhD aspirant. It is my understanding that doing a PhD is a lot like a beginner’s course in Entrepreneurship. You have to develop ideas, form opinions and sell them to your community. Of course, you don’t go out of business and have your start-up crashing down if your product doesn’t sell, but you have different pressures like establishing your ideas firmly.

With those side notes apart, I wish to begin a small reading group here at ASU. I have read about a similar group in CMU (http://www.cs.cmu.edu/~misc-read/) that is well established now. Inspired by this blog post, a similar group will do folks at ASU a lot of good.

Research Groups working on Computer Vision and/or Machine Learning at ASU:
Although ASU’s research in EE is strongly towards communication, networking etc. Few professors are changing the research landscape to add to computer vision research. The CS department, however, is known to have strong faculty working on Machine Learning, Pattern Recognition and some computer vision. I shall try to note all the groups here in order to make a comprehensive list of people who will be the ideal audience for the aforementioned seminar sessions.

Electrical, Computer & Energy Engineering:

• Dr. Andreas Spanias’ group (Main contributors [1], [2]) –  Sparse representations, dictionary learning methods, high dimensional geometry etc.
• Dr. Lina Karam’s group – Well established group, work mainly on Compression codecs, Visual quality assessment, Saliency in Images/Videos etc.
• Dr. Pavan Turaga‘s group – Relatively new, main focus – topics on activity analysis, compressive sensing, dictionary learning, non linear geometries etc.

Next is the CS group, with many more faculty working on different aspects of these areas. I am less familiar with them and hence will just list them (in no particular order) for record’s sake.

Computing, Informatics and Decision Systems Engineering:

So assuming each of them have around 5 students and out of that at least 2 are interested in this reading session that gives us around ~15 students to start off this with. Which seems like a reasonable number. Hopefully, if the graduate student association of ASU (GPSA) recognizes this as a grad organization, they’ll even fund some part of it.