I had the opportunity to attend a workshop on Modern Massive Data Sets previous week and it had a wonderful mix of industry and academic talks from the people who are leading the efforts to understand, analyze and process massive data sets efficiently. I was attending as an intern at Dropcam, which is a home monitoring startup that currently has more inbound video than Youtube.
Naturally, this was an interesting conference for us with a lot of ideas, which I will try to summarize in this post. The mix between Graph Theory and Linear Algebra was perfect to satisfy CS and EE folks. A note – italics in blue are my thoughts from that day which may or may not be related to the talks. Day 1 highlights
- Large scale ML at Verizon – Ashok Srivastava: Recommending ads based on context, using Non-Negative Matrix Factorization and Multiple Kernel Anomaly Detection
- (What is the minimum amount of information to learn that can perform inferences?)
- Content based search from 50TB of video – Gerald Friedland, ICSI: While this was not a new problem, it was very relevant to us. The speaker spoke of searching inside videos using “mental concepts”. Another interesting point he made was about the difference between “precepts” and “concepts”.
- (there were other interesting talks, but I missed the second session of day 1)
Day 2 highlights
- Counterfactual Reasoning and Learning Systems – Leon Bottou, MSR Spoke about the complex systems that arise in most web-scale experiments such as search engines, ad placement engines etc. Where the user interaction, placement of an ad, context etc. contribute to a wealth of information during the event. The speaker’s claims as that these manifest themselves as a correlation/causation paradox. Some other thoughts – Reichenbach’s Principle or the common cause principle. If two events A, B are correlated, how does manipulating ‘B’ affect ‘A’? Related to A/B testing, counter factual reasoning. Relevant paper http://arxiv.org/abs/1209.2355
- Lot of PageRank based talks – PageRank, Personalized PageRank, Generalized Personalized PageRank!
Day 3 highlights
- “Embarrassingly Parallel” was a phrase that got thrown around very often. I had no idea what it meant so I looked it up: Any problem that does not require any/much effort to separate into parallel problems that can be solved simultaneously.
- Analyzing Big Graphs via Sketching and Streaming – Andrew McGregor, UMass This has to be among one of my favorite talks from the conference. Prof. McGregor set up the problem well enough that someone with little knowledge about the topic was able to quickly appreciate the novelty. Essentially their goal was to extend streaming and sketching algorithms to graphs. These techniques are popularly used to address problems in time series analysis where the length of the data maybe be very large that one requires to use special techniques of processing. The speaker’s goal was to devise an efficient way to determine if a graph was connected using sketching. Relevant survey – McGregor, SIGMOD 2014
- Sketching is an interesting domain where you want to project your data onto a lower dimensional space (like dimensionality reduction) where some properties such as the norm, distance between points etc. are preserved to some degree tau. When you impose a sparse structure on the input vector x, compressive sensing tells us that when we use a random matrix transform, we are able to exactly recover x. However, x need not always be sparse, and we can use it like Andrew McGregor explained.
- IPython: A language independent framework for computation and data, Fernando Perez, UC Berkeley website: http://ipython.org I was very impressed by the presentation, IPython looks like a very good scientific computing language. For most of us in the scientific community, this maybe an effective and free competitor to MatLab. I have become fond of Python very quickly with its various packages such as NumPy, SciPy etc. which are sufficient for most numerical/data analysis. IPython also has the ability to interact with code then and there, which is very cool for analysis. When I typically write code, I end up spitting out various results, several times with different parameters. This is ideal for that purpose, with a very neat way of visualizing results. More information regarding handling different types of data can be found here: IPython Notebooks
- CUR Factorization: This was the first I heard about this factorization method, and I am glad I did! CUR matrix approximation basically attempts to factorize a matrix into three constituent matrices (‘C’ ‘U’ ‘R’) where ‘C’ and ‘R’ can be the columns and rows of the original matrix it self! This is similar to SVD decomposition, but it is a worse approximation. The reason it interests me (and many, more qualified others) is because this decomposition is much more intuitive to interpret than compared to the SVD. To put it simply, you can identify which columns of the matrix you want to keep, based on the approximation error that you can afford. I am interested in the problem of “smart sampling”, for applications in vision (shameless plug). Previous methods I have come across like Manifold Precis ( Shroff et al., 2011) which uses rank reducing QR decomposition to solve for a similar problem. I am definitely looking forward to reading about this, to see where I can apply it. At MMDS, a notable talk on this was by Mark Embre from Virginia Tech who proposed to achieve CUR factorization using discrete empirical interpolation.
It was a great workshop and I got to meet a lot of interesting folks! Rushil