The goal of computer vision is to impart machines with the ability to see, that is, to understand an image similar to a human. This consists of identifying which object categories are present in an image and where (car, person, trees), their actions (running, driving, sitting), and their relative locations (person inside the car, car above the road). The challenge of computer vision lies in obtaining a powerful discriminative representation of an image that allows us to infer the scene encoded within it. Consider, for instance, the representation of an image as captured by a camera, which consists of color values of each of its pixels. Two images that differ greatly in their color values may still depict the same scene (for example, images of the same location at day and night). At the same time, small changes in the color values may result in a completely different scene where objects have moved considerably. This makes raw color values unsuitable for understanding the scene depicted in the image.
Traditional computer vision approaches have relied on hand-designed representations of an image that are more amenable to scene interpretation. Given such a representation, there exist several principled formulations for learning to interpret scenes, which take advantage of the powerful mathematical programming framework of convex optimization. Convex optimization offers many computational advantages: it scales elegantly with the size of the problem, it provides global optimality, it offers convergence guarantees and it can be parallelised over multiple machines without affecting the accuracy.
Recent years have seen the rise of deep learning, and specifically convolutional neural networks (CNN), which aim to automatically obtain the representation of visual data from a large training set. While an automated approach is highly desirable due to its scalability, it comes with the challenge of solving highly complex non-convex mathematical programs. The aim of our research is to overcome these challenges by finding connections between convex optimization and deep learning. The key observation is that the non-convex programs encountered in deep learning for computer vision have a special structure that is closely related to convexity. Specifically, while the mathematical programs are not convex, they are of a difference-of-convex (DC) form. A DC program can be optimized efficiently by an iterative algorithm, which, at each iteration, solves a convex optimization problem.
Our aim is to exploit the structure of DC-CNNs to design the next generation of algorithms for computer vision. Specifically, we will build customized algorithms that will scale up the dimensionality of the CNN by orders of magnitude while keeping the computational cost low. Our algorithms will retain many of the highly desirable benefits of convex programming (convergence, quality guarantees, elegant scaling, distributed computing) while still allow the automatic estimation of image representations.
The impact of such principled and efficient algorithms is potentially huge. The new CNN architectures that this enables will allow researchers to address significantly more complex visual tasks. For example, a generative network that can provide a set of diverse future frames of a given video sequence, or a intelligent agent that can crawl the web for images and videos and complete the captions in order to bridge the gap between visual data and searchable content. Our research results will be made publicly available via open source software. The project is also likely to have a large academic impact, consolidating the leadership of the UK in machine learning and computer vision.