The amount of digital information accessible on the web and, more generally, in data repositories of various sorts is
growing at an ever faster pace. Increasingly, digital information means visual content (image and video), and this
development has resulted in a situation where computational solutions are lagging behind a diverse range of
current image/video search, processing and management needs. There is a big, and as yet unbridged, semantic
gap between visual content and language. Finding solutions for image/video retrieval, automatic image/video
annotation and similar challenges will require this gap to be bridged, and this in turn will require expertise from both
the computer vision (CV) and natural language processing (NLP) fields. Yet, while language and vision are the two
primary modalities for human perception and computer-mediated communication, the two corresponding computing
science disciplines hardly talk to each other, and this is part of the reason why the language-vision gap is still so
wide: NLP research is perhaps not aware enough of the range of possible applications involving visual content and
their specific language processing requirements; CV can tend to underestimate the complexity of the
language processing problem, and currently uses mostly basic language processing technology, whereas
sophisticated, high-performance tools exist.
We propose an EPSRC Network on Vision and Language, V&L Net, to create a forum for researchers from CV
and NLP to meet and exchange ideas, expertise and technology. The UK has some of the world's leading
researchers in NLP and CV. V&L Net aims to tap this body of expertise to create new strategic partnerships aimed at
narrowing the language-vision gap by developing the theory required for solutions to the difficult challenges posed
by our increasingly multi-modal world. A successful network will place the UK at the forefront of developing solutions
at the language-vision intersection which have clear commercial potential.
Our overarching goal in V&L Net is the creation of a new interdisciplinary research community working towards
computational solutions for challenges that involve both language and vision. By (i) bringing researchers from the
two currently separate disciplines of computer vision and language processing together, (ii) facilitating access to
relevant information, expertise, and resources, and (iii) stimulating research and pump-priming individual research
projects, we aim to engender a substantial increase in interdisciplinary research activity. Through this increase in
work bringing to bear expertise from both computer vision and language processing, we expect to see a step
change in progress towards solutions for a range of real-world challenges as well as theoretical questions. While
the latter will tend to have a more long-term impact (laying the groundwork for future breakthroughs), the former
have substantial potential to result in ground-breaking new products and services that will improve people's quality
of life in diverse ways even in the short to medium term. People with impairments in sight, hearing and cognitive
ability will benefit from assistive technology that will help them access multiple modalities. Improvements in image
search and retrieval will enhance online search experience, as well as help institutions such as hospitals and police
forces to cope with the massive amounts of images and videos they deal with daily.
|