EPSRC logo

Details of Grant 

EPSRC Reference: EP/K024272/1
Title: Modelling Discourse in Statistical Machine Translation
Principal Investigator: Specia, Dr L
Other Investigators:
Researcher Co-Investigators:
Project Partners:
GALA (Globalization and Localis Assoc) Microsoft TAUS
Department: Computer Science
Organisation: University of Sheffield
Scheme: First Grant - Revised 2009
Starts: 01 December 2013 Ends: 31 May 2015 Value (£): 99,127
EPSRC Research Topic Classifications:
Artificial Intelligence Comput./Corpus Linguistics
EPSRC Industrial Sector Classifications:
No relevance to Underpinning Sectors
Related Grants:
Panel History:
Panel DatePanel NameOutcome
16 Jan 2013 EPSRC ICT Responsive Mode - Jan 2013 Announced
Summary on Grant Application Form
Automatic translation of human languages is an increasing necessity in our global society: large amounts of text are constantly produced in various languages and fast, cheap and accurate translation into a number of other languages is required to foster business and communication within and across nations. This high demand for translations cannot be fulfilled by human translators because of its sheer volume, cost and the lack of skilled professionals.

Different Machine Translation (MT) approaches have been proposed to automate translation. The most widely adopted approach is Statistical MT (SMT): the broad availability of free, open source SMT systems, along with significant improvements in their quality in recent years, has made SMT a very promising technology. This is evidenced by the many commercially successful SMT systems, such as those developed by Google, Microsoft and IBM.

Despite its recent success, SMT systems are still far from producing translations that reach human quality levels. A major limitation is that they translate sentences one by one, in isolation, without resorting to any information about the context in which such sentences appear. This leads to systems that are computationally feasible; however, more advanced approaches that overcome this limitation are needed to improve SMT quality and make it a de facto translation technology. The context surrounding a sentence -- its discourse -- contains information about dependencies connecting words or expressions across sentences. Neglecting such connections can lead to incoherent and inconsistent translations:

-- Humans use different words to refer to the same concepts in different sentences. If the links between these words are not identified, sentences can be incoherently translated. E.g.: in "The man bought a leather bag" and "It was soft", Bing Translator misses the connection between "it" and "bag". It produces for Portuguese "[...]. *Ele *foi *suave", rendering a completely inadequate meaning: "He went smooth".

-- The same text can appear in different sentences. If the links between these occurrences are not identified, they can be translated inconsistently. E.g.: in "He took cash from the bank" and "The bank was far away", only the first sentence has enough information about the correct meaning of "bank", and thus the second occurrence gets translated as "*margem" in Portuguese (river bank).

SMT is a young area and researchers have so far focused on overcoming issues within sentence boundaries. Most of these issues have been addressed to a large extent in recent years and it is now time to turn to discourse-level challenges. Very few attempts to deal with these challenges have been proposed. These are limited to pre- or post-processing strategies.

This project aims at explicitly modelling discourse level relationships across sentences in SMT at translation time without compromising the scalability of existing approaches. The proposed approach includes (i) a novel framework to model discourse level relationships by learning valid transitions across sentences based on rich linguistic information for both source and target languages and (ii) a constraint-based inference algorithm to use these relationships to guide the translation process while keeping it tractable. By decoupling model learning and inference, a basic SMT model will augmented at inference time with document-wide constraints representing expected discourse relationships that are too expensive or unavailable at model learning time.
Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.shef.ac.uk