EPSRC logo

Details of Grant 

EPSRC Reference: EP/J013293/1
Title: Learning Highly Structured Sparse Latent Variable Models
Principal Investigator: Silva, Dr RBd
Other Investigators:
Researcher Co-investigators:
Project Partners:
Department: Statistical Science
Organisation: University College London
Scheme: First Grant - Revised 2009
Starts: 01 October 2012 Ends: 31 December 2013 Value (£): 99,532
EPSRC Research Topic Classifications:
Artificial Intelligence
EPSRC Industrial Sector Classifications:
No relevance to Underpinning Sectors
Related Grants:
Panel History:
Panel DatePanel NameOutcome
01 Feb 2012 EPSRC ICT Responsive Mode - Feb 2012 Announced
Summary on Grant Application Form
Technological advances have brought the ability of collecting and analysing patterns in high-dimensional databases. One particular type of analysis concerns problems where the recorded variables indirectly measure hidden factors that explain away observed associations. For instance, the recent National NHS Staff Survey of 2009, taken by over one hundred thousand staff members, contained several questions on job satisfaction. It is only natural that the patterns of observed answers are the result of some common hidden factors that remain unrecorded. In particular, such answers could arguably be grouped by factors such as perceptions of the quality of work practice, support of colleagues and so on, that are only indirectly measured.

In practice, when making sense out of a high-dimensional data source, it is useful to reduce the observations to a small number of common factors. Since records are affected by sources of variability that are unrelated to the actual factors (think of someone having a bad day, or even typing wrong information by mistake), removing such artifacts is also part of the statistical problem. A model that estimates such transformations is said to perform "dimensionality reduction" and "smoothing".

There are a variety of methods to accomplish such tasks. At one end of the spectrum, there are models that assume the data match some very simple patterns such as bell curves and pre-determined factors. Others are very powerful, allowing for flexible patterns and even an infinite number of factors that are inferred from data under some very mild assumptions. The proposed work tries to bridge these extremes: the shortcomings of the very flexible models are subtle but important. In particular, they can be very sensitive to changes in the data - meaning some very different conclusions about the hidden factors might be achieved if a slightly different set of observations is provided. Moreover there are computational concerns: calculating the desired estimates usually requires an iterative process, a process that needs some initial guess about these estimates. So, even for a fixed dataset, results can vary considerably if such an initial guess is not carefully chosen. Our motivation is that if one does have these concerns, one might as well take the trouble of incorporating domain knowledge about the domain. The upshot: we do not aim to be general, and instead target applications where some reasonable domain knowledge exists. In particular, we focus on problems where the hidden targets of interest are pre-specified, but infinitely many others might exist. While we map our data to a fixed space of hidden variables, we provide an approach that is robust to the presence of an unbounded number of other, implicit, common factors. The proposed models are adaptive: they account for possible extra variability between the given hidden factors that would be missed by the simpler models. At the same time, they are designed to be less sensitive to initial conditions while being less sensitive to small changes in the datasets.

Key Findings
The act of data collection usually implies the simultaneous measurement of many aspects of a social or natural phenomenon. For instance, surveys are usually implemented through questionnaires that probe respondents (such as clients of a company, patients in a medical study or staff members in a organisation) from a variety of perspectives (such as questions on job satisfaction, relationship to co-workers, welfare and so on). It is not uncommon that, due to the very structure in which such data collection is organised, one should expect that a few latent traits will explain the association among the recorded variables. However, there is a degree by which a model for such data should include other latent traits that were not foreseen by the data scientist modelling the phenomenon of interest. Misspecification of latent traits can result in a bad fit of a postulated model to the data, while a statistical approach that does not exploit the background knowledge of the problem can result in uninterpretable outcomes and be exceedingly computationally demanding.

Our first key finding is the some background knowledge about the main expected latent traits can go a long way. We developed an approach that starts with a given partition of the measurements according to what should be the main latent traits that explain the overall data. For instance, the data scientist can group the questions in a questionnaire according to the main aspect they are intended to measure (so that questions about satisfaction with job duties can be grouped in a different set from those about satisfaction with financial compensation). Some models can infer a single number summarising the relationship of each group with respect to others, but this can be far too restrictive and waste some of the information in the data. Our method searches for potential new traits by identifying which associations implied by a partially build model do not match the ones supported by the empirical observations. It then iteratively "patches" a candidate structure with residual associations that correspond to new latent traits of very limited scope - hence,retaining most of the interpretability of the original main traits. We show that this simple "partition-and-patch" recipe for breaking apart a large system of measurements can provide better fitting models, while at the same time respecting the background knowledge of the data scientist and deviating from it using very local modifications to the original specification.

Our second finding is that we can avoid making assumptions about the nature of the individual measurements and instead model their relationships directly. This means that we can avoid assigning numerical meanings to graded answers in a questionnaire (for instance, which numbers one would assign to answers that vary from "Strongly agree" to "Strongly disagree"?). We developed advanced algorithms that work by directly comparing the ranks of the answers instead of the actual values, and as such it reduces the number of assumptions required to draw conclusions from such cases.

Our third finding concerns the nature of some probabilistic models that are composed by modelling a large system of variables by pieces. Each piece consists of a distribution over the values of a small number of variables, not unlike the idea of starting from a partition of the variables. The whole system has been recently defined in the literature by the product of such pieces instead of a seemingly complex system of latent traits. However, this poses complicated computational statistics problems, since although the system as a whole is always well-defined, this product of such pieces needs to be translated into a function that measures the probability of a data point. Although it is possible to do some sort of piecewise fitting, as we had originally planned, we realised we could do much more. We found that this translation process can also be cast as operations in a highly structured latent variable system, but with a process of a very different computational structure. This unifies two schools of model construction in statistics, and allow them to exchange ideas for algorithms for data fitting and languages for expressing more flexible probability distributions.
Potential use in non-academic contexts
A main application, as mentioned above, is on modelling social data and using this as a tool to understand the validity of the measurements (as identified by which residual structure was identified by the approach, and which new latent traits can cast light on how appropriate the measurements were to begin with). This is of potential use not only for social scientists, but also in business contexts, where private surveys and marketing research can benefit from validating the measurement of latent traits that summarize the result of a survey. Moreover, in industry at large, there is the potential on exploiting the product of distributions framework in predictive modelling by developing some further research on how to adapt such models to streaming data setups and possible network data.
No information has been submitted for this grant.
Sectors submitted by the Researcher
Information & Communication Technologies
Project URL:  
Further Information:  
Organisation Website: http://www.ucl.ac.uk