Non-negative matrix tri-factorisation with missing values, applied to drug sensitivity prediction
Thomas Alexander Brouwer, Computer Lab, University of Cambridge
As the amount of information we have about patients and diseases increases exponentially, we need new methods that can combine different types of information and thus enhance our analysis. One particular problem of interest is the prediction of drug sensitivity, the concentration needed to reduce gene activity by half, as this allows us to choose the best treatment strategy for patients. Due to the large number of possible drugs and cancer cell lines, datasets containing drug sensitivity values are often incomplete, with a large fraction of the dataset consisting of missing values. There is therefore the need for methodology that can impute the remaining values in the matrix, while exploiting prior information about the drugs and diseases.
One family of methods that has proved particularly powerful for predicting missing values in matrices is matrix factorisation, where we factorise an observed matrix into the product of two or more matrices. A special form of matrix factorisation, non-negative matrix tri-factorisation (Ding et al., 2006), decomposes the original matrix into three matrices without any negative values. These matrices can be interpreted as performing clustering of rows and columns of the original matrix simultaneously. It has been used recently for clustering genes and phenotypes (Hwang et al., 2012), cancer subtypes (Liu et al., 2014), gene ontology terms (Gligorijevic et al., 2014), and protein interactions (Wang et al., 2013).
These papers use multiplicative updates to find a good approximation of the factorisation, minimising the mean square error between the original observed matrix, and the multiplication of the three inferred matrices. However, in order to perform these updates the entire original matrix must be known, which is often not the case. Previous research has typically dealt with this by forcing unknown entries to a certain value (e.g. 0 for no interaction), but this results in a solution that is fitted to these (potentially incorrect) values, and as a result the quality of predictions for these missing values suffers.
In this research we revisit the NMTF algorithm, and use a different cost function to find an approximation to the original matrix, the I-divergence (Lee and Seung, 2001). Starting from an algorithm for NMTF minimising the I-divergence (Yoo and Choi, 2009), we extend it by incorporating constraint matrices expressing similarity of entities and hence guiding the clustering to a better solution. This allows data integration from different sources. These additional matrices allow us to add prior knowledge of the underlying problem, further improving the clustering and prediction quality. The resulting algorithm considers only the known values in the original matrix, resulting in better predictions for missing values.
We use this new algorithm to study the Sanger Genomics of Drug Sensitivity in Cancer dataset, incorporating similarity measures between the drugs and cancer cell lines as the constraint matrices, and imputing the missing values in this matrix. The predictive power of the algorithm is verified using cross-validation, and its performance compared to the previous methods. Furthermore, the results of clustering the drugs and cancer cell lines are studied.
Daniel D. Lee and H. Sebastian Seung (2001). "Algorithms for Non-negative Matrix Factorization". Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference. MIT Press. pp. 556–562.
Chris Ding, Tao Li, Wei Peng, Haesun Park (2006). "Orthogonal Nonnegative Matrix Tri-factorizations for Clustering". KDD '06 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.
Jiho Yoo, Seungjin Choi (2009). "Probabilistic Matrix Tri-Factorization". IEEE International Conference on Acoustics, Speech and Signal Processing, 2009.
TaeHyun Hwang, Gowtham Atluri, MaoQiang Xie, Sanjoy Dey, Changjin Hong, Vipin Kumar, Rui Kuang (2012). "Co-clustering phenome–genome for phenotype classification and disease gene discovery". Nucleic Acids Research, 2012, Vol. 40, No. 19 e146.
Hua Wang, Heng Huang, Chris Ding, Feiping Nie (2013). "Predicting Protein–Protein Interactions from Multimodal Biological Data Sources via Nonnegative Matrix Tri-Factorization". Journal of Computational Biology. April 2013, 20(4): 344-358. doi:10.1089/cmb.2012.0273.
Yiyi Liu, Quanquan Gu, Jack P Hou, Jiawei Han, Jian Ma (2014). "A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression". BMC Bioinformatics 2014, 15:37.
Vladimir Gligorijevic, Vuk Janjic, Natasa Przulj (2014). "Integration of molecular network data reconstructs Gene Ontology". Vol. 30 ECCB 2014, pages i594–i600.