skip to primary navigationskip to content

Undiscovered Scientific Knowledge from Large Unstructured Text Collections in an Era of Big Data

last modified Aug 04, 2015 10:35 AM
Dr Nigel Collier and Dr Anna Korhonen, Department of Theoretical and Applied Linguistics

Undiscovered Scientific Knowledge from Large Unstructured Text Collections in an Era of Big Data

Nigel Collier ( and Anna Korhonen (

Language Technology Laboratory
Department of Theoretical and Applied Linguistics
University of Cambridge
9 West Road, Cambridge CB3 9DP


In 1986 Donald Swanson introduced the notion of undiscovered scientific knowledge by connecting facts that scientists, working in independent fields, reported about Raynaud’s Syndrome and dietary fish oil [1]. To-date text/data mining (TDM) – a fusion between natural language processing, machine learning, bioinformatics and data science - has focussed on extracting facts to provide summaries for scientists to keep them up to date in their fields. It can be argued though that the more ambitious goal of producing credible hypotheses for scientific investigation remains an urgent ‘to do’ on the list of human language technologists, especially as the vast amount of biomedical Big Data makes the task of scientists keeping up to date more and more challenging.

This talk introduces several exemplars of how our group at the Language Technology Laboratory has applied TDM to summarisation tasks involving high volume, high velocity and high variety Big Data. We show that TDM based on state of the art natural language processing and machine learning is being enhanced to understand the meaning of biomedical texts and that new projects applied to cancer biology and public health are making progress to support the goal of generating and testing novel research hypotheses.

(1) In the BioCaster project (JST: 2008-2012) [2,3], working with global public health colleagues, high throughput TDM on multilingual news was employed to detect and map infectious disease outbreaks in near-real time on a global scale.

(2) In the PhenoMiner project (FP7 Marie Curie: 2012-2014) [4,5], large-scale phenotype acquisition is taking place on the scientific literature to support the rapid building of vocabulary databases and knowledge integration with resources such as OMIM, Human Phenotype Ontology and the Phenotype and Trait Ontology.

(3) In the SIPHS project (EPSRC: 2015-2020) e.g. [6], we are beginning to explore techniques for encoding personal health reports from a variety of social media sources to support real-time knowledge discovery about infectious diseases and adverse drug reactions.

(4) In the CRAB project (MRC, Swedish Research Council, FP7 Marie Curie, 2007-2015) e.g. [7] we have developed a publicly-available text mining tool for supporting large-scale scientific literature review in chemical health risk assessment.
(5) In a recently-funded project (MRC, 2015-2018) [8] we plan to explore how text mining could support literature-based discovery in cancer biology. Our aim is to develop a tool which cancer biologists can use to test and generate novel research hypotheses on the basis of knowledge already published in scientific literature.



[1] Swanson, D. R. (1986). Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 30(1), 7-18.
[2] Collier, N., Doan, S., Kawazoe, A., Goodwin, R. M., Conway, M., Tateno, Y., ... & Taniguchi, K. (2008). BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics, 24(24), 2940-2941.
[3] Barboza, P., Vaillant, L., Mawudeku, A., Nelson, N. P., Hartley, D. M., Madoff, L. C., ... & Astagneau, P. (2013). Evaluation of epidemic intelligence systems integrated in the early alerting and reporting project for the detection of A/H5N1 influenza events. PloS one, 8(3), e57252.
[4] Collier, N., Oellrich, A., & Groza, T. (2013). Toward knowledge support for analysis and interpretation of complex traits. Genome biology, 14(9), 214.
[5] Collier, N., Tran, M. V., Le, H. Q., Ha, Q. T., Oellrich, A., & Rebholz-Schuhmann, D. (2013). Learning to Recognize Phenotype Candidates in the Auto-Immune Literature Using SVM Re-Ranking. PloS one, 8(10), e72965.
[6] Collier, N., Son, N. T., & Nguyen, N. M. (2011). OMG U got flu? Analysis of shared health messages for bio-surveillance. J. Biomedical Semantics, 2(S-5), S9.
[7] Korhonen, A., Ó Séaghdha, D., Silins, I., Sun, L., Hogberg, J. & Stenius, U. 2012. Text mining for literature review and knowledge discovery in cancer risk assessment and research. PLoS ONE 7(4):e33427.
[8] Korhonen, A., Guo, Y., Yetisgen-Yildiz, M., Stenius, S., Narita, M., & Lio, P. 2015. Improving Literature-Based Discovery with Text Mining. In Proceedings of CIBB. Cambridge, UK.