Extracting value from the data deluge
Big Data in R&D—blessing or curse? The answer to all depends on whether you can capture its hidden value and harness it for your purposes.
Data galore: Every second, an estimated 3.7 million email messages are sent on a global scale, and every day 4.5 billion “like” clicks are registered on Facebook. In Western Europe alone, the volume of digital data generated grows by 30 percent each year. Computational power and speed are increasing dramatically; the ability to store, convey and make sense of massive amounts of data is making giant leaps forward. In just a few years, we have gone from thinking in terms of megabytes (MB) and gigabytes (GB, 1000 MB) to talking about terabytes (TB, 1000 GB) and petabytes (PB, 1000 TB) when referring to Big Data.
Bryn Roberts speaks about the proverbial “data tsunami,” that is the sheer velocity and volume of data being produced. He says: “The goal in dealing with the Big Data revolution is to capture, manage, integrate and interpret vast amounts of information to create value. This challenge is often described by the four Vs: volume of data; velocity with which the analysis needs to be performed; variety of data sources and the need to integrate them; veracity or trustworthiness of the data.”
Challenge of Big Data in R&D
Whilst the volume of data generated by human activity in the social media, through credit card transactions or digital telecommunications is mind-boggling in itself, there is an added dimension in the life sciences, as Bryn emphasizes: “What really provides R&D with our Big Data conundrum is the complex biology that we are working with. This complexity issue in R&D compounds is an additional challenge, on top of the sheer volume of data, the great variety of data sources from early discovery to the clinic that need to be integrated, and the often uncertain veracity of the data (especially regarding the scientific literature).”
Take the realm of gene sequencing. The human genome is made up of three billion base pairs, encoding around 20,000 genes. Sequencing such a genome, which took weeks a few years back, can be done today in a matter of hours, generating tons of data.
We need to reduce complex data into a model that is accessible for human comprehension
“All research data at Roche up to 2010 amounted to about 100 TB,” Bryn adds. During 2011/12, we ran a project called CELLO, where the genomes from about 300 cancer cell lines were sequenced. Together with other data from the cells, we generated 100 TB of data in this single ‘experiment’—equal to 100 years of Roche research up until 2010!”
Another major challenge, according to Bryn, is to extract meaningful, trustworthy information from the more than 22 million life science publications available (veracity challenge). Bryn explains: “A significant proportion of this literature contains assertions that we know are not reproducible. Yet to make effective decisions through the course of a drug discovery and development project, we must combine the claims from the scientific literature with external data and Roche internal data of many kinds such as high throughput screening, toxicology, target selectivity, metabolism and pharmacokinetics, in vitro and in vivo efficacy, imaging, etc. Add to these the complexity at the individual genetic level—for example, polymorphisms that affect the way in which drugs interact with their target biological molecules or are metabolized and eliminated from the body—, then one really starts to appreciate the Big Data challenge in R&D, namely how to make data actionable to support key decisions, thus maximizing the chances of success.”
Computer guidance for human decision makers
Inspite of the data deluge in drug discovery and development, it may seem ironic that sparsity remains a considerable challenge in R&D. Why? The answer: Although the data volume available may be huge, much of it is often not qualitative enough to actually provide sufficient insight to make sound decision choices. Says Bryn: “That is where the design of experiments becomes so important because this ensures we generate data of sufficient quality and reliability to allow successful decision-making. Importantly, addressing the Big Data challenge in R&D requires a multidisciplinary approach where biologists, computer scientists, toxicologists, statisticians, chemists, and many others need to work in a highly collaborative way.”
Computers can help circumnavigate the dangers of “inappropriate reductionism” by generating complex models that consider all of the relevant factors and present the output in a way that allows scientists to move forward without bias or ignoring important information. Says Bryn: “We somehow need to reduce large complex data into a model that is accessible for human comprehension. We hope that in the next few years we can further supplement human decision making with computer guidance, using machine learning over Big Data.”
Regarding the issues that IT and informatics have to address in helping R&D move forward, Bryn highlights some of the biggest ones: “We need to address network bandwidth to move these very large data around in the world, between research centers, in a reasonable timeframe, or find more effective ways to compute over the data cloud without moving datasets from their source locations.”
Information Technology, adds Bryn, will help with the integration of data from many sources and formats, as well as new approaches to visualize and explore very large information landscapes. “Analysis algorithms will also allow our scientists to extract meaning from massively complex data. Finally, novel human-computer interfaces will enable multidisciplinary teams to interact with their data more meaningfully, enabling effective decisions when moving projects forward.”
The American author John Naisbitt (*1929) wrote: “We are drowning in information but starved for knowledge.” pRED Informatics is doing everything it can to make sure that Roche scientists will neither be drowned in Big Data nor starved for relevant information.
- Less than 2% of all data are analogue today.
- According to an estimate, 90%of all digital data (approximately 1200 exabytes) were produced in the last two years alone.
- Google registers about 20 billion websites every day.
- If you were to burn this volume to CDs, you would get five CD stacks reaching from the earth to the moon.