Science’s Big Data Problem


datapipe_660

brunkfordbraun/Flickr



Modern science seems to have data coming out of its ears. From genome sequencing machines capable of reading a human’s chromosomal DNA (about 1.5 gigabytes of data) in half an hour to particle accelerators like the Large Hadron Collider at CERN (which generates close to 100 terabytes of data a day), researchers are awash with information. Yet in this age of big data, science has a big problem: it is not doing nearly enough to encourage and enable the sharing, analysis and interpretation of the vast swatches of data that researchers are collecting.


Science is the archetypal empirical endeavour. The theoretical physicist and all-round entertainer Richard Feynman put it best: “It doesn’t matter how beautiful your theory is, it doesn’t matter how smart you are. If it doesn’t agree with the experiment, it’s wrong.” This has been the founding principle of science since its earliest days. Without the painstaking astronomical observations of Tycho Brahe, a sixteenth-century Danish nobleman, Johannes Kepler would not have determined that the planets move in elliptical orbits and Isaac Newton would not have had the foundations on which to build his law of universal gravitation.


In the late nineteenth and early twentieth century, without the ingenious experiments of Albert Michelson and Edward Morley, in which they demonstrated the constancy of the speed of light and the absence of the putative ether (producing perhaps the most famous negative result of all time), Albert Einstein would have lacked a critical empirical basis for his special theory of relativity.


So praise to the data gatherers, sharers and analysers who are essential to the continued progress of science on which we all rely — and which too often we take for granted. But even as the rest of the society, from business and economics to journalism and art, wakes up to the power of big data, the world of research is, ironically, not doing nearly enough to embrace the power of information. A big-data mindset involves more than having a lot of petabytes on your hard drive, and science is falling short in three main areas.


Information Is Power First, the power of information increases when it is shared. To see this you only have to look at the transformational effects of the Internet and the World Wide Web, and before that of other information technologies, from moveable type to the telegraph. Yet scientists are curiously reluctant to share their research findings, even with each other. True, it happens in some fields, such as genomics and astronomy, but in many others, including molecular biology and chemistry, secrecy is the norm.


A few years ago I attended a round-table meeting at a large US research organisation. The topic of discussion was data sharing among scientists, and how to encourage and enable it. Yet resistance, even among the experienced and enlightened people present, was palpable. Because of the intrinsically collaborative endeavour in which they’re engaged, academic researchers are often assumed to be more collaborative and less proprietary than their business counterparts. But the opposite is often true: as an employee of a commercial organisation, I wouldn’t dream of claiming ownership of information and withholding it from my colleagues in the way that scientists routinely do.


I would keep it from my competitors though, and there’s the rub. Scientific credit accrues to the authors of influential journal papers, not the providers of data (or experimental samples or software algorithms or any number of other kinds of contributions that people can make to the research process). With credit comes access to the scarce resources that researchers naturally seek, namely funding and employment. So until you’ve secured publication in a top journal everyone else is a competitor. If institutions and funders were to give more credit to open sharing of research data, scientific progress would accelerate and we would all benefit.


The Science of Data A second, related problem is that we still tend to see data generation and analysis as merely a prelude to the real job of science, which is to generate insights and theories. In a sense this is a reasonable view — data unaccompanied by an explanatory theory is often useless. But as the historical examples above illustrate, we need observations and analyses too — theories without data are mere speculation, not science.


Yet even as the rest of the world embraces the concept of the data scientist, science itself has yet to catch up. It is not at all unusual for a researcher to spend a highly successful career specialising in the study of a single object, however well known or obscure: the heart, the fruit fly or the goldfish Mauthner neuron. Yet it remains highly unusual to specialise in the functional role of data gathering, analysis and display. This is exactly what data scientists do, and we need more of them in research. Unfortunately the organisational structures of science — from university departments and funding bodies to academic societies and journals — are siloed into subject areas, inhibiting functional specialisation that cuts across these traditional disciplinary boundaries.


Academics, employers and funders in particular should actively encourage functional specialists, especially data scientists, and ensure that it becomes no more unusual for a researcher to become an expert in data science than to specialise in dark matter or fullerenes.


Improved Tools Finally, researchers need improved tools for managing, interpreting and sharing their data. Today most of them have better software at home for handling their music or photo collections than they have in the laboratory for doing the same with their research data. The reasons why are not are hard to see: there are around 7 million researchers in the world, making them about 0.1% of the human population. From the point of view of a big software company they therefore represent a relatively insignificant niche. But economically and culturally they are disproportionately important and for all our sakes they deserve better.


This is an area in which commercial organisations (including Digital Science, the one I run,) have important roles to play. Worldwide spending on research is in the order of a trillion dollars. The scientific publishing industry alone is worth tens of billions of dollars a year, and I believe the scientific software industry will overtake it. There are many scientists, software developers and entrepreneurs who are striving to build better tools in order to enable the next wave of scientific discovery on which we all depend. In whatever way we can provide it, they deserve our support.


Dr. Timo Hannay is Managing Director at Digital Science. He’s on Twitter at @digitalsci and @timohannay.



No comments:

Post a Comment