Thursday, November 14, 2013

Big Data, Boundary Work and Computer Science

A Google Data Center. Image taken from here.

The Annual Meeting of the Society of the Social Studies of Science this year (i.e. 4S 2013) was full of "big data" panels (Tom Boellstorff has convinced me to not capitalize the term). Many of these talks were critiques; the authors saw big data as a new form of positivism, and the rhetoric of big data as a sort of false consciousness that was sweeping the sciences*.

But what do scientists think of big data?

In a blog-post titled "The Big Data Brain Drain: Why Science is in Trouble," physicist Jake VanderPlas (his CV lists his interests as "Astronomy" and "Machine Learning") makes the argument that the real reason big data is dangerous because it moves scientists from the academy to corporations.
But where scientific research is concerned, this recently accelerated shift to data-centric science has a dark side, which boils down to this: the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. While academia, with typical inertia, gradually shifts to accommodate this, the rest of the world has already begun to embrace and reward these skills to a much greater degree. The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms. [all emphasis in the original]
His argument proceeds in three steps: first, he argues that yes, new data is indeed being produced, and in stupendously large quantities. Second, processing this data (whether it's in biology or physics) requires a certain kind of scientist who is both skilled in statistics and software. Third, because of this, "scientific software" which can be used to clean, process, and visualize data becomes a key part of the research process. And finally, this scientific software needs to be built and maintained, and because the academy evaluates its scientists not for the software they build but for the papers they publish, all of these talented scientists are now moving to doing corporate research jobs (where they are appreciated not just for their results but also for their software). That, the author argues, is not good for science.

Clearly, to those familiar with the history of 20th century science, this argument has the ring of deja vu. In The Scientific Life, for example, Steven Shapin argued that the fear that corporate research labs would cause a tear in the prevailing (Mertonian) norms of science, by attracting the best scientists away from the academy, was a big part of the scientific (and social scientific) landscape of the middle of the 20th century. And these fears were largely unfounded (partly, because they were largely based on a picture of science that never existed, and partly because, as Shapin finds, scientific virtue remained nearly intact in its move from the academy to the corporate research lab.) [And indeed, Lee Vinsel makes a similar point in his comment on a Scientific American blog-post that links to VanderPlas' post.]

But there's more here, I think, for STS to think about. First, notice the description of the new scientist in the world of big data:
In short, the new breed of scientist must be a broadly-trained expert in statistics, in computing, in algorithm-building, in software design, and (perhaps as an afterthought) in domain knowledge as well. [emphasis in the original].
This is an interesting description on so many levels. But the reason it's most interesting to me is that it fits exactly with the description of what a computer scientist does. I admit this is a bit of a speculation, so feel free to disagree. But in the last few years, computer scientists have increasingly turned their attention to a variety of domains: for example, biology, romance, learning. And in each of these cases, their work looks exactly like the work that VanderPlas' "new breed of scientist" does. [Exactly? Probably not. But you get the idea.] Some of the computer scientists I observe who design software to help students learn work exactly in this way: they need some domain knowledge, but mostly they need the ability to code, and they need to know statistics both, in order to create, machine learning algorithms, as well as to validate their argument to other practitioners.

In other words, what VanderPlas is saying that practitioners of the sciences are starting to look more and more like computer scientists. His own CV, which I alluded to above, is a case in point: he lists his interests as both astronomy and machine learning. [Again, my point is not so much to argue that he is right or wrong, but that his blog-post is an indication of changes that are afoot.]

His solution to solving the "brain drain" is even more interesting, from an STS perspective. He suggests that the institutional structure of science should recognize and reward software-building so that the most talented people stay in academia and do not migrate to industry. In other words, become even more like computer science institutionally so that the best people stay in academia. Interesting, no?

Computer science is an interesting field. The digital computer's development went hand-in-hand with the development of cybernetics and “systems theory”—theories that saw themselves as generalizable to any kind of human activity. Not surprisingly, the emerging discipline of computer science made it clear that it was not about computers per se; rather, computers were the tools that it would use to understand computation—which potentially applied to any kind of intelligent human activity that could be described as symbol processing e.g. see Artificial Intelligence pioneers Newell and Simon’s Turing award speech. This has meant that computer science has had a wayward existence: it has typically flowed where the wind (meaning funding!) took it. In that sense, its path has been the polar opposite to that of mathematics, whose practitioners, as Alma's dissertation shows, have consciously policed the boundaries of mathematics.   (Proving theorems was seen to be the essence of math; anything else was moved to adjoining disciplines.)

X-posted on Tumblr and the HASTS blog.  

--------------------

*The only exception to this that I found was Stuart Geiger's talk which was titled "Hadoop as Grounded Theory: Is an STS Approach to Big Data Possible?," the abstract of which is worth citing in full:
In this paper, I challenge the monolithic critical narratives which have emerged in response to “big data,” particularly from STS scholars. I argue that in critiquing “big data” as if it was a stable entity capable of being discussed in the abstract, we are at risk of reifying the very phenomenon we seek to interrogate. There are instead many approaches to the study of large data sets, some quite deserving of critique, but others which deserve a different response from STS. Based on participant-observation with one data science team and case studies of other data science projects, I relate the many ways in which data science is practiced on the ground. There are a diverse array of approaches to the study of large data sets, some of which are implicitly based on the same kinds of iterative, inductive, non-positivist, relational, and theory building (versus theory testing) principles that guide ethnography, grounded theory, and other methodologies used in STS. Furthermore, I argue that many of the software packages most closely associated with the big data movement, like Hadoop, are built in a way that affords many “qualitative” ontological practices. These emergent practices in the fields around data science lead us towards a much different vision of “big data” than what has been imagined by proponents and critics alike. I conclude by introducing an STS manifesto to the study of large data sets, based on cases of successful collaborations between groups who are often improperly referred to as quantitative and qualitative researchers.