In this first in a series of blogs on the relevance of data science for survey research, we start with a fundamental question: How can we leverage new data sources? How do those sources compare to more traditional data collection efforts?
Defining “Survey Data”
Survey data is what former Census Director and current Georgetown Provost Robert Groves has referred to as “design data.” This is information collected from specific populations for a specific purpose, often by the researchers or persons who intend to use the information. Such data come from studies designed to ensure accuracy and reduce bias through the careful development of a data collection instrument or survey, the statistical sampling of the population of interest—ensuring the most participation possible—and, finally, statistical analyses to understand the potential variability in responses or “confidence interval.” In short, with design data the researcher seeks to exert “control” over as many factors that could influence the data as possible.
Defining “Organic Data”
“Organic data” arise out of systems, computers and digital machines—they serve to either help the system run or are natural outputs of the system. These data are often massive in volume, sometimes requiring new techniques and software in order to manipulate them. The data are not designed for research purposes, but rather as input or output of the system, process or platform. In fact, oftentimes there are many key unknowns about “organic data”—such as what exactly the data represent or measure, the characteristics of the individuals the data represent, or even if change in data over time are due to actual changes in people’s behaviors/attitudes or due to changes made to an algorithm by a system engineer. As a result, researchers do not have control over the same types of factors they do with design data.
Moreover, organic data can vary widely in terms of their degree of “structure”; they can be analyzed with a minimum of data cleaning and/or restructuring. Such data often include those captured in many government program systems, commercial transactions, electronic medical records, or school records.
At the other end of the spectrum are “unstructured data,” that is, those that have no standard analytic structure and do require significant transformation and “scrubbing” before they can be readily analyzed. Such data include visual data from pictures or videos, satellite or radar images, and open-ended social media commentary. Between these two extremes are other types of data that can require more or less transformation before use, like mobile phone GPS data, sensor data or computer logs.
Making the Most of Available Data
While surveys have been a staple means of gathering essential data for research, evaluation and decision-making, our approach is changing rapidly as we leverage both new sources of information and established data science analytic techniques. These changes include:
- Increased use of “organic data” as an adjunct or at times replacement for “design” data;
- The application of newer techniques–machine learning, natural language processing and object recognition–to improve how we conduct surveys; and
- Continuing vigilance as we seek to adhere to the new cautions associated with this field given the whirlwind of change.
The rapid expansion of computing power—along with new data science and computer science techniques for extracting, transforming, cleaning and ultimately analyzing such data—have dramatically expanded the potential information sources available to researchers. The sources enable the understanding of people’s attitudes and behaviors in ways often different from those assessed through more traditional survey research efforts.
But it’s not just the analysis tools that are important—in many ways, those are the easy parts. Real transformation comes from the combination of the tools with the experience and knowledge of the underlying data (designed or organic) that allow us to understand what the data can (and cannot) tell us.
Abt Associates is leading the way in working with our clients to understand how these various data sources can enhance the utility of the information being collected, which will lead to greater understanding of the issues and more informed decision-making.
In the next blog on this topic, we will explore how various data science analytic approaches—namely machine learning, natural language processing, and computer vision—are changing the ways in which we approach and evolve more traditional survey research.