The marriage of an increasingly powerful set of data science tools and approaches with more traditional survey research goals is moving data collection in some new and important directions. In particular, machine learning, natural language processing, and image recognition are sets of artificial intelligence techniques that are expanding how we conduct, evaluate, and analyze survey data. The result is the ability to conduct broader and more efficient data collection efforts with greater confidence in the quality of the data, thus providing more meaningful insights, decisions, and action.
Machine Learning (ML) is a set of analytic techniques that allow a computer application to more accurately predict outcomes without being explicitly re-programmed over and over. Typically, this approach uses patterns and inference to classify data into different “buckets”—sometimes for descriptive purposes, other times for predictive ends. ML is the driving force behind a number of activities, such as the “recommended for you” features on many streaming services, “fake news” bot detection on Twitter, and even for analyzing vast arrays of data in the attempt to discover new exoplanets.
In survey research, machine learning has a growing number of applications. It can be used to analyze—in real time—an array of data about a survey or interviewer behavior (such as question by question timing, straight-lining answers, and types of cases) and predicting potential cases of interviewer falsification
Another use case involves developing more effective and efficient sampling and sample allocation approaches. For instance, at Abt we have leveraged ML to develop models in large urban areas to help identify households with demographics that are harder to reach (such as Asian-speaking households in Los Angeles) and then make more call attempts by language-appropriate interviewers. Effectively, ML can help us to categorize certain aspects of our survey data—be it sample data, information about interviewers or the data collection process, or even survey data itself—in new ways to gain greater efficiency in our processes.
Natural Language Processing
Natural language processing (NLP) is another class of techniques for extracting, processing, and analyzing various forms of human communication, particularly speech and text. There is an entire field around text analytics in particular, which involves extracting and analyzing information from written sources, such as Tweets or open-ended survey responses. The goal is to convert “unstructured” information–spoken language, documents, social media input, website information, letters to government officials, etc.–into a form of data that can be easily analyzed. We see a number of everyday uses of NLP, from Google search engine results, to Amazon’s Alexa and Apple’s Siri, to an array of new techniques for analyzing medical records to better understand the origins and spread of diseases.
At Abt we have leveraged NLP in several unique ways. First, to help federal agencies pore over vast volumes of information from constituents, we developed a technique called CommentCounts that helps human coders sift through and analyze such data. More recently, we applied NLP and ML to the CommentCounts process to extract data from various forms sent in by constituents and then code or categorize these data into meaningful buckets to generate new insights into what constituents are thinking. In another use, we developed a process for the Centers for Disease Control to leverage web-scraping of data from public health websites across the country and then extract and code these data to better understand immunization practices at local and state levels.
Lastly, image recognition (IR) is a set of techniques for extracting data and insights from visual data, such as pictures or videos. A more specific set of such techniques involves “object recognition,” which attempts to make sense of specific patterns within an image to identify objects of interest (such as people, cars, commercial products, etc.). The primary objective is to leverage visual media as a source of data and determine if objects of interest are or are not present in the image and, if so, the location of the object within the image. IR is in regular use all around us, from facial recognition software (for auto-tagging pictures to providing new log-in security protocols for smartphones) to being a key element of self-driving cars so the vehicles recognize and understand the context in which they are operating (Is that a person in the cross walk? Is that a stop or a yield sign?).
While the commercial world has leveraged IR quite a bit, it’s a fairly new approach within the realm of survey data collection. For surveys, visual data can be used to replace certain types of self-reports, such as portion size. Rather than ask a respondent a question about what and how much they are eating, they can instead to take a picture of their meal and then the size and contents can be auto coded using IR. More complex uses involve extracting information from satellite or other aerial data to create unique sampling frames for surveys in remote areas, or around particular points of interest.
These various forms of AI are rapidly evolving with both the advent of cloud computing and the greater number of researchers using and refining these techniques. This has important implications for the continued evolution of survey research, as these new AI tools help us to augment our processes and methods, ultimately delivering greater insights from our data.
That said: In the next blog, we will explore some of the potential pitfalls in the use of AI techniques in the survey process.