While the field of data science offers many avenues for improving the way we conduct survey research—such as the use of new types of data and data extraction and analytic techniques—this new era is not without its challenges. In fact, there’s a veritable minefield of issues that need to be defused to ensure the results from these new approaches are valid and reliable. There are few published works, but many major conferences–such as those of the American Association for Public Opinion Research or the BigSurv conferences–provide a growing number of scientific presentations that are increasing our understanding of the potential issues. Drawing from this work, there are at least three key concerns: (1) potential data issues, (2) the appropriate application of data science techniques, and (3) transparency of methods.
Potential Data Issues:
The first set of questions to always ask when using and analyzing new sources of data should determine the origins of the data and what the findings represent. Even though a dataset may have tens of thousands of records or more (i.e., “big data”), we still need to understand who and what those data represent. In this respect, many of the lessons learned from decades of survey methods and statistics still hold true. Concepts such as representation, measurement error, reliability, and validity are all germane. For example, we may use a dataset that purports to represent all of the people within a specific geography, such as a state or health planning district, but what information do we have that allows us to trust this dataset? When conducting survey samples, the approach to selecting people and the measurement tools are designed by the researchers—with many secondary datasets, this is not the case.
Moreover, when data are pulled from systems such as medical records or social media platforms, how do we know that changes over time are real—and not the product of system engineers making changes to the platform? And then there is the issue of “fake data” and “bot-driven posts” on social media sites – how can we tell if such information comes from actual people?
At Abt, we take steps to learn as much about any external data we may leverage. Many of our studies provide critical research and/or evaluation data for policymakers at all levels and therefore need to be held to a high standard.
Appropriate Application of Data Science Techniques
The second to ask is ”Are we applying new analytic techniques appropriately?” Despite the great branding in the term “machine learning,” (ML) machines don’t really “learn” per se, but do what programmers program them to do, so how we “train” and interpret algorithms can make a huge difference. For instance, many machine learning approaches depend upon a “training” dataset to develop the initial algorithm that is then put to use with “production” data. If the training data, however, do not accurately represent the characteristics found in the production data then the classifications or predictions made by the ML model will likely be inaccurate.
For example, if a ML algorithm is developed to predict the types of individuals who might serve best in the leadership of a company for the purposes of mentoring, and the training data are drawn from historical company records, then the algorithm could produce biased results unfavorable for women if the company historically had an all or predominately male leadership team. In a survey perspective, if we are attempting to predict which households in a sample are likely to respond in the latter half of a survey period by using data from those who responded earlier in the study, we will likely get biased results as the characteristics of those who respond quickly to a survey are often much different than those who require many contact attempts.
At Abt, we are focused on such issues and have developed protocols and quality checks to ensure that we are applying the array of ML (and other AI) techniques in a rigorous manner. This includes trying to validate models with external or other sources of information, rather than simply relying on an algorithm’s processing speed or predictive capacity. Our goal is to lead the industry in the development and use of such metrics to ensure quality of insights.
Transparency of Methods
Finally, when assessing a study or set of findings, how can we be sure the study was done correctly? It should go without saying that in situations where the quality of the information generated needs to be rigorously assessed by others, a “black box” approach is not acceptable. As we do with documenting the details of our survey research methodology and findings, we need to ensure that when we use data from external sources or develop new types of algorithms that we document the data used, variables, methods, outcomes, and potential sources of error (ideally quantified) when releasing data or results. This includes providing enough information to be able to replicate results and ensure that approaches can be reproduced at scale (not simply “one-off” approaches).
Abt is committed to transparency, being one of the early members of the American Association for Public Opinion Research’s Transparency Initiative, which provides standards for ensuring visibility into how data are generated and used.
New sources of data and data science techniques can help us to evolve many aspects of survey research, but we need to remain true to the key concepts and quality measures we have applied for decades to more “designed” survey data. Only then can we have confidence in the data and insights we generate through exciting new methodologies.