This page is optimized for a taller screen.
Please rotate your device or increase the size of your browser window.
Using Data Science to Analyze 50,000 Articles in Two Weeks
April 30, 2020
As the ramifications of the spread of COVID-19 wore on in March 2019, the data science community came together in various ways to help researchers learn more about the disease and the ways it affects communities.
The COVID-19 Open Research Dataset (CORD) was created as part of a competition sponsored by Kaggle, a data science community that hosts a series of challenges for competitors to solve. The competition provided the online community with a machine-learning-readable dataset of more than 50,000 journal articles on COVID-19 and related respiratory diseases. Several Abt Data Science Fellows and a data scientist from Abt’s Data Science, Surveys, and Enabling Technologies division took on this challenge and formed a sprint team to learn as much as possible about these diseases using data science methods--all in two weeks.
The team consisted of Data Science Fellows Katie Long, Xi Xi, Justin Stein, and Farhad Siraj; Suja Thomas, a data scientist with industry experience; and myself. The team met daily for two weeks on 30-minute calls that focused on updates, mostly technical in nature. Over the first few days, the team realized that although the CORD dataset was technically machine-learning-ready (and many organizations had culled information from journal articles to make it so), it still needed a variety of cleaning methods to get the team’s algorithms to analyze the words correctly. We gave a medical bent to our exhaustive use of standard techniques such as stemming words (so that “breathes” and “breathing” are recognized as very similar words by the algorithm) and creating stopwords (words that provide no substantive meaning to a body of text). We frequently used ScispaCy, an open-source Python package that contains models for processing medical and scientific textual data. The team spent most of the first week prepping the vast amounts of text data for analysis.
Near the end of the first week, Katie Long suggested that a text-based analysis could be well-served with a tool that helps readers search for the answer to a specific question. For example, what is the average incubation period of COVID-19? The eventual tool, QuickSearch, uses natural language processing algorithms to help search through the body of text to determine which journal articles are most likely to contain the answer to that question. In this case, QuickSearch’s top result suggests an incubation period of 4.8 days, and the top 10 results average close to five days. This is in line with the Centers for Disease Control and Prevention’s publicly reported median time of 4-5 days from exposure to symptoms onset.
In addition to the tool’s search capabilities, the interface enables users to visualize the most important topics within the body of text. When confronted with a collection of 50,000 journal articles, it’s imperative to understand what the main ideas are in that ocean of words and to do so quickly. The QuickSearch tool provides users with up to 15 main topics in a visual interface and provides users the most relevant journal articles within each of those topics. This provides a fast way for users to get a lay of the land in large collections of documents and to rapidly determine the most important papers in their topics of interest.
The QuickSearch tool was submitted to the Kaggle competition in mid-April. And we’re already using the tool and processes of the sprint in our work. For example, as soon as the competition wrapped up, the team started another two-week, big-text-data sprint that provided valuable subject-area context for an upcoming proposal. A variation of the QuickSearch tool used more than 25,000 publicly available abstracts that had been digitally scraped from a large content library to better understand the main topics in the subject matter of interest. We envision conducting these types of sprints more frequently, particularly in situations where it’s necessary to analyze a large amount of text quickly and comprehensively.
Subscribe to our bimonthly newsletters, with information about our work, staff, and current job openings, and other periodic mailings.