DIME: Abt’s Data Integration and Management Ecosystem

March 28, 2022

Historically, many of Abt’s research projects have had a relatively long timeline. We would collect data for years. Then we would take several months to produce interim and final analyses and data sets. If we knew the data could have a major policy impact, we would build in significant extra time for quality review.

This approach is no longer feasible. In the Centers for Disease Control and Prevention (CDC) Covid-19 projects I have been working on for the last year, we have been processing data and delivering analytic files containing hundreds of thousands of records, nearly 200 million data points, every other week. CDC sometimes needs the data in a day, rather than in months.

And the stakes couldn’t be higher. CDC uses our data to support a wide range of public health decisions that affect people’s daily lives and could be life-or-death for some, such as mask mandates and vaccination recommendations. This is just one of many projects where our data and analytic deliveries need to be both extremely rapid and completely reliable.

To meet this need, Abt developed the Data Integration and Management Ecosystem (DIME). DIME is a full-featured, secure, and compliant suite of interoperable tools for high-quality, rapid data processing and analysis.

data integration management ecosystem graphic

The beginning of the life cycle is the data integration layer for automatically downloading, transforming, and merging data from multiple study sites and external data sources. Data can be ingested using multiple formats and protocols, including standard protocols, commonly used in the healthcare space. Data sources can include survey data, lab data, or third-party administrative data such as medical records or registries. We use this layer to maintain our repository of publicly available data we use for multiple studies.

Data passing through the integration layer is pushed to the central data warehouse (CDW), which tracks raw, intermediate, and final datasets. We use the CDW to maintain data lineage and a single source of truth.

To ensure the validity of the data, DIME includes a library of more than 250 standard quality checks to detect missing data, invalid or improbable responses, duplications, and dataset-level anomalies, such as unexpected aggregate results. One key issue that DIME addresses is factoring for complex conditional logic when determining whether data are missing. DIME parses the logic from the data dictionary and flags data as missing only when the respondent saw the question in the first place.

DIME also includes tools for generating reports on quality issues for external data collection sites, posting these reports on dashboards and a collaboration space, and tracking issues to conclusion.

DIME’s free-text review tool streamlines processes for reviewing free-text fields included in the dataset to strip personally identifying information (PII) and to categorize the free text into analytic categories. The tool uses a machine learning-based suggestion engine followed by an analyst review. Changes are retained for all subsequent runs of the same dataset, so data that are pulled on a repeating schedule have to be reviewed only once.

The analytic file development library enables the rapid construction of analytic files, including standard re-usable code for manipulating data and constructing derived variables, with rules that shift over time. Analysts can use DIME’s data dictionary to track derivations associated with any given data element constructed at a particular point in time.

DIME also includes more than 50 dashboard templates. These dashboards monitor all aspects of study implementation, such as recruitment, enrollment, survey response rates, protocol adherence, and data quality on both the site, survey, and variable levels. The dashboards incorporate frequently used data content, such as participant demographics, service or treatment uptake, health and environment factors, and behavioral practices. Dashboards can be filtered so that individual sites or other stakeholders can see only results appropriate to them.

At the end of the processing cycle, DIME facilitates posting processed data to our secure file transfer protocol (FTP) data exchange platform or to other locations for ongoing delivery to the client or other designated data dissemination environments or recipients.

DIME is built within Abt’s Data Collection and Analytic Computing Environment (DC-ACE). All the data remain on the platform between the time we receive them externally and deliver them to the client. This platform guarantees the ongoing availability and performance of the ecosystem. It is built on Amazon Web Services using FEDRamp Moderate services. DC-ACE and DIME are fully FISMA Moderate and HIPAA compliant.

Through the use of DIME on the CDC’s Abt-led RECOVER project, we have been able to deliver more than 200,000 records and 200 million cells accurately bi-weekly basis and often more frequently. CDC uses these data for public health guidance, White House press conferences, medical journals, and mainstream news outlets on tight deadlines. We have been able to spin up the use of DIME for similar projects in a month.