Data integrity is crucial in unlocking novel data-based insights.
Today's most valuable gold rush is invisible. In labs and back offices worldwide, companies are mining for something arguably more valuable than gold, oil, or precious gems: their own data. In the early 21st century, data are considered many companies’ most prized asset, particularly in the life sciences and biopharmaceutical industries due to the digitalization of research data previously recorded on paper and the explosion of opportunities created by this “digital transformation”.
Unlike oil or precious gems, data are abundant. However, as with precious gems, quality is paramount. High-quality data can improve efficiency and unlock insights that lead to new discoveries and support better manufacturing processes, especially with the help of machine learning (ML) and artificial intelligence (AI). Relying on bad data, however, can lead to wrong conclusions and have significant cost implications.
The biopharmaceutical industry is lagging behind other industries in terms of digital transformation. To catch up, companies will need to treat their data as one of their most valuable assets, prioritize the capture of high-quality data at every level of the business going forward, and check the integrity of their historical data. In this new gold rush, data integrity is no longer an afterthought for the information technology team; it is the foundation for success.
Data integrity is the assurance that data are accurate, unchanged, and traceable. In the biopharma industry, data include everything from cell culture and bioreactor readings, molecule structures and mixture compositions, genetic information and protein structures, or even clinical trial participant forms and email lists. During the data lifecycle, these data are created, recorded, and often transferred from one system to another before being used.
Data integrity requires every step of this process to be consistent, clear, and traceable for all mission-critical data. Data must be generated and documented accurately. The data must be protected from modifications (intentional or accidental) and unwanted deletion or destruction. Furthermore, the data must be maintained in such a way that it can be returned to past states to verify previous analyses, reused to perform new analysis, or disposed of safely if required (as with General Data Protection Regulation-regulated personal data).
If data are not accurate or complete, then the information they represent may be skewed. Poor quality data in any area can impact study results, patient safety, business credibility, and the regulatory process. Poor quality data can also lead to poor decision-making, due to misleading insights and incorrect ML models.
When poor-quality data drives business decisions, the quality of these decisions and, therefore, confidence in these decisions, are affected in all domains—such as the safety, quality, and efficacy of therapeutics. This, in turn, may slow regulatory reviews and time to market, ultimately impacting profits. Data integrity, due to its impact on patients’ safety, is one of the key concerns of FDA, whose top 10 citation types (1, 2) in recent years have been related to data integrity.
On the flip side, good data are a critical attribute for digital transformation. The biopharma industry is in a flux state (3): companies are scrambling to take advantage of recent digital advances to improve operational efficiency, reduce costs, speed time to market, explore existing data in a new light, and lead the competition.
However, building a culture of data integrity means scrutinizing data handling across the business, from production to consumption. This can be a vast undertaking, but the end goal is clear: data should be complete, consistent, and accurate (4). In this regard, two widely accepted data principles can guide companies toward good, high-quality data.
First, companies should follow the findable, accessible, interoperable, and reusable (F.A.I.R.) (5) principles for scientific data management and stewardship. The F.A.I.R. principles focus on making it easier for both people and machines to find, use, and reuse high-quality data.
Second, data should follow the attributable, legible, contemporaneously recorded, original or a true copy, and accurate (ALCOA) principles, which extend the F.A.I.R. principles (4). In addition to these guidelines, companies would be wise to follow ALCOA+ principles that add complete, consistent, enduring, and available to the previous data attributes. This guidance is recommended by various regulatory bodies and industry organizations such as FDA, World Health Organization, and the International Society for Pharmaceutical Engineering.
In addition to being accurate and complete, data should be easy to find and accessible for decision-making. Notably, the data must prioritize interoperability and be able to be read and consumed across multiple systems to provide a complete picture, as missing parts may lead to wrong conclusions.
Additionally, data scientists often must spend time gathering and formatting data before they can process that data. Instead, data should be reusable and durable to enable repeat analysis to confirm conclusions. Reusability is a particularly important attribute of good data because data that has already proven useful in the past are likely to be useful in the future as well. However, accessing “historical” data can be challenging, as data may have been created on legacy systems and in an old format, including paper. This leads to compatibility and accessibility issues as the data may not be indexed or in a machine-readable format.
Various factors contribute to shoddy data. Manual transcription of data during experiments from one system to another, for example, can lead to errors. While these errors can be as simple as the location of a decimal point, the consequences remain significant. Inconsistent experiment protocols mean inconsistent data between experiments with some data not being recorded or incorrectly recorded and labeled, limiting the data’s value and reusability. Additionally, a lack of traceability and audit capability, coupled with poor quality control of the data, means errors and inconsistencies can slip through the cracks and persist.
Thankfully, advances in technology mean that new tools can help to address these pitfalls. Newer scientific instruments can gather data automatically, with no manual handling or human transcription, eliminating errors from data entry mistakes. Most modern systems provide auditing functionalities that trace the history of the data, allowing for cleanly traceable data, creating a strong foundation for data integrity and regulatory compliance.
Data management systems can integrate data from both instruments and various other sources into a unified data backbone, organizing data so it can be accessed from one place. When deciding what metadata to capture alongside the results, it is important to look at the full picture to understand what will be required for downstream processing and ensure it is labeled appropriately and in a format to ensure interoperability. Which system the data came from should also be recorded to ensure traceability but also to allow access to the original data set. Most of this traceability is achieved automatically by the auditing functionality of modern systems.
With the right tools in place, companies should also focus on effective quality assurance and quality control. Quality assurance takes place prior to data capture and involves the design and installation of quality systems, standard operating procedures (SOPs), and staff training, all with the intent of preventing issues before data are collected. Quality assurance is an ongoing journey and should be constantly optimized.
Quality control involves monitoring step-by-step workflows in detailed SOPs and making sure directions are followed; it also involves detecting, investigating, and, if possible, correcting data issues after data capture. Quality control should be based on quality assurance protocols and performed by the appropriate staff members. In many cases, quality control focuses on auditing by exception: when well-documented protocols are in place, any deviation should be recorded by trained staff members. These deviations should be easy to spot and investigate, and they should provide full visibility into their root causes. Clear documentation can potentially result in corrective actions, including SOP updates if required, to prevent future deviations. The recording and review of these deviations is critical and an important artifact during inspection by regulatory agencies (6).
Good data have always been foundational for decision-making but have also been foundational for learning. For learning, one particularly important attribute of good data is completeness: data sets should capture not only successful attempts but also failed ones (7), or experiments that didn’t lead to the desired result.
In life and in science, one often learns more from failure than from success; as it turns out, this is as true for machines as it is for humans. While ML and AI algorithms can sift through more data more quickly than humans can, they need to learn from unsuccessful attempts to identify what actually works.
Consider a chemical reaction that generated 0% yield in the expected product. That experiment is as important (if not more important) as all the chemical reactions using the same conditions that produced the expected transformation. The null result may help uncover incompatibilities between the parameters in the desired reaction; avoiding these in the future will save time and costs. Additionally, further analysis of what happened may also lead to some valuable insights.
In the past, companies and researchers were biased against failure data, preferring to only publish successes. In the era of ML, this mindset must change. If data sets don't sufficiently represent failures and outliers, the algorithms will have deficiencies in their learnings, and the predictions those algorithms provide will not be sound. Even more than in the past, data completeness is now a core component of data integrity.
With a foundation of good data, AI and ML can drive huge advances and efficiency gains with in-silico modeling, experimental prediction, and improved decision-making. With appropriate access controls, these digitalization tools can also allow cross-departmental data analysis. Companies are using AI and ML to re-analyze historical “good” data to discover trends or insights that were not originally picked up, leading to potential new drugs or new application of existing drugs. AI and ML using good data are also informing experiment and study design, helping to predict some preclinical properties to speed the process development phase of the drug lifecycle (Figure 1). This can save on costs by allowing teams to perform fewer actual experiments and instead use more predictions and analyses.
AI and ML are already reshaping the biopharmaceutical industry. Many companies will lean on these tools to power drug discovery, streamline processes, reduce costs, and accelerate time to market. But to strike gold, companies must first make their data reliable and easy to mine. ML, like human learning, requires a foundation of accurate information. For biopharmaceuticals, that means high-quality data supported by rigorous data integrity.
Companies must ensure that data are accurate, complete, reusable, and accessible. To do that, they must take a fresh look at human and digital processes with those goals in mind and carefully review their data management strategy. The companies that make data integrity a top priority will reap the rewards of this decade's digital gold rush.
1. The FDA Group. FDA Warning Letter & Inspection Observation Trends.Feb. 6, 2023.
2. FDA. Inspections. Data Dashboard. datadashboard.fda.gov/ora/cd/inspections.html (Aug. 30, 2023)
3. Weiss, S. An Integrated Approach to the Data Lifecycle in BioPharma. Pharm. Technol. 46 (8) 2022
4. FDA, Draft Guidance for Industry, Data Integrity and Compliance with CGMP (CDER, April 2016).
5. Wilkinson, M. D.; Dumontier, M; and Aalbersberg, I. J.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci Data. 2016, 3, 160018. DOI: 10.1038/sdata.2016.18
6. Kumar, K. Data Integrity in Pharmaceutical Industry. J Anal Pharm Res 2016, 2 (6), 00040. DOI: 10.15406/japlr.2016.02.00040
7. Cepelewicz, J. Lab Failures Turn to Gold in Search for New Materials. Scientific American, May 6, 2016.
Nathalie Batoux and Dan Rayner work in Data Strategy Innovations at IDBS.