Capturing and curating R&D data are crucial to realizing the full value of advanced analytics.
Many pharmaceutical and biotechnology organizations are rich in data but need more insights. Scientific data are often siloed, lacking context, and, therefore, inaccessible. Artificial intelligence (AI) and machine learning (ML) are promising tools for enhancing R&D. In drug discovery, AI can perform a variety of tasks, from identifying potential drug targets to predicting drug efficacy and optimizing treatment strategies. However, there is often a missing link between drug discovery data and AI outcomes. The integration of AI in drug discovery hinges on a crucial need for robust data management.
Demand for traditional drugs and new modalities has increased data digitization across the pharmaceutical industry (1). Meanwhile, the latest AI and ML methods are transforming the R&D process within drug discovery, delivering on the promise to accelerate the journey from laboratory insights to lifesaving therapies. This transformative power of AI inspires optimism for the future of drug discovery, but much of the potential is trapped in isolated and undefined data.
AI is only as powerful as the data it consumes, and it works best with data that have the proper quality, detail, and context. While there are abundant scientific data in pharmaceuticals and biotechnology, the data are often siloed, poorly modeled, and ultimately inaccessible. A meticulous approach to capturing and curating R&D data is indispensable to unlock the full potential of novel AI methodologies. Examining how data are ingested, stored, organized, and maintained is crucial. A strong AI-ready data management strategy can help drug discovery leaders leverage AI to deliver innovative drugs to patients faster.
As AI has matured, many industries have experienced the benefits of automating tedious, data-heavy tasks. Now, the drug discovery industry is facing a high demand for finding insights from high volumes of data, while cutting-edge AI technology is becoming more widely available just in time.
The challenge of data complexity. Pharmaceutical R&D has always been data-intensive, involving many data types from various sources, including laboratory instruments, clinical trials, and real-world evidence.
These data are often stored in disparate systems, leading to silos that make it difficult for both humans and machines to access and utilize the data. What’s more, modern drug therapies are increasingly complex. Drug modalities such as antibodies and cell and gene therapies comprise large, intricate molecular entities, and developing these complicated solutions requires large amounts of data. This complexity is an opportunity for innovation, but it also underscores the need for AI’s interpretive capabilities in drug discovery (see Figure 1).
As science grows in complexity, researchers need new tools and different approaches to curating data to ensure quality and consistency. Simple tools for storing and organizing data were adequate for managing the work that previous generations performed, but they lacked the capabilities required for today’s complex research needs. Fortunately, AI can reduce human workload and quickly achieve targets (2). When paired with proper data management, it can rapidly find patterns and discover hidden insights that humans would otherwise miss.
AI’s capabilities. AI can consume and analyze complex datasets, and ML can identify valuable patterns within the data. Much like how the manufacturing industry has outsourced heavy lifting and repetitive tasks to robots, the pharmaceutical industry can delegate complex computing tasks to AI systems, enabling humans to find insights, make decisions, and explore new capabilities. Tools such as generative AI and ML algorithms give back quality time to scientists and fuel what scientists do best: think, experiment, and discover new therapies faster.
There are already many use cases for AI in drug discovery. In drug design and molecule generation, generative AI models can analyze existing databases of chemical compounds and learn how to generate novel molecules that are likely to have the desired efficacy and safety profile. In this way, researchers can employ AI to design new molecular structures with desired properties for treating specific diseases. Additionally, AI models can analyze biological data to understand the complex interactions within biological systems and identify points where therapeutic intervention could be beneficial.
Generative AI can also aid in discovering new biomarkers for diseases, which can lead to the development of more precise and targeted therapies. By analyzing patient data, including genetic information, AI models can predict how individual patients might respond to certain treatments, leading to more personalized and effective therapy plans.
Furthermore, AI models can simulate the interaction between drugs and biological systems, predicting drugs’ efficacy and potential side effects before real-world testing. Researchers can also use AI to design clinical trials, identify suitable participants, and predict outcomes, thus speeding up the development process and reducing costs. Another benefit of AI is its speed in finding, analyzing, and summarizing large volumes of external literature. For example, researchers can employ techniques such as natural language processing (NLP) and semantic search to quickly find and understand relevant issues from large bodies of work.
AI is clearly making a tremendous impact on the pharma industry, and the future holds even more promise. However, as the volume and complexity of research data grow, AI’s success depends on data management. Take ML, for example. An ML algorithm’s effectiveness depends on the quality of the training dataset. As AI capabilities continue to advance, there will be even more opportunities to streamline the scientific process and unite disparate sources of information—as long as organizations practice strategic data management.
Cloud computing. As technology advances and labs continue introducing more automation to their processes, the volume of data will increase, thus underscoring the need for proactive data management. Traditional on-premises infrastructure struggles to keep pace with the demands of modern labs, often leading to inefficiencies and increased costs. As a result, many research organizations have embraced cloud computing, which provides data storage and computational power for processing it. But simply storing data in the cloud is not enough. No matter where the data are stored, it needs context in order to be useful. Metadata—or in other words, data about data—are essential. Without metadata, data are like a cabinet full of unlabeled chemicals. In order to utilize AI, researchers must label data with information about the experiment, such as experimental methods, conditions, and the nature of the data.
By moving data management to the cloud, pharmaceutical companies can leverage the extensive infrastructure, security, and scalability that cloud service providers offer. This shift allows organizations to focus on their core expertise—drug discovery and development—instead of troubleshooting technical problems.
While simply moving existing applications to the cloud (a process known as “lift-and-shift”) can offer some benefits, it often fails to unlock cloud computing’s full potential. The real advantages are realized through the adoption of cloud-native applications—those specifically designed to exploit cloud computing principles and technologies from the ground up.
Cloud-native solutions provide a rich ecosystem of services and connected capabilities that are indispensable for modern pharmaceutical R&D. They facilitate seamless integration of data from multiple devices, enable the application of advanced analytics and AI, and support real-time collaboration across geographically dispersed teams.
Cloud-based Software as a Service (SaaS) platforms seamlessly integrate with existing information technology (IT) ecosystems, providing a unified and flexible approach to data management. These integration-ready SaaS applications enable pharmaceutical companies to streamline their R&D workflows, reduce manual processes, and enhance collaboration across global teams. Plus, the cloud-based approach allows for frequently updating software with new releases and capabilities. The ability to update is essential as research needs change, AI capabilities advance, and new security vulnerabilities develop.
Electronic lab notebooks (ELNs) provide a central tool for data capture, analysis, and reporting. Cloud-based ELNs offer robust integration capabilities, allowing data from various sources to exist in a single, accessible platform. This centralization not only simplifies data management but also breaks down silos, fostering a collaborative environment where researchers can share insights and work together more effectively.
They also enable the creation of seamless workflows that connect different stages of the drug development process. For instance, integration between an ELN and a sample management system can automate the flow of substance registration data, minimizing the need for manual data entry. This approach saves time and reduces the chance of human error, ensuring that ML algorithms are working with accurate data.
User experience. Like many technological advances, AI will only be effective if users embrace it. Data entry must be simple enough to encourage researchers to adopt AI-based research methods. User-friendly interfaces can help scientists quickly learn to navigate these platforms, manipulate data, and customize workflows to suit their needs. Features such as inventory tracking and integration with familiar tools make it easier for researchers to embrace new technologies and drive innovation.
Data security. As pharmaceutical R&D evolves, security and compliance are critical aspects of data management. Ensuring robust data security and compliance is crucial for safeguarding sensitive research information, especially as AI becomes integral to the industry.
While scientists leverage AI to accelerate discoveries, cybercriminals are also using AI to develop sophisticated attack methods. The benefits of using AI far outweigh the risks, but organizations must take proactive measures to prevent attacks. A layered approach to security includes data encryption and rigorous access controls, continuous monitoring and proactive threat intelligence, and regular vulnerability scans and risk assessments.
FAIR principle. The Findability, Accessibility, Interoperability, and Reuse (FAIR) principle, defined in a March 2016 paper in the journal Scientific Data by a consortium of scientists and organizations, states that data should be findable, accessible, interoperable, and reusable. A given data point must have metadata and a unique identifier. Furthermore, humans and machines should be able to read the data, which must be registered or indexed in a searchable resource (2) (see Figure 2).
Scientific inquiry is unpredictable, and the data collected today might need to answer unforeseen questions tomorrow. Rigid data schemas risk losing valuable context and potential insights. A different methodology called “late binding of schema” offers the flexibility that scientific discovery requires.
This methodology involves capturing data in a way that allows for its structure to be defined and modified closer to the time of analysis. By enabling data to be restructured as new hypotheses and analytical needs emerge, late binding of schema preserves the data’s full informational content and contextual richness. This flexibility is especially critical in the context of AI and ML, where the quality of training datasets directly influences model performance.
It is not only the output of today’s research that has changed, but also the way research is conducted. Collaborating with outsourced partners and colleagues in different locations is not only possible, but common. The modern way of working requires seamless, secure data sharing across teams and scientific disciplines. As teams become more distributed, there is a need for continuous integration with a globalized scientific community.
As AI progresses, so will the way researchers capture, integrate, and analyze data. Gathering data and transferring these data into programs for analysis and review will continue to become more streamlined and automated so that, eventually, data could automatically flow from across systems and algorithms and onto predictive models.
In the drug development industry, unifying data and contextualizing data are paramount for advancing scientific research and innovation. By annotating data with rich metadata and employing flexible data modeling, researchers can transform overwhelming volumes of information into structured, navigable resources ready for complex analysis by today’s AI tools and new tools that will become available in the future.
The collaborative nature of scientific research necessitates a data management system that facilitates seamless and secure data sharing across teams and disciplines. This approach prevents data silos and allows for the collective expertise of the scientific community to enrich and expand upon the available data. Integrating diverse data types—such as medicinal chemistry and biochemical data, screening data, and pre-clinical observations—becomes increasingly critical for a holistic understanding of therapeutic outcomes as research becomes a global endeavor.
Flexibility in data management is also crucial, particularly in the context of AI and ML. Adopting methodologies such as late binding of schema allows for the data structure to be defined and refined as close as possible to the analysis time. This ensures that data remain agile and adaptable enough for modern scientific discovery. By enabling the reorganization of data and correction of errors without restarting the process, researchers save time and maintain the integrity and utility of their data, ultimately accelerating R&D progress and driving breakthroughs in the pharmaceutical industry.
Effective data management may have previously been considered an operational decision, but it is now a strategic decision. When data are contextualized, centralized, and accessible, AI can enable pharmaceutical companies to accelerate innovation and improve patient outcomes.
1. Paul, D.; Sanap, G.; Shenoy, S.; et al. Artificial Intelligence in Drug Discovery and Development. Drug Discovery Today 2021, 26 (1), 80–93. DOI: 10.1016/j.drudis.2020.10.010
2. Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016, 3 (1), 160018.
Chris Stumpf is director of Drug Discovery Informatics Solutions, Revvity Signals.
Pharmaceutical Technology®
Vol. 48, No. 11
November 2024
Pages: 27–30
When referring to this article, please cite it Stumpf, C. Untethered Data: Unifying and Contextualizing Drug Discovery Data. Pharmaceutical Technology 2024, 48 (11), 27–30.
Drug Solutions Podcast: Applying Appropriate Analytics to Drug Development
March 26th 2024In this episode of the Drug Solutions Podcast, Jan Bekker, Vice President of Business Development, Commercial and Technical Operations at BioCina, discusses the latest analytical tools and their applications in the drug development market.
Specification Equivalence—A Practical Approach
December 30th 2024In this first of a four-part series, the authors provide an introduction to a practical approach for establishing specification equivalence. The regulatory basis and compliance examples from FDA observations and warning letters are included to support the need for an effective process.