PHARMA SEMANTIC SEARCH: CONNECTING REGULATORY INFORMATION TO INTERNAL R&D DATA VIA A KNOWLEDGE GRAPH
Pharmaceutical companies need to go through a regulatory submission process when filing for the registration of new products. An internal view of R&D data is usually maintained separately from the data used in the submission process. Integrating the information from regulatory documents and R&D databases often includes a manual search through documents and databases, which requires a significant amount of effort.
brox IT Solutions has built a solution that integrates data from the regulatory process and R&D databases to overcome these challenges for a large German pharma company. The solution was built on a knowledge graph, which allows an uncomplicated extension of the data and contains an according search frontend to allow non-technical users to access the data effortlessly. The reduced effort of manually searching through databases and documents can lead to a significant decrease in cost.
A pharma company needed to integrate the data represented in regulatory submission documents with data from internal databases, such as substances and molecules, as well as organizational master data. The firm wanted to integrate the data for:
- ensuring data quality of submission documents
- getting information on which substances are registered in which countries
- directing research effort to areas that result in products
The frontend for exploring the data was required to allow searching for relevant and filtered information, along with allowing users with no data-science or analytics background to interact with the information.
One of the main challenges for this project was relating data from the R&D and regulatory domains to one another. Data from the regulatory domain included text-mined documents. Therefore, identifiers in the documents did not always exactly match identifiers and names used in other databases. After data cleansing, the data needed to be matched to the internal master data on substances and legal entities already maintained in a knowledge graph. The results of the matching process had to be stored in a knowledge graph to allow integration with other sources.
Another challenge was making the data available to non-technical users via a front-end. To use an interaction pattern that was known to these users, this frontend was chosen to be based on a search engine. That search engine had to be integrated with the knowledge graph and therefore allow a faceted search over the data represented in the graph.
The solution built for these challenges consisted mainly of two aspects: the data integration and the frontend implementation.
Data integration was accomplished while using the following components:
- An ETL-Software was used to extract data from text mining results and create a graph.
- Matching the text mining results was done via matching patterns created with SPARQL queries.
- This data was then ingested into an RDF database
For the frontend implementation, the following components were a part of the solution:
- The search engine was built using elasticsearch
- Ingestion into elasticsearch was done by using rdflib for extracting the data from the graph store and the elastic library for python for indexing the data.
- Frontend for searching was implemented using searchkit, which allowed for a fast implementation due to easy and accessible templates.
The implementation of the solution allowed the pharma company access to several benefits that were not attainable without the solution:
- Inconsistencies between regulatory and R&D data can be discovered. Finding these inconsistencies can reduce regulatory risk.
- Connections between products, substances, and according legal entities that are allowed to sell them can now be found in the graph. Finding this information usually requires hours or even days of manual work searching through documents. Hence, this is a potential to cut costs.
- Regulatory data can be easily accessed and filtered by country, internal substance identifiers, related company, and other aspects. These can help with getting an overview of the current market access of the company and hence make the search for new streams of revenue easier.
- The solution is very extendable: now that the regulatory data is contained in the graph, additional use cases can be built on top of that data by extending the graph and building a new frontend on top.
- Implementation of a solution was done within weeks because a graph was already present for the internal R&D data and tools like pentaho, graph databases, and searchkit were able to facilitate quick prototyping. Thus, the costs of new applications built with a similar approach can be reduced.
Dr. Matthias Jurisch
Manager Information Management Unit