The issue of how to cost-effectively create knowledge graphs for the purpose of data integration was all around in the air. I was impressed by seeing how present the topic of data integration was not only at Bayer but for all pharmaceutical companies present there. In my talk, titled 'Domain-specific knowledge graphs for knowledge management and knowledge discovery' I emphasized that all data integration activities require clear use cases and competency questions to scope a project adequately and get the most out of the data. But cost is an issue. Semantics and ontologies are key to data integration, providing the five principles for semantic data integration:
- Normalization: Semantics is inherently reductionists, abstracting from details and focusing on commonalities. Practically, this is achieved by mapping data to an agreed upon set of IDs and vocabulary elements.
- Reuse: Normalization is achieved by reuse of vocabulary and IDs, do not invent own IDs or vocabulary elements if suitable elements exist already? Otherwise, integration will simply not happen.
- Commitment: Commitment is about agreeing to understand a certain concept in the same way as the stakeholder who introduced that concept; this works without formal axiomatic definitions. It works when we speak. We can exchange messages in natural language without formally agreeing on the definition of each single word.
- Grouping: Normalization and typing allows to group different entities together at a certain abstraction level. This is key for aggregation (see below), that is computing summarization for data that are grouped according to some criterion.
- Aggregation: The ultimate goal of any semantic data integration exercise. At the end of the day, we are less interested in the single data point, put in aggregating all data points or entities that share certain characteristics or features and provide informative summaries / statics for the aggregated elements.
I also talked about the challenges of incorporating unstructured / textual information into knowledge graphs via text mining. Errors are unavoidable when using machine reading / information techniques. On the other hand, by deploying machine reading, we are able to ingest information from text at a speed and scale that no single human would ever be able to do.
So "Where is the sweet-spot along the trade-off between being able to „machine read“ a large amount of documents and having to live with errors?"
The panel after the talks in the morning of the 7th of December was very informative and lively. There were very interesting discussions on the role of foundational/upper ontologies in data integration, the cost of integrating data using knowledge graphs compared to using a standard data warehouse approach, the challenge of dealing with datasets and vocabularies that constantly evolve, the question how to implement quality assurance / quality control over an evolving knowledge graph and how to effectively involve users in this process. Big questions!
It was a great conference. It was a pleasure to speak to the audience, a very interesting and knowledgeable audience indeed. The post-its all around and all the brainstorming going on were really inspiring and fruitful. When is the next edition?