A framework for the continuous curation of a knowledge base system

Ahmadi, Naser

Thesis

Entity-centric knowledge graphs (KGs) are becoming increasingly popular for gathering information about entities. The schemas of KGs are semantically rich, with many different types and predicates to define the entities and their relationships.

These KGs contain knowledge that requires understanding of the KG’s structure and patterns to be exploited. Their rich data structure can express entities with semantic types and relationships, oftentimes domain-specific, that must be made explicit and understood to get the most out of the data.

Although different applications can benefit from such rich structure, this comes at a price.

A significant challenge with KGs is the quality of their data. Without high-quality data, the applications cannot use the KG. However, as a result of the automatic creation and update of KGs, there are a lot of noisy and inconsistent data in them and, because of the large number of triples in a KG, manual validation is impossible.

In fact, KG creation and maintenance is a never-ending process and semi-automatic curation techniques are needed for adding new facts to a KG and for removing noises and inconsistencies. Computational methods can be employed in the creation and curation of KGs.

Deep learning techniques are one of such computational methods that can be exploited for finding new relationships by matching entities in a KG. Mining systems are another computational approach that can help the users improving the quality of KGs.

In this line of work, logical rules are used to express dependencies between entities in KGs. They are useful in tasks such as query answering, data curation, and automatic reasoning, but they are not included with the KGs. These rules must be defined manually or discovered using rule mining techniques.

In this thesis, we present different tools that can be utilized in the process of continuous creation and curation of KGs.

We first present an approach designed to create a KG in the accounting field by matching entities. The proposed approach starts by extracting entities from auditing documents and then finds the links between related entities. This is especially challenging because auditing entities can have different granularity, such as activities, taxonomies, and topics.

We then introduce methods for the continuous curation of KGs. We present an algorithm for conditional rule mining and apply it on large graphs. Our results show that conditional rules can help human curators in finding more accurate rules for specific types of entities.

Next, we describe RuleHub, an extensible corpus of rules for public KGs which provides functionalities for the archival and the retrieval of rules. RuleHub defines different measures for capturing the confidence and the quality of each rule.

We also report methods for using logical rules in two different applications: teaching soft rules to pre-trained language models (RuleBert) and explainable fact checking (ExpClaim).

Detail

Document

HAL

BIBTEX

Type:

Thesis

Date:

2021-12-08

Department:

Data Science

Eurecom Ref:

6647