Saturday, January 25, 2020

Combining the data of multiple databases in a single natural language query

In our time we have at our disposal a growing number of open data sources. We can query these individually, and that's nice. But wouldn't it be awesome if we could make queries that combined information from diverse data sources at once?

Suppose that you have two or open data databases at your disposal. One contains information about the relations between famous persons, the other about books and authors. What if you were to ask:
Did Charles Darwin's grandfather write "The Loves of the Plants"?

Database 1 might hold

parent(`a11`, `a34`)
parent(`a34`, `a71`)
name(`a11`, "Erasmus Darwin")
name(`a71`, "Charles Darwin")

and database 2 might hold

author(`x21-r11`, `w909`)
person(`x21-r11`, "Darwin", "Erasmus")
work(`w909`, "The Loves of the Plants")

You can see that one part of the information necessary to answer the question is available in database 1, while the order part is present in database 2, but it is not clear how these parts could be combined. Is it possible to combine these two to answer the question?

Shared identities

I think the answer to this problem can be found in what I will call shared identities: database independent ids that can be used to reason about entities that are present in more than one database.

The shared identity consists of an id and a number of database ids. The shared identities that would be used in the example above could be

[type: person, shared-id: darwin-105, db-ids: {db1: a71}]
[type: person, shared-id: darwin-99, db-ids: {db1: a34}]
[type: person, shared-id: darwin-21, db-ids: {db1: a11, db2: x21-r11}]
The shared identity has an id ("darwin-21") that is present in neither database, yet it can be used to populate the query. At the same time the ids of the databases where this entity is represented are available ("db1", "db2").

How does that work?

When the question is processed, a system will first look up the names "Charles Darwin" and "The Loves of the Plants" in the databases. This results in "a71" from db1 and "w909" from db2.

Next, the system will look up the shared identities from these two entities. Now most entity types do not have shared identities. "Work" for example only occurs in database 2, so there's no need for an extra super-database construct. The id from db2, "w909", can be used as shared identity for the work "The loves of the Plants". Persons, on the other hand, are represented in both databases, and so we look up the shared identity for "a71", which is "darwin-105".

Our query Did Charles Darwin's grandfather write "The Loves of the Plants"? will then look simply like this:

parent(E1, E2)
parent(E2, `darwin-105`)
author(E1, `w909`)

Or: did the parent of the parent of `darwin-105` author `w909`?

When this query is processed, the system will present the relations to the underlying databases.

parent(E2, `darwin-105`)
 contains a shared identity. It's id's will be looked up and be replaced before the relation is passed to the database.

parent(E2, `a71`)
The database will return a34 and the system will look up its shared id: `darwin-99`. E2 will be bound to `darwin-99`.

The system will work with shared identities everywhere, except when it deals with an individual database directly.

parent(E1, E2)
will then be

parent(E1, `darwin-99`)

and E1 will be found to be `darwin-21`. In order to resolve author(`darwin-21`, `w909`), `darwin-21` will need to transformed into the database id `x21-r11`.

Who creates the shared identities?

Shared identities are stored as mapping tables: database-id <==> shared id, one table for each entity type in a database.

Only entities that live in multiple databases need shared identities. For entities that live in a single database we can still use the term "shared identity" but only use the database id. No mapping tables are used.

The mapping tables can be constructed manually or be generated by rules or heuristics. It must be done off-line, not during the query, because it takes too much time.

The tables used in the example above can be constructed as follows: for each person, take the name from db1, check if the same name occurs in db2, by appending first name and last name. If so, add an entry to the person mapping tables of db1 and db2. Take the unique database id. Make up a shared identity, it doesn't matter what.

db1: darwin-21 = a11
db2: darwin-21 = x21-r11

If a person occurs in only one database, just add the mapping for that single database.


On SQLAlchemy

I've been using SQLAlchemy and reading about it for a few months now, and I don't get it. I don't mean I don't get SQLAlchem...