Unstructured Text: 2020

Saturday, November 28, 2020

The right way of looking up names in a database while understanding a human sentence - this is it!

This is about natural language processing. Parsing a sentence. That holds names. Of persons, or cars, or whatever. When these names are linked to ID's in the database.

Let's just get to the point. These are the steps you need to take to parse a sentence, turn it into a semantic representation and resolve the names in the input to ID's in the database.

Parse
Relationize
Find Sorts
Find Names

The first thing to note is that these steps can be taken sequentially. It is not necessary to combine them.

Parse

Parsing the sentence is done with a parser using a grammar that contains rewrite rules. Each rewrite rule has pieces of semantics attached. I suggest you use Earley's algorithm. Some example syntax rule with semantics (= sense):

{ rule: s(S1) -> interrogative(S1),                                     
  sense: go:intent(question) }
  
{ rule: interrogative(P1) -> 'who' copula(_) np(E1),                    
  sense: go:quant_check($np, go:intent(who, E1) dom:person(E1)) }
  
{ rule: proper_noun_group(N1) -> proper_noun(N1) proper_noun(N1) proper_noun(N1) }
{ rule: proper_noun_group(N1) -> proper_noun(N1) proper_noun(N1) }
{ rule: proper_noun_group(N1) -> proper_noun(N1) }

When the parse is complete you have one or more syntax trees and each node in the tree may have semantic attachments.

In the rules above, take a special look at the `proper_noun_group` and `proper_noun` categories. The `proper_noun_group` matches a multi-word name (there are three rules, one for 1 word, one for 2 words, and one for 3 words). The `proper_noun` category matches a single word in the name.

Insight 1: the `proper_noun` category matches any word.

That's right: this grammar has a category that matches any word. Why? Because names can be anything, from "Michael Jackson" to "IBM", "iPhone", "the miracles", "X11", and of course the legendary band the the (i mean The The). If these names are in a database, there is no need to guess which words are names, just look them up. This is how to do it.

You may be worried, nay, you should be worried, that this approach leads to many, many parse trees. And yes, this is true, a little, but not much. It leads so some more parse trees, but not a lot. And this is because of

Insight 2: syntax constrains the number of possible parse trees enormously

While any word can be a proper noun, and thus a noun, these proper nouns have to fit in with the rest of the parse tree. Once the parse is complete, only valid parse trees remain. Thus reducing the number of parse trees enormously.

Relationize and extract names

Combine the semantic attachments to create a single semantic representation of the sentence. No need for the syntax anymore.

When you walk through the syntax tree to combine the semantic representations into a single representation, you occasionally come across these `proper_noun_group` nodes that matched names. When you do, pick up the name-words from the child-nodes and append them into the full multi-word name. Create a mapping from variable to name, like this

E1: "Michael Jackson"
E2: "Lisa Marie Presley"

The variables refer to the entities that form the arguments of the relations in the semantic representation. Here's a simple semantic representation that we can use as an example:

marry(E1, E2)

We must now continue to find the database ID with these variables so that we may replace the variables with database ID's, like this

marry(19932, 20021)

Once we have that, we can check this relation to the database entry. But how do we find the name in the database?

Find sorts

Usually there is some table that maps names to ids, of course. And in the simple case where you need only a single entity type, this is straightforward. But I want to go a bit beyond that and consider the case where you have tens or hundreds of entity types, each with its own tables. Must you look up the name in each of these tables? No. And what if the name is present in multiple tables? Ouch.

The proper way to address this is to find out the sorts of the entities involved. The sort is the type of the entity.

Each relation has a semantic selection that constrains its sorts. For example:

marry(person, person)

Once you find this out, you can map the variables to sorts, like this:

E1: person
E2: person

Now you know that its two persons that married. And you can look up the names in the person table, not the country table.

Find names

In the last step you take the variables, the names and the sorts, and you check which database supports which sorts and which table belongs to that particular sort. You may need database specific information about how to extract the ID from the table. But at least you know where to look for the name.

Database 1:

sort = person, locate = relation(Id, _, Name, _, _)

At this point you know the exact table you need to look for the name, and you need to do only a single query to find the ID with the name. When this is done, you have the variable - ID link, like this:

E1: 19932
E2: 20021

And you can replace the variables with constants:

marry(19932, 20021)

As you can see, this representation does not have naming information in it any more. It has direct links to the primary key of the tables. And you know what this means: speed! And also

Insight 3: resolve names before executing the semantic representation

Afterword

Why did I use this slightly hysterical way of writing an article? Because this is a hard subject to get attention. And because it took me five years to get to this point, and it was a real revelation when I found out. It finally clicked. It all came together. It's so simple, once you see it.

I hope you enjoyed this and it makes you want to use it. Good luck!

Saturday, January 25, 2020

Combining the data of multiple databases in a single natural language query

In our time we have at our disposal a growing number of open data sources. We can query these individually, and that's nice. But wouldn't it be awesome if we could make queries that combined information from diverse data sources at once?

Suppose that you have two or open data databases at your disposal. One contains information about the relations between famous persons, the other about books and authors. What if you were to ask:

Did Charles Darwin's grandfather write "The Loves of the Plants"?

Database 1 might hold

parent(`a11`, `a34`)
parent(`a34`, `a71`)
name(`a11`, "Erasmus Darwin")
name(`a71`, "Charles Darwin")

and database 2 might hold

author(`x21-r11`, `w909`)
person(`x21-r11`, "Darwin", "Erasmus")
work(`w909`, "The Loves of the Plants")

You can see that one part of the information necessary to answer the question is available in database 1, while the order part is present in database 2, but it is not clear how these parts could be combined. Is it possible to combine these two to answer the question?

Shared identities

I think the answer to this problem can be found in what I will call shared identities: database independent ids that can be used to reason about entities that are present in more than one database.

The shared identity consists of an id and a number of database ids. The shared identities that would be used in the example above could be

[type: person, shared-id: darwin-105, db-ids: {db1: a71}]
[type: person, shared-id: darwin-99, db-ids: {db1: a34}]
[type: person, shared-id: darwin-21, db-ids: {db1: a11, db2: x21-r11}]

The shared identity has an id ("darwin-21") that is present in neither database, yet it can be used to populate the query. At the same time the ids of the databases where this entity is represented are available ("db1", "db2").

How does that work?

When the question is processed, a system will first look up the names "Charles Darwin" and "The Loves of the Plants" in the databases. This results in "a71" from db1 and "w909" from db2.

Next, the system will look up the shared identities from these two entities. Now most entity types do not have shared identities. "Work" for example only occurs in database 2, so there's no need for an extra super-database construct. The id from db2, "w909", can be used as shared identity for the work "The loves of the Plants". Persons, on the other hand, are represented in both databases, and so we look up the shared identity for "a71", which is "darwin-105".

Our query Did Charles Darwin's grandfather write "The Loves of the Plants"? will then look simply like this:

parent(E1, E2)
parent(E2, `darwin-105`)
author(E1, `w909`)

Or: did the parent of the parent of `darwin-105` author `w909`?

When this query is processed, the system will present the relations to the underlying databases.

parent(E2, `darwin-105`)

contains a shared identity. It's id's will be looked up and be replaced before the relation is passed to the database.

parent(E2, `a71`)

The database will return a34 and the system will look up its shared id: `darwin-99`. E2 will be bound to `darwin-99`.

The system will work with shared identities everywhere, except when it deals with an individual database directly.

parent(E1, E2)

will then be

parent(E1, `darwin-99`)

and E1 will be found to be `darwin-21`. In order to resolve author(`darwin-21`, `w909`), `darwin-21` will need to transformed into the database id `x21-r11`.

Who creates the shared identities?

Shared identities are stored as mapping tables: database-id <==> shared id, one table for each entity type in a database.

Only entities that live in multiple databases need shared identities. For entities that live in a single database we can still use the term "shared identity" but only use the database id. No mapping tables are used.

The mapping tables can be constructed manually or be generated by rules or heuristics. It must be done off-line, not during the query, because it takes too much time.

The tables used in the example above can be constructed as follows: for each person, take the name from db1, check if the same name occurs in db2, by appending first name and last name. If so, add an entry to the person mapping tables of db1 and db2. Take the unique database id. Make up a shared identity, it doesn't matter what.

db1: darwin-21 = a11
db2: darwin-21 = x21-r11

If a person occurs in only one database, just add the mapping for that single database.

Unstructured Text