Unstructured Text: November 2020

This is about natural language processing. Parsing a sentence. That holds names. Of persons, or cars, or whatever. When these names are linked to ID's in the database.

Let's just get to the point. These are the steps you need to take to parse a sentence, turn it into a semantic representation and resolve the names in the input to ID's in the database.

Parse
Relationize
Find Sorts
Find Names

The first thing to note is that these steps can be taken sequentially. It is not necessary to combine them.

Parse

Parsing the sentence is done with a parser using a grammar that contains rewrite rules. Each rewrite rule has pieces of semantics attached. I suggest you use Earley's algorithm. Some example syntax rule with semantics (= sense):

{ rule: s(S1) -> interrogative(S1),                                     
  sense: go:intent(question) }
  
{ rule: interrogative(P1) -> 'who' copula(_) np(E1),                    
  sense: go:quant_check($np, go:intent(who, E1) dom:person(E1)) }
  
{ rule: proper_noun_group(N1) -> proper_noun(N1) proper_noun(N1) proper_noun(N1) }
{ rule: proper_noun_group(N1) -> proper_noun(N1) proper_noun(N1) }
{ rule: proper_noun_group(N1) -> proper_noun(N1) }

When the parse is complete you have one or more syntax trees and each node in the tree may have semantic attachments.

In the rules above, take a special look at the `proper_noun_group` and `proper_noun` categories. The `proper_noun_group` matches a multi-word name (there are three rules, one for 1 word, one for 2 words, and one for 3 words). The `proper_noun` category matches a single word in the name.

Insight 1: the `proper_noun` category matches any word.

That's right: this grammar has a category that matches any word. Why? Because names can be anything, from "Michael Jackson" to "IBM", "iPhone", "the miracles", "X11", and of course the legendary band the the (i mean The The). If these names are in a database, there is no need to guess which words are names, just look them up. This is how to do it.

You may be worried, nay, you should be worried, that this approach leads to many, many parse trees. And yes, this is true, a little, but not much. It leads so some more parse trees, but not a lot. And this is because of

Insight 2: syntax constrains the number of possible parse trees enormously

While any word can be a proper noun, and thus a noun, these proper nouns have to fit in with the rest of the parse tree. Once the parse is complete, only valid parse trees remain. Thus reducing the number of parse trees enormously.

Relationize and extract names

Combine the semantic attachments to create a single semantic representation of the sentence. No need for the syntax anymore.

When you walk through the syntax tree to combine the semantic representations into a single representation, you occasionally come across these `proper_noun_group` nodes that matched names. When you do, pick up the name-words from the child-nodes and append them into the full multi-word name. Create a mapping from variable to name, like this

E1: "Michael Jackson"
E2: "Lisa Marie Presley"

The variables refer to the entities that form the arguments of the relations in the semantic representation. Here's a simple semantic representation that we can use as an example:

marry(E1, E2)

We must now continue to find the database ID with these variables so that we may replace the variables with database ID's, like this

marry(19932, 20021)

Once we have that, we can check this relation to the database entry. But how do we find the name in the database?

Find sorts

Usually there is some table that maps names to ids, of course. And in the simple case where you need only a single entity type, this is straightforward. But I want to go a bit beyond that and consider the case where you have tens or hundreds of entity types, each with its own tables. Must you look up the name in each of these tables? No. And what if the name is present in multiple tables? Ouch.

The proper way to address this is to find out the sorts of the entities involved. The sort is the type of the entity.

Each relation has a semantic selection that constrains its sorts. For example:

marry(person, person)

Once you find this out, you can map the variables to sorts, like this:

E1: person
E2: person

Now you know that its two persons that married. And you can look up the names in the person table, not the country table.

Find names

In the last step you take the variables, the names and the sorts, and you check which database supports which sorts and which table belongs to that particular sort. You may need database specific information about how to extract the ID from the table. But at least you know where to look for the name.

Database 1:

sort = person, locate = relation(Id, _, Name, _, _)

At this point you know the exact table you need to look for the name, and you need to do only a single query to find the ID with the name. When this is done, you have the variable - ID link, like this:

E1: 19932
E2: 20021

And you can replace the variables with constants:

marry(19932, 20021)

As you can see, this representation does not have naming information in it any more. It has direct links to the primary key of the tables. And you know what this means: speed! And also

Insight 3: resolve names before executing the semantic representation

Afterword

Why did I use this slightly hysterical way of writing an article? Because this is a hard subject to get attention. And because it took me five years to get to this point, and it was a real revelation when I found out. It finally clicked. It all came together. It's so simple, once you see it.

I hope you enjoyed this and it makes you want to use it. Good luck!

Unstructured Text

Saturday, November 28, 2020

The right way of looking up names in a database while understanding a human sentence - this is it!

Parse

Relationize and extract names

Find sorts

Find names

Afterword

On SQLAlchemy

Report Abuse