Let's just get to the point. These are the steps you need to take to parse a sentence, turn it into a semantic representation and resolve the names in the input to ID's in the database.
- Parse
- Relationize
- Find Sorts
- Find Names
The first thing to note is that these steps can be taken sequentially. It is not necessary to combine them.
Parse
Parsing the sentence is done with a parser using a grammar that contains rewrite rules. Each rewrite rule has pieces of semantics attached. I suggest you use Earley's algorithm. Some example syntax rule with semantics (= sense):
{ rule: s(S1) -> interrogative(S1), sense: go:intent(question) } { rule: interrogative(P1) -> 'who' copula(_) np(E1), sense: go:quant_check($np, go:intent(who, E1) dom:person(E1)) } { rule: proper_noun_group(N1) -> proper_noun(N1) proper_noun(N1) proper_noun(N1) } { rule: proper_noun_group(N1) -> proper_noun(N1) proper_noun(N1) } { rule: proper_noun_group(N1) -> proper_noun(N1) }
When the parse is complete you have one or more syntax trees and each node in the tree may have semantic attachments.
In the rules above, take a special look at the `proper_noun_group` and `proper_noun` categories. The `proper_noun_group` matches a multi-word name (there are three rules, one for 1 word, one for 2 words, and one for 3 words). The `proper_noun` category matches a single word in the name.
In the rules above, take a special look at the `proper_noun_group` and `proper_noun` categories. The `proper_noun_group` matches a multi-word name (there are three rules, one for 1 word, one for 2 words, and one for 3 words). The `proper_noun` category matches a single word in the name.
Insight 1: the `proper_noun` category matches any word.
That's right: this grammar has a category that matches any word. Why? Because names can be anything, from "Michael Jackson" to "IBM", "iPhone", "the miracles", "X11", and of course the legendary band the the (i mean The The). If these names are in a database, there is no need to guess which words are names, just look them up. This is how to do it.
You may be worried, nay, you should be worried, that this approach leads to many, many parse trees. And yes, this is true, a little, but not much. It leads so some more parse trees, but not a lot. And this is because of
Insight 2: syntax constrains the number of possible parse trees enormously
While any word can be a proper noun, and thus a noun, these proper nouns have to fit in with the rest of the parse tree. Once the parse is complete, only valid parse trees remain. Thus reducing the number of parse trees enormously.
Relationize and extract names
Combine the semantic attachments to create a single semantic representation of the sentence. No need for the syntax anymore.
When you walk through the syntax tree to combine the semantic representations into a single representation, you occasionally come across these `proper_noun_group` nodes that matched names. When you do, pick up the name-words from the child-nodes and append them into the full multi-word name. Create a mapping from variable to name, like this
- E1: "Michael Jackson"
- E2: "Lisa Marie Presley"
The variables refer to the entities that form the arguments of the relations in the semantic representation. Here's a simple semantic representation that we can use as an example:
marry(E1, E2)
We must now continue to find the database ID with these variables so that we may replace the variables with database ID's, like this
marry(19932, 20021)
Once we have that, we can check this relation to the database entry. But how do we find the name in the database?
Find sorts
Usually there is some table that maps names to ids, of course. And in the simple case where you need only a single entity type, this is straightforward. But I want to go a bit beyond that and consider the case where you have tens or hundreds of entity types, each with its own tables. Must you look up the name in each of these tables? No. And what if the name is present in multiple tables? Ouch.
The proper way to address this is to find out the sorts of the entities involved. The sort is the type of the entity.
Each relation has a semantic selection that constrains its sorts. For example:
marry(person, person)
Once you find this out, you can map the variables to sorts, like this:
- E1: person
- E2: person
Now you know that its two persons that married. And you can look up the names in the person table, not the country table.
Find names
In the last step you take the variables, the names and the sorts, and you check which database supports which sorts and which table belongs to that particular sort. You may need database specific information about how to extract the ID from the table. But at least you know where to look for the name.
Database 1:
sort = person, locate = relation(Id, _, Name, _, _)
At this point you know the exact table you need to look for the name, and you need to do only a single query to find the ID with the name. When this is done, you have the variable - ID link, like this:
- E1: 19932
- E2: 20021
And you can replace the variables with constants:
marry(19932, 20021)
As you can see, this representation does not have naming information in it any more. It has direct links to the primary key of the tables. And you know what this means: speed! And also
Insight 3: resolve names before executing the semantic representation
Afterword
Why did I use this slightly hysterical way of writing an article? Because this is a hard subject to get attention. And because it took me five years to get to this point, and it was a real revelation when I found out. It finally clicked. It all came together. It's so simple, once you see it.
I hope you enjoyed this and it makes you want to use it. Good luck!