One Big Query vs Many Small QueriesFor Nli-Go I chose to send many little requests to the database. Each request deals with a single relation and has no joins and aggregations. For a triple store such as SPARQL, each relation just has two arguments. In this case there are only four queries per relation, depending on the variables that were bound in advance:
SELECT A, B FROM RELATION_P SELECT A FROM RELATION_P WHERE B = Y SELECT B FROM RELATION_P WHERE A = X SELECT 1 FROM RELATION_P WHERE A = X AND B = Y
There are major benefits of the small queries approach:
- Very little knowledge is required of the idiosyncrasies of the database and of the database management system.
- Given that the tables are properly indexed, all queries are quick.
- It is possible to add extra logic to the user query not available from the database.
- It is possible to combine the results of multiple databases for a single user query. These databases may have completely types.
- The system does not make use the processing power of the database.
- The system need to do (very) many queries
- More knowledge of the nli system is required.
OptimizationTo me the benefits outweigh the drawbacks, but in order to have a performing system, I need to deal with the fact that the database does not help in creating the fastest query execution.
This comes down to the fact that the Query Optimizer of the database is not used for the full user query. Mainly, this means that the queried relations are not presented to the database in the most efficient order.
A simple example will make this clearer. Suppose the following two queries were created from a user input sentence "who married Lord Byron?"
SELECT A, B FROM MARRIED_TO SELECT 1 FROM NAME WHERE A = ? AND B = "Lord Byron"The first query asks for all A, B involved in a marriage relation. The second query asks for the ID of a person whose name is given. When executed in the shown order, the first query yields, say, 40.000 rows. Next, each of the resulting values is entered into the second query. The second query is executed 40.000 times, each time with different values. As you can see, even a simple set of relations gets out of control easily.
Since the database does not do query optimization, the nli system needs to perform this task.
David H.D. Warren wrote an interesting article about query optimization for CHAT-80, a Prolog based natural language interface. It's called
Efficient Processing of Interactive Relational Database Queries Expressed in LogicHis algorithm to determine the order of the simple queries consists of two steps:
- Calculate the cost for each query
- Order the queries by cost, lowest first.
For the query
SELECT ID, NAME FROM PERSONthe cost is calculated as
size(PERSON)where "size(PERSON)" is the number of rows in PERSON. Note that the query has no constraints.
For the query
SELECT NAME FROM PERSON WHERE ID = 8721the cost is calculated as
size(PERSON) / size_unique(ID)Where "size(PERSON)" is the number of records in the table PERSON, and "size_unique(ID)" is the number of unique values in the column ID. Note that the query has one constraint: ID.
For the query
SELECT NAME FROM PERSON WHERE COUNTRY = "England" AND AGE = 87the cost is calculated as
size(PERSON) / ( size_unique(COUNTRY) * size_unique(AGE) )Where "size(PERSON)" is the number of records in the table PERSON, and "size_unique(COUNTRY)" is the number of unique values in the column COUNTRY. Note that the query has two constraints: COUNTRY and AGE.
The resultThe cost of a bound NAME query is much lower than that for an unbound MARRIED_TO query, hence the NAME query will be scheduled for execution first. This query returns only a single row, and the second query will need to be executed only once.
In this algorithm, queries with many unbound variables are very costly. With more bound variables, the cost drops rapidly. It also takes into account that tables with less rows are cheaper than tables with many rows and that columns with many distinct values (higher cardinality) are cheaper than those with little distinct values.
It is a very simple, yet efficient way of optimizing conjunctive queries.