Thursday, August 31, 2017

Query Optimization

When building a natural language interface system you must ask yourself if you are going to send many simple queries to the database, or a single complex query and have the database execute it a single run.

One Big Query vs Many Small Queries

For Nli-Go I chose to send many little requests to the database. Each request deals with a single relation and has no joins and aggregations. For a triple store such as SPARQL, each relation just has two arguments. In this case there are only four queries per relation, depending on the variables that were bound in advance:

SELECT A, B FROM RELATION_P
SELECT A    FROM RELATION_P WHERE           B = Y
SELECT    B FROM RELATION_P WHERE A = X
SELECT 1    FROM RELATION_P WHERE A = X AND B = Y

There are major benefits of the small queries approach:
  • Very little knowledge is required of the idiosyncrasies of the database and of the database management system.
  • Given that the tables are properly indexed, all queries are quick.
  • It is possible to add extra logic to the user query not available from the database.
  • It is possible to combine the results of multiple databases for a single user query. These databases may have completely types.
There are also some major drawbacks, of course:
  • The system does not make use the processing power of the database.
  • The system need to do (very) many queries
  • More knowledge of the nli system is required.

Optimization

To me the benefits outweigh the drawbacks, but in order to have a performing system, I need to deal with the fact that the database does not help in creating the fastest query execution.
This comes down to the fact that the Query Optimizer of the database is not used for the full user query. Mainly, this means that the queried relations are not presented to the database in the most efficient order. 
A simple example will make this clearer. Suppose the following two queries were created from a user input sentence "who married Lord Byron?"
SELECT A, B FROM MARRIED_TO
SELECT 1    FROM NAME WHERE           A = ? AND B = "Lord Byron"
The first query asks for all A, B involved in a marriage relation. The second query asks for the ID of a person whose name is given. When executed in the shown order, the first query yields, say, 40.000 rows. Next, each of the resulting values is entered into the second query. The second query is executed 40.000 times, each time with different values. As you can see, even a simple set of relations gets out of control easily.
Since the database does not do query optimization, the nli system needs to perform this task.
David H.D. Warren wrote an interesting article about query optimization for CHAT-80, a Prolog based natural language interface. It's called
Efficient Processing of Interactive Relational Database Queries Expressed in Logic
His algorithm to determine the order of the simple queries consists of two steps:
  • Calculate the cost for each query
  • Order the queries by cost, lowest first.
The cost is calculated by taking the size of the relation, and dividing it by the product of the number of distinct values in each of the bound arguments. I will explain this with examples:
For the query
SELECT ID, NAME FROM PERSON
the cost is calculated as
size(PERSON)
where "size(PERSON)" is the number of rows in PERSON. Note that the query has no constraints.
For the query
SELECT NAME FROM PERSON WHERE ID = 8721
the cost is calculated as
size(PERSON) / size_unique(ID)
Where "size(PERSON)" is the number of records in the table PERSON, and "size_unique(ID)" is the number of unique values in the column ID. Note that the query has one constraint: ID.
For the query
SELECT NAME FROM PERSON WHERE COUNTRY = "England" AND AGE = 87
the cost is calculated as
size(PERSON) / ( size_unique(COUNTRY) * size_unique(AGE) )
Where "size(PERSON)" is the number of records in the table PERSON, and "size_unique(COUNTRY)" is the number of unique values in the column COUNTRY. Note that the query has two  constraints: COUNTRY and AGE.

The result

The cost of a bound NAME query is much lower than that for an unbound MARRIED_TO query, hence the NAME query will be scheduled for execution first. This query returns only a single row, and the second query will need to be executed only once.

In this algorithm, queries with many unbound variables are very costly. With more bound variables, the cost drops rapidly. It also takes into account that tables with less rows are cheaper than tables with many rows and that columns with many distinct values (higher cardinality) are cheaper than those with little distinct values.

It is a very simple, yet efficient way of optimizing conjunctive queries.

No comments:

Post a Comment

On SQLAlchemy

I've been using SQLAlchemy and reading about it for a few months now, and I don't get it. I don't mean I don't get SQLAlchem...