Unstructured Text: Semantic web marvels in a relational database

This blog article first appeared on http://techblog.procurios.nl/k/n618/news/view/34300/14863/Semantic-web-marvels-in-a-relational-database---part-I-Case-Study.html

01 June 2009

You have heard about the semantic web. You know it is described as the future of the Web. But you are still wondering how this vision is going to make your applications better. Can it speed up application development? Can it help you to build complex datastructures? Can you use Object Oriented principles? This article shows how it can be done. And more.

By Patrick van Bergen

The semantic web is framework developed by the W3C under supervision of Tim Berners Lee. Its basic assumption is that data should be self-descriptive in a global way. That means that data does not just express numbers, dates and text, it also explicitly expresses the types of relationship these fields have for their objects. Using this uniform datastructure, it will be easier to interchange data between different servers, and most of all, data can be made accessible to global search engines.

That is a big thing. But is that all? Can't you just provide an RDF import / export tool for your data and be done? Are there any intrinsic reasons why you would base you entire datastructure on the semantic web?

In a series of two articles I will try to explain how we at Procurios implemented semantic web concepts, what the theoretical background of our implementation is, and what benefits a semantic web has over a traditional relational database. In this first article I will explain how we implemented a semantic web in a relational database (we used MySQL), added an object oriented layer on top, and even created a data revision control system from it.

Triples

In a classic relational database, data is stored in records. Each record contains multiple fields. These fields contain data that may belong to some object. The relation between the field and the object it belongs to is not represented as data in the database. It is only available as metadata in the form of the column (name, datatype, collation, foreign keys). An object is not explictly modelled, but rather via a series of linked tables.

A semantic web is a network of interrelated triples ("subject-predicate-object" triplets) whose predicates are part of the data themselves. Moreover, each object has an identifier that is not just an integer number that means only something inside the database only. It is a URI that may have a distinct meaning worldwide.

A triple is a record containing three values: either (uri, uri, uri) or (uri, uri, value). In the first form the triple relates one object to another, as in the fact "Vox inc. is a supplier" (Both "Vox inc.", "is a", and "supplier" are semantic subjects identified by a uri). In the second form the triple links a constant value to a subject, as in "Vox inc.'s phone number is 0842 020 9090". A naive implementation would look like this:

CREATE TABLE `triple` (
    `subject`                varchar(255) NOT NULL,
    `predicate`              varchar(255) NOT NULL,
    `object`                 longtext,
);

This table provides a complete implementation for the semantic web. However, it is too slow to be used in any serious application. Now, there are various ways in which this basic form can be optimized, but to my knowledge there is no best practise available. Several problems have to met:

How to identify a triple uniquely (If this is necessary for your application. The combination of subject, predicate, object itself is not unique)
How to search fast, given a subject and predicate? ("Give me the names of these set of people"?)
How to search fast, given a predicate and an object? ("Give me the persons whose name begins with `Moham`"?)

To solve these problems we came up with the following changes:

Create a single triple index table that only stores triple ids.
Create separate triple tables for each of the major datatypes needed (varchar(255), longtext, integer, double, datetime)
The triple tables reference the index table by foreign key.
Add two extra indexes for the two ways the tables are used: a subject-predicate combined key and a predicate-object combined key.

Here's the triple index table (we are using MySQL):

CREATE TABLE `triple` (
    `triple_id`                int(11) NOT NULL auto_increment,
    PRIMARY KEY (`triple_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

and here's the triple table for the datatype "datetime" (the other datatypes are handled similarly)

CREATE TABLE `triple_datetime` (
    `triple_id`                int(11) NOT NULL,
    `subject_id`               int(11) NOT NULL,
    `predicate_id`             int(11) NOT NULL,
    `object`                   datetime NOT NULL,
    `active`                   tinyint(1) NOT NULL DEFAULT '1',
    PRIMARY KEY (`triple_id`),
    KEY (`subject_id`, `predicate_id`),
    KEY (`predicate_id`, `object`),
    CONSTRAINT `triple_datetime_ibfk_1` FOREIGN KEY (`triple_id`) REFERENCES `triple` (`triple_id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The table definition should speak for itself, except for the field "active". This field is not necessary at this point, but I will need it in the next section.

The predicate_id refers to a separate "uri" table where the full uris of these predicates are stored. However, this is not necessary and the uris may be stored in the triple_longtext table as well.

The two combined keys have an interesting side-effect: the application developer never needs to be concerned again about using the right keys. Effective keys have been added by default.

To query this triplestore building SQL queries by hand may be a daunting task. It requires a special query language to be effective. More about that below.

All data of a given object can be queried by selecting all triples with a given subject id (one query per datatype triple table). That seems to be inefficient and it is: compared to the situation where an object can be stored in a single record, the triplestore is always slower. However, in a more complex situation a relational database requires you to join many tables to fetch all data. We use 5 separate queries (one per datatype table) to fetch all object data from the triplestore. This turned out faster than a single union query on the five queries. We use the same 5 queries to fetch all data of any desired number of objects. Here a the queries needed to fetch object data from three objects identified by the ids 12076, 12077, and 12078:

SELECT `object` FROM `triple_varchar` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_longtext` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_integer` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_double` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_datetime` WHERE `subject_id` IN (12076, 12077, 12078);

You can see that the object-data is fetched from the database without having to provide explicit type or attribute information. The type of the object is stored in one of the triples. This is useful in case of inheritance where the exact type of an object can only be determined at runtime.

Arrays and multilinguality

Many object attributes have an array datatype (an unordered set). To model these in a relational database you would need a separate table for each of these attributes. Querying all attributes of a series of objects including these array attributes is far from easy. In the triple store you can model an unordered set as a series of triples having the same subject and predicate and a different object. When you query all object data, you will get the array values the same way as you get the scalar values.

Multilinguality is also a hassle in relational databases. For each of the attributes that need to be available in more than one language the table structure needs to be adjusted and it is hard to avoid data duplication. In a triplestore you can treat a multilingual attribute almost like an array element. The only thing is that the predicates are similar but not the same. We use the following uri's for the representation of different language variants of an attribute: http://our-business.com/supplier/description#nl, http://our-business.com/supplier/description#en, http://our-business.com/supplier/description#de (in the tables these predicates are replaced by their integer ids for faster querying).

Data revision control

Version control is pretty common for developers when it comes to storing previous versions of their code. It allows you to track the changes of the code, revert to a previous version, and to work on the same file together. Still, when it comes to data, version control is very uncommon. And I think that is mainly because the overhead to create such a system is huge in a traditional relational database.

One of the requirements for our framework was that there should be some form of data-change history available. And when you think of it, it is actually really simple to keep track of all the changes that are made to the data if you use triples. And that's because from a version-control point of view, all that changes each revision is that some triples are added, and others are removed.

So all that is needed is two more tables, one to keep track of the revision-data, like, who made the change, when, and a short description for future reference, and another to track all the added and removed triples in this revision:

CREATE TABLE `revision` (
    `revision_id`                int(11) not null auto_increment,
    `user_id`                    int(11),
    `revision_timestamp`         int(11) not null,
    `revision_description`       varchar(255),
    PRIMARY KEY  (`revision_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE IF NOT EXISTS `mod_lime_revision_action` (
    `action_id`                   int(11) NOT NULL AUTO_INCREMENT,
    `revision_id`                 int(11) NOT NULL,
    `triple_id`                   int(11) NOT NULL,
    `action`                      enum ('ACTIVATE', 'DEACTIVATE') NOT NULL,
    `section_id`                  int(11),
    PRIMARY KEY  (`action_id`),
    CONSTRAINT `revision_triple_ibfk_1` FOREIGN KEY (`revision_id`) REFERENCES `revision` (`revision_id`) ON DELETE CASCADE,
    CONSTRAINT `revision_triple_ibfk_2` FOREIGN KEY (`triple_id`) REFERENCES `triple` (`triple_id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Each time a user changes the data, a new revision is stored in the database, along with a list of all triples that are added or deactivated, and a compact description of the change. A triple that was already available in an inactive state is made active. If no such triple was present, an active one is created. Triples are never really removed, they are only set to be inactive.

If you query the triplestore (the set of all triples), you need to ensure that only the active triples are queried.

With this information, you can:

List all revisions made to the data, showing who made the change and when, along with a small description of the change.
Revert changes back to a previous revision, by performing the revisions backwards: activate the deactivated triples, deactivate the activated triples. It is also possible to undo a single revision, that is not even the last one. But beware that revisions following it may have dependencies on it.
Work together on an object by merging the changes made by the two users using the difference in data between the start and end revisions.

Object database

Businesses are used to work with objects. A web of data needs to be structured first before it can be used for common business purposes. To this end we decided to build an object oriented layer on top of the triplestore. But even though the Web Ontology Language (OWL) was designed for this purpose, we did not use it, since we needed only a very small subset anyway and we wanted complete freedom for our modelling activities, because processing speed was very high on our priority list. I will not cover all the details here, since it is a very extensive project, but I want to mention the following features:

The database was set up as a repository: no direct database access is possible by the application developer. Object creation, modification, destruction, and querying is done via the repository API. This provided the OOP principles of information hiding, modularity.
Object types could be associated with PHP classes. This is no requirement, but it proved really easy to generate object types from PHP classes. This provided us with the principle of polymorphism.
Not only simple objects are modelled as objects (a set of triples, having the same subject), but object types as well. Furthermore, the attributes of the types are modelled as objects as well. Objects and their types can be used in the same query.
Object types can be subtyped. The triplestore allows us to query objects of a given type and all its subtypes in a straightforward way.
The attributes of objects can be subtyped as well. This allows you to add datatype restrictions to the attributes of subtypes that were not applicable on a higher level up the type hierarchy.

These features are very powerful. It is possible to build a real Object database using triples as a datastructure only. Types and attributes are treated the same as normal objects. This means that the same code can be used to manipulate normal data as wel as metadata. Also, to implement inheritance is relatively easy, since object data is not chunked in single rows any more.

Query language

After some time we felt that the simple queries we were performing on the object database were too constraining. We wanted the same power that SQL provides. And on top of that, since we continue to use normal relation tables as well, the object queries needed to be able combine the object database with the relational tables. For these reasons, the semantic web query language of choice, SPARQL, was insufficient for our purposes. We now build SQL-like queries using method chaining on a query object. The object then creates the SQL query.

I mention this because you really need to build or use a new query language when starting to work with either a triplestores or an object database. The underlying triple store is too Spartan for an application developer. The lower level SQL queries consist of many self-joins connecting the subject of one to the object of another. Very hard to understand.

Afterword

I wrote this article because I think this sort of detailed information about emerging datastructures is lacking on the internet. It is also great to work for a company (Procurios!) that agrees with me that knowledge should be given back to the community if possible. Gerbert Kaandorp from Backbase once asked me very seriously what I had done to promote the semantic web, like it was some kind of holy mission. I hope this article has made a small contribution and that it inspires some of you to build your own semantic web based object datastore. Let me know what you think!

Unstructured Text

Monday, February 1, 2021

Semantic web marvels in a relational database - part I: Case Study

01 June 2009

Triples

Arrays and multilinguality

Data revision control

Object database

Query language

Afterword

No comments:

Post a Comment

Unbound variable unification in Prolog

Report Abuse