Unstructured Text: February 2021

Monday, February 1, 2021

How do I write my own parser? (for JSON)

This blog post first appeared on http://techblog.procurios.nl/k/n618/news/view/14605/14863/how-do-i-write-my-own-parser-(for-json).html

05 December 2008

If no parser is available for the file you need, writing one yourself may be easier than you think. What file-structures are managable? What would be the design of such a parser? How do you make sure it is complete? Here we describe the process for building a JSON parser in C#, and issue the source code.

By Patrick van Bergen

[Download the JSON parser / generator for C#]

The software is subject to the MIT license: you are free to use it in any way you like, but it must keep its license.

For our synchronisation-module (which we use to synchronize data between diverse business applications) we chose JSON for data exchange. JSON is just a little better suited for a PHP web-environment than XML, because:

The PHP functions json_encode() and json_decode() allow you to convert data structures from and to JSON strings
JSON can be sent directly to the browser in an Ajax request
It takes up less space than XML, which is important in server > browser traffic.
A JSON string can be composed of only ASCII characters, while still being able to express all UNICODE characters, thus avoiding all possible conversion issues a transport may carry.

So JSON is very convenient for PHP. But of course we wanted to be able to synchronize with Windows applications as well, and because C# is better suited to this environment, this part of the module was written in this language. The .Net framework just didn't have its own JSON parser / encoder and the open-source software written for this task often contained a whole package of classes and constraints and sometimes the JSON implementation wasn't even complete.

We just wanted a single class that could be imported and that used the most basic building blocks of our application: the ArrayList and the Hashtable. Also, all aspects of JSON should would have to be implemented, there should a JSON generator, and of course it should be fast.

More reasons to write our own parser weren't necessary. Writing a parser happens to be a very thing satisfying to do. It is the best way to learn a new programming language thoroughly. Especially if you're using unit-testing to guarantee the parser / generator matches the language specification exactly. JSON's specification is easy to find. The website http://www.json.org/ is as clear as one could wish for.

You start by writing the unit-tests. You should really write all test before starting the implementation, but such patience is seldomly found in a programmer. You can at least start by writing some obvious tests that help you to create a consistent API. This is an example of a simple object-test:

string json;
Hashtable o;
bool success = true;

json = "{\"name\":123,\"name2\":-456e8}";
o = (Hashtable)JSON.JsonDecode(json);
success = success && ((double)o["name"] == 123);
success = success && ((double)o["name2"] == -456e8);

Eventually you should write all tests needed to check all aspects of the language, because your users (other programmers) will assume that the parser just works.

OK. Parsers. Parsers are associated with specialized software: so called compiler compilers (of which Yacc is the most well known). Using this software will make sure that the parser will be fast, but it does not do all the work for you. What's more, it can be even easier to write the entire parser yourself than to do all the preparatoy work for the cc.

The compiler compiler is needed for languages with a high level of ambiguity. A language expression is parsed from-left-to-right. If a language contains many structures that cannot be identified at the start of te parse, it is advisable to use a tool that is able to manage the emerging complexity.

Unambiguous languages are better suitable for building the parser manually, using recursive functions to process the recursive nature of the language. The parser looks ahead one or more tokens to identify the next construct. For JSON it is even sufficient to look ahead a single token. This classifies it as an LL(1) language (see also http://en.wikipedia.org/wiki/LL_parser).

A parser takes as input a string of tokens. Tokens are the most elementary building blocks of a language, like "+", "{", "[", but also complete numbers like "-1.345e5" and strings like "'The scottish highlander looked around.'". The parse-phase is usually preceded by a tokenization phase. In our JSON parser this step is integrated in the parser, because to determine the next token, in almost all cases, it is enough to just read the next character in the string. This saves the allocation of a token table in memory.

The parser takes a string as input and returns a C# datastructure, consisting of ArrayLists, Hashtables, a number of scalar value types and null. The string is processed from left-to-right. An index (pointer) keeps track of the current position in the string at any moment. At each level of the parse process the parser performs these steps:

Look ahead 1 token to determine the type of the next construct
Choose the function to parse the construct
Call this function and integrate the returned value in the construct that is currently built.

A nice example is the recursive function "ParseObject" that parses an object:

protected Hashtable ParseObject(char[] json, ref int index)
{
  Hashtable table = new Hashtable();
  int token;

  // {
  NextToken(json, ref index);

  bool done = false;
  while (!done) {
    token = LookAhead(json, index);
    if (token == JSON.TOKEN_NONE) {
      return null;
    } else if (token == JSON.TOKEN_COMMA) {
      NextToken(json, ref index);
    } else if (token == JSON.TOKEN_CURLY_CLOSE) {
      NextToken(json, ref index);
      return table;
    } else {

      // name
      string name = ParseString(json, ref index);
      if (name == null) {
        return null;
      }

      // :
      token = NextToken(json, ref index);
      if (token != JSON.TOKEN_COLON) {
        return null;
      }

      // value
      bool success = true;
      object value = ParseValue(json, ref index, ref success);
      if (!success) {
        return null;
      }

      table[name] = value;
  }
}

return table;
}

The function is only called if a look ahead has determined that a construct starts with an opening curly brace. So this token may be skipped. Next, the string is parsed just as long as the closing brace is not found, or the end of the string is found (a syntax error, but one that needs to be caught). Between the braces there are a number of "'name': value" pairs, separated by comma's. This algorithm is can be found literally in the function, which makes it very insightful and thus easy to debug. The function builds an ArrayList and returns it to the calling function. The parser mainly consists of these types of functions.

If you create your own parser, you will always need to take into account that the incoming string may be grammatically incorrect. Users expect the parser to be able to tell on which line the error occurred. Our parser only remembers the index, but it also contains an extra function that returns the immediate context of the position of the error, comparable to the error messages that MySQL generates.

If you want to know more about parsers, it is good to know there consists a een standard work on this subject, that recently (2006) saw its second version:

Compilers: principles, techniques, and tools, Aho, A.V., Sethi, R. and Ullman ,J.D. (1986)

Semantic web marvels in a relational database - part II: Comparing alternatives

This blog article first appeared on http://techblog.procurios.nl/k/n618/news/view/34441/14863/Semantic-web-marvels-in-a-relational-database---part-II-Comparing-alternatives.html

15 June 2009

In this article I will compare the basic technical details of current relational database alternatives.

By Patrick van Bergen

In the first article I explained the relational database mapping of our semantic web implementation. In this article I will place this work into perspective by exploring related techniques.

The last few years developers are looking for ways to overcome certain shortcomings of relational database systems. RDBMSes are general purpose data stores that are flexible enough to store any type of data. However, these are several cases in which the relational model proves inefficient:

An object has many attributes (100+), many of which are optional. It would be wasting space to store all these attributes in separate columns.
Many attributes with multiple values. Since each of these attributes needs a separate table, the object data will be distributed over many tables. This is inefficient in terms of development time, maintenance, as well as query time.
Class inheritance. Since most software is Object Oriented these days the objects in code will need to be mapped to the database structure. In the case of class inheritance, where attributes are inherited from superclasses, it is a big problem to store objects in, and query them from, an RDBMS efficiently.
Types and attributes are not objects. In an RDBMS the data of a model is separate from the metadata (attribute names, datatypes, foreign key constraints, etc.). Types and attributes are not like normal objects. This is inefficient in areas where types and attributes need to be added, changed and removed regularly, just like any other data. It is inefficient to write separate code to manipulate and query types and attributes. In short, first order predicate logic no longer suffices for many new applications. The second order is needed.
Scalability. Is an aspect often named as the reason to leave RDBMS. However, since relational databases have been optimized for decades, they do scale. Nevertheless, in this age of global, real-time webapplications, techniques provided by RDBMS manufacturers may prove to be inadequate, or simply too expensive.

In the following I will provide a simple understanding of the basic principles of alternative database techniques, along with some pointers to more in-depth information. I hope you will forgive me my non-expert view on these subjects. For detailed information on any given subject, look elsewhere. This article is meant to be just a quick overview, aimed to waken some concepts provided by the examples.

RDBMS, or Row-Oriented database

In a relational database management system, pieces of data are grouped together in a record. In this article I will consider the case where the data stored is meant to represent the attributes of an object. Seen this way, a record is a group of attributes of an object. Here's an example of such a table of objects:

object id	color	width	height	name
3	red	100	100	my box
4	green	50	500	old beam

Metadata is shown in gray. Keys / foreign keys are shown in bold typeface.

Need more attributes? Add more columns. Need an attribute with multiple values? Add a table and link it to the first. The RDBMS chooses speed over flexibility. Speed was a big deal 40 years ago, when this database type was designed. And it still is a big deal today. For large amounts of simple data, there is absolutely no need to leave this model.

Semantic net

Storing semantic information as triples is an old idea in the field of Knowledge Representation. As early as 1956, semantic nets were used for this purpose. In this technique the relations between objects are represented by plain labels. Each "record" stores only a single attribute, or one element of an array-attribute. Most notable are the absense of metadata and the fact that object data is distributed over many records.

object id	predicate	value
3	color	red
3	width	100
3	height	100
3	name	my box
4	color	green
4	width	50
4	height	500
4	name	old beam

Need more attributes? No need to change the table structure. Need an attribute with multiple values? Same thing.

Entity-Attribute-Value

The Entity-Attribute-Value model of knowledge representation uses some form of triples, just like the semantic web. Its primary use is described by Wikipedia as "Entity-Attribute-Value model (EAV), also known as object-attribute-value model and open schema is a data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an "entity" or "object") is potentially very vast, but the number that will actually apply to a given entity is relatively modest. In mathematics, this model is known as a sparse matrix."

Attribute metadata is stored in separate attribute tables, which are not triples. EAV is a sort of middle between semantic nets and semantic web: attributes have explicit properties, but these are fixed in amount.

EAV can be used to model classes and relationships as in EAV/CR.

EAV is used in Cloud computing databases like Amazon's SimpleDB and Google's App Engine.

object id	attribute id	value
3	1	red
3	2	100
3	3	100
3	4	my box
4	1	green
4	2	50
4	3	500
4	4	old beam

attribute id	name	datatype	unique
1	color	char(6)	true
2	width	double	true
3	height	double	true
4	name	string	true

Need more attributes? Add them in the attribute table. Attributes with multiple values? No extra work. The schema of the attributes is stored in the database explicitly, but attributes are treated different from the objects.

Column-Oriented databases

From wikipedia: "A column-oriented DBMS is a database management system (DBMS) which stores its content by column rather than by row."

object id	color
3	red
4	green

object id	width
3	100
4	50

object id	height
3	100
4	500

object id	name
3	my box
4	old beam

Google's BigTable is based, in part, on column-orientation. Their tables use reversed URI's as object and column identifiers, and have a "third dimension" in that older revisions of the data are stored in the same table.

References:

Correlation databases

A correlation database is "value based": every constant value is stored only once. All these values are stored together, except that values are grouped by datatype. All values are indexed. "In addition to typical data values, the data value store contains a special type of data for storing relationships between tables...but with a CDBMS, the relationship is known by the dictionary and stored as a data value."

I have not found a clear example of what this datastructure looks like, but we can infer that the internal structure must look something like the following. Note: I may be completely wrong here!

The values-table (actually there is one table per major datatype; i.e. integers, strings, dates, etc.)

value id	value
1	red
2	green
3	100
4	50
5	500
6	my box
7	old beam
8	<object 1>
9	<object 2>
10	<relationship color>
11	<relationship width>
12	<relationship height>
13	<relationship name>

and then there is at least a table containing the relationships (or: "associations") between the values. The relationships are stored as values themselves:

value id 1	association	value id 2
8	10	1
8	11	3
8	12	3
8	13	6
9	10	2
9	11	4
9	12	5
9	13	7

References:

Value-based storage (VBS) unleashes fast ad hoc query performance.

Hierarchical model, Network model, Navigational database

For the sake of completeness I have to name these models. The hierarchical model stores tree-like structures only, requiring each piece of data to have a single "parent". The network model allows a piece of data to have multiple parents. Both models were superseded by the relational model, but they are still used for special-purpose applications. A navigational database allows to traverse such trees / DAGs by following paths.

Object-Oriented databases

In an object-oriented database all attributes of a class are stored together. From what I've read on the internet I conclude that the actual storage structure of an OODBMS is sort of an implementation detail. This means that performance characteristics of the database will depend heavily on the type of implementation chosen. Development of this model was first in the hands of the ODMG, but control was transferred to the Java Community Proces that build the Java Data Objects specification. This specification names the conditions for such a database, but does not guide the implementation.

Some special properties:

Class inheritance is supported in the data model.
Object nesting: an object can contain (not just link to) other objects

Mapped to an RDBMS, a so called ORM (Object Relational Mapping), objects are commonly stored in a standard relational way: one column per (single valued) attribute. To implement inheritance, the columns of all base classes of an object are joined. This can be done at design-time (create a big table containing the columns of all parent classes) or at query-time (join parent class tables).

class id	object id	color	width	height	name
101	3	red	100	100	my box
101	4	green	50	500	old beam

class id	class name	parent class
101	Object
102	Bar	101

References:

Document based databases

A document based database is a different beast altogether. It lacks a database schema completely, and a complete object is stored in a single cell. In the case of CouchDB, this is done by encoding the object (or: document) in JSON. Real-time querying of the source table is thus impossible, one needs to create views on the data.

object id	document
3	{"color":"red","width":100,"height":100,"name":"my box"}
4	{"color":"green","width":50,"height":500,"name":"old beam"}

References:

CouchDB: Technical Overview

Triplestores

Some triplestores are publicly available. Commonly they have an RDF interface. Their performance can be measured using the Lehigh University Benchmark (LUBM). The most advanced open source triplestores are Sesame, and ARC.

object id	attribute id	value
3	101	red
3	102	100
3	103	100
3	104	my box
4	101	green
4	102	50
4	103	500
4	104	old beam
101	104	color
102	104	width
103	104	height
104	104	name

Very little has been made public about the way triplestores are implemented in a relational database. A laudable exception to this is the Jena2 database schema. Unfortunately, the schema appears to be very inefficient, since the URIs are not indexed but are used literally.

A charmingly simple implementation that seems resource intensive was made for expasy4j: triples are stored in a single table, but for query speed, a single column is reserved for each separate datatype.

Another, somewhat better implementation was made for OpenLink Virtuoso: it uses indexed uris, but all constants are placed in a single field datatyped "ANY".

Conclusion

I hope this article has shown you a little bit why developers are looking for alternatives for the familiar RDBMS and which forms these currently have taken. Currently the field is quite diverse and developments are being made by many different parties. It will be interesting to see how this evolves and which alternative(s) will eventually become the successor of the relational database.

Semantic web marvels in a relational database - part I: Case Study

This blog article first appeared on http://techblog.procurios.nl/k/n618/news/view/34300/14863/Semantic-web-marvels-in-a-relational-database---part-I-Case-Study.html

01 June 2009

You have heard about the semantic web. You know it is described as the future of the Web. But you are still wondering how this vision is going to make your applications better. Can it speed up application development? Can it help you to build complex datastructures? Can you use Object Oriented principles? This article shows how it can be done. And more.

By Patrick van Bergen

The semantic web is framework developed by the W3C under supervision of Tim Berners Lee. Its basic assumption is that data should be self-descriptive in a global way. That means that data does not just express numbers, dates and text, it also explicitly expresses the types of relationship these fields have for their objects. Using this uniform datastructure, it will be easier to interchange data between different servers, and most of all, data can be made accessible to global search engines.

That is a big thing. But is that all? Can't you just provide an RDF import / export tool for your data and be done? Are there any intrinsic reasons why you would base you entire datastructure on the semantic web?

In a series of two articles I will try to explain how we at Procurios implemented semantic web concepts, what the theoretical background of our implementation is, and what benefits a semantic web has over a traditional relational database. In this first article I will explain how we implemented a semantic web in a relational database (we used MySQL), added an object oriented layer on top, and even created a data revision control system from it.

Triples

In a classic relational database, data is stored in records. Each record contains multiple fields. These fields contain data that may belong to some object. The relation between the field and the object it belongs to is not represented as data in the database. It is only available as metadata in the form of the column (name, datatype, collation, foreign keys). An object is not explictly modelled, but rather via a series of linked tables.

A semantic web is a network of interrelated triples ("subject-predicate-object" triplets) whose predicates are part of the data themselves. Moreover, each object has an identifier that is not just an integer number that means only something inside the database only. It is a URI that may have a distinct meaning worldwide.

A triple is a record containing three values: either (uri, uri, uri) or (uri, uri, value). In the first form the triple relates one object to another, as in the fact "Vox inc. is a supplier" (Both "Vox inc.", "is a", and "supplier" are semantic subjects identified by a uri). In the second form the triple links a constant value to a subject, as in "Vox inc.'s phone number is 0842 020 9090". A naive implementation would look like this:

CREATE TABLE `triple` (
    `subject`                varchar(255) NOT NULL,
    `predicate`              varchar(255) NOT NULL,
    `object`                 longtext,
);

This table provides a complete implementation for the semantic web. However, it is too slow to be used in any serious application. Now, there are various ways in which this basic form can be optimized, but to my knowledge there is no best practise available. Several problems have to met:

How to identify a triple uniquely (If this is necessary for your application. The combination of subject, predicate, object itself is not unique)
How to search fast, given a subject and predicate? ("Give me the names of these set of people"?)
How to search fast, given a predicate and an object? ("Give me the persons whose name begins with `Moham`"?)

To solve these problems we came up with the following changes:

Create a single triple index table that only stores triple ids.
Create separate triple tables for each of the major datatypes needed (varchar(255), longtext, integer, double, datetime)
The triple tables reference the index table by foreign key.
Add two extra indexes for the two ways the tables are used: a subject-predicate combined key and a predicate-object combined key.

Here's the triple index table (we are using MySQL):

CREATE TABLE `triple` (
    `triple_id`                int(11) NOT NULL auto_increment,
    PRIMARY KEY (`triple_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

and here's the triple table for the datatype "datetime" (the other datatypes are handled similarly)

CREATE TABLE `triple_datetime` (
    `triple_id`                int(11) NOT NULL,
    `subject_id`               int(11) NOT NULL,
    `predicate_id`             int(11) NOT NULL,
    `object`                   datetime NOT NULL,
    `active`                   tinyint(1) NOT NULL DEFAULT '1',
    PRIMARY KEY (`triple_id`),
    KEY (`subject_id`, `predicate_id`),
    KEY (`predicate_id`, `object`),
    CONSTRAINT `triple_datetime_ibfk_1` FOREIGN KEY (`triple_id`) REFERENCES `triple` (`triple_id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The table definition should speak for itself, except for the field "active". This field is not necessary at this point, but I will need it in the next section.

The predicate_id refers to a separate "uri" table where the full uris of these predicates are stored. However, this is not necessary and the uris may be stored in the triple_longtext table as well.

The two combined keys have an interesting side-effect: the application developer never needs to be concerned again about using the right keys. Effective keys have been added by default.

To query this triplestore building SQL queries by hand may be a daunting task. It requires a special query language to be effective. More about that below.

All data of a given object can be queried by selecting all triples with a given subject id (one query per datatype triple table). That seems to be inefficient and it is: compared to the situation where an object can be stored in a single record, the triplestore is always slower. However, in a more complex situation a relational database requires you to join many tables to fetch all data. We use 5 separate queries (one per datatype table) to fetch all object data from the triplestore. This turned out faster than a single union query on the five queries. We use the same 5 queries to fetch all data of any desired number of objects. Here a the queries needed to fetch object data from three objects identified by the ids 12076, 12077, and 12078:

SELECT `object` FROM `triple_varchar` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_longtext` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_integer` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_double` WHERE `subject_id` IN (12076, 12077, 12078);
SELECT `object` FROM `triple_datetime` WHERE `subject_id` IN (12076, 12077, 12078);

You can see that the object-data is fetched from the database without having to provide explicit type or attribute information. The type of the object is stored in one of the triples. This is useful in case of inheritance where the exact type of an object can only be determined at runtime.

Arrays and multilinguality

Many object attributes have an array datatype (an unordered set). To model these in a relational database you would need a separate table for each of these attributes. Querying all attributes of a series of objects including these array attributes is far from easy. In the triple store you can model an unordered set as a series of triples having the same subject and predicate and a different object. When you query all object data, you will get the array values the same way as you get the scalar values.

Multilinguality is also a hassle in relational databases. For each of the attributes that need to be available in more than one language the table structure needs to be adjusted and it is hard to avoid data duplication. In a triplestore you can treat a multilingual attribute almost like an array element. The only thing is that the predicates are similar but not the same. We use the following uri's for the representation of different language variants of an attribute: http://our-business.com/supplier/description#nl, http://our-business.com/supplier/description#en, http://our-business.com/supplier/description#de (in the tables these predicates are replaced by their integer ids for faster querying).

Data revision control

Version control is pretty common for developers when it comes to storing previous versions of their code. It allows you to track the changes of the code, revert to a previous version, and to work on the same file together. Still, when it comes to data, version control is very uncommon. And I think that is mainly because the overhead to create such a system is huge in a traditional relational database.

One of the requirements for our framework was that there should be some form of data-change history available. And when you think of it, it is actually really simple to keep track of all the changes that are made to the data if you use triples. And that's because from a version-control point of view, all that changes each revision is that some triples are added, and others are removed.

So all that is needed is two more tables, one to keep track of the revision-data, like, who made the change, when, and a short description for future reference, and another to track all the added and removed triples in this revision:

CREATE TABLE `revision` (
    `revision_id`                int(11) not null auto_increment,
    `user_id`                    int(11),
    `revision_timestamp`         int(11) not null,
    `revision_description`       varchar(255),
    PRIMARY KEY  (`revision_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE IF NOT EXISTS `mod_lime_revision_action` (
    `action_id`                   int(11) NOT NULL AUTO_INCREMENT,
    `revision_id`                 int(11) NOT NULL,
    `triple_id`                   int(11) NOT NULL,
    `action`                      enum ('ACTIVATE', 'DEACTIVATE') NOT NULL,
    `section_id`                  int(11),
    PRIMARY KEY  (`action_id`),
    CONSTRAINT `revision_triple_ibfk_1` FOREIGN KEY (`revision_id`) REFERENCES `revision` (`revision_id`) ON DELETE CASCADE,
    CONSTRAINT `revision_triple_ibfk_2` FOREIGN KEY (`triple_id`) REFERENCES `triple` (`triple_id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Each time a user changes the data, a new revision is stored in the database, along with a list of all triples that are added or deactivated, and a compact description of the change. A triple that was already available in an inactive state is made active. If no such triple was present, an active one is created. Triples are never really removed, they are only set to be inactive.

If you query the triplestore (the set of all triples), you need to ensure that only the active triples are queried.

With this information, you can:

List all revisions made to the data, showing who made the change and when, along with a small description of the change.
Revert changes back to a previous revision, by performing the revisions backwards: activate the deactivated triples, deactivate the activated triples. It is also possible to undo a single revision, that is not even the last one. But beware that revisions following it may have dependencies on it.
Work together on an object by merging the changes made by the two users using the difference in data between the start and end revisions.

Object database

Businesses are used to work with objects. A web of data needs to be structured first before it can be used for common business purposes. To this end we decided to build an object oriented layer on top of the triplestore. But even though the Web Ontology Language (OWL) was designed for this purpose, we did not use it, since we needed only a very small subset anyway and we wanted complete freedom for our modelling activities, because processing speed was very high on our priority list. I will not cover all the details here, since it is a very extensive project, but I want to mention the following features:

The database was set up as a repository: no direct database access is possible by the application developer. Object creation, modification, destruction, and querying is done via the repository API. This provided the OOP principles of information hiding, modularity.
Object types could be associated with PHP classes. This is no requirement, but it proved really easy to generate object types from PHP classes. This provided us with the principle of polymorphism.
Not only simple objects are modelled as objects (a set of triples, having the same subject), but object types as well. Furthermore, the attributes of the types are modelled as objects as well. Objects and their types can be used in the same query.
Object types can be subtyped. The triplestore allows us to query objects of a given type and all its subtypes in a straightforward way.
The attributes of objects can be subtyped as well. This allows you to add datatype restrictions to the attributes of subtypes that were not applicable on a higher level up the type hierarchy.

These features are very powerful. It is possible to build a real Object database using triples as a datastructure only. Types and attributes are treated the same as normal objects. This means that the same code can be used to manipulate normal data as wel as metadata. Also, to implement inheritance is relatively easy, since object data is not chunked in single rows any more.

Query language

After some time we felt that the simple queries we were performing on the object database were too constraining. We wanted the same power that SQL provides. And on top of that, since we continue to use normal relation tables as well, the object queries needed to be able combine the object database with the relational tables. For these reasons, the semantic web query language of choice, SPARQL, was insufficient for our purposes. We now build SQL-like queries using method chaining on a query object. The object then creates the SQL query.

I mention this because you really need to build or use a new query language when starting to work with either a triplestores or an object database. The underlying triple store is too Spartan for an application developer. The lower level SQL queries consist of many self-joins connecting the subject of one to the object of another. Very hard to understand.

Afterword

I wrote this article because I think this sort of detailed information about emerging datastructures is lacking on the internet. It is also great to work for a company (Procurios!) that agrees with me that knowledge should be given back to the community if possible. Gerbert Kaandorp from Backbase once asked me very seriously what I had done to promote the semantic web, like it was some kind of holy mission. I hope this article has made a small contribution and that it inspires some of you to build your own semantic web based object datastore. Let me know what you think!

Unstructured Text

Monday, February 1, 2021

How do I write my own parser? (for JSON)

Semantic web marvels in a relational database - part II: Comparing alternatives

15 June 2009

RDBMS, or Row-Oriented database

Semantic net

Entity-Attribute-Value

Column-Oriented databases

Correlation databases

Hierarchical model, Network model, Navigational database

Object-Oriented databases

Document based databases

Triplestores

Conclusion

Semantic web marvels in a relational database - part I: Case Study

01 June 2009

Triples

Arrays and multilinguality

Data revision control

Object database

Query language

Afterword

Unbound variable unification in Prolog

Report Abuse