Monday, February 1, 2021

Semantic web marvels in a relational database - part II: Comparing alternatives

This blog article first appeared on http://techblog.procurios.nl/k/n618/news/view/34441/14863/Semantic-web-marvels-in-a-relational-database---part-II-Comparing-alternatives.html

15 June 2009

In this article I will compare the basic technical details of current relational database alternatives.

By Patrick van Bergen

In the first article I explained the relational database mapping of our semantic web implementation. In this article I will place this work into perspective by exploring related techniques.

The last few years developers are looking for ways to overcome certain shortcomings of relational database systems. RDBMSes are general purpose data stores that are flexible enough to store any type of data. However, these are several cases in which the relational model proves inefficient:

  • An object has many attributes (100+), many of which are optional. It would be wasting space to store all these attributes in separate columns.
  • Many attributes with multiple values. Since each of these attributes needs a separate table, the object data will be distributed over many tables. This is inefficient in terms of development time, maintenance, as well as query time.
  • Class inheritance. Since most software is Object Oriented these days the objects in code will need to be mapped to the database structure. In the case of class inheritance, where attributes are inherited from superclasses, it is a big problem to store objects in, and query them from, an RDBMS efficiently.
  • Types and attributes are not objects. In an RDBMS the data of a model is separate from the metadata (attribute names, datatypes, foreign key constraints, etc.). Types and attributes are not like normal objects. This is inefficient in areas where types and attributes need to be added, changed and removed regularly, just like any other data. It is inefficient to write separate code to manipulate and query types and attributes. In short, first order predicate logic no longer suffices for many new applications. The second order is needed.
  • Scalability. Is an aspect often named as the reason to leave RDBMS. However, since relational databases have been optimized for decades, they do scale. Nevertheless, in this age of global, real-time webapplications, techniques provided by RDBMS manufacturers may prove to be inadequate, or simply too expensive.

In the following I will provide a simple understanding of the basic principles of alternative database techniques, along with some pointers to more in-depth information. I hope you will forgive me my non-expert view on these subjects. For detailed information on any given subject, look elsewhere. This article is meant to be just a quick overview, aimed to waken some concepts provided by the examples.

RDBMS, or Row-Oriented database

In a relational database management system, pieces of data are grouped together in a record. In this article I will consider the case where the data stored is meant to represent the attributes of an object. Seen this way, a record is a group of attributes of an object. Here's an example of such a table of objects:

object idcolorwidthheightname
3red100100my box
4green50500old beam


Metadata is shown in gray. Keys / foreign keys are shown in bold typeface.

Need more attributes? Add more columns. Need an attribute with multiple values? Add a table and link it to the first. The RDBMS chooses speed over flexibility. Speed was a big deal 40 years ago, when this database type was designed. And it still is a big deal today. For large amounts of simple data, there is absolutely no need to leave this model.

Semantic net

Storing semantic information as triples is an old idea in the field of Knowledge Representation. As early as 1956, semantic nets were used for this purpose. In this technique the relations between objects are represented by plain labels. Each "record" stores only a single attribute, or one element of an array-attribute. Most notable are the absense of metadata and the fact that object data is distributed over many records.

object idpredicatevalue
3colorred
3width100
3height100
3name

my box

4colorgreen
4width50
4height500
4nameold beam


Need more attributes? No need to change the table structure. Need an attribute with multiple values? Same thing. 

Entity-Attribute-Value

The Entity-Attribute-Value model of knowledge representation uses some form of triples, just like the semantic web. Its primary use is described by Wikipedia as "Entity-Attribute-Value model (EAV), also known as object-attribute-value model and open schema is a data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an "entity" or "object") is potentially very vast, but the number that will actually apply to a given entity is relatively modest. In mathematics, this model is known as a sparse matrix."

Attribute metadata is stored in separate attribute tables, which are not triples. EAV is a sort of middle between semantic nets and semantic web: attributes have explicit properties, but these are fixed in amount.

EAV can be used to model classes and relationships as in EAV/CR.

EAV is used in Cloud computing databases like Amazon's SimpleDB and Google's App Engine.

object idattribute idvalue
31red
32100
33100
34

my box

41green
4250
43500
44old beam

 

attribute idnamedatatypeunique
1colorchar(6)true
2widthdoubletrue
3heightdoubletrue
4namestringtrue


Need more attributes? Add them in the attribute table. Attributes with multiple values? No extra work. The schema of the attributes is stored in the database explicitly, but attributes are treated different from the objects.

Column-Oriented databases

From wikipedia: "A column-oriented DBMS is a database management system (DBMS) which stores its content by column rather than by row."

object idcolor
3red
4green

 

object idwidth
3100
450

 

object idheight
3100
4500

 

object idname
3my box
4old beam

 

Google's BigTable is based, in part, on column-orientation. Their tables use reversed URI's as object and column identifiers, and have a "third dimension" in that older revisions of the data are stored in the same table.

References:

Correlation databases

correlation database is "value based": every constant value is stored only once. All these values are stored together, except that values are grouped by datatype. All values are indexed. "In addition to typical data values, the data value store contains a special type of data for storing relationships between tables...but with a CDBMS, the relationship is known by the dictionary and stored as a data value."

I have not found a clear example of what this datastructure looks like, but we can infer that the internal structure must look something like the following. Note: I may be completely wrong here!

The values-table (actually there is one table per major datatype; i.e. integers, strings, dates, etc.)

value idvalue
1red
2green
3100
450
5500
6my box
7old beam
8<object 1>
9<object 2>
10<relationship color>
11<relationship width>
12<relationship height>
13<relationship name>

 

and then there is at least a table containing the relationships (or: "associations") between the values. The relationships are stored as values themselves:

value id 1associationvalue id 2
8101
8113
8123
8136
9102
9114
9125
9137


References:

Hierarchical model, Network model, Navigational database

For the sake of completeness I have to name these models. The hierarchical model stores tree-like structures only, requiring each piece of data to have a single "parent". The network model allows a piece of data to have multiple parents. Both models were superseded by the relational model, but they are still used for special-purpose applications. A navigational database allows to traverse such trees / DAGs by following paths.

 

 

Object-Oriented databases

In an object-oriented database all attributes of a class are stored together. From what I've read on the internet I conclude that the actual storage structure of an OODBMS is sort of an implementation detail. This means that performance characteristics of the database will depend heavily on the type of implementation chosen. Development of this model was first in the hands of the ODMG, but control was transferred to the Java Community Proces that build the Java Data Objects specification. This specification names the conditions for such a database, but does not guide the implementation.

Some special properties:

  • Class inheritance is supported in the data model.
  • Object nesting: an object can contain (not just link to) other objects

Mapped to an RDBMS, a so called ORM (Object Relational Mapping), objects are commonly stored in a standard relational way: one column per (single valued) attribute. To implement inheritance, the columns of all base classes of an object are joined. This can be done at design-time (create a big table containing the columns of all parent classes) or at query-time (join parent class tables).

class idobject idcolorwidthheightname
1013red100100my box
1014green50500old beam

 

class idclass nameparent class
101Object 
102Bar101


References:

Document based databases

document based database is a different beast altogether. It lacks a database schema completely, and a complete object is stored in a single cell. In the case of CouchDB, this is done by encoding the object (or: document) in JSON. Real-time querying of the source table is thus impossible, one needs to create views on the data.

object iddocument
3{"color":"red","width":100,"height":100,"name":"my box"}
4{"color":"green","width":50,"height":500,"name":"old beam"}


References:

Triplestores

Some triplestores are publicly available. Commonly they have an RDF interface. Their performance can be measured using the Lehigh University Benchmark (LUBM). The most advanced open source triplestores are Sesame, and ARC.

object idattribute idvalue
3101red
3102100
3103100
3104my box
4101green
410250
4103500
4104old beam
101104color
102104width
103104height
104104name

 

Very little has been made public about the way triplestores are implemented in a relational database. A laudable exception to this is the Jena2 database schema. Unfortunately, the schema appears to be very inefficient, since the URIs are not indexed but are used literally.

A charmingly simple implementation that seems resource intensive was made for expasy4j: triples are stored in a single table, but for query speed, a single column is reserved for each separate datatype.

Another, somewhat better implementation was made for OpenLink Virtuoso: it uses indexed uris, but all constants are placed in a single field datatyped "ANY".

Conclusion

I hope this article has shown you a little bit why developers are looking for alternatives for the familiar RDBMS and which forms these currently have taken. Currently the field is quite diverse and developments are being made by many different parties. It will be interesting to see how this evolves and which alternative(s) will eventually become the successor of the relational database.


No comments:

Post a Comment

On SQLAlchemy

I've been using SQLAlchemy and reading about it for a few months now, and I don't get it. I don't mean I don't get SQLAlchem...