This blog article first appeared on http://techblog.procurios.nl/k/n618/news/view/34441/14863/Semantic-web-marvels-in-a-relational-database---part-II-Comparing-alternatives.html
15 June 2009
In this article I will compare the basic technical details of current relational database alternatives.
In the first article I explained the relational database mapping of our semantic web implementation. In this article I will place this work into perspective by exploring related techniques.
The last few years developers are looking for ways to overcome certain shortcomings of relational database systems. RDBMSes are general purpose data stores that are flexible enough to store any type of data. However, these are several cases in which the relational model proves inefficient:
- An object has many attributes (100+), many of which are optional. It would be wasting space to store all these attributes in separate columns.
- Many attributes with multiple values. Since each of these attributes needs a separate table, the object data will be distributed over many tables. This is inefficient in terms of development time, maintenance, as well as query time.
- Class inheritance. Since most software is Object Oriented these days the objects in code will need to be mapped to the database structure. In the case of class inheritance, where attributes are inherited from superclasses, it is a big problem to store objects in, and query them from, an RDBMS efficiently.
- Types and attributes are not objects. In an RDBMS the data of a model is separate from the metadata (attribute names, datatypes, foreign key constraints, etc.). Types and attributes are not like normal objects. This is inefficient in areas where types and attributes need to be added, changed and removed regularly, just like any other data. It is inefficient to write separate code to manipulate and query types and attributes. In short, first order predicate logic no longer suffices for many new applications. The second order is needed.
- Scalability. Is an aspect often named as the reason to leave RDBMS. However, since relational databases have been optimized for decades, they do scale. Nevertheless, in this age of global, real-time webapplications, techniques provided by RDBMS manufacturers may prove to be inadequate, or simply too expensive.
In the following I will provide a simple understanding of the basic principles of alternative database techniques, along with some pointers to more in-depth information. I hope you will forgive me my non-expert view on these subjects. For detailed information on any given subject, look elsewhere. This article is meant to be just a quick overview, aimed to waken some concepts provided by the examples.
RDBMS, or Row-Oriented database
In a relational database management system, pieces of data are grouped together in a record. In this article I will consider the case where the data stored is meant to represent the attributes of an object. Seen this way, a record is a group of attributes of an object. Here's an example of such a table of objects:
object id | color | width | height | name |
3 | red | 100 | 100 | my box |
4 | green | 50 | 500 | old beam |
Metadata is shown in gray. Keys / foreign keys are shown in bold typeface.
Need more attributes? Add more columns. Need an attribute with multiple values? Add a table and link it to the first. The RDBMS chooses speed over flexibility. Speed was a big deal 40 years ago, when this database type was designed. And it still is a big deal today. For large amounts of simple data, there is absolutely no need to leave this model.
Semantic net
Storing semantic information as triples is an old idea in the field of Knowledge Representation. As early as 1956, semantic nets were used for this purpose. In this technique the relations between objects are represented by plain labels. Each "record" stores only a single attribute, or one element of an array-attribute. Most notable are the absense of metadata and the fact that object data is distributed over many records.
object id | predicate | value |
3 | color | red |
3 | width | 100 |
3 | height | 100 |
3 | name | my box |
4 | color | green |
4 | width | 50 |
4 | height | 500 |
4 | name | old beam |
Need more attributes? No need to change the table structure. Need an attribute with multiple values? Same thing.
Entity-Attribute-Value
The Entity-Attribute-Value model of knowledge representation uses some form of triples, just like the semantic web. Its primary use is described by Wikipedia as "Entity-Attribute-Value model (EAV), also known as object-attribute-value model and open schema is a data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an "entity" or "object") is potentially very vast, but the number that will actually apply to a given entity is relatively modest. In mathematics, this model is known as a sparse matrix."
Attribute metadata is stored in separate attribute tables, which are not triples. EAV is a sort of middle between semantic nets and semantic web: attributes have explicit properties, but these are fixed in amount.
EAV can be used to model classes and relationships as in EAV/CR.
EAV is used in Cloud computing databases like Amazon's SimpleDB and Google's App Engine.
object id | attribute id | value |
3 | 1 | red |
3 | 2 | 100 |
3 | 3 | 100 |
3 | 4 | my box |
4 | 1 | green |
4 | 2 | 50 |
4 | 3 | 500 |
4 | 4 | old beam |
attribute id | name | datatype | unique |
1 | color | char(6) | true |
2 | width | double | true |
3 | height | double | true |
4 | name | string | true |
Need more attributes? Add them in the attribute table. Attributes with multiple values? No extra work. The schema of the attributes is stored in the database explicitly, but attributes are treated different from the objects.
Column-Oriented databases
From wikipedia: "A column-oriented DBMS is a database management system (DBMS) which stores its content by column rather than by row."
object id | color |
3 | red |
4 | green |
object id | width |
3 | 100 |
4 | 50 |
object id | height |
3 | 100 |
4 | 500 |
object id | name |
3 | my box |
4 | old beam |
Google's BigTable is based, in part, on column-orientation. Their tables use reversed URI's as object and column identifiers, and have a "third dimension" in that older revisions of the data are stored in the same table.
References:
Correlation databases
A correlation database is "value based": every constant value is stored only once. All these values are stored together, except that values are grouped by datatype. All values are indexed. "In addition to typical data values, the data value store contains a special type of data for storing relationships between tables...but with a CDBMS, the relationship is known by the dictionary and stored as a data value."
I have not found a clear example of what this datastructure looks like, but we can infer that the internal structure must look something like the following. Note: I may be completely wrong here!
The values-table (actually there is one table per major datatype; i.e. integers, strings, dates, etc.)
value id | value |
1 | red |
2 | green |
3 | 100 |
4 | 50 |
5 | 500 |
6 | my box |
7 | old beam |
8 | <object 1> |
9 | <object 2> |
10 | <relationship color> |
11 | <relationship width> |
12 | <relationship height> |
13 | <relationship name> |
and then there is at least a table containing the relationships (or: "associations") between the values. The relationships are stored as values themselves:
value id 1 | association | value id 2 |
8 | 10 | 1 |
8 | 11 | 3 |
8 | 12 | 3 |
8 | 13 | 6 |
9 | 10 | 2 |
9 | 11 | 4 |
9 | 12 | 5 |
9 | 13 | 7 |
References:
Hierarchical model, Network model, Navigational database
For the sake of completeness I have to name these models. The hierarchical model stores tree-like structures only, requiring each piece of data to have a single "parent". The network model allows a piece of data to have multiple parents. Both models were superseded by the relational model, but they are still used for special-purpose applications. A navigational database allows to traverse such trees / DAGs by following paths.
Object-Oriented databases
In an object-oriented database all attributes of a class are stored together. From what I've read on the internet I conclude that the actual storage structure of an OODBMS is sort of an implementation detail. This means that performance characteristics of the database will depend heavily on the type of implementation chosen. Development of this model was first in the hands of the ODMG, but control was transferred to the Java Community Proces that build the Java Data Objects specification. This specification names the conditions for such a database, but does not guide the implementation.
Some special properties:
- Class inheritance is supported in the data model.
- Object nesting: an object can contain (not just link to) other objects
Mapped to an RDBMS, a so called ORM (Object Relational Mapping), objects are commonly stored in a standard relational way: one column per (single valued) attribute. To implement inheritance, the columns of all base classes of an object are joined. This can be done at design-time (create a big table containing the columns of all parent classes) or at query-time (join parent class tables).
class id | object id | color | width | height | name |
101 | 3 | red | 100 | 100 | my box |
101 | 4 | green | 50 | 500 | old beam |
class id | class name | parent class |
101 | Object | |
102 | Bar | 101 |
References:
- JSR 243: Java Data Objects 2.0 specification
- Building Scalable Database Applications: Object-Oriented Design, Architectures and Implementations
- Objects, objects everywhere
Document based databases
A document based database is a different beast altogether. It lacks a database schema completely, and a complete object is stored in a single cell. In the case of CouchDB, this is done by encoding the object (or: document) in JSON. Real-time querying of the source table is thus impossible, one needs to create views on the data.
object id | document |
3 | {"color":"red","width":100,"height":100,"name":"my box"} |
4 | {"color":"green","width":50,"height":500,"name":"old beam"} |
References:
Triplestores
Some triplestores are publicly available. Commonly they have an RDF interface. Their performance can be measured using the Lehigh University Benchmark (LUBM). The most advanced open source triplestores are Sesame, and ARC.
object id | attribute id | value |
3 | 101 | red |
3 | 102 | 100 |
3 | 103 | 100 |
3 | 104 | my box |
4 | 101 | green |
4 | 102 | 50 |
4 | 103 | 500 |
4 | 104 | old beam |
101 | 104 | color |
102 | 104 | width |
103 | 104 | height |
104 | 104 | name |
Very little has been made public about the way triplestores are implemented in a relational database. A laudable exception to this is the Jena2 database schema. Unfortunately, the schema appears to be very inefficient, since the URIs are not indexed but are used literally.
A charmingly simple implementation that seems resource intensive was made for expasy4j: triples are stored in a single table, but for query speed, a single column is reserved for each separate datatype.
Another, somewhat better implementation was made for OpenLink Virtuoso: it uses indexed uris, but all constants are placed in a single field datatyped "ANY".
Conclusion
I hope this article has shown you a little bit why developers are looking for alternatives for the familiar RDBMS and which forms these currently have taken. Currently the field is quite diverse and developments are being made by many different parties. It will be interesting to see how this evolves and which alternative(s) will eventually become the successor of the relational database.
No comments:
Post a Comment