IDC
International Data Corporation

Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Carl W. Olofson
IDCBulletin #27774 - Aug 2002

Table of Contents - Abstract - Document



IDC Opinion

Opinion Q & A
What is the significance of semantics in data management, and is anyone offering more support than that afforded by relational database management systems (DBMSs)?

Flexible data management, including combined data from multiple sources, and the ability to manage data over time as the data and its definitional characteristics change require support for the semantic rules governing the data. Yet, relational DBMSs are inherently limited in their ability to provide such support.

IDC has identified Lazy Software as one vendor that offers a refreshingly different approach to the problem of structured data management, an approach that goes a long way toward addressing the increasingly difficult task of managing the wide variety of data involved in ebusiness systems today.


Document Body

Overview

When DBMSs were first conceived they were intended to provide centralized enforcement of integrity rules. They were also intended to act as clearinghouse facilities for querying and reporting on enterprise data and to provide efficient storage and retrieval of structured data. As the relational model has become dominant, database technology has focused largely on the problem of maintaining common structures of data using a simple and flexible cataloguing scheme so that such data could be efficiently shared by multiple application programs and updated by multiple concurrent users. Issues such as data integrity enforcement and ease in the query and reporting of more than simple lists of data were largely cast aside as too hard to resolve.

Now, data structures have become more complex, and there is greater emphasis on business intelligence, data integration, and the use of deep structures such as those supported by XML. The absence of semantic awareness in the DBMS is reflected in ever more complex database program code and arcane tools for seeking out, combining, formatting, and reporting on data. It has also created difficulties in building enterprise information portals (EIPs) and other enterprise information access points.

In this bulletin IDC examines the nature of semantically based structural rules for data and the deficiencies of the relational model in dealing with them for integrity assurance and ease of query and reporting. IDC then considers the associative model of data presented by Lazy Software and the extent to which it addresses these issues. IDC also looks at how the associative model is supported by Lazy Software’s two products: a post-relational DBMS called Sentences and a data collection, query, and reporting facility called LazyView.

Semantic Support in the Database

What Are Data Semantics?

Data semantics are those characteristics of data that capture its meaningful use and management. In the simplest extant forms of software support for data semantics, we find data dictionaries or repositories that can capture metadata (i.e., data about data) at a level that enables DBAs, programmers, and power users to locate, retrieve, and present meaningful information based on stored data.

Such metadata often may include business definitions and some usage context in addition to rules about format and structure. It enables programmers and DBAs to develop applications, move data from database to database, and reuse data in a variety of other ways while ensuring that the data is being used properly and in a manner consistent with its designated purpose.

Elements of Semantic Structures: Optionality, Cardinality, Dependency, Containment, Roles (Polymorphism), Recursion, and Type Hierarchies

The semantics of data imply certain rules regarding format and structure. For instance, a customer must have at least one order or a purchase history, an employee must have a supervisor, and so forth. These rules govern not just atomic data items or even entities but also associations among entities and structures of associations.

The elements of semantic structures may be broken down as follows:

How Data Semantics Can Be Related to Natural Language

A good deal of the effort in relating natural language systems to databases is in defining how the syntactic rules of statements can be mapped to the flat structures of tables, rows, columns, and relationships. If the database is fully defined using such rules as are described above, then building natural language support is limited to ensuring that the lexicon covers all the entities, attributes, and associations of the data structure and that there is an appropriately rich set of synonyms. This is because the kind of structure that results from the application of semantic rules to a database already has a form similar to that of a natural language sentence, with a subject, verb, direct object, indirect objects, adjectives, and so on.

The Role of Semantics in Database Management

A database that is driven and constrained by semantic rules offers certain benefits to the IT organization. These benefits include much better automated integrity assurance, more sophisticated query and reporting capabilities that can be used by nontechnical people (because the semantic rules structure reflects the business model), and the ability to move data to data warehouses and other systems without “cleansing” it or checking periodically to see if it is consistent (because the semantic rules ensure its consistency on an ongoing basis).

Evolution of Semantic Support in Databases

The idea of including semantic rules in the operations of databases is not new. Ever since IBM developed the Formatted File System (FFS), a data storage system that stored its own format along with the data, data management specialists have been looking for ways to enforce the rules of data. The network DBMS, developed by Charles Bachman in the mid-1960s, came closest with its support for various kinds of associations that had rules for capturing most, though not all, of those listed above. With the advent of the relational model, however, the DBMS was relegated to a data cataloguing system based on mathematical set theory, and most of the semantic rules support that was developed in earlier DBMSs were discarded.

Limitations of the Relational Model

As has been previously stated, the relational model supports highly flexible cataloguing and reporting of data organized in flat structures. It offers very little support for semantics. Relational DBMSs support the following semantic rules:

That’s it. They do not support cardinality, containment, or type hierarchies. The relational model makes recursive structures very difficult to represent, even with the recursion executed in the application program with no support in the catalogue. It makes exclusivity, containment, and roles nearly impossible to implement. It is important to bear in mind that all this is not to say that one cannot build a database with such semantics in it. However, the RDBMS will not enforce its semantic rules, and in some cases the relational model makes implementation nearly impossible.

The Onus Is on the Programmer and the User

Essential integrity rules cannot be enforced by the RDBMS; they must be ensured through the proper coding of all applications that update the database. It also means that those programs had better be well commented or that data models are kept up to date because the database catalogue will give maintenance programmers no guidance whatsoever regarding the proper updating of the database. To get around this problem, RDBMS vendors have added triggers and stored procedures that, together, can protect data structures from corruption, but only by coding procedures in the stored procedure language, which must also be properly commented or otherwise documented.

For users, a third normal form schema is an absolute mystery since it represents common business objects as a range of tables related through foreign keys. For instance, to look at the status of an order, one might need to query the customer, order, order-item, inventory-item, and product tables, knowing what the foreign keys are and how to build a nested SELECT statement. Also, the tables may not have obvious names. To avoid this problem, DBAs often create view tables that contain the complex query and that return a tabular result.

Examples of Problem Areas

Following are examples of data models that cannot be well supported by a relational database.

Traversing Multiple Tables (The Order-Entry Example)

Most real-world data collections involve nested structures. An example is an order form. Here, one sees an assortment of different data, including customer data, billing data, the items ordered, the date promised, and so on. Data is normally entered in this form, and people expect to see it in this form. However, normalization requires that such data be distributed across a number of tables, and retrieving an order involves selecting data from tables in the right order using foreign key associations. An example of an entity-relationship model for an order is in Figure 1, below.

Figure 1 - Order Entry Data Model
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

This is a very simplified diagram that does not show many of the relevant semantic characteristics of such a model. Nonetheless, one can clearly see that there is one customer for an order, but a customer may place many orders. An order may be placed for many products, and each product may have many orders placed for it. The nearest way of representing this structure relationally appears in Figure 2.

Figure 2 - Order Entry Relational Data Structure
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

Here we can see that there are four tables. The primary key of the table Customer is Custno. Order has a compound primary key of Ordno (a value that makes the combination unique) and Custno. Because Order is dependent on Customer for its existence and identity, part of its primary key (Custno) is a foreign key reference to the primary key of Customer. The OrderItem table has a triple compound key of Ordno, Custno, and Itemno. The Ordno-Custno compound value is a foreign key reference to the primary key of the Order table, and the ProdID column is a foreign key reference to the Product table, thereby creating a many-to-many association between Order and Product containing the necessary attributes for the association, such as the sale price.

Since nulls are not to be allowed for the OrderItem columns Custno, Ordno, Itemno, or ProdID, a row cannot exist in this table unless all the foreign key values have been inserted and resolved. Thus, the RDBMS will enforce the integrity of this relationship, but not its mandatory nature. If the row is deleted, the RDBMS will cheerfully permit the Order row to continue to exist with no items. The only way to ensure that an Order row exists only if it has OrderItem rows associated with it is to write code using triggers and stored procedures.

Tertiary Relationships (The Student-Class Example)

Although data modelers have generally been trained to model data using only binary models, this creates awkward situations when dealing with three or more entities that have clear relational interdependencies. An example of such a situation, modeled with binary relationships like the Order example, appears in Figure 3.

Figure 3 - A Teacher-Student-Class Binary Relationship Data Model
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

Here, a class has one teacher and many students. A teacher may teach many classes. A student may take many classes. This structure is not the most convenient, however. If one wishes to know what students a teacher teaches, one must navigate the question through class by asking something like, “What classes does the teacher teach, and what students take those classes?”

A more streamlined structure that expresses the same thing with tighter associations is provided by using a ternary relationship; that is, one that has three rather than two participants. This relationship is illustrated in Figure 4.

Figure 4 - A Student-Class-Teacher Ternary Relationship Data Model
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

In the prior example, extra notation is necessary to show the optionality and cardinality of the relationships modeled so that the semantics of the overall structure may be represented. Here, it is clear that teachers, classes, and students are in a combined relationship that requires less notation to represent. With such a model, it becomes possible to ask simply, “Which students does this teacher teach?”

The relational model, however, does not afford such simplicity. It must be driven by the binary model because there is no straightforward way to represent a ternary relationship such as this one without violating the rules of normalization. The corresponding relational model is illustrated in Figure 5.

Figure 5 - Student-Class-Teacher Relational Data Structure
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

Again, there are four tables. The primary key of the Teacher table is FacltID (Faculty ID). The Class table has a primary key of ClassID and a column called FacltID that is a foreign key reference to the Teacher table. Because Class is not dependent on Teacher, the foreign key reference is not part of its primary key. The Student table has a primary key of StdntID. The Enrollment table is a cross-reference table that has a compound primary key, one element of which (ClassID) is a foreign key reference to the Class table and the other of which (StdntID) is a foreign key reference to the Student table. To find out what students are taught by a given teacher named Smith, the SQL would look like this:

SELECT UNIQUE StdntNam FROM Student WHERE StdntID EQ SELECT StdntID FROM Enrollment WHERE ClassID EQ SELECT ClassID FROM Class WHERE FacltID EQ SELECT FacltID FROM Teacher WHERE Profname EQ “Smith”;

This results in searching the Teacher table to find the Faculty ID for the teacher named Smith, then searching the Class table for each class that has that faculty ID as its teacher; then for each such class, searching the Enrollment table for the Student ID of each student enrolled in that class; then selecting all the rows of the Student table with those Student Ids; and finally returning a list of Student Names, eliminating duplicates (the “UNIQUE” qualifier). This process seems to involve a good deal of complexity to answer a simple question.

Recursive Relationships (The Parts Assembly Example)

Recursive relationships are natural and common in many areas of information management. Examples include just about anything that involves hierarchies of entities of the same type, such as organization charts. Such structures, however, are very difficult to manage using a relational database. Figure 6 shows an example of a recursive relationship involving the assembly of parts into higher order parts which, in turn, are components of still larger assemblies.

Figure 6 - Parts Assembly Recursive Relationship Data Model
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

This model shows that a part may consist in an assembly of other parts, and it may also be a component of an assembly of parts. An assembly may have many component parts, but it represents just one containing part. Implicit — but not indicated here — is the fact that no parts assembly structure should exist in which a part can somehow be a component of itself. Also not indicted here is the fact that a part cannot be a component of more than one assembly part.

Using this structure, it should be possible to answer these two questions: “What are the component parts of this part?” and “Where is this part in the overall assembly?” This structure is very difficult to support in a relational database. About the closest one can come is illustrated in Figure 7.

Figure 7 contains two tables. The Part table has a primary key of PartID and other columns such as its description. The Assembly table has a composite primary key with two columns in it, each of which is a foreign key reference to the Part table. The Assembly table is a cross-reference table that associates two rows of the Part table with each other. Note that there is nothing here, and in fact nothing in the relational model, that can ensure that the RDBMS will prevent a part from being a component of itself either directly or indirectly (Part A is a component of Part B, Part B is a component of Part A, for instance). It can ensure that a part is included in just one assembly by making the Component column of the Assembly table unique.

Figure 7 - Parts Assembly Recursive Relationship Relational Data Structure
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

SQL cannot be used to answer the two questions posed above. Instead, one must ask a series of incremental questions. To learn all the components of a part, one must ask: What are the direct components of this part? for each result set it must then ask the question again until the result set is null. Going in the other direction (from the component to the final assembly) is similar.

Heterogeneous Collections (The Product Inventory Example)

We must sometimes track a collection of items of different types and perform collective actions on them, such as taking inventory. The relational model is not useful here because it insists upon collecting entities of the same type only. As a result, we must take indirect approaches to dealing with heterogeneous collections and may not be able to do so in a third normal form database.

Figure 8 shows a heterogeneous product inventory example. Here we see seasonal, clothing, dry good, and grocery items being managed in inventories by store. There are many operations that may be performed on all of them collectively, yet they differ from one another and have different attributes. Note that the relationship is a binary one, with Store on one end and either Seasonal, Clothing, Dry Good, or Grocery on the other.

Note also that depending on which it is, the properties Display Until, On Rack, or Sell By may also apply to the relationship in connection with the participation of that entity. One or more different types of those items (e.g., seasonal and clothing) may be related to a single Store in any number through the Inventory relationship, but each item can be in the inventory of just one store. An example of how this might be handled in a relational database is illustrated in Figure 9.

Note also that depending on which it is, the properties Display Until, On Rack, or Sell By may also apply to the relationship in connection with the participation of that entity. One or more different types of those items (e.g., seasonal and clothing) may be related to a single Store in any number through the Inventory relationship, but each item can be in the inventory of just one store. An example of how this might be handled in a relational database is illustrated in Figure 9.

Figure 8 - A Product Inventory Heterogeneous Collections Data Model
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

In order to mediate the relationship with the store, a proxy for the items themselves is created. The Store table has a primary key of StoreID. The Grocery table has a primary key of GrocID, a column indicating whether or not the item is perishable and other columns appropriate for groceries. The DryGood table has a primary key of DGID as well as a column naming the vendor from whom the item came and other columns appropriate to dry goods. The Clothing table has a primary key of ClothID as well as a column indicating the style of the item and other columns appropriate to clothing. The Seasonal table has a primary key of DGID — it would probably be considered a subtype of DryGood if relational databases supported subtypes — and has various columns appropriate to seasonal items.

The Inventory table connects them to the store. It has a compound primary key of the SKU of the item and StoreID, which is a foreign key reference to the Store table. It has columns with values common to all inventory items and a column called InvType. This column indicates which type the item is. Program code must set this type, and then ensure that the appropriate foreign key is valued and that all others are NULL. If InvType indicates “grocery,” then GrocID (the foreign key to the Grocery table) must be valued, along with the SellBy column (which contains the “sell by” date of the item). DGID, ClothID, OnRack, SeasDGID, and DispUntil must all be NULL. If InvType indicates “dry good,” then DGID (the foreign key to the DryGood table) must be valued, and the others must be NULL, and so on.

Figure 9 - Heterogeneous Product Inventory Relational Data Structure
Addressing the Semantic Gap in Databases: Lazy Software and the Associative Model of Data

Source: IDC, 2002

This model shows the complete abdication by the RDBMS of any responsibility to manage the integrity of heterogeneous relationships, leaving the task to program code, either in stored procedures or in application programs.

Lazy Software and the Associative Model

As stated previously, the relational model is not the only database model. Some prior models, including the network model and the hierarchical model, had more capability to provide support for semantic rules, though these too had their limitations since they required specific knowledge of database organization to use them. Lazy Software offers another approach, based on an approach to database management that draws in part from relational, network, and inverted-list systems. Its approach is called the Associative Model.

About Lazy Software

Lazy Software is a firm based in the United Kingdom that was founded by Simon Williams, Melinda Horton, and Simon Haigh, all of whom were also the cofounders of the development tools software company Synon. The company has been capitalized in part from the proceeds of the sale of Synon to Sterling Software in 1998. (Sterling was subsequently acquired by Computer Associates.) The company offers a post-relational DBMS that is based on the Associative Model, storing data in semantically rich structures. The DBMS is called Sentences.

The Associative Model of Data

At the heart of the Associative Model is the assertion that maintaining data associations through binary relationships is not sufficient to have meaningful, reusable data structures. Rather than store data with references to other data, the Associative Model involves storing data in objects of two types: entity types and association types. Unlike association models involving binary associations only, however, the Associative Model allows association types to have identifiers so that instances of associations can be related to each other.

As a result, structures of associated associations can result in ternary relationships, or, in fact, n-ary (having any number of participating entity types) relationships. For example, whereas a binary association model would force us to traverse the structure illustrated in Figure 1 by saying “customer Jones placed order numbered 456123, and then “the order numbered 456123 was for the product blender,” an association model that supports associated associations allows the model to put that together and say “customer Jones placed an order for the product blender.” Note that the order number is not mentioned since a cross-association traversal is not necessary to find out what was ordered.

The Post-Relational DBMS Sentences

Sentences is the name of the DBMS that Lazy Software offers, based on the Associative Model. It derives its name from the fact that the act of storing data using self-describing entities and structuring their relationships with associations that can link entities to other entities or associations or that can link associations to other associations enables structures that take the form of a sentence, with a subject, verb, optionally a direct object or objects and, optionally, some number of potentially nested indirect objects. It appears to support most of the semantic rules discussed earlier in this bulletin.

Lazy argues that because Sentences uses self-describing structures that can be explored and traversed dynamically, it enables what it calls “omnicompetent programming;” that is, one can use it without setup and can discover data without knowledge of the schema, much like a spreadsheet. This feature lends itself to discovery and to complex structure management, structures such as might be found in an XML document or in a semantic Web services registry.

Sentences is also easily expanded. Its server instances, called “chapters,” can contain schemas or data or a combination of the two and may be added or removed from the configuration at will.

The Data Query, Reporting, and Aggregation Tool: LazyView

The ability of Sentences to store and present data in a highly descriptive manner is given broader utility by a complementary product called LazyView. This product enables the near real-time extraction, aggregation, and reorganization of data from relational data sources, including Oracle, Sybase, Microsoft SQL Server, and IBM’s DB2, into the format of the Sentences DBMS. It does not copy and load but rather acts more like a pipe, taking requests from the Sentences engine, using SQL to retrieve the data, and mapping rules to render it in the expected manner.

The data from LazyView is presented in the context of something Lazy calls a “pseudo-chapter,” which looks like a “chapter” to the Sentences server software, but is really a proxy for the back-end data.

Implications for the eBusiness Platform

Sentences appears to offer clear and compelling benefits in a number of key areas of data management that are not especially well served by relational or object-relational DBMS technology. Those benefits are outlined in the following section.

Integrated Management and Access of Data in Disparate Sources

The LazyView product demonstrates the potential of Sentences to collect, aggregate, and compile data from disparate sources. IDC suspects the possible list of sources could extend well beyond relational databases in the future to include XML data servers, content management servers, email servers, and so forth. The ability to capture and present complex data in a semantically clear and consistent manner makes this product attractive both in using the integrated data access features of LazyView and by storing it for repeated query and analysis on the Sentences database itself.

Support of Complex Structures that Require Flexibility (XML)

Currently, RDBMSs support XML documents either by storing them as text BLOBs and providing sophisticated XML keyword-aware search facilities or by “shredding” them into elements that may be mapped to tables according to predefined rules. The former approach does not allow for the dynamic recombination of data in the XML documents or for high-speed aggregation or similar operations against document details. The latter requires that the document be known in advance in order to define the tables and mapping rules, and it inhibits flexibility in dealing with new document types or changes to the format of existing ones.

Sentences should possess both the structural flexibility to handle any XML structure and the dynamism to be able to fully capture the contents of documents even if the document type has not been seen before or anticipated. It could, therefore, be used as an XML data server or as a means of compiling or re-rendering XML content for user query tools, EIPs, reports, or analytical tools.

Potential as Registry Database for Semantic Web Services

IDC has previously suggested that Web services need support for semantic frameworks or context of services in order to ensure the level of flexibility necessary for a dynamic commercial Web services market to flourish. Such a development would call for a Web services registry that could capture all the nuances of the affinity rules (rules that show how a service definition in one taxonomy relates to a service requirement definition in another taxonomy in the registry) as well as the protocol definitions and relationships among Web services.

Sentences would seem an ideal DBMS candidate to consider as the engine for such a registry due to its ability to assimilate new structures dynamically and to manage complex structures that have rich semantic rules underlying them.

Conclusion

Sentences represents the rebirth of an old idea: that databases are about more than just storing and retrieving structured data; they are about maintaining repositories of useful information and recombining it in different ways to make it even more useful. Relational DBMSs are not set up to handle such a task.

This is not to say the relational DBMSs do not have their strengths. They are very good at high-speed transaction processing involving data that has a relatively static structure and can always be rendered in two dimensions (rows and columns) and they are very good at aggregating and reporting such data. They are not so good at managing more complex data: at data in more than two dimensions, at data where the format or structure may vary, or where heterogeneous data is to be combined and reported.

The book titled Associative Model of Data contends that relational DBMSs are best limited to the online transaction processing functions that they currently perform; that they are not appropriately extended beyond such functions because packaged applications deployed on them are unreasonably difficult to maintain due to the overwhelming data management burden placed upon them by the programmer. In particular, the book questions the utility of relational databases in managing online content or business intelligence data such as that required by customer relationship management (CRM) applications.

IDC is reluctant to join in such sweeping assertions. The leading RDBMS vendors, most notably Oracle and IBM, have made strides in optimizing their products to efficiently manage business intelligence data (through explicit support of so-called “star schema” constructs and better SQL optimization), insert programmatic intelligence into their databases using Java-stored procedures and enterprise Java beans (EJBs) deployed within a DBMS-based J2EE-compliant EJB container, and to step outside the relational paradigm when necessary to manage complex data within and without the database itself.

Admittedly, the resulting environments tend to be very complex to manage, but they provide a programmatic means of shielding application programmers from that complexity. (Pity the DBA.) IDC also concedes that all this seems a bit like trying to jam a square peg into a round hole. Nonetheless, these vendors have done a fair job of shaving its corners.

Having said all that, it is also clear that the demand for the ability to customize and tune applications will drive more component-oriented development (as has been happening with the emergence of J2EE and .NET and predominant development frameworks), and it also seems clear that such flexibility may extend to the data as well. XML, complex content, email, legacy data sources, and other online data all demand coordinated, comprehensive management, and current relational DBMS technology seems inherently incapable of fully addressing such a demand.

Lazy Software’s Sentences and LazyView products offer relief in this area, providing such data management capability and enhancing the overall application environment. Areas of opportunity for Lazy and for potential users and partners of Lazy include the following:

Lazy Software should consider targeting these areas. Vendors offering, or planning to offer, products in these areas should consider partnering with Lazy or reselling an embedded copy of the Sentences DBMS and LazyView. Vendors offering similar DBMS products, or offering products that compete with Lazy functionality technically or conceptually, would be well advised to take this contender seriously.

Learn More

Related Research

Bibliography

Williams, Simon: The Associative Model of Data, Lazy Software Ltd. (second edition), 2002. ISBN 1-903453-01-1


Table of Contents - Abstract - Document


Quoting IDC Information and Data: Internal Documents and Presentations - Quoting individual sentences and paragraphs for use in your company's internal communications does not require permission from IDC. The use of large portions or the reproduction of any IDC document in its entirety does require prior written approval and may involve some financial consideration. External Publication - Any IDC Information that is to be used in advertising, press releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or Country Manager. A draft of the proposed document should accompany any such request

Copyright 1994-2002 International Data Corporation.
Reproduction without written permission is completely forbidden.
For copies please contact Cheryl Toffel, (508) 935-4389