Posts Tagged ‘business semantics management’

Collibra wins Datanews award for start-up of the year

June 1, 2010

Some months after winning the award for best start-up at the European Semantic Technology Conference 2009, we have acquired the prestigious Belgian Datanews award as well.


Collibra and IBM Research join forces in European research on service-oriented architectures

May 20, 2010

I am happy to announce that my company Collibra has acquired considerable co-funding in ACSI, a European FP7 research project worth 5 million Euro. ACSI stands for “Artifact-Centric Service Interoperation”. The coordinator is IBM Research Haifa, and the kick-off of the project will be held in June at their premises in Israel. Details of the international consortium are below.

ACSI will serve to dramatically reduce the effort and lead-time of designing, deploying, maintaining, and joining into environments that support service collaborations. This will be achieved by developing a rich framework around the novel notions of dynamic artifacts and interoperation hubs, enabling a substantial simplification in the establishment and maintenance of service collaborations.


Interoperation between electronic services, and more generally the business processes embodied by these services, is one of the most challenging and pressing issues in today’s increasingly globalized and de-centralized economy. Out-sourcing, globalization, and the automation of business processes continue to increase.  However, today, there is no effective, flexible, scalable, and principled approach to enable the interoperation of services across enterprise boundaries in support of shared (business) goals.  This is a major roadblock to preventing the automation of these kinds of collaboration, and more broadly, the design, deployment, and operation of innovative value nets.  The ACSI project is aimed directly at filling this vacuum.

Based on an innovative foundation, the ACSI research will develop scientific advances, techniques, and tools to dramatically simplify the design and deployment of infrastructure to support service collaborations, the ability of services to join such collaborations, and the evolution of such collaborations as the marketplace and competitive landscape change.

A Brand New Approach

Artifact-Centric Service Interoperation (ACSI) is based on two fundamental constructs: the interoperation hub and dynamic artifacts. Business-driven intelligent operation of these constructs will be grounded by business semantics.

An interoperation hub serves as a virtual rendezvous for multiple services that work together towards a common goal. Our research will develop a principled, easy-to-use framework for creating, deploying, maintaining, and joining into ACSI interoperation hubs in essentially any application domain. Similar to EasyChair or, an ACSI interoperation hub will serve as the anchor for a collaborative IT environment that supports large numbers of service collaborations that operate independently, but focus on essentially common goals. These hubs are primarily reactive, serving as a kind of structured whiteboard to which participating services can refer. The hubs can be updated with information relevant to the group, assist the services by carrying out selected tasks, or notify services of key events.

Example of interoperation hub that supports collaboration around hiring

The interoperation hubs used in ACSI will be structured around dynamic artifacts, also known as “business artifacts” or “business entities”. These provide an holistic marriage of data and processes, serving as the basic building block for modeling, specifying, and implementing services and business processes.  In the context of single enterprises, it has been shown that the use of artifacts can lead to substantial cost savings in the design and deployment of business operations and processes, and can dramatically improve communication between stakeholders. Artifacts can give an end-to-end view of how key conceptual business entities evolve as they move through the business operations, in many cases across two or more silos. As a result, artifacts can substantially simplify the management or “hand-off” of data and processing between services and organizations.

A key pillar of the ACSI research is to generalize the advantages of dynamic artifacts to the broader context of interoperation hubs and service collaborations. While the interoperation hubs themselves will take advantage of the artifact paradigm, the services participating in such hubs are not required to be artifact-centric; they can be conventional SOA services, including legacy applications with SOA adapters.


ACSI provides an approach to populating the web with semantically rich building blocks, around which services can cluster to create a broad variety of service collaborations and value networks.

The ACSI interoperation hub framework, in conjunction with the underlying ACSI artifact paradigm, provides a rich structure around which many subsequent scientific and technology advances can be made. The ACSI research will substantially extend current verification and synthesis techniques to incorporate data along with process, and will develop the next generation of process mining research by generalizing it to handle data along with process.

The project aims to achieve dramatic savings over conventional approaches to service interoperation across several areas: design and deployment, on-boarding, day-to-day operation, maintenance, data transformation automation, and evolvability. This will be accomplished while enabling rich flexibility for the different service collaborations using a given interoperation hub.

The technology can be applicable in key challenge areas of societal importance, including government, energy, healthcare, supply chain logistics (especially in industries such as food or heavy manufacture with deep upstream supply chains), and heavy manufacture (e.g., airline industries). The mechanisms incorporated into the ACSI framework to support rich variation within a single hub can be especially advantageous in domains, such as human resources, where there are significant differences from country to country.

The ACSI interoperation hub framework will provide a paradigm shift in the way that services, and more generally enterprises, can work together.


IBM Research – Haifa (coordinator)

Università degli Studi di Roma La Sapienza

Libera Università di Bolzano

Imperial College of Science, Technology and Medicine

Technische Universiteit Eindhoven

Tartu Ulikool

Indra Software Labs SLU

Collibra NV

Magritte Flirting with Semantics

December 15, 2009
…or to be more precise “Rene Magritte flirting with semiotics”. I spent my lazy Sunday (according to my sense of the word) at the new Rene Magritte Museum in Brussels (not to be confused with his birthplace house which is also a smaller museum about his life).

Magritte Museum (src:

Magritte went through various phases of “his” interpretation of surrealism expressing what he calls the inexpressible. The common thread was the study of the function between words (or signs) and images (syntax), scientifically called semiotics. One masterpiece (see below; produced in the context of a NYC exposition) performs a methodological exploration of the construction of a semiotic tetrahedron.

A semiotic tetrahedron is a quadrangular commutative diagram that is constructed when an actor perceives a physical object (say a cat) in the domain, consequently renders a mental conception of this perceived object, and finally chooses a representation for his conception (say the string “yojo”). Semantics is de
fined by the relationship between an object and the object that represents it.  An actor’s ontology is a representation of all conceptions the actor believes are observable, hence exist, in the domain plus their inter-relationships (which are also conceptions). Representations (either orally uttered or in written form) are essential for actors to socialize their observations and so, by learning from others’, to refine their interpretation of the world.
To facilitate this process, although they share the same physical world, actors have to align their representations of the domain. In other words they must reconcile (parts of) their ontologies to build a common language. E.g., they plausibly do not speak the same language hence not share the same word for referring to a cat. As computer systems need to communicate though formal languages, solving this ontology construction problem is one of the most important routes of research in information and Web sciences. It is the core of business semantics management and Collibra.

The masterpiece (translated from the original in French) is depicted below (originally published in La Revolution Surrealiste in 1929. Found in Conceptual Art by Tony Godfrey, Phaidons). In 18 combinations of semiotic tetrahedrons, he prescribes a methodology for interpreting his paintings being complex semiotic puzzles. E.g., the first one illustrates synonymy. Check the second row: the first one is not about the boot or the sea, but about the inexpressible emotion that emerges when conceiving the objects and their relationships altogether.  The second on the third row makes the remarkable observation that not everything can be represented. Indeed, wovon man nicht sprechen kann, darüber muß man schweigen. Finally, the third on the second row shows a perfect example of a semiotic tetrahedron.

Another famous example is La Trahison des Images (1929). This painting that represents a pipe, and states “ceci n’est pas une pipe”: indeed its is merely Magritte’s personal representation for a pipe. Moreover, this very particular representation is contingent on its state of being at the moment of painting. This state includes signs like his mood, the room he was in, the absence of his wife… Any small phenomenon in his state of being influences the very colour tone or shape of the pipe painting. This is illustrated by the many pipes he painted. Many of them also have arbitrary signs (words or images) surrounding the “pipe”, by which he wants to show us that these signs (as part of the painting process also forming part of the actual state of being), have influenced the representation. Ironically this is one of his most realistic paintings :-)

From Horta and Hergé to Knopff and Delvaux, Brussels has many hidden secrets to discover. But the top of surrealism is found in the Magritte Museum. If you have a couple of hours do not hesitate to visit all three stores. We booked out visit in advance on-line, which is highly recommended.

Enterprise Data World

October 21, 2009

At the Enterprise Data World Conference in March 2010 in San Francisco,  I will be talking about business semantics-driven integration of service-oriented applications. Unlike other so-called ontology languages that focus on formal aspects, Business semantics define the contextual meaning of key business assets for your organization in terms of business facts and rules.

Business Semantics have a dual utility: the derived business semantics do not only provide a shared glossary to augment human understanding, but can also be used to automate meaningful data integration during process integration.

Many organisations start to realise the potential of business semantics to leverage information management, and take initiatives at the grass roots. However sustainable and meaningful business semantics management must be organised and cultivated organisation-wide by the right balance of people, methods, and tools. In this talk you will learn that:

  1. your organisation already has much valuable metadata as building blocks for business semantics;
  2. reconciliation of metadata into sharable and reusable semantic patterns requires a systematic approach and careful selection of technologies;
  3. application of business semantics for EAI is much better than tradional point-to-point or hub-and spoke approaches;
  4. identify business drivers that convince your senior management to prioritise and free the necessary budget and resources accordingly;
  5. outline a roadmap to implement business semantics management as part of the overall information architecture and governance plan.

We illustrate these points with realistic case studies, and point out important challenges for the future. Rendez-vous Wed 17 March at 9h30 in the Hilton at Union Sq., San Francisco.

Business Semantics-driven Data Matching: a Case Study for Competency-based HRM

October 11, 2009

De Baer, P.; Tang, Y.; and De Leenheer, P. (2009) An Ontology-based Data Matchin Framework: Case study for Comptency-based HRM. In Proc. of the 4th International ISWC Workshop on Ontology Matching (OM 2009), CEUR

As part of the European PROLIX (Process Oriented Learning and Information eXchange) project, VUB STARLab designed a generic ontology-based data matching framework (ODMF). Within the project, the ODMF is used to calculate the similarity between data elements, e.g. competency, function, person, task, and qualification, based on competency-information. Several ontology-based data matching strategies were implemented and evaluated as part of the ODMF. In this article we describe the ODMF and discuss the implemented matching strategies.


Semantic data matching plays an important role in many modern ICT systems. Examples are data mining [6], electronic markets [1], HRM [2], service discovery [5], etc. Many existing solutions, for example [2], make use of description logics and are often tightly linked to certain ontology engineering platforms and/or domains of data matching. This often leads to a knowledge bottleneck because many potential domain users and domain experts may not be familiar with description logics or the specific platform at hand. To avoid such potential technical barrier we designed the ODMF so that it is independent of a require the use of description logics. Instead, we make use of the combination of an ontologically structured terminological database [3] and a DOGMA ontology [4] to describe data. Both the DOGMA ontology and the terminological database make use of natural language to describe meaning. On top of this semantic data model we developed an interpreter module and a comparison module. Both the interpreter and the comparator make use of a library of matching algorithms. The matching algorithms have access to the data model via an API, and may be written in any programming language that can access this Java API. Via the terminology base, data can be described and interpreted in different natural languages. We believe that this multilingualism will improve the usefulness of the framework within an international setting.

The ODMF is designed to support data matching in general. Currently, the ODMF has been, however, only implemented and evaluated as part of the European integrated PROLIX project1. Within the PROLIX platform2, the ODMF supports semantic matching of competency-based data elements, e.g. competency, function, person, task, and qualification.

Matching strategies

We implemented and evaluated several ontology-based data matching algorithms within the ODMF. These algorithms relate to three major groups: (1) string matching, (2) lexical matching, and (3) graph matching. However, most matching algorithms make use of a combination of these techniques.

  1. String matching techniques are useful to identify data objects, e.g. competences and qualifications, using a (partial) lexical representation of the object. We selected two matching tools for this type of data matching: (a) regular expressions and (b) the SecondString3 library.
  2. Lexical matching techniques are useful to identify data objects, e.g. competences and qualifications, using a (partial) lexical representation of the object. In addition to plain string matching techniques, linguistic information is used to improve the matching. We selected two techniques to improve the matching: (a) tokenization and lemmatization and (b) the use of an ontologically structured terminological database.
  3. Graph matching techniques are useful (a) to calculate the similarity between two given objects and (b) to find related objects for a given object.

For more information on this framework you can contact the main author of this article.

Social Performance in Collaborative Business Semantics Management

September 9, 2009

The “living” ontologies that will furnish the Semantic Web are lacking. The problem is that in ontology engineering practice, the underlying methodological and organisational principles to involve the community are mostly ignored. Each of the involved activities in the community-based ontology evolution methodology require certain skills and tools which domain experts usually lack. Finding a social arrangement of roles and responsibilities that must supervise the consistent implementation of methods and tools is a wicked problem. Based on three technology-independent problem dimensions of ontology construction, we propose a set of social performance indicators (SPIs) to bring insights in the social arrangement evolving the ontology, and how it should be adapted to the changing needs of the community. We illustrate the SPIs on data from a realistic experiment in the domain of competency-centric HRM.

Actions grouped per part of the ontology over time.

Actions grouped per part of the ontology over time.

The illustration here is a sneak preview of what’s to come. It is the analysis of an SPI that observes the balance between the human resources spent on the respective parts (computational\formal vs. substantial\informal parts) of the representation of individual concept types through time. This may indicate the need to adapt the social arrangement accordingly.  The actions are grouped per part of the ontology: G0 for the discussion part; G1 for the formal part; and G2 for the informal part. G3, 4 and 5 resp. for creating, deleting and moving concept pages. The graph shows three moments (i.e., 3/26; 4/2; and 4/23) where all groups peak. These moments indicate (i) an intermediary deadline for a new ontology version to be accepted, and (ii) and consequently a point where the domain is rescoped for another iteration of the ontology evolution cycle, resulting in a temporarily higher production. The initial scoping peak is the largest, while the following two peaks become gradually smaller. This indicates the ontology reaches a fixpoint as the final deadline approaches, as more concepts covering the domain become mature. There are two isolated peaks of actions on the formal parts in the second iteration: 29 actions on 2009-04-09 and 22 on 2009-04-16. This shift of balance between formal and informal actions is the result of a general request by the core domain expert to spent more resources on formalisation of core concept types.

The full experiment will be presented at and published by the International Semantic Web Conference (ISWC) 2009.

Full article: De Leenheer, P., Debruyne, C., Peeter, J. (2009) Towards Social Performance Indicators for Community-based Ontology Evolution. In Proc. of ISWC Workshop on Collaborative Construction, Management and Linking of Structured Knowledge (CK2008)

Ontology Elicitation Defined in Encyclopedia of Database Systems

August 17, 2009

My entry (download a pre-publication draft here with permission of Springer) on Ontology Elicitation will soon be published in the Encyclopedia of Database Systems by Springer.  The Encyclopedia, under the editorial guidance of Ling Liu and M. Tamer Özsu, will be a multiple volume, comprehensive, and authoritative reference on databases, data management, and database systems. Since it will be available in both print and online formats, researchers, students, and practitioners will benefit from advanced search functionality and convenient interlinking possibilities with related online content.  The  Encyclopedia’s online version will be accessible on the platform: SpringerLink.

De Leenheer, P. (2009) Ontology Elicitation. In Encyclopedia of Database Systems, editors-in-chief. Liu, L. and Ôzsu, T., Springer, forthcoming Spring 2009.

The Virtue of Naming concepts

July 16, 2009

Everybody knows the Pizza Ontology that has been used for ages now to demonstrate tools and methods in the Semantic Web community. Nowadays the Beer Ontology is gaining interest, and I wonder how many concept types the Belgian beer namespace will consist, as there is no clear enumeration of that :-) Anyway, when talking about pizza or even about Belgian beers, we are still playing around with small ontologies.

(Too long) names to decontextualise the proliferation of concept types

Seriously, an ontology should refer to context-independent and language-neutral concepts. However, natural language (vocabulary etc.) is still needed to represent these concepts. Wittgenstein once said:

“The limits of my language means the limits of my world. “

When building large conceptual frameworks of thousands of concept types, vocabulary is usually exhausted before finishing. BTW, is the job ever finished given the proliferation of concepts in communities? Anyway, (as in natural language) terms will have different meanings depending on the context. E.g., the term  java can refer to coffee, a country, or a programming language. In the latter case we can even doubt whether we are talking about java as a sub-type or an instance of the concept type programming language. Let’s not see how deep the philosophical rabbit hole goes here. IMHO, in a formal semantic system we could consider to introduce a fuzzy parameter that can switch between both perspectives.

Now, let’s get back to the ambiguity problem of vocabulary. Lacking better solutions, many of these large ontologies have chosen very long labels to refer their concepts in an unambiguous manner (as the title of this blog already suggests). Usually, these labels are concatenations of a number of parameters that determine the context of the label. Consider, for example, the IFRS Taxonomy 2009 which is a complete translation of International Financial Reporting Standards (IFRSs) as of 1 January 2009 into XBRL:Picture 1

The label for the illustrated concept reads (first take a deep breath):


And this is not a single occurance. The IFRS taxonomy counts hundreds of concept labels fo this size. See for yourself:

Picture 2

This may be ok for one single person who built the ontology, and actually chose the labels, but when sharing it is not understandable for machines, or even other user. This situation creates a vicious circle: long labels are difficult to navigate, hence users introduce new concept types as they cannot retrieve what they are looking for. When defining these new concept types, they have no choice than to invent new labels “with the wet index”, inexorably aggravating the situation.


The problem is also found when people tend to overcategorise. This is an excerpt of a product taxonomy from Kevin Jenkins during a discussion on SemWeb on this matter:

Product (Root Class)
--- software
------ desktop software
----------- desktop internet software
------------------- desktop internet access software (individual)
------------------- desktop internet browser software (individual)
------------------- desktop internet messaging software (individual)
------------ desktop multimedia software
------------------- desktop multimedia 3d software (individual)
------------------- desktop multimedia audio software (individual)
------------------- desktop multimedia video software (individual)
------ internet software
------------ internet saas software
------------------- internet saas collaboration software (individual)
------------------- internet saas videosharing software (individual)
------------ internet cloud software
------ enterprise software

In order to differentiate a subtype from its parent, a term is appended to the more general label. According to Azamat Abdoullaev long classification is done according the scheme “noun specifying another noun”, like below:

((subsubclass)(subclass(class)): audio multimedia desktop software.

He compares it withe problem of URI schemes or computer directory (folder, catalog) names, it will be written as a root hierarchy:


However, this is not how humans talk to each other. Humans tend to contextualise their concepts through sentences in which they qualify certain attributes. This is done in terms of facts. E.g, following example shows 4 facts for this Person.

Person drives Car with Brand “Minerva” and married to Woman with Name “Athena”.

The fact types used here are:

Person drives Car
Car with Brand
Person married to Woman
Woman with Name

Hence, using simple fact types we can describe very complex concept types without even using categorisation in many cases. The terms used to refer to the concept types of course need to be disambiguated. There is no deus ex machina here: context is a social construct as well that has to be included in the ontology.

Context as first-class citizen

Context is an inexorable construct when representing ontologies. As I already discussed in an earlier publication. Particularly when stakeholders in a community use a different vocabulary to refer to the common concept types.

In our approach we use a context identifier g to articulate a term t with a concept type identifier c with the following function.


Hence, c is a URI that refers to a language-neutral and context-independent concept type. This can be represented in the WordNet manner in terms of a gloss (=informal description) plus a synset (=set of synonymous terms). For one of the terms on the above fact types this would be (based on WordNet):

(drivingfordummies, person)->(gloss,synset)
gloss="a human being"

Assuming that this fact type was extracted from a book called Driving for Dummies. So by keeping track of the context of elicitation g of very fact type, we can disambiguate the involved terms properly without the need for very long labels.

Further Reading

In my PhD, I developed a methodology that enacts a community to collaboratively construct an ontology architecture consisting of several layers (upper common, lower common, stakeholder level).

  • The top layer refers to language-neutral and context-independent concepts that are already agreed and applied by the community.
  • The lowest stakeholder layer consists of “stakeholder perspectives” on these upper layers, specialising the upper layer with locally relevant concept types represented by local vocabularies.
  • Gradually these lower perspectives are reconciled in the lower common layer, and when a new version is produced parts are promoted the upper common layer.

Hence community does not only have to agree on the concept types (gloss) but also on the preferred terms (synset) to refer to these concept types.

Animation film about Collibra’s Superman !

June 26, 2009

We asked Beshart (who also created our Collibra logo) to produce a ludic animation film that shows how current approaches of integration can lead to complex situations, and how Semantic Data Integration (in the form of a Collibra Superman or “Green Knight”) provides the answer. Check out it for yourself, and give us your feedback !

The video can be found on Vimeo. Read more on Semantic Data Integration at our website and leave a comment on the movie.

Business Semantics Management identified as key solution in PriceWaterhouseCooper’s Technology Forecast

June 22, 2009

In a ground-breaking Technology Forecast Report, PriceWaterhouseCoopers (PWC) has identified semantic technology as a solution for bridging information gaps, facilitating interoperability and unlocking hidden data in the enterprise.

The report describes how the technology can lower costs by reusing data and decreasing the number of custom connections between databases. Via my WordPress blog, the PWC analysts learned about the scientific boilerplate of Business Semantics Management, and decided to list Collibra as a key technology provider in the final version of the report, which is great news and a real recognition of our efforts. ReadWriteWeb has also covered the report in the meantime and we are featured in that article, too.

We’re very excited about being listed in the report, and invite all companies interested in our solutions to get in touch with us.

Note: My previous HTML-based website counted around 8,000 visitors over a period of 6 years (2002-2008). Since I converted it into a WordPress blog in January 2009, the visitor count ticks +2,000 eyes today. That is more than I used to have in one year using plain HTML+CSS. Bravo to WordPress, who contributed to the fact that our hard work finally hits the “eye of the beholder”, and gets promoted by well-known trend spotters like PWC. !  Below you can see the statistics of the visits to this blog, which clearly peeked during the SemTech conference 14-18 June.Picture 1