Integration Functionality

This document outlines the system context and major functionality of integration. This is the author's viewpoint based on his experience gained over many years. Every integration problem is unique (in general) and for each integration problem there are many ways to approach it. Having a system context as frame of reference helps analyzing the integration problem at hand. The functionality section presents a collection of main requirements and features and is by no means complete. The list gives an impression on the complexity of integration and on the variety of the integration problems.

System Context

The system context defines the frame of reference for the systems that are being integrated. Systems considered for integration here are HAD systems: heterogeneous, autonomous and distributed:

Heterogeneous. Each system has in the general case its own data model and its own data schema. The conceptualization and development of those takes place independently of each other resulting in conceptual semantics differences as well as schema difference. Different systems therefore have some disjoint concepts, some overlapping concepts and some common concepts. The same is true for the data schema. The fact that the data model and data schema are different to some extent leads to heterogeneous systems.
Autonomous. Systems are in general autonomous, meaning, they perform state changes independent of other systems. They do not coordinate changes through coordination technology, distributed transactions, or similar mechanisms. As a consequence, the state of one system that needs to be integrated does not imply a certain state of another system that needs to be integrated. Any integration solution must be able to deal with any state that every of the systems has that are to be integrated as well as with the fact that these state changes can happen at any time.
Distributed. In general, systems run on servers and have their state managed in databases. Different systems might share the same server and database infrastructure, other do not. It is safe to assume that systems do not share infrastructure and the systems they use are mostly disjoint. This implies that in general there is network access required to reach these systems and there is no central environment that allows to access the systems locally. An integration solution must therefore be aware of the distribution and the fact that different systems can be available and unavailable at different times for different periods of time.

A note on heterogeneity. Although heterogeneity is often seen as a data-specific system property, heterogeneity can also be found in the remote access interface to those systems. Different systems use different technology to expose their functionality and that causes heterogeneity far beyond data level heterogeneity. Integration has to overcome this type of heterogeneity, too.

As a borderline case it is possible that the integration is with systems that are homogeneous, coordinated and centralized. While this case is certainly "nicer", it is not the usual case. Therefore, HAD systems are the target and the context.

System vs. Data Integration?

In some cases, authors and professionals make a case for the distinction between system and data integration. Is there really a fundamental difference?

Data integration usually refers to transferring data from one system to another system. In the extreme case this is database synchronization (replication or master-slave). In many cases, however, the data is exchanged between systems through application programming interfaces (APIs). In this case the data is extracted from the database into a main memory representation before sent over to the target system (using web services, queues, or other mechanisms).

System integration usually is a super set including data integration, but also remote functionality integration. In this case one system "invokes" the other system's functionality by sending data and in some cases including processing directives. As part of this approach data can be sent also, making system integration the super set.

Direct vs. Indirect Integration

Systems can directly communicate with each other to achieve integration, or through a third system, a so-called integration middleman.

Direct Integration. Direct integration between systems is accomplished by one system directly invoking the interfaces of another system. The invoking system therefore must decide when to invoke a given interface and the data passed. The invoked system simply executes the implementation of the invoked interface.
Indirect Integration. In the case of indirect invocation there is a middle-man system involved. Each system to be integrated only communicates with the middle-man. A middle-man can be a service bus, and integration broker, or a custom-implemented system. In indirect integration, the control can lie in the middle-man, meaning, it initiates all communication, or with the individual systems to be integrated, meaning, each system when it has the need for integration functionality, invokes the middle-man as a broker.

There are several pros and cons for each approach. The direct invocation is quicker to achieve, but it requires the systems to know about each other. It is also difficult to reason about the overall systems state in terms of protocol consistency and data consistency.

The indirect integration takes more effort to implement, but since the integration logic is separate from the integrated systems, the overall system state and consistency is easier to reason about.

Functionality

Integration functionality can be from a very simply best-effort file transfer to a very complex data replication process ensuring data consistency/integrity.

In principle terms, there are several different integration functionalities:

Data Copy. This case integrates two systems by copying data from one to the other. In the simples form, the data is exported from one system and imported into another system. No further assumption is made about state or consistency. The exporting system decides what to export, and he importing system decides how to import the data and how to act in presence of conflicts.
Data Replication. Data replication ensures the the data in one system is replicated into a second system. In the simplest case, the schema of the two systems is the same on a database level, so that replication does not have to deal with data transformation. In a more complex case data might have to be transformed (syntactically or semantically) in order to be properly replicated (and the transformation can happen on the originating system, the target system, or the middle-man (if one is involved). Data replication might be instantaneous, meaning, both systems progress in lock-step, or lagging, meaning, that the target system might not be fully replicated at any given point in time.
State Replication. In this case the objective is not to transfer data, but to advance state of systems. If the source system changes a state, the same state change is accomplished in the target system. So instead of integration by sending data, the integration takes place by sending state changes from one system to the other.
Process Integration. Process integration is not concerned about keeping systems "the same" across each other in terms of data or state, but is concerned about integrating various different functionalities in a certain causal order. Systems being integrated implement a certain business functionality and if one functionality is accomplished the integration determines the next system that has to execute the next functionality. Integration in this sense is like treading systems together to accomplish functionality that no single system is able to achieve.

This is only a higher level characterization of systems integration. Each can be sub-divided into more special cases. But at the same time, it can be argued that a client - server relationship is integration of systems also.