Integration Functionality
This document outlines the system context and major functionality of integration. This is the author's viewpoint
based on his experience gained over many years. Every integration problem is unique (in general) and for each
integration problem there are many ways to approach it. Having a system context as frame of reference helps analyzing
the integration problem at hand. The functionality section presents a collection of main requirements and features
and is by no means complete. The list gives an impression on the complexity of integration and on the variety
of the integration problems.
System Context
The system context defines the frame of reference for the systems that are being integrated. Systems considered
for integration here are HAD systems: heterogeneous, autonomous and distributed:
- Heterogeneous. Each system has in the general case its own data model and its own data schema. The
conceptualization and development of those takes place independently of each other resulting in conceptual
semantics differences as well as schema difference. Different systems therefore have some disjoint concepts,
some overlapping concepts and some common concepts. The same is true for the data schema. The fact that the
data model and data schema are different to some extent leads to heterogeneous systems.
- Autonomous. Systems are in general autonomous, meaning, they perform state changes independent
of other systems. They do not coordinate changes through coordination technology, distributed transactions, or
similar mechanisms. As a consequence, the state of one system that needs to be integrated does not imply a
certain state of another system that needs to be integrated. Any integration solution must be able to deal
with any state that every of the systems has that are to be integrated as well as with the fact that these
state changes can happen at any time.
- Distributed. In general, systems run on servers and have their state managed in databases. Different
systems might share the same server and database infrastructure, other do not. It is safe to assume that systems
do not share infrastructure and the systems they use are mostly disjoint. This implies that in general there
is network access required to reach these systems and there is no central environment that allows to access
the systems locally. An integration solution must therefore be aware of the distribution and the fact that different
systems can be available and unavailable at different times for different periods of time.
A note on heterogeneity. Although heterogeneity is often seen as a data-specific system property, heterogeneity
can also be found in the remote access interface to those systems. Different systems use different technology to
expose their functionality and that causes heterogeneity far beyond data level heterogeneity. Integration has to
overcome this type of heterogeneity, too.
As a borderline case it is possible that the integration is with systems that are homogeneous, coordinated
and centralized. While this case is certainly "nicer", it is not the usual case. Therefore, HAD systems are the
target and the context.
System vs. Data Integration?
In some cases, authors and professionals make a case for the distinction between system and data integration. Is
there really a fundamental difference?
Data integration usually refers to transferring data from one system to another system. In the extreme case
this is database synchronization (replication or master-slave). In many cases, however, the data is exchanged
between systems through application programming interfaces (APIs). In this case the data is extracted from the
database into a main memory representation before sent over to the target system (using web services, queues,
or other mechanisms).
System integration usually is a super set including data integration, but also remote functionality integration.
In this case one system "invokes" the other system's functionality by sending data and in some cases including
processing directives. As part of this approach data can be sent also, making system integration the super set.
Direct vs. Indirect Integration
Systems can directly communicate with each other to achieve integration, or through a third system, a so-called
integration middleman.
- Direct Integration. Direct integration between systems is accomplished by one system directly invoking
the interfaces of another system. The invoking system therefore must decide when to invoke a given interface and
the data passed. The invoked system simply executes the implementation of the invoked interface.
- Indirect Integration. In the case of indirect invocation there is a middle-man system involved. Each
system to be integrated only communicates with the middle-man. A middle-man can be a service bus, and integration
broker, or a custom-implemented system. In indirect integration, the control can lie in the middle-man, meaning,
it initiates all communication, or with the individual systems to be integrated, meaning, each system when it has
the need for integration functionality, invokes the middle-man as a broker.
There are several pros and cons for each approach. The direct invocation is quicker to achieve, but it requires
the systems to know about each other. It is also difficult to reason about the overall systems state in terms
of protocol consistency and data consistency.
The indirect integration takes more effort to implement, but since the integration logic is separate from
the integrated systems, the overall system state and consistency is easier to reason about.
Functionality
Integration functionality can be from a very simply best-effort file transfer to a very complex data replication
process ensuring data consistency/integrity.
In principle terms, there are several different integration functionalities:
- Data Copy. This case integrates two systems by copying data from one to the other. In the simples form,
the data is exported from one system and imported into another system. No further assumption is made about state or
consistency. The exporting system decides what to export, and he importing system decides how to import the data and
how to act in presence of conflicts.
- Data Replication. Data replication ensures the the data in one system is replicated into a second
system. In the simplest case, the schema of the two systems is the same on a database level, so that replication does
not have to deal with data transformation. In a more complex case data might have to be transformed (syntactically
or semantically) in order to be properly replicated (and the transformation can happen on the originating system,
the target system, or the middle-man (if one is involved). Data replication might be instantaneous, meaning, both
systems progress in lock-step, or lagging, meaning, that the target system might not be fully replicated at any
given point in time.
- State Replication. In this case the objective is not to transfer data, but to advance state of systems.
If the source system changes a state, the same state change is accomplished in the target system. So instead of
integration by sending data, the integration takes place by sending state changes from one system to the other.
- Process Integration. Process integration is not concerned about keeping systems "the same" across each
other in terms of data or state, but is concerned about integrating various different functionalities in a certain
causal order. Systems being integrated implement a certain business functionality and if one functionality is
accomplished the integration determines the next system that has to execute the next functionality. Integration in
this sense is like treading systems together to accomplish functionality that no single system is able to achieve.
This is only a higher level characterization of systems integration. Each can be sub-divided into more special
cases. But at the same time, it can be argued that a client - server relationship is integration of systems also.