Christoph Bussler

Integration

What is Integration?

Integration is establishing consistent communication and data exchange between heterogeneous, autonomous, and distributed (HAD) systems.

Reading about all the wonderful web service technology, EAI technology, B2B technology, Service Bus, etc., one could get the impression that there is a working approach to "clean" and "easy" integration. With clean integration I mean that the messages are well-defined in structure and vocabulary, that their exchange sequence is well-defined, and at runtime it simply works without issues or problems. Any run-time failure is detectable and can be automatically addressed and rectified.

The reality is, however, very different. Integration is still complicated and needs a lot of diligence in its implementation to work reliably. Monitoring remains difficult and error detection and repair take a lot of time. This web site is dedicated to discuss the nitty-gritty details, challenging use cases, opinions, as well as solutions in context of daily integration challenges.

Top

Functionality

Integration can mean many things to many people. From a simple copy of a file, to a complex replication across several application systems, all can be considered "integration". It is therefore very important to build up a frame of reference in order to have a meaningful discussion on the functionality and eventually semantics of what integration means.

First, it has to be defined what the systems are that require integration. And then it has to be defined what kinds of integration can be accomplished between the systems. For example, a system can be a database and a possible integration is to send changes from one database to another one without considering failures. Another example is a Java class that accesses a specific web service as a remote entity. Yet another example is two databases that are integrated by defining views across them and making the views the access interface.

The functionality discussion is quite involved and takes place on a separate page: Integration Functionality

Top

Context

This project was a classical B2B integration problem where two systems had to be integrated (source and target system). The two systems belonged to two different corporations and were running inside their clouds. The integration had to cross the boundaries of the clouds and the technology used for that was XML over HTTP. The integration functionality required was state replication.

The data was kept in a relational database system and the state changes happened inside the database as data changes. Plus, once a state change happened, an indicator in the data set was updated to record the fact that a state change happened.

In every integration, the source system has to decide, when to initiate integration. In case of state replication, a state change (not necessarily every state change) triggers the integration: the state changes, and a document containing either the state change or the new state is sent to the target system (for it to process the change). In this context the document sent to the target system contains the new state.

Integration is often associated with a queuing architecture. The usual way to using queuing in context of integration is that the source system queues up the documents that need to be sent to the target system. The queuing system itself, or a queue monitor, fetch each document from the queue and send it to the target system. Once the target system received the document, it is removed from the queue. If there is an error situation, the document remains in the queue.

There are several functional problems that can arise (not necessarily have to) depending on the context:

Queue Delay. A document entered into the queue will stay in that queue until a successful transmission to the target system took place. Depending on the queue length, this might be a while, especially in presence of network outages and failures.
Document Ordering. Because documents can be in the queue for a while, it is possible that several state changes take place in the source system and therefore there might be several documents in the queue with different state changes of the same state. In order for the target system to stay consistent, the documents have to be transmitted in exactly that order.
Error Handling. If for some reason the target system does not successfully receive the document and e.g. returns an error, the document in the queue has to be dealt with. One immediate thought is to remove the document and put it into an error queue. While this in general might be a good approach, this also means that all subsequent documents that contain state changes of the same state have to be put into the error queue without trying to send them to the target system. And this has to take place until the error is resolved. A more advanced version would put the state into an error state, too, so that no more state changes are put into the queue. Fundamentally, state changes are not triggering integration anymore until the error is resolved. Once the error is resolved, all the documents from the error queue have to be inserted in that order into the regular queue again before processing of that state can resume.

In case of many errors on the target system side (because of target system processing errors or communication errors), the error handling on the source system can be quite involved in the sense that a lot of different documents have to be managed and their state.

Top

Approach: Query-Based Integration

The context of the project was such that a lot of errors were occurring on the target system side as well as on the communication layer side crossing several cloud boundaries on the way.

Aside from a high-frequency state change situation, and additional interesting business logic specific behavior was that that state changes were cumulative. This means that not every state change had to be sent to the target system. If two were skipped, the third state change subsumed the previous two.

The question asked was: is it possible to achieve the integration without a queuing system to avoid all the error handling complexity?

Several alternatives were explored. A first alternative was to have a daemon running that queries the data base for changed data items. Since there was an indicator in the data set, this would be a simple select query. Once the data is retrieved, the data is packaged for transmission to the target system. Once the transmissing happened, the indicator was reset. Since the reset is an update, the daemon was transactional.

This approach has several issues. One is, a single query can return many data items that have a state change. Each has to be transmitted one-by-one. This takes time and during that time there is an open transaction. This might be quite an issue if there are hundreds or thousands of state changes processed inside one transaction. Secondly, if the daemon fails (or the database fails), the transaction is aborted, and all the data items are reset. However, the actually transmissions cannot be reset as they are non-transactional. So an inconsistent state would be the result which would be difficult to repair.

An alternative approach was implemented. The basic idea was to use the "select for update skip locked" statement. In that SQL statement several functionalities were achieved:

only changed data items were selected
the items were ordered by date of change so that the oldest change was processed first
only the oldest change was selected (meaning, the query returns exactly one result, or none if there was no change)
the selected data item was marked "in transmission"
and items "in transmissions" were not selected, either

Basically, the query selected the oldest state change not in transmission and put it into state "in transmission". All that happens in one transaction. Once the query returned (and the transaction was committed), the actually transmission was initiated. If it succeeded, the data item was updated in a separate transaction (and therefore not being fetched anymore until the next state change). If the the transmission failed, a re-try took place. If that still failed, then the data item was put into error state (and data items in error state were not fetched either).

In terms of error handling, if the transmission fails x number of times, the data item is put into error state. If the transmitting system crashes, then the transmission did not take place, and the data item is "in transmission". After re-start, those "in transmission" items can be tried again.

Those data items in error state can be changed back to normal once the error condition was resolved. And then they will be picked up again by the query.

This approach closed the window of error situations significantly, but, of course, did not completely eliminate them as the transmission is non-transactional. In the worst case, the transmission happened, but the data item was not updated and remains in "in transmission". This happens when the transmitting code crashes after a successful transmission, but before updating the data item. While this is a real possibility, the window of code is fairly small and would affect only one data item.