Christoph Bussler
Integration
Integration is establishing consistent communication and data exchange between heterogeneous, autonomous, and distributed (HAD) systems.
Reading about all the wonderful web service technology, EAI technology, B2B technology, Service Bus, etc., one could get the impression that there is a working approach to "clean" and "easy" integration. With clean integration I mean that the messages are well-defined in structure and vocabulary, that their exchange sequence is well-defined, and at runtime it simply works without issues or problems. Any run-time failure is detectable and can be automatically addressed and rectified.
The reality is, however, very different. Integration is still complicated and needs a lot of diligence in its implementation to work reliably. Monitoring remains difficult and error detection and repair take a lot of time. This web site is dedicated to discuss the nitty-gritty details, challenging use cases, opinions, as well as solutions in context of daily integration challenges.
Integration can mean many things to many people. From a simple copy of a file, to a complex replication across several application systems, all can be considered "integration". It is therefore very important to build up a frame of reference in order to have a meaningful discussion on the functionality and eventually semantics of what integration means.
First, it has to be defined what the systems are that require integration. And then it has to be defined what kinds of integration can be accomplished between the systems. For example, a system can be a database and a possible integration is to send changes from one database to another one without considering failures. Another example is a Java class that accesses a specific web service as a remote entity. Yet another example is two databases that are integrated by defining views across them and making the views the access interface.
The functionality discussion is quite involved and takes place on a separate page: Integration Functionality
This project was a classical B2B integration problem where two systems had to be integrated (source and target system). The two systems belonged to two different corporations and were running inside their clouds. The integration had to cross the boundaries of the clouds and the technology used for that was XML over HTTP. The integration functionality required was state replication.
The data was kept in a relational database system and the state changes happened inside the database as data changes. Plus, once a state change happened, an indicator in the data set was updated to record the fact that a state change happened.
In every integration, the source system has to decide, when to initiate integration. In case of state replication, a state change (not necessarily every state change) triggers the integration: the state changes, and a document containing either the state change or the new state is sent to the target system (for it to process the change). In this context the document sent to the target system contains the new state.
Integration is often associated with a queuing architecture. The usual way to using queuing in context of integration is that the source system queues up the documents that need to be sent to the target system. The queuing system itself, or a queue monitor, fetch each document from the queue and send it to the target system. Once the target system received the document, it is removed from the queue. If there is an error situation, the document remains in the queue.
There are several functional problems that can arise (not necessarily have to) depending on the context:
In case of many errors on the target system side (because of target system processing errors or communication errors), the error handling on the source system can be quite involved in the sense that a lot of different documents have to be managed and their state.
The context of the project was such that a lot of errors were occurring on the target system side as well as on the communication layer side crossing several cloud boundaries on the way.
Aside from a high-frequency state change situation, and additional interesting business logic specific behavior was that that state changes were cumulative. This means that not every state change had to be sent to the target system. If two were skipped, the third state change subsumed the previous two.
The question asked was: is it possible to achieve the integration without a queuing system to avoid all the error handling complexity?
Several alternatives were explored. A first alternative was to have a daemon running that queries the data base for changed data items. Since there was an indicator in the data set, this would be a simple select query. Once the data is retrieved, the data is packaged for transmission to the target system. Once the transmissing happened, the indicator was reset. Since the reset is an update, the daemon was transactional.
This approach has several issues. One is, a single query can return many data items that have a state change. Each has to be transmitted one-by-one. This takes time and during that time there is an open transaction. This might be quite an issue if there are hundreds or thousands of state changes processed inside one transaction. Secondly, if the daemon fails (or the database fails), the transaction is aborted, and all the data items are reset. However, the actually transmissions cannot be reset as they are non-transactional. So an inconsistent state would be the result which would be difficult to repair.
An alternative approach was implemented. The basic idea was to use the "select for update skip locked" statement. In that SQL statement several functionalities were achieved:
Basically, the query selected the oldest state change not in transmission and put it into state "in transmission". All that happens in one transaction. Once the query returned (and the transaction was committed), the actually transmission was initiated. If it succeeded, the data item was updated in a separate transaction (and therefore not being fetched anymore until the next state change). If the the transmission failed, a re-try took place. If that still failed, then the data item was put into error state (and data items in error state were not fetched either).
In terms of error handling, if the transmission fails x number of times, the data item is put into error state. If the transmitting system crashes, then the transmission did not take place, and the data item is "in transmission". After re-start, those "in transmission" items can be tried again.
Those data items in error state can be changed back to normal once the error condition was resolved. And then they will be picked up again by the query.
This approach closed the window of error situations significantly, but, of course, did not completely eliminate them as the transmission is non-transactional. In the worst case, the transmission happened, but the data item was not updated and remains in "in transmission". This happens when the transmitting code crashes after a successful transmission, but before updating the data item. While this is a real possibility, the window of code is fairly small and would affect only one data item.
© Christoph Bussler