Distributed systems in low reliability environments
This is the challenge of health IT at the primary care level. Most of the medical-objects destinations are Australian general practices which while computerised, are only just computerised and the stability and management of their systems is well… poor in general.
On top of this we run real time messaging using full PKI security for privacy and authentication. It is a challenge at times and we are continuing to learn, sometimes the hard way. The average life of a client install is < 12 months and often no one in the practice knows which computer is on, and frequently that computer disappears as part of an upgrade or reshuffle.
Managing this over thousands of sites is something we are trying to perfect. The first lesson we learnt was to get the content of the messages standards compliant. This has been vital as when we know that the payload meets Australian standards we can safely tweak the envelope to suite to recipients software, which often can’t handle compliant HL7. Compliant messages have allowed us to develop a large number of message modifiers or decorators which can be chained together in a huge number of ways to achieve a message that can be consumed. Even with a huge array of standard modifiers we still run into problems. Corporate systems are virtually impossible to tweak and the fact that they read to report date from the wrong place is something that you sometimes just have to accept. So the next element that makes it work is scriptable modifiers. We have a pascal script interface into our hl7 and xml parser that allow virtually any modification to be made. This is less routine and scalable that the canned modifiers, but is something you just need to have.
The other area that has proved vital is ongoing monitoring of server uptime. This allows us to detect clients with problems before someone notices that there have been no results for a while. Installation of new anti-virus software or changes to authentication parameters commonly cause this. We have also found sever instances of ISP misconfiguration of routers that block traffic between 2 clients.
Forwarding of errors in sending to support is another vital health monitor. Analysis of this is still not to the stage we would like it as many clients are disconnected after hours and weekends and the monitoring has to try and take this into account. Automating analysis of the problem is also an area we can improve on. We have started to identify patterns in the errors that mean something but the analysis is quite complex. We are considering using GELLO (Clinical decision support Language) to try and recognise these patterns in an automated way.
Double checking of new user and upgrade configuration is now much enhanced also. Users are good at finding shortcuts and at one stage we found support found a way to copy installation setups, followed by a few modifications. This made their setup much quicker but they did not change the GUID that is used for the URL on the proxy server…. The pleasing thing was that the PKI infrastructure prevented breaches of privacy as a wrong recipient could not decrypt the data but it had us scratching our heads. The sender and recipient both had each others keys and they were signed and correct but still we were getting errors in both authentication and decryption. In the end several clients were randomly getting each others messages so it would fail commonly, but if the correct client eventually received the message it would work. It was only when we noticed no hits in the client log despite a failed transmission that we tweaked to this one. The servers now avidly check all setups to prevent this happening.
The process has also made us aware of the difficulties in testing distributed systems. We have complex test setups but trying to simulate the real world of a distrubuted messaging system is also impossible. You have to try but the ongoing monitoring of connectivity and errors is something you have to have and the real world will always throw up surprises.
However we are excited about how well it works and how fast distributed systems are. The real world works quicker than any test setup and the message processing is done on different machines. In a test setup on a few machines you can almost follow it by looking at logging windows. In the real world it moves so fast that you have no hope of following it. Now to build something useful on top of it, after all that’s why we built it!
Leave a Reply