Sunday, May 27, 2007

Is eNatis broken because of IT basics being ignored?

The eNatis scandal has captured South Africa's imagination and been the subject of ire. The reasons are easy to understand. If you were one of those queuing for days waiting for a drivers licence or test that is essential to your getting a job, or missing work for days in order to be there, you'd also be upset. Or if you were a car manufacturer and were losing business (car sales were down 23% after the system introduction due to the inability to process new car registrations) and faced laying off workers, you'd also be upset.

But what bothers me is I can almost see the reasons. I say that having been nowhere near the systems or licensing service. But I can guess. Our minister of transport has told parliment that transactions are up to 619 000 transactions per day on the new system from 287 000 transactions on the old system [source].

What happened to require that increase in transactions? Backlog of work caused by downtime during the implementation could be one cause, but I'm betting on another.

The new systems is apparently a centralised conglomeration of a old systems, allowing better citizen management and supposedly service. So that could certainly account for an increase. but I'm betting on another cause.

The new system appears also to be much more "fat server, thin client" architected. Now that certainly has potential to require a massive increase in "transactions" between branches and datacentre. And I think that's the beginning of the issue. Some IT and PR blokes are feeding the minister stuff to make the system seem less inept. He's talking about IT transactions (e.g. record updates, etc) versus citizen transactions (e.g. new car registrations). Now that is a compeletely meaningless statistic. The relationship between citizen transactions and IT transactions is completely dependent on the system development architecture. In fact, poor development is typically characterised by large numbers of IT transactions to customer transactions. Lack of understanding of scalability and resource requirements usually results in this. I'll bet that the new development is characterised by a massive number of database transactions per citizen transaction and that is at the heart of this mess.

Further, given my guess that the new architecture has followed a thin client model, I wouldn't be surprised to find the developer has built a terminal server configuration - meaning that sessions are actually hosted on the server and users are actually seeing a image of the session on their machine. Terminal server implementations mimic the days of mainframes and dumb terminals (some of you may remember those old green screens). Such architectures are great for call centres or data processing shops where people are located in the same building as the server. They're typically lousy for distributed architectures where the design puts a huge load on the network. The admission that many problems are being caused by link failures makes me believe this guess is true.

Finally, I'm shocked at the level of ineptitude that caused this to happen. The developers apparently ignored the auditor general's report that predicted an 80% chance of failure was part of the problem. But way before that report was delivered, testing should have picked up these problems. These days testing is pretty easy to do. Automated scripts can fire transactions at a system far faster than humans can simulate and it should have become quickly apparent whether the system was scaling or not. The project manager's excuse that problems were caused by things that were not able to be tested during development does not wash.

In the mean time, I'll bet that some of the hardware, networking and database suppliers are making a fortune as the government attempts to scale the infrastructure to cope with the bad design.

Of course the above is all conjectre, but I'd place some bets its true.

No comments: