Why   IT   Doesn't   Work

Home
Up 
   

Dirty Data

I was briefing the senior architect of an extremely complex DoD program a while back. Tens of millions of dollars slated for just this part of the system and far more for certain other connected elements. He is an altogether excellent guy and has been in IT for about twenty years.

So I say, as I outline how a couple subsystems should communicate, "It would be best, of course, if the two subsystems had a transactional connection, so that data integrity will be protected." He nodded and we discussed other options also, but I realized after a bit that we weren't quite communicating. I suggested that maybe we were using the term transactional differently and asked his definition. He offered that transactional programs were short in duration -- that was his concept.

Now listen -- this is awful. Every IT professional or manager should have the concept of transactionality clear in his mind, and especially a senior architect. I'm sure all readers of WITDW know all about this, but just in case just one or two don't, transactionality means that when some hunk of software starts executing and updates some data:

  • Either all the data is updated, or none, even if there is a hardware or software failure at the least opportune moment
  • No other hunks of software see half-updated data even if they access the data while another hunk is processing
  • Once the hunk commits the data (at completion or, sometimes, selected intermediate points ), the data changes are guaranteed to be permanent, even if, e.g., the hard drive fails a millisecond later.
  • [This is a casual explanation of what are called the ACID features of a transactional solution.]

If you think it through, some of these stipulations sound impossible -- if the server loses power in the middle of updating your customer file to, say, lower the base interest rate and double the late charge fee, obviously half of the customers are updated and half aren't. The data is a mess. You can't roll the clock back and undo what's been done.

Except you can. If you have magical widgets like write-ahead logs, image copies, two-phase commit, journals, and transaction-isolation tables. Which is exactly what database, transaction management, and application server products from IBM, BEA, Microsoft, and Oracle have inside them.

Since about 1970.

Maybe It's Yoko's Fault

So technology has existed since before the Beatles broke up that enables the construction of systems that should never lose data and never have inconsistent data. Any of your corporate systems corrupted or lost data since Abbey Road? Maybe your IT guys forgot to use this technology?

This technology was, at first, limited to data on a single system (think a mainframe at a bank here) and subject to other technical limitations (TANSTAAFL), but within its applicable problem domains, it was a terrific improvement over either (a) trying to invent solutions to this problem, or, (b) having corrupt data.

Now back to my architect friend from above. What's this got to do with an interface between two subsystems? About fifteen years ago, distributed transactionality began to be implemented in the vendor products. With this technology, clever architects could design enterprise solutions that connected applications on different servers (and, eventually, using different vendors' technology) such that a business transaction could execute across multiple machines and still have the ACID features I defined above.

I will once again say that this technology is nearly magical -- a 'new employee' record could be created on, say, the Payroll system, the 401K system, the employee club system, and the PAC-dunning system (built on mainframes, Unix, Intel, and Linux, perhaps) from one data entry screen and when the response is displayed 'Employee added,' every system is current. The transaction managers on all four systems interoperate, no matter where they are on the network, to guarantee that either all or none of the systems add the record.

Is this technology always needed? Nah -- there are four or five ways of interfacing systems that are appropriate for different circumstances. The whole buzz-field of Enterprise Application Integration (EAI) was invented a few years ago by software vendors to give a name to various means of (usually) loose, risky application integration. But transactionality is the most rigorous way of connecting systems and when you need it, you need it.

Big Problem or Detail?

As always, the question is, how does this affect your enterprise? Is it a technical detail that only software and database engineers should worry about, or is it something senior technologists and management also need to think about? Well, it's the latter, unfortunately:

  • Management needs to assess the risks of data loss or corruption and balance them against the costs, if any, of building better-architected systems. These risks might show up as customer satisfaction problems, legal exposures, accounting errors, IT remediation costs, or even safety issues.

    I'll provide examples of some of these in future columns but as a teaser I'll just mention a telecom client who estimated that data cleansing to repair corruption cost over $20M annually. To say nothing of the lead architect of an Internet front-end to his firm's legacy financial systems who asserted to me that his front-end only induced account errors in .03% of the transactions it performed. This meant that after 12 months, 75,000 of their customer accounts had wrong balances.
     
  • Technologists at all levels certainly need to understand where transactional systems and connections are vital and where they are not. (I'll actually cut my architect friend some slack -- he had spent his entire career in a single, very complex problem domain that dealt much more with algorithms than data. But the new architecture he was building should have considered transactionality. This type of occurrence, common in IT, is a good argument that all IT staff need to work across at least a moderate variety of problem spaces before they are deemed experts. I'll talk about this later, too.)
     
  • And finally, the enterprise technology decision maker -- CTO, CIO, Chief Architect, whoever -- clearly needs to understand this deeply. Before applications can be built that do not corrupt data they store or share, the enterprise infrastructure needs to be in place to provide support. This may require an investment, or it may just require a memo directing the way systems should be built. In either case, it has to come top down, as do other elements of an Enterprise Architecture, of which this is a part.

I'll close today's sermon with this comment: I believe the failure to take data integrity seriously enough in many enterprises has a couple roots: (a) it can be a subtle problem and frankly, a lot of IT decision makers don't want to tackle it. Better career path to make some pretty slides about a new application that is under development.

And (b) it's a good example of what I call the "fail to scale" problem in IT. Neither IT systems nor IT skills scale up very easily. If a little departmental system corrupts its data .03% of the time, someone may have to re-key some data once every year or so -- no big deal. Thus the intuition of technologists who have started with small or casual applications that can be fixed by hand when errors creep in leads them astray. Their assumptions and methods fail when they are tasked with building serious, high volume systems. After all, how big is .03%? It seems so low as to be, like, nearly zero, right?

I say that's wrong. 75,000 accounts in error is not OK. $20M is real money. That's what I think.

 

     

 

Send this page to a friend:     

Copyright © 2005 Why IT Doesn't Work
Last modified: 11/28/2005
No Project Managers were harmed during construction of this site.