I stumbled upon this blog entry about a debugging story
on a blog amusingly named "a.out". Go read it now before you read on here. It is pretty amusing...
Also - I liked the name a.out - it's been a while since I've seen one of those files :)
Anyway - onto my favorite debugging story. It is very similar to the a.out story above - very similar. Just goes to show that sometimes debugging takes more than computer skills - somethings you have to be like Sherlock Holmes.
It was in the first half of the 1990's - either late 1993 or early 1994. Version 6 of Oracle was still fairly heavily used - Version 7.0 was just released in late 1992. A customer was using an off the shelf time and attendance application for a large factory. It was a brand new implementation, just ported from Informix. This installation was one of the first Oracle installs and the biggest installation for the time and attendance company to date.
But - they were having a problem. At various points in the day the application would freeze - just totally stop. Unfortunately those points in the day were known as morning shift, lunch, evening shift - the exact points in time people would be checking in and out of the application. In short - it seemed to freeze under load. This was something that according to the time and attendance company "never ever happened with the Informix version - so it must be an Oracle bug" (of course it must be....)
Anyway - I was asked to take a look along with another guy I worked with. We went to the factory and started looking at the problem. Now remember - this is way before ASH/AWR, sql_trace was in its infancy, statspack did not exist - utlestat/bstat were useless, OEM didn't exist (sqldba lmode=y was state of the art). There wasn't much to go on - what we had to do initially was wait for the issue to happen and see what we could see.
That day - during lunch, it happened - the application froze. People could not clock back into work - the queues of people trying to get back in were building up. We could see there was enqueue contention - but we couldn't figure out what was causing it. It appeared to be a read only application blocking everyone (not possible if you asked us - it was read only and reads don't block writes in Oracle). We knew it was a read only application because the time and attendance folks told us it was - just a report that ran every 15 minutes to produce a list of who was clocked in and who wasn't.
So, we set down to write some scripts to start capturing the limited v$ information we had at that time - very little existed but we set out to capture it. And then we waited. And waited. And then it happened again - but the information we captured didn't really help us.
Getting a little desperate now (the time and attendance folks were suggesting that maybe the factory use Informix instead of Oracle since Oracle was apparently 'broken') - we kept trying to go forward. Someone suggested that we go out on the factory floor and observe the people using the application - to which we said "that doesn't sound useful, no thanks". After a few more incidents and not being able to see the solution - it was suggested again that we observe the application users in action. This time - since nothing else was working - we said ok.
So, we go onto the floor right before lunch was ending. The setup was such that there were a bunch of PC's lined up - sort of like cash registers at a large grocery store, and people would line up at the PC's and swipe their ID card to clock in. Everything was going good - people would walk up to the PC - hit enter and swipe their card. It was going like clockwork - until - it wasn't. The incident occurred again - the system froze.
What we observed was that some of the lines where still moving - but as people moved into those lines, they would freeze shortly as well. Pretty soon - all of the lines were 'stuck' - well, all but one. There was one PC at the end no one was using. We walked over to it and saw the screen said "hit enter to continue". We hit enter to continue and.... The event was over.
That was a clue... So, we started asking questions - the most important ones being "what took place in the application right before the hit enter to continue was on the screen" and "what took place right after". The time and attendance people said that right before that message came up - they would update the record of the person clocking in or out - and then after hitting enter they would be set up for the next person by committing. Simple enough - but how could that be blocked by a simple report?
So, we looked at the code for the report. It was read only - in a manner of speaking. Turns out - they wanted a 'read consistent'
report and the way they achieved that in Informix was to use SELECT FOR UPDATE. Informix didn't do multi-versioning, it didn't have read consistency. By using SELECT FOR UPDATE they ensured that the results from Informix would be consistent with respect to the time the query completed.
When they "ported" their application to Oracle - they didn't change a thing - they assumed "sql is sql is sql, all databases are the same". So, this read only report was doing a big old SELECT FOR UPDATE - and the SELECT FOR UPDATE was a simple select from the user table - of the entire table. In short - it would eventually lock the entire table (when it worked).
So, this is what was happening. The end users were so used to swiping a card and walking away from their old system (just clock in and leave) that they continued this behavior with the new system. They would swipe and walk away - leaving the blinking "hit enter to continue". The next person in line would hit enter, swipe and leave. But now ask yourself - what would happen if there was no next person in line? Well, the last person through that line would leave their record locked in the users table - it would remain locked until someone hit enter.
That doesn't sound too bad yet - after all the only thing that was going to update that sole record would be that user clocking out later - they would only block themselves. That would have been true except for that report. Every 15 minutes that report would wake up and issue a SELECT FOR UPDATE against the users table. And of course that SELECT FOR UPDATE would - block. But it would only block after locking a certain amount of that table! Suppose the user with the locked record was physically in the "middle" of the table - the report would now have locked the entire top half of the table. That would prevent anyone whose record was in the top half of the table from clocking in or out - the locking issue would raise its head and the system would freeze.
So, the problem wasn't Oracle - and the problem was really tricky to figure out. If the person that left their record locked was near the "top" of the table - the problem didn't really come up. If the person that left their record locked was near the bottom of the table - the problem would show up big time. And it was caused by two things:
- A poorly placed commit - don't put the commit after a point in time you waited for the end user
- A misunderstanding on how to get read consistent results in Oracle
The problem never arose in the Informix applications because the installations using Informix had all been very small - and/or - the end users interacted with the application differently (they would swipe and hit enter - instead of hit enter and swipe!).
Moving the commit was the quick fix. Reviewing the code and removing FOR UPDATE from many of the queries (it would many times be there 'just in case') was the real fix.
But - it was a debugging story that required no computers to debug. Just like a defective floor plate from a.out...