Sunday, May 27, 2012

Doctors...

This recent blog post by Seth Godin reminded me a lot of my introduction to Effective Oracle by Design of a few years ago. What was true then is still so true today...

Here is an excerpt from my book that mirrors what he just wrote:


I will use yet another analogy to describe how this book will present information.  Pretend for a moment that the developer is instead a medical doctor and the application is the patient.  There are many types of MD’s:

  • The emergency room (ER) doctor. They do “triage” – separating the hopeless from the ones that can be helped.  Performing quick fixes to keep patients alive for as long as possible.  They strive for short term band-aids to fix up the patient.  They will take a patient with a heart attack induced by smoking, bad diet and no exercise and get them stabilized.

  • The operating room (OR) doctor.  They get the patient after the ER doctor has triaged them and patched them up.  They strive for long term fixes to keep the patient not only alive but as fully functioning as possible.  They perform the by-pass operation on that heart attack attempting to clear the arteries.

  • The physical therapist (PT).  They get the patient after the operating room doctor is finished and begin a long and painful (not to mention expensive) process of rehabilitation.

  • The preventative medicine doctor.  They strive to avoid the above three doctors at all costs.  They counsel the patient to quit smoking, eat a healthy diet, and exercise – developing a phased plan to get them in shape.  If they do their job right – with the exception of unfortunate accidents (like a car accident), the patient will never see the ER, OR or PT doctors.
Now, the world needs all types of doctors – accidents do happen after all.  But one of the most important types of doctors is that last one, the preventative medicine doctor.  The one that tries hard to avoid having their patient need the other three.

It is my belief (experience) that most people and books approach tuning using the mindset of the first three doctor types above.  They are in support of the hero developer; e.g. the ER or OR doctor.  Perhaps that is partially due to my observation that pre-emptive good design and implementation is mostly a thankless exercise.  These developers seem to get all of the fame as they snatch the patient from the grasp of death (save the system by doing something miraculous).  They get called in at the last moment; work horribly hard for an extended period of time trying to keep the patient alive (and get paid handsomely as well).  The physical therapists are the unlucky souls that get the system after the ER/OR doctor has patched it up.  They are the ones responsible for keeping this system going.

I feel I am well equipped to speak from that perspective.  I am in fact one of those “heroes”.  I am called in to “lay hands on” systems and make them better.  I could write that book, I’ve been told I should write that book – but I won’t.

What is missing is the comprehensive approach that includes the preventative medicine doctor training.  There are some out there – my favorites being Guy Harrison’s developer book, as well as Jonathan Lewis’s DBA book.  These books, including my own “Expert One on One Oracle” work to remove the need for the hero.  Remember – firefighters are heroes when they do their job, we all just hope we never need them personally!

Wednesday, May 23, 2012

When is a foreign key not a foreign key...

I learn or relearn something new every day about Oracle.  Just about every day really!

Last week I was in Belgrade Serbia delivering a seminar and an attendee reminded me of something I knew once but had totally forgotten about.  It had to do with foreign keys and the dreaded NULL value.

Many of you might think the following to be not possible, we'll start with the tables:


ops$tkyte%ORA11GR2> create table p
  2  ( x int,
  3    y int,
  4    z int,
  5    constraint p_pk primary key(x,y)
  6  )
  7  /
Table created.


ops$tkyte%ORA11GR2> create table c
  2  ( x int,
  3    y int,
  4    z int,
  5    constraint c_fk_p foreign key (x,y) references p(x,y)
  6  )
  7  /
Table created.

Looks like a normal parent child relationship - a row may exist in C if and only if a parent row exists in P.  If that is true - then how can this happen:

ops$tkyte%ORA11GR2> select count( x||y ) from p;

COUNT(X||Y)
-----------
          0

ops$tkyte%ORA11GR2> select count( x||y ) from c;

COUNT(X||Y)
-----------
          1

There are zero records in P - none.  There is at least one record in C and that record has a non-null foreign key.  What is happening?

It has to do with NULLs and foreign keys and the default "MATCH NONE" rule in place.  If your foreign key allows NULLs and your foreign key is a composite key - then you must be careful of the condition where by only SOME of the foreign key attributes are not null.  For example - to achieve the above magic, I inserted:

ops$tkyte%ORA11GR2> insert into c values ( 1, null, 0 );
1 row created.

The database cannot validate a foreign key when it is partially null.  In order to enforce the "MATCH FULL" option of a foreign key - you would want to add a constraint to your table:

ops$tkyte%ORA11GR2> alter table c add constraint check_nullness
  2  check ( ( x is not null and y is not null ) or
  3          ( x is null and y is null ) )
  4  /
Table altered.

That will ensure either:
  • All of the columns are NULL in the foreign key
  • None of the columns are NULL in the foreign key
As long as that constraint is in place - your foreign key will work as  you probably think it should work.


Tuesday, May 15, 2012

UKOUG 2012...

The call for papers for the UKOUG 2012 conference ends in less than three short weeks!  If you were planning on going to the conference (and even if not) - you should consider submitting a paper.

I've been a long time supporter of all of the user groups and their conferences and I can attest to the quality of the UKOUG event.  The conference is chock full of technical talks with hundreds of sessions to choose from.  There is something for everyone there.

If you've never presented before, don't let that deter you from submitting a paper.  No one knows the anxiety that public speaking can bring better than I - I've written about it before. You'll find the conference to be an entirely different experience on the other side of the podium.  In addition to the experience of presenting, the networking and exposure that comes with being a speaker won't hurt you at all.  Whether you are a DBA or developer - having good public speaking skills is a necessity today - and using the conference as a way to build those skills is a great way to start.

Additionally - what you have to say is important and relevant to the user community as a whole.  A good conference needs a lot of speakers, from many diverse disciplines, with diverse backgrounds - the more speakers the merrier.  Don't think you don't have anything to offer - everyone does.  And don't feel that your topic wouldn't be interesting to someone else - it will be.  There are a lot of people out there trying to do some of the same things you've done and they'd love to hear how you did it.

That is one of the things about user groups I really like - they bring together a lot of people doing similar things - but in a different way.  You'll learn something new - and they will too.  

The UKOUG is one of the larger and well run conferences out there - don't be afraid to talk.  Challenge yourself to get up there and just do it.  You won't be sorry (ok, maybe in the minutes leading up to it you will be - but you'll get over that :) ) 

Hope to see you there - and don't chicken out!

Thursday, May 10, 2012

Pokemon and When Others...

Question: What do Pokemon and When Others in PL/SQL have in common?

For many developers the answer is:

Asnwer: You gotta catch 'em all.

I saw this other blog post on "Dodgy Coder" and was greatly amused by the comparison of Pokemon and catching all exceptions and basically making them disappear.

I've written about this "worst" practice a lot.  You can see some of them at:



You really DON'T gotta catch em' all!  And if you do - you really gotta throw them again (re-raise them)

Wednesday, May 09, 2012

Another debugging story...

There is a saying "correlation is not causation" meaning - just because you observe A and then observe B, it does not mean that A causes B - even if every single time you observe A - you see B as a 'side effect'.

Here is a great story to back that up - a debugging session where the end users had determined that they could not send email more than 500 miles.  Read it and note how much work the end users had gone into 'proving' that email cannot be sent more than 500 miles...

I use a similar story when talking about bind variables and bind variable peeking.  In the story - I make the claim that "when ever it rains on a Monday morning - you have to restart the database in the afternoon to make it perform correctly".  That is - rain on Mondays implies poor performance, and the only fix is to reboot.

This conclusion was reached via empirical observation - every single time, *every single time*, it would rain on Monday mornings - the database would respond with really poor response times to certain critical queries in the afternoon.  Every single time, without failure.  It must be that rain affects database performance somehow.

But here is what really was happening.  The database in question was a database that was backed up using cold backups (DBA's were of the archaic type).  These cold backups took place Sunday nights - late at night.  That meant that every Monday morning all of the caches would be virtually empty.  No parsed SQL for the applications in particular would be present.  So - Monday morning was the morning of hard parses every single week.

Now, in the critical application - there was a query that went against some very skewed data.  99.9999% of the time - the queries would be very selective, they would return 1 or 2 rows.  The other times - they would be very non-selective returning much of the table - but these queries were rare.

Now, the main user of this application was a person that came in early most of the time - they would get into the office at 6am and start working.  They always used bind inputs that returned very few rows - hence the problem queries would hard parse and tend to use indexes.  When this happened - everyone was happy.

The other set of users - they came in later, usually around 9am or so.  They would run the non-selective query and were OK with the index performance.  It might not have been the *best* possible performance - but it worked OK for them.

So, what was the link to rain?  Well, the office where everyone worked was located in Washington DC - when it rains in DC it doesn't matter if you leave home at 6am - it'll been noon before you get to work.  If you leave at 7am - it'll be noon when you get to work.  If you leave at 8am - it'll still be noon before you get to work.  In short - it really didn't matter when you left for work - if you are driving into the city - it is going to take a long time.  So, when it rained, the person that did the selective queries would just hit the snooze button on the alarm and go back to sleep.  Not worth driving in yet.

However, the people that did the non-selective queries - they lived in an apartment across the street from the  office.  They were not affected by the rain.  They came in at 9am regardless.  Since they ran the non-selective queries and the cache was empty - they would hard parse the queries and result with full scan plans.  This group of people was in fact a little convinced that rain *helped* performance a bit - when it rained on Mondays - they saw slightly better performance in the morning.

When the person that slept in finally got to work - and ran the application - they would definitely notice the full scans and how slow the database was.  Since it was raining - and they had observed this hundreds of times before - they call the data center - tell them "It is Monday, it is raining, you know what to do", after the reboot, everything is running fast again.

But - did it have anything to do with rain - or was it something entirely different :)

Getting to the root cause (bind peeking issue) can lead you to an efficient, effective corrective action (perhaps query plan stability for a certain set of queries for example).

Going down the empirical path of "correlation must imply causation" will lead to never actually fixing the problem - and the invention of many myths...

Tuesday, May 08, 2012

Requesting your input!!

Usually I'm pushing information out - now I need to pull some in.

We're evaluating the usage of XMLDB in the database community. We would like to know what components of XMLDB are most frequently used. Please let us know which of the following you use:

  • XMLType storage
    • XML store as CLOB
    • XML store as Binary XML
    • XML store as Object Relational
  • Indexing
    • Unstructured
    • Structured
    • Functional indexes
    • Text Indexes
  • XML <--> Relational interaction
    • XML Generation from relational
    • XMLType views on relational data
    • Relational views on XML data (using XMLTable)
  • XDB repository
    • Protocols
    • Webservices
    • File/folder access
    • advanced features like versioning, events, xlink, anything else
Any background on how you use the capability will be of interest. Just leave a comment and we'll definitely be reading them all.


Thanks for taking the time to respond!

Monday, May 07, 2012

My favorite debugging story...

I stumbled upon this blog entry about a debugging story on a blog amusingly named "a.out".  Go read it now before you read on here.  It is pretty amusing...

Also - I liked the name a.out - it's been a while since I've seen one of those files :)

Anyway - onto my favorite debugging story.  It is very similar to the a.out story above - very similar.  Just goes to show that sometimes debugging takes more than computer skills - somethings you have to be like Sherlock Holmes.

It was in the first half of the 1990's - either late 1993 or early 1994.  Version 6 of Oracle was still fairly heavily used - Version 7.0 was just released in late 1992.  A customer was using an off the shelf time and attendance application for a large factory.  It was a brand new implementation, just ported from Informix. This installation was one of the first Oracle installs and the biggest installation for the time and attendance company to date.

But - they were having a problem.  At various points in the day the application would freeze - just totally stop.  Unfortunately those points in the day were known as morning shift, lunch, evening shift - the exact points in time people would be checking in and out of the application.  In short - it seemed to freeze under load.  This was something that according to the time and attendance company "never ever happened with the Informix version - so it must be an Oracle bug" (of course it must be....)

Anyway - I was asked to take a look along with another guy I worked with.  We went to the factory and started looking at the problem.  Now remember - this is way before ASH/AWR, sql_trace was in its infancy, statspack did not exist - utlestat/bstat were useless, OEM didn't exist (sqldba lmode=y was state of the art).  There wasn't much to go on - what we had to do initially was wait for the issue to happen and see what we could see.

That day - during lunch, it happened - the application froze.  People could not clock back into work - the queues of people trying to get back in were building up.  We could see there was enqueue contention - but we couldn't figure out what was causing it.  It appeared to be a read only application blocking everyone (not possible if you asked us - it was read only and reads don't block writes in Oracle).  We knew it was a read only application because the time and attendance folks told us it was - just a report that ran every 15 minutes to produce a list of who was clocked in and who wasn't.

So, we set down to write some scripts to start capturing the limited v$ information we had at that time - very little existed but we set out to capture it.  And then we waited.  And waited.  And then it happened again - but the information we captured didn't really help us.

Getting a little desperate now (the time and attendance folks were suggesting that maybe the factory use Informix instead of Oracle since Oracle was apparently 'broken') - we kept trying to go forward.  Someone suggested that we go out on the factory floor and observe the people using the application - to which we said "that doesn't sound useful, no thanks".  After a few more incidents and not being able to see the solution - it was suggested again that we observe the application users in action.  This time - since nothing else was working - we said ok.

So, we go onto the floor right before lunch was ending.  The setup was such that there were a bunch of PC's lined up - sort of like cash registers at a large grocery store, and people would line up at the PC's and swipe their ID card to clock in.  Everything was going good - people would walk up to the PC - hit enter and swipe their card.  It was going like clockwork - until - it wasn't.  The incident occurred again - the system froze.

What we observed was that some of the lines where still moving - but as people moved into those lines, they would freeze shortly as well.  Pretty soon - all of the lines were 'stuck' - well, all but one.  There was one PC at the end no one was using.  We walked over to it and saw the screen said "hit enter to continue".  We hit enter to continue and.... The event was over.

That was a clue...  So, we started asking questions - the most important ones being "what took place in the application right before the hit enter to continue was on the screen" and "what took place right after".  The time and attendance people said that right before that message came up - they would update the record of the person clocking in or out - and then after hitting enter they would be set up for the next person by committing.  Simple enough - but how could that be blocked by a simple report?

So, we looked at the code for the report.  It was read only - in a manner of speaking.  Turns out - they wanted a 'read consistent' report and the way they achieved that in Informix was to use SELECT FOR UPDATE.  Informix didn't do multi-versioning, it didn't have read consistency.  By using SELECT FOR UPDATE they ensured that the results from Informix would be consistent with respect to the time the query completed.

When they "ported" their application to Oracle - they didn't change a thing - they assumed "sql is sql is sql, all databases are the same".  So, this read only report was doing a big old SELECT FOR UPDATE - and the SELECT FOR UPDATE was a simple select from the user table - of the entire table.  In short - it would eventually lock the entire table (when it worked).

So, this is what was happening.  The end users were so used to swiping a card and walking away from their old  system (just clock in and leave) that they continued this behavior with the new system.  They would swipe and walk away - leaving the blinking "hit enter to continue".  The next person in line would hit enter, swipe and leave.  But now ask yourself - what would happen if there was no next person in line?  Well, the last person through that line would leave their record locked in the users table - it would remain locked until someone hit enter.

That doesn't sound too bad yet - after all the only thing that was going to update that sole record would be that user clocking out later - they would only block themselves.  That would have been true except for that report.  Every 15 minutes that report would wake up and issue a SELECT FOR UPDATE against the users table. And of course that SELECT FOR UPDATE would - block.  But it would only block after locking a certain amount of that table!  Suppose the user with the locked record was physically in the "middle" of the table - the report would now have locked the entire top half of the table.  That would prevent anyone whose record was in the top half of the table from clocking in or out - the locking issue would raise its head and the system would freeze.

So, the problem wasn't Oracle - and the problem was really tricky to figure out.  If the person that left their record locked was near the "top" of the table - the problem didn't really come up.  If the person that left their record locked was near the bottom of the table - the problem would show up big time.  And it was caused by two things:


  • A poorly placed commit - don't put the commit after a point in time you waited for the end user 
  • A misunderstanding on how to get read consistent results in Oracle
The problem never arose in the Informix applications because the installations using Informix had all been very small - and/or - the end users interacted with the application differently (they would swipe and hit enter - instead of hit enter and swipe!).

Moving the commit was the quick fix.  Reviewing the code and removing FOR UPDATE from many of the queries (it would many times be there 'just in case') was the real fix.

But - it was a debugging story that required no computers to debug.  Just like a defective floor plate from a.out...

Saturday, May 05, 2012

Not that this is the last time...

Not that this is the last time I'll write a blog entry ever again - but...

I've noticed that many times I've wanted to blog something recently - it would have been really really short and I just said "that would be silly to blog a sentence".  So.... I've decided to try twitter


You can follow me there if you are interested.