Monday, 23 January 2017

purging a cursor when Stuck in execution plans



I was at a UKOUG RAC/HA special interest group meeting last month. My favorite presentation of the day {out of what was a very good selection} was by Harald Van Breederode, an Oracle University DBA trainer from the Netherlands.  Harald’s presentation was on Checkpoints, a feature of Oracle that most DBAs tend to know a little about but never look into extensively. If you are a member of the UKOUG you can get the slides {If you are not a member, and you are in the UK {{or even Europe}}, nag your boss, get a membership – the conferences and meetings are a great source of info}.
Anyway, that is not the topic of this Blog. I finally downloaded a copy of the slides today and I checked out Harald’s blog. I immediately learnt something, which isthe topic of this blog.
In Oracle 10.2.0.4 (and upwards I assume) you can now flush a specific cursor out of the library cache using dbms_shared_pool.purge. You need to create the package first, it is not installed by default:-
@?/rdbms/admin/dbmspool
Package created.
Grant succeeded.
View created.
Package body created.
You also need to set an event to allow the purge to work. See metalink note 751876.1:
“The fix has gone into 10.2.0.4. However, the fix is event protected. You need to set the event 5614566, to make use of purge. Unless the event is set, purge will have no effect.”
alter session set events ‘5614566 trace name context forever’;
Now that you have the feature installed you can find the address and hash_value of a cursor you want to purge with eg:
select address, hash_value from v$sqlarea
where sql_text = ‘select count(c2) from skew where c1 = :bind’;
ADDRESS HASH_VALUE
——– ———-
27308318 2934790721
And purge it.
exec sys.dbms_shared_pool.purge(‘&address, &hash_value’,’c’)
PL/SQL procedure successfully completed.
For more comprehensive details check out
Why would you want to flush a specific cursor out of the library cache? Because it is performing poorly and you want the CBO to have another go at coming up with a plan. Before this nice sys.dbms_shared_pool.purge function you had three other options to flush the cursor out.
  • You can flush the whole shared pool
    alter system flush shared_pool;
    You really want to avoid doing this on a live system, as it flushes all SQL, PL/SQL (functions, packages, procedures} and data dictionary information, which will now have to be read back in when next required. It’s like unloading a whole delivery truck just to throw out one letter and then reloading the truck.
  • Alter one of the objects referenced by the SQL statement, with some DDL. Some suggest doing a grant or revoke on a table, I used to try adding a table comment {to what it already was}. I’m still not that comfortable with that option on a live system as you are doing something to a live object.
  • Another option is to gather new stats on one of the objects referenced by the cursor, which is fine so long as there is a small object you can quickly execute a dbms_stats.gather_xxxx_statistics on.
So I’ll look more into this package as I think it is cleaner way to do it. Mind you, there is that setting of an event on a live system…
How does a statement get a poor plan that is going to be fixed simply by asking the CBO to parse it again?
In Harald’s posting he also covers a main reason as to why you would want to flush a sql cursor out of the shared pool. His worked example shows how a SQL statement with a bind variable is open to getting an execution plan the first time it is parsed which is suitable for that first value of the bind – but NOT suitable for most other executions. eg the first value passed to the bind is maybe a very common one matching a large percentage of the table and so an index is not used, but most executions of the statement are with a bind value that matches very few records, and thus the index lookup is the best plan. But once the sql statement is parsed, the plan will be used until the statement disappears out of the  shared pool.
Another cause of a poor plan is when the statistics for an object referenced by the SQL statement changes. The CBO sometimes just chooses a plan that is not good. You may be thinking that, in this case, the optimizer will simply come to the same plan if asked again. Maybe not.
An issue I kept encountering at one site was very interesting. One of a small handful of simple SQL statements would go rouge overnight. Usually about 2 or 3am in the morning. Instead of a plan using a a highly specific index and a couple of table joins, a very, very poor plan was used instead. It only ever occurred early in the morning and only when the automated statistics gathering job had gathered stats on one of the tables involved. It took a while to spot this as the SQL would usually go rogue a while after the stats on the relevant tables had been gathered. This is because SQL statements are not invalidated when the underlying segments have their stats re-gathered by the automated job, they are invalidated “a little later”. It seems on Oracle 10.2 to be within an hour of the gather but not always. {To be totally accurate, this delayed invalidation is due to the DBMS_STATS parameter “no invalidate” defaulting to the value DBMS_STATS.AUTO_INVALIDATE but it can be overridden if you wish}
What seemed to be happening, though I never 100% proved it {so I am guessing, do not take this as a fact} is that one table would have new stats and the range of values for a column would include recent data {let’s say values 1 to 10,000,000}. Another table had the old information and so the range of known values was different {1 to 9,200,000}. This discrepancy would cause the wrong plan to be chosen. {I have a wilder theory which is that the indexes for a table had a different range of values for a column as the table stats had, but there are problems with my theory}.
By the time I logged in to the system in the morning to fix the overnight performance problem, stats on all relevant tables had finished being gathered and prompting the code to re-parse was the solution.
That leads me to the last point {sorry, a very long and wordy post again}.
Sometimes you can’t purge the cursor. The execution plan is stuck. Why?
Let’s say you have a cursor that is executed say 100 times a minute. Normally each execution runs in less than 50 milliseconds. All is fine. It has now gone rogue and it is taking 5 seconds to run, as the execution plan is poor. Each execution completes but at any time there is always at least one session running the code, usually several.
A SQL cursor will not be flushed from the shared pool if it is in use. Thus this rogue cursor gets stuck in the SGA. You can’t alter any of the objects as you can never get an exclusive lock on them. You flush the shared pool in desperation {even on your live system} and the cursor stays there, as even alter system flush shared_pool will not flush out an in-flight cursor. You could try locking the table, but like the DML, you are likely never to get that lock.
In this situation you have to identify the application running the code, stop it and wait for all executions to finish. Not maybe an issue in my example of 5 seconds to complete, but I’ve had the same problem with code run every few minutes now taking over an hour, so we could not wait an hour for it to sort out. In that situation we also had to kill sessions.
Of course, stopping and starting the database will cure the problem but the business may not be too happy about the whole database being shut down. {This prompts a thought – I’ve never noticed this but I wonder if you get the same problem on RAC but only impacting one node?}
This new package may help with stuck execution plans in that you can just keep trying over and over again to flush the one cursor until you hit a moment when no current execution is running.


================


Troubleshooting performance problems is an art by itself, especially when the problem is transient and only shows up once in a while. A classic example of such a problem is an unlucky peeked bind variable causing a sub-optimal execution plan based on the first execution of a SQL statement containing a bind variable on a skewed column. The problem is twofold: first to find such a statement and second to make the problem to go away. The latter is what this posting is all about. In order to demonstrate the problem and the fix we need to have something to play around with.
SQL> create table skew(c1 number(6), c2 char(20));

Table created.

SQL> insert into skew select 1,1 from dual connect by level <= 10000;

10000 rows created.

SQL> update skew set c1 = 2 where rownum <= 10;

10 rows updated.

SQL> create index skew_idx on skew(c1);

Index created.

SQL> exec dbms_stats.gather_table_stats(null,'skew', -
> method_opt => 'for all columns size 2')

PL/SQL procedure successfully completed.

SQL> select c1, count(*) from skew group by c1;

        C1   COUNT(*)
---------- ----------
         1       9990
         2         10
We now have an indexed table with skewed data in it with current object statistics in place including a histogram on the skewed column. Lets execute a query using a bind variable on the skewed column and see what the query optimizer expects and what execution plan it considers optimal.
SQL> var x number
SQL> exec :x := 1;

PL/SQL procedure successfully completed.

SQL> select count(c2) from skew where c1 = :x;

 COUNT(C2)
----------
      9990

SQL> select *
  2  from table(dbms_xplan.display_cursor(null,null,'basic rows'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(c2) from skew where c1 = :x

Plan hash value: 568322376

-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      |       |
|   1 |  SORT AGGREGATE    |      |     1 |
|   2 |   TABLE ACCESS FULL| SKEW |  9990 |
-------------------------------------------
The example above shows that the query optimizer predicted the cardinality correctly and choosed the optimal execution plan based upon this information. It could do so because there is a histogram available to describe the data skew in the table. Now see what happens if we bind the value 2 and execute the query again:
SQL> exec :x := 2;

PL/SQL procedure successfully completed.

SQL> select count(c2) from skew where c1 = :x;

 COUNT(C2)
----------
        10

SQL> select *
  2  from table(dbms_xplan.display_cursor(null,null,'basic rows'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(c2) from skew where c1 = :x

Plan hash value: 568322376

-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      |       |
|   1 |  SORT AGGREGATE    |      |     1 |
|   2 |   TABLE ACCESS FULL| SKEW |  9990 |
-------------------------------------------
Because the statement is not hard parsed again the same execution plan is used based on the prediction of 9990 rows. Because the query only returns 10 rows this execution plan is no longer optimal for the given value for the bind variable. If this query gets executed many times with this value of the bind variable we do have a performance problem for as long as this execution plan remains in the library cache. If this is indeed the case it might be beneficial to flush this cursor out of the shared pool. Starting with 10.2.0.4.0 this can be done using the PURGE procedure in the DBMS_SHARED_POOL package as demonstrated below:
SQL> @?/rdbms/admin/dbmspool

Package created.

Grant succeeded.

View created.

Package body created.

SQL> select address, hash_value from v$sqlarea
  2  where sql_text = 'select count(c2) from skew where c1 = :x';

ADDRESS  HASH_VALUE
-------- ----------
27308318 2934790721

SQL> exec sys.dbms_shared_pool.purge('&addressash_value','c')

PL/SQL procedure successfully completed.
Because the DBMS_SHARED_POOL package is not installed at database creation time, it has to be installed manually as shown above. The PURGE procedure needs the ADDRESS and HASH_VALUE of the cursor being flushed and the flag ‘C’ to indicate that we are flushing a cursor. This knowledge comes out of the dbmspool.sql script. The ADDRESS and HASH_VALUE can be retrieved from V$SQLAREA as shown in the example. A successful execution of the PURGE procedure indicates that the parent cursor is gone among with its children. A next execution of the query will force a hard parse and the creation of a new execution plan as we can see below:
SQL> select count(c2) from skew where c1 = :x;

 COUNT(C2)
----------
        10

SQL> select *
  2  from table(dbms_xplan.display_cursor(null,null,'basic rows'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(c2) from skew where c1 = :x

Plan hash value: 3493361220

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |       |
|   1 |  SORT AGGREGATE              |          |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID| SKEW     |    10 |
|   3 |    INDEX RANGE SCAN          | SKEW_IDX |    10 |
---------------------------------------------------------
This time the query optimizer predicted the correct number of rows for the given value of the bind variable and selected the optimal execution plan for the given situation. The difficulty is of course to detect these situations before we can correct them. An indication could be a difference in the predicted number of rows and the actual number of rows in an execution plan, but therefore we need to set the STATISTICS_LEVEL parameter to ALL or add the GATHER_PLAN_STATISTICS hint to all possible affected statements which might be difficult to do. Once a possible affected statement has been found we can see the used bind value in the execution plan by using the PEEKED_BINDS options in the format specifier in the call to DBMS_XPLAN.
SQL> exec :x := 1;

PL/SQL procedure successfully completed.

SQL> select count(c2) from skew where c1 = :x;

 COUNT(C2)
----------
      9990

SQL> select *
  2  from table(dbms_xplan.display_cursor(null,null,'basic rows peeked_binds'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select count(c2) from skew where c1 = :x

Plan hash value: 3493361220

---------------------------------------------------------
| Id  | Operation                    | Name     | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT             |          |       |
|   1 |  SORT AGGREGATE              |          |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID| SKEW     |    10 |
|   3 |    INDEX RANGE SCAN          | SKEW_IDX |    10 |
---------------------------------------------------------

Peeked Binds (identified by position):
--------------------------------------

   1 - :X (NUMBER): 2
In this final example we bounded the value 1 again and executed the query which retrieved 9990 rows whilst the execution plan shows a prediction of only 10 rows. By using PEEKED_BINDS we receive extra information from DBMS_XPLAN telling us that this particular execution plan is based on the value 2 of the first bind variable in the statement which is named ‘:x’ and is a number data type.
Conclusion: By using the PURGE procedure in the DBMS_SHARED_POOL package we can flush a cursor out of the Library Cache when the execution plan causes performance problems due to an unlucky bind variable value. However this is only a temporary solution. The definitive solution is Adaptive Cursor Sharing which is introduced in Oracle11g.

========================================

Purging Cursors From the Library Cache Using Full_hash_value

Introduction: Purging cursors from the library cache is a useful technique to keep handy for troubleshooting. Oracle has introduced a procedure call to do that in version 11 with backports to 10g. Besides  this has been covered by several blogs already (including , Oracle support (note 457309.1 for example) and the actual package file in $ORACLE_HOME/rdbms/admin/dbmspool.sql

Most of the examples and discussions in the links above utilize with the following syntax:

SQL> exec sys.dbms_shared_pool.purge(‘&address, &hash_value’,'c’)

What's new in 11.2:

A new (overloaded) procedure in dbms_shared_pool.purge is available in 11.2 and allows to purge statements identified by thier full_hash_value of the statement. One of the advantages compared to the previous method is that the full_hash_value is a property of a given sql statement and does not depend on the memory address of the (parent) cursor. Note this has been tested in 11.2.0.3 64 bit for Linux.


Example:

myapp_user_SQL> select /*MYTEST*/ sysdate from dual; -- put test SQL statement that we want to flush in the following
admin_user_SQL> select a.FULL_HASH_VALUE from V$DB_OBJECT_CACHE a where name='select /*MYTEST*/ sysdate from dual';  
-- find full_hash_value to be used in the next step
-- in this example the full_hash_value is 98d0f8fcbddf4095175e36592011cc2c
admin_user_SQL> exec sys.dbms_shared_pool.purge(HASH=>'98d0f8fcbddf4095175e36592011cc2c',namespace=>0,heaps=>1) 

Additional info:

full_hash_value is a 128-bit MD5 hash of the sql statement
A few methods to find full_hash_value given different input are listed here below:
  • find full_hash_value from cache, query v$db_object_cache 
    • select a.FULL_HASH_VALUE from V$DB_OBJECT_CACHE a where name='select /*MYTEST*/ sysdate from dual';
  • find full_hash_value from hash_value
    • select full_hash_value from  v$db_object_cache where hash_value=538037292
  • find full_hash_value from sql_id
    • find hash_value from sql_id using DBMS_UTILITY.SQLID_TO_SQLHASH
    • select  full_hash_value from  v$db_object_cache where hash_value= DBMS_UTILITY.SQLID_TO_SQLHASH('1frjqb4h13m1c');
  • compute full_hash_value from SQL tex
namespace=>0 means 'SQL  AREA' , which is the relevant namespace for cursors.

Conclusions:

We have discussed a method to purge cursors for the library cache that uses the full_hash_value of the cursor instead of the address and hash_value which is the more common approach (and the only one documented in previous versions). This method discussed here is available in 11.2.

Thanks
Srini

No comments:

Post a Comment


No one has ever become poor by giving