Default Blame Acceptor: Why is my SYSAUX tablespace So bigg.. Is it worth adding a datafile to it??

Link explaining more:

http://mwidlake.wordpress.com/?s=table+partitions&searchbutton=go%21

The client I was working with at the time was about to go live with a new application and database. For various reasons I was a little anxious about how the Java application (the User Interface) would actually call the stored PL/SQL code I had helped develop. Initial workloads would be low and so I asked that the STATISTICS_LEVEL be set to ALL, so that bind variables (amongst other things) would be gathered. This is on version 10.2.0.4, btw, enterprise edition and 4-node RAC.

We went live, issues were encountered and resolved, the usual way these things work. Then, a few weeks in and when everything was still very “exciting” from a problem resolution perspective, I got an odd email from the DBA team. Would they like us to add another datafile to the SYSAUX tablespace. Huh? I checked. I’d been watching the size of our application’s tablespaces but not the others {well, I was not supposed to be a DBA and I was watching an awful lot of other things}. Our SYSAUX tablespace was around 160GB in size, having pretty much filled it’s 5th datafile. Why? I checked to see what was taking up the space in the tablespace:

select * from

 (select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m

 from dba_segments

 where tablespace_name = 'SYSAUX'

 ORDER BY BLOCKS desc)

 where rownum < 40

OWNER              SEGMENT_NAME                                             SIZE_M

------------------ -------------------------------------------------- ------------

SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911            27,648

SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911         26,491

SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3537            23,798

SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3537         22,122

SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296            17,378

SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296         16,818

SYS                WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3          136

                   911

SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911                       96

SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3537                       72

SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_4296                       47

SYS                WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_           45

                   3537

SYS                I_WRI$_OPTSTAT_H_OBJ#_ICOL#_ST~                              41

SYS                WRH$_SYSMETRIC_SUMMARY~                                      40

SYS                WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_           37

As you can see, almost all the space is being taken up by WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK partitions. They are massive compared to other objects. A quick goggle did not come up with much other than many hits just listing what is in SYSAUX and the odd person also seeing SYSAUX being filled up with these objects and suggested ways to clear down space, nothing about the cause.

I had a chat with the DBAs and we quickly decided that this was going to be something to do with AWR given the name of objects – “WRH$_” objects are the things underlying AWR. The DBA suggested my settings of 15 minute intervals and 35 day retention was too aggressive. I knew this was not the case, I’ve had more aggressive snapshot intervals and longer retention periods on far busier systems than this. I did not have access to Metalink at that point so I asked the DBAs to raise a ticket, which they duly did.

Oracle support cogitated for a couple of days and came back with the advice to reduce the retention period. Hmmmm. Via the DBA I asked Oracle support to explain why those objects were so large when I had not seen this issue on several other systems. Was it a bug? I had by now corroborated with a friend from a previous site with 5 minute snapshot intervals and two months retention period and their SYSAUX tablespace was about 10GB all in. I did not want to go changing things if we did not know it would fix the issue as we really wanted to stop the growth of SYSAUX as soon as possible, not just try a hunch.

As you probably realise from the title of this blog, the issue was not the snapshot interval or retention period but the STATISTICS_LEVEL=ALL. The one and only hit you get in metalink if you search on WRH$_LATCH_CHILDREN is note 874518.1. From V10.1.0.2 to V11.1.0.7 setting this parameter to ALL is known to create a lot of data about Latch children and not clear it down when the AWR data is purged (Bug 8289729). The advice was to change STATISTICS_LEVEL and make the snapshot interval larger. I’d suggest you just need to alter the STATISTICS_LEVEL, unless you really, really need that extra information gathered. It seemed to take Oracle Support an extra day or two to find that note for us. {I’ve since checked out Metalink directly to confirm all this}.

So with a known issue we felt confident that altering the initialisation parameter would solve the issue. It took a while for us to change the STATISTICS_LEVEL on the production system – Change Control for that site is rather robust. This allowed us to see some other impacts of this issue.

The mmon process which looks after AWR data was becoming a top session in our OEM performance screens. In particular, a statement with SQL id 2prbzh4qfms7u that inserted into the WRH$_LATCH_CHILDREN table was taking several seconds to run each time and was running quite often {I include the SQL ID as it may be the same on many oracle V10 systems as it is internal code}:

The internal SQL inserting into wrh$_latch_children was becoming demanding

This was doing a lot of IO, by far the majority of the IO on our system at the time – it was a new system and we had been able to tune out a lot of the physical IO.

The physical IO requirements and 15-20 second elapsed time made this out most demanding statement on the system

We also now started to have issues with mmon running out of undo space when it ran at the same time as our daily load. This was particularly unfortunate as it coincided in a period of “intense management interest” in the daily load…

What was happening to the size of the SYSAUX tablespace?

Enter the tablespace (or leave null)> sys

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK

-------------------- ----- ----------- ------------ ----------- --------

SYSAUX               alloc  58,187,904  465,503,232  33,553,408       14

                     free       10,728       85,824      21,504       20

SYSTEM               alloc     128,000    1,024,000   1,024,000        1

                     free       68,360      546,880     546,752        3

4 rows selected.

select * from

(select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m

 from dba_segments

 where tablespace_name = 'SYSAUX'

 ORDER BY BLOCKS desc)

where rownum < 40

OWNER    SEGMENT_NAME                                                     SIZE_M

-------- ------------------------------------------------------------ ----------

SYS      WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6201                     30262

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5817                     29948

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5435                     28597

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4675                     28198

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911                     27648

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5817                  27144

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6585                     26965

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6201                  26832

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4675                  26741

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911                  26491

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296                     26307

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5435                  26248

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296                  25430

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6585                  25064

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5058                     24611

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5058                  23161

         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6966                      9209

         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6966                   8462

         WRH$_SYSMETRIC_SUMMARY~                                             152

         WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3911               136

         WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911                               96

@sysaux_conts

OWNER                          OBJ_PART_NAME                                SIZE_M

------------------------------ ---------------------------------------- ----------

SYS                            WRH$_LATCH_CHILDREN-WRH                  231745.063

SYS                            WRH$_LATCH_CHILDREN_PK-WRH               215573.063

SYS                            WRH$_SQLSTAT-WRH                           711.0625

SYS                            WRH$_LATCH_MISSES_SUMMARY_PK-WRH           439.0625

SYS                            WRH$_ACTIVE_SESSION_HISTORY-WRH            437.0625

SYS                            WRH$_LATCH_PARENT-WRH                      292.0625

SYS                            WRH$_LATCH-WRH                             276.0625

SYS                            WRH$_LATCH_MISSES_SUMMARY-WRH              273.0625

SYS                            WRH$_SEG_STAT-WRH                          268.0625

SYS                            WRH$_LATCH_PARENT_PK-WRH                   239.0625

SYS                            WRH$_SYSSTAT_PK-WRH                        237.0625

Yes, that is close to half a terabyte of SYSAUX and it is all used, more partitions have appeared and the total size of the largest segments in SYSAUX show how WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK make up the vast majority of the space used.

Shortly after, we finally got permission to change the live system. The impact was immediate, mmon dropped from being the most demanding session, that SQL code dropped down the rankings and the issues with running out of undo ceased.

I was anxious to see if the old data got purged, as the Metalink note had suggested the data would not be purged. Thankfully, that was not the case. The space was slowly released as normal purging of data outside the retention period took place and after just over a month, the SYSAUX tablespace contained a lot less information and was mostly free space:

OWNER      OBJ_PART_NAME                                SIZE_M

---------- ---------------------------------------- ----------

SYS        WRH$_LATCH_MISSES_SUMMARY_PK-WRH           512.0625

SYS        WRH$_LATCH_MISSES_SUMMARY-WRH              350.0625

SYS        WRH$_LATCH-WRH                             304.0625

SYS        WRH$_SQLSTAT-WRH                           280.0625

SYS        WRH$_LATCH_PK-WRH                          259.0625

SYS        WRH$_SYSSTAT_PK-WRH                        247.0625

SYS        WRH$_SERVICE_STAT_PK-WRH                   228.0625

SYS        WRH$_PARAMETER_PK-WRH                      201.0625

SYS        WRH$_PARAMETER-WRH                         169.0625

SYS        WRH$_SYSSTAT-WRH                           169.0625

SYS        WRH$_SEG_STAT-WRH                          161.0625

SYS        WRH$_SYSTEM_EVENT_PK-WRH                   156.0625

SYS        WRH$_SYSMETRIC_SUMMARY-                         152

SYS        WRH$_SYSTEM_EVENT-WRH                      133.0625

SYS        WRH$_SERVICE_STAT-WRH                      123.0625

SYS        WRH$_ACTIVE_SESSION_HISTORY-WRH            115.0625

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK

-------------------- ----- ----------- ------------ ----------- --------

SYSAUX               alloc  58,251,904  466,015,232  33,553,408       15

                     free   57,479,400  459,835,200   4,063,232    1,208

SYSTEM               alloc     128,000    1,024,000   1,024,000        1

                     free       68,048      544,384     544,320        2

Now, how do we get that space back? I left that with the DBA team to resolve.

Oh, one last thing. I mentioned the above to a couple of the Oaktable lot in the pub a few weeks back. Their instant response was to say “You set STATISTICS_LEVEL to ALL on a live system?!? You are mad!”

Friday, August 9, 2013

Why is my SYSAUX tablespace So bigg.. Is it worth adding a datafile to it??

No comments:

Post a Comment

Total Pageviews

Reference