Database CPU Utilisation

How do you measure CPU consumption retrospectively? More importantly how do you find an individual SQL\process who consumed all your CPU yesterday? Let’s take the below case study to quickly demonstrate how we can quickly find our culprit.

Let’s jump over the formalities of monitoring and alerting and assume someone or something has alerted us of excessive CPU consumption. On quick inspection it is quite apparent when the CPU spike occurred. 13-Aug 18:00 to be precise.

Illustration 1: CPU Spike Found

Now we have the precise date and time of our CPU spike we can sort through SQL operations based on CPU consumption for the given time period. For a quick dump of your recent CPU consumption you can use the below SQL:

AWR Script:

select x.sql_id||'::'||x.plan_hash_value,

sum(x.cpu_time_delta)/1000000

from

(select snap_id,

      sql_id,

      plan_hash_value,

      cpu_time_delta

     from dba_hist_sqlstat) x,

(select snap_id,

      max(to_char((trunc(begin_interval_time, 'mi')),'DD/MON/YYYY HH24:MI')) as start_time,

      max(to_char((trunc(end_interval_time, 'mi')),'DD/MON/YYYY HH24:MI')) as end_time

      from dba_hist_snapshot

        where begin_interval_time >= to_date('13/AUG/2017 17:45:00','DD/MON/YYYY HH24:MI:SS')

          and end_interval_time <=   to_date('13/AUG/2017 18:16:00','DD/MON/YYYY HH24:MI:SS')

      group by snap_id ) s 

where s.snap_id = x.snap_id 

group by x.sql_id||'::'||x.plan_hash_value

order by 2

STATSPACK Script:

select x.sqlexec,
sum(x.CPU_T)
from
(select sql_id||'::'||hash_value as sqlexec,
nvl(decode(greatest(CPU_TIME, nvl(lag(CPU_TIME) over
(partition by stats$sql_summary.dbid, stats$sql_summary.instance_number,
sql_id||'::'||hash_value order by stats$sql_summary.snap_id),0)),
CPU_TIME, CPU_TIME - lag(CPU_TIME) over
(partition by stats$sql_summary.dbid, stats$sql_summary.instance_number,
sql_id||'::'||hash_value order by stats$sql_summary.snap_id),CPU_TIME), 0) /1000000 CPU_T
from       stats$sql_summary, stats$snapshot nm
where stats$sql_summary.snap_id = nm.snap_id
and nm.snap_time >= sysdate - 1
) x
group by x.sqlexec

 

For the purpose of this case study we have used "Speedway" to quickly lookup the CPU consumption details. Depending on the type of Oracle licensing you are using "Speedway" can extract the CPU details from AWR or STATPACK:

Illustration 2: SQL sorted by CPU consumption

Turns out this SQL was a “Create Table as Select” using 5 parallel processes. Considering the database instance was caged at 2 CPU’s the overall performance to the database was drastically impacted.

 

Further to the above article there are some simple diagnostic tools which can monitor CPU utilisation. We have found one of the most informative and verbose diagnostic tool to be OSW(or commonly known as Oracle System Watcher).

OSW Sample Output

zzz ***Thu May 9 16:00:15 EET 2013

System configuration: lcpu=16 mem=45056MB ent=1.00

 

kthr    memory              page              faults              cpu

----- -----------    ------------------------ ------------   -----------------------

 r  b   avm   fre    re  pi  po  fr   sr  cy   in    sy   cs us sy id wa    pc    ec

 5  0 4182496 45273   0   0   0   0    0   0  217 30477 8249 34 27 39  0  1.44 143.6

 2  0 4182504 45261   0   0   0   0    0   0   53  7833 6815  8  7 85  0  0.28  28.1

 1  0 4182269 45492   0   0   0   0    0   0   85 10636 7602  8  9 82  0  0.33  32.7

The below performance metrics collected by OSW pertain to Unix AIX

 

kthr: Information about kernel thread states

r: Average number of runnable kernel threads over the sampling interval. Runnable threads consist of the threads that are ready but still waiting to run, and the threads that are already running.

b: Average number of kernel threads that are placed in the Virtual Memory Manager (VMM) wait queue (awaiting resource, awaiting input/output) over the sampling interval.

 

Memory: Information about the usage of virtual and real memory. Virtual pages are considered active if they are accessed. A page is 4096 bytes

avm: Active virtual pages.

fre: Size of the free list. Note: A large portion of real memory is used as a cache for file system data. It is not unusual for the size of the free list to remain small.

 

Page: Information about page faults and paging activity. This information is averaged over the interval and given in units per second.

re: Pager input/output list.

pi: Pages that are paged in from paging space.

po: Pages paged out to paging space.

fr: Pages freed (page replacement).

sr: Pages that are scanned by page-replacement algorithm.

cy: Clock cycles by page-replacement algorithm.

 

Faults: Trap and interrupt rate averages per second over the sampling interval.

in: Device interrupts.

sy: System calls.

cs: Kernel thread context switches.

 

CPU: Breakdown of percentage usage of processor time

us: User time.If the current physical processor consumption of the uncapped partitions exceeds the entitled capacity, the percentage becomes relative to the number of physical processor consumed (pc).

sy: System time. If the current physical processor consumption of the uncapped partitions exceeds the entitled capacity, the percentage becomes relative to the number of physical processor consumed (pc).

id: Processor idle time. If the current physical processor consumption of the uncapped partitions exceeds the entitled capacity, the percentage becomes relative to the number of  physical processor consumed (pc).

wa: Processor idle time during which the system had outstanding disk/NFS I/O request. If the current physical processor consumption of the uncapped partitions exceeds the entitled capacity, the percentage becomes relative to the number of physical processor consumed (pc).

pc:  Number of physical processors used. Displayed only if the partition is running with shared processor.

ec: The percentage of entitled capacity that is consumed. Displayed only if the partition is running with shared processor. Because the time base over which this data is computed can vary, the entitled capacity percentage can sometimes exceed 100%. This excess is noticeable only with small sampling intervals.

Complete Database Performance Management