ORACLE RAC Health & Performance
So how do you know if your cluster is unhealthy and not performing up to scratch? A clustered database can offer more than just high availability if it is healthy and performing well. By distributing your workload across an active-active node configuration you can really take advantage of every pocket of your clusters compute power. However an active-active node configuration can also mean performance degradation in an unhealthy RAC.
​
The ORACLE performance tuning tool "Speedway" has been used to conduct the below RAC analysis.
Cluster Network Latency Analysis
In the first set of illustrations we look at the overall network latency between the RAC nodes. We can see that the below EXADATA interconnect ping statistics are consistently well below 1 millisecond. This is a great result and will result in minimal cluster wait time in an active-active cluster.
Illustration 1: EXADATA cluster ping statistics
However when we analyse the network latency for a cluster hosted on IBM P-Series frames we get a completely different result. Speedway shows us a latency of greater than 1millsecond. At this point it is quite obvious the EXADATA Infiniband is going to deliver better cluster performance.
Illustration 2: IBM P-Series frame cluster ping statistics
Cluster Error Counters
In the next set of illustrations we look at the errors incurred during block transmission over the network. Ideally we want to see zero block errors over the network. As per below the EXADATA Infiniband network experiences zero errors.
Illustration 3: EXADATA cluster block error counters
However when we analyse the block error rate over the P-Series frames we once again see a different result.
Illustration 4: IBM P-Series cluster block error counters
So what’s the next step? Well no amount of database tuning is going to help us resolve network latency and block transmission errors. One thing we can do is focus on the network protocol which manages the cluster communication. The User Datagram Protocol commonly known as UDP is responsible for such communication. As per below we can see that the results of the netstat command give us some error metrics around UDP:
​
udp:
3979565887 datagrams received
0 incomplete headers
0 bad data length fields
0 bad checksums
28099 dropped due to no socket
295723 broadcast/multicast datagrams dropped due to no socket
0 socket buffer overflows
3979242065 delivered
51004598 datagrams output
To gauge the UDP error metrics we ideally want to poll netstat frequently so we can capture the delta value of UDP errors and gauge the overall error rate. There are two ways to go about doing this. You can write your own netstat polling script or better yet you can take advantage of ORACLE OSW(Oracle System Watcher). OSW will frequently poll system statistics which includes netstat.
Once armed with performance and error metrics around your cluster and UDP using the above mentioned technique decisions can be made around network and operating system tuning to help alleviate the problem.