RAC/CRS/Voting disk failover Tests
Failure Scenario | How to Test it? | Oracle Recovery |
Private Network Failure between Nodes | Pull PRIVATE port network cable out from RAC node 1 or 2 | Oracle will push all connections from node1 to node 2 or vice versa. If an application is certified against RAC then unnoticeable to end users. |
Public Network Failure | Pull PUBLIC port network cable out from RAC node 1 and 2 | Application will return an error as it won’t be able to connect to the database |
Rman Backup/Restore/Recovery:
I will put script in place to backup entire database with archive logs for 2 days. The backup sets would be stored in ASM storage area for a quick restore if needed.
Rman Restore:
Assumption : Rman complete backup set is available at FLASH_RECOVERY_AREA
Scenario | Database Crash | How to Test it? | Oracle Recovery |
Loss of Control file | NO | Delete control file from ASM storage | A control file will be multiplexed so deleting 1 file won’t pull Oracle database down |
Loss of Redo Log | NO | Delete Redo Log from ASM storage | A redo log will be multiplexed so deleting 1 log won’t pull Oracle database down |
Loss of on-system Data file | NO | Delete data file from ASM storage | Oracle will raise an alert and continue to function by setting data file as OFFLINE. If an application data was mapped at data file being unavailable then users will receive Oracle error like “file XYZ is offline”.We can restore files from latest rman backup. |
Loss of SYSAUX data file | NO | Delete data files used for table space | When SYSAUX table space is lost, it does not result in a database crash.We can restore files from latest rman backup & recover entire database until point in time. |
Loss of SYSTEM data file | YES | Delete data file used for table space | Will pull entire RAC system down.We can restore files from latest rman backup & recover entire database. |
Cluster Component failure:
Scenario | Database Crash |
How to Test it? | Oracle Recovery |
Loss of Voting Disk | NO | Disable SAN volume used for Voting Disk | RAC will continue to function as far as Private interconnect between RAC nodes is working fine.I will schedule backup of voting disk every 4 hours. Voting disk contains transient data, even old backup is OK for restore. |
Loss of Cluster Registry | YES | Disable SAN volume used for Cluster Registry | Cluster registry is multiplexed between SAN volumes.In case of total failure to access OCR volumes, we need to restore it. |
Temporary loss of SAN storage
In case SAN is completely lost then entire RAC system will crash. I am considering all SAN volumes used for data/backup/ocr/voting disk are lost and hence chances of data corruption are minimal as data corruption is possible when nodes evict each other and overwrite each others data blocks. With no access to SAN storage the nodes won’t able to carry any tasks.
Check alrtog messages for instance, kill dangling oracle processes and restart clusterware/ RAC instances once SAN is back.
Complete loss of SAN storage
If entire SAN array is blown away, then there is no easy way to recover it. We will have to re-install Oracle RAC s/w and restore old database/ocr/voting disks from TAPE. We have to use rman to rebuild ASM data structures.
Leave a Reply
You must be logged in to post a comment.