Wednesday, 16 December 2015

(TAF) Transparent Application Failover and testing






Part 4: (TAF) Transparent Application Failover and testing.


TAF, Transparent Application Failover is the ability for a connection to failover and reconnect to a surviving node. Transactions are rolled back but TAF can be configured to resume SELECT operations. Before you can use TAF you have to create a Service. Initially services were used for connection load balancing but over the years has developed into a method to distribute and manage workload across database instances. For more information on services see Oracles documentation:
http://www.oracle.com/technology/obe/10gr2_db_vmware/ha/rac/rac.htm
There are a few ways to create a service. You can use Grid Control, srvctl or dbca. For this exercise I am going to use DBCA.
Launch DBCA. Select Oracle Real Application Clusters database and click on next:
clip_image002[1]
Select Services Management and click on Next:

clip_image004[1]
RACDB should be selected, so click the Next button:

clip_image006[1]
On the following screen click on the Add button. When prompted for a service name enter RAC and click on the OK button:

clip_image008[1]

On the following screen verify both RAC instances (RACDB1, RACDB2) have the preferred option selected. Preferred means that the service will run on that instance when the environment first starts. An instance flagged as available means it will not run the service unless the preferred instances fail. Since this is a RAC environment we want users to be distributed between both nodes, thus they need to have the preferred option set.
The TAF policy is the failover settings. Select Basic. Basic means that the session is reconnected when a node fails. Preconnect means that a shadow process is created on another node in the cluster, to reduce time if there is a failure.

clip_image010[1]
Once the changes above have been completed, click on the finish button. A popup will appear prompting you to configure the services. Select OK.

clip_image012[1]
clip_image014[1]
Once the service configuration has completed, you will be asked if you’d like to perform another operation. After you select No, dbca will close.

clip_image016[1]
A check of the $ORACLE_HOME/network/admin/tnsnames.ora file will show the following new entry:
clip_image018[1]

Lets break down this new entry and highlight a few key lines:
Line 5: (LOAD_BALANCE = yes)
This line indicates that Oracle will randomly select one of the addresses defined on the previously two lines and connect to that nodes listener. This is called client-side connect-time load balancing.

Lines 9 -13: These lines are the TAF settings.
Line 10: (TYPE = SELECT)
There are two types of failover. SELECT indicates that the session will be authenticated on a surviving node but as well SELECT statements will be re-executed. Rows already returned to the client are ignored.
The other option is SESSION. In a SESSION failover, only the users session is re-authenticated.
Line 11: (METHOD = BASIC)
BASIC means that the session will not reconnect to a surviving node until the failover occurs.
PRECONNECT means a shadow process is created on a backup instance to reduce failover time. There are some additional considerations when choosing this setting so be sure to read up on it.
Line 12 and 13: (RETRIES = 180)
(DELAY = 5)
Self-explanatory, the maximum number of retries to failover and the amount of time in seconds to wait between attempts.

Testing TAF

Session 1: Login to the database via the service created above:

clip_image020[1]

Session 2: Login as sysdba and query gv$session to determine which instance Scott is connected to:

clip_image022[1]

Scott is connected to instance 2 (which resides on raclinux2), so lets shutdown that instance and see what happens. As sysdba, connect to RACDB2 and shutdown immediate. Once the instance has shutdown re-execute the query above to see which instance Scott is now connected to:

clip_image024[1]

Why hasn’t the session failed over? Because when node 2 was shutdown scott wasn’t executing a query. Re-execute the select count(*) from emp statement, then query gv$session again:
clip_image026[1]
If a session is inactive, it will not failover until another statement is issued.

Part 3 - Issues and Resolutions

The following is a list of the issues and solutions for the problems I encountered while installing 10gR2 RAC , OEL5 on VMware. Some of the items below are things I tried but didn't resolve an issue but I thought might be interesting.
1. If you are using vmware on a slow computer, especially when 2 nodes are running, you may experience locking issues. If so, it could be timeouts for the shared disk. In your vmware config files add the following:
reslck.timeout="1200"

On the bottom right hand side of your vmware window you will see disk icons that will flash green when they are in use. If vmware hangs and any of the shared disks are green then this is probably the issue.
2. Eth0 has to be bridged or you will see the following running vipca or in your vip log:

Interface eth0 checked failed (host=raclinux1)
Invalid parameters, or failed to bring up VIP (host=raclinux1)

3. If during the install you didn’t disable the firewall, root.sh will be unable to start on the second node, raclinux2. If so, disable iptables by executing:

/sbin/service iptables stop

To disable this feature from starting after a reboot, execute the following as root:

/usr/sbin/system-config-services

Scroll down until you see iptables, if it is checked, remove the check then click on the save button. You can also stop the services from this program as well by highlighting iptables and clicking the stop button.
Subject: Root.sh Unable To Start CRS On Second Node Doc ID: Note:369699.1
4. The following two workarounds are already addressed in the installation guide but here they are for reference.
Subject: VIPCA FAILS COMPLAINING THAT INTERFACE IS NOT PUBLIC Doc ID: Note:316583.1 Subject: 10gR2 RAC Install issues on Oracle EL5 or RHEL5 or SLES10 (VIPCA Failures) Doc ID: Note:414163.1
5. Subject: CRS-0215: Could not start resource 'ora..vip' Doc ID: Note:356535.1 During a VIP status check, your public interfaces gateway is pinged. If you don't have a gateway specified, this check will fail. If that action fails it assumes there is a problem with the Ethernet adapter interface. To resolve this, change the parameter FAIL_WHEN_DEFAULTGW_NOT_FOUND in the $ORA_CRS_HOME/bin/racgvip and set it to 0
clip_image002

This doesn’t mean the VIP will failover, there are some additional checks. Also, the parameter FAIL_WHEN_DEFAULTGW_NOT_FOUND only applies if you don't have a gateway defined in your network setup. If you entered a gateway ip address as per my guide, even tho it may not be pingable, this will have no effect.
6. While troubleshooting VIP failovers I found the following note which details how to increase the timeouts for the VIP service. This didn’t solve any of the issues I encountered but I thought it may be interesting to note:
Subject: ASM Instance Shuts Down Cleanly On Its Own Doc ID: Note:277274.1
7. If you are using vmware on a slow computer you may experience a problem where the VIP’s failover frequently. If that happens you may want to set the following:
Increasing the value of the parameter CHECK_TIMES to 10 may help in this case.
In $ORA_CRS_HOME/bin/racgvip set the following line:
# number of time to check to determine if the interface is down
CHECK_TIMES=2
-- to --
# number of time to check to determine if the interface is down
CHECK_TIMES=10

NOTE: This will only help when the problem is because of slow response from the gateway. Please do NOT use this workaround in other situations. This will have sideaffect of increasing the time to detect a unresponsive public interface.

Subject: VIP Going Offline Intermittantly - Slow Response from Default Gateway Doc ID: Note:399213.1

8. CRS services in an unknown state after reboot. Watching the crs logs I noticed the services weren’t waiting for the appropriate timeout value before erroring. This is a bug with 10.2.0.1 CRS:
Patch 4698419
[10201-050630.0024A_LINUX64]STARTING CRS RESOURCES FAILS WITHOUT WAITING TIMEOUT

 

Part 2 – Installing OCR, ASM and Database


Pre-Req: Download and obtain 10.2.0.1 clusterware and 10.2.0.1 database from Technet.
  1. Install CVUQDISK package.
    Unzip the clusterware software and cd to ./clusterware/rpm directory. Run the following commands on both nodes:

    [root@raclinux1 ~] export CVUQDISK_GRP=dba
    [root@raclinux1 ~] rpm –iv cvuqdisk-1.0.1-1.rpm
    Preparing packages for installation…
    cvuqdisk-1.0.1-1


  2. Clusterware Install
    Change directory to /clusterware and execute:

    runInstaller -ignoreSysPreReqs
    clip_image002

    Specify the inventory location and group name:

    clip_image004

    Specify the Home details:

    clip_image006

    Ignore warnings on the Prerequisite checks screen below. At the time 10.2.0.1 was released Oracle Enterprise Linux didn’t exist, so the installer doesn’t view it as a supported OS.

    clip_image008

    Cluster Configuration. Click on Add and fill in the information for the second node, raclinux2 as per the screenshot below:

    clip_image010

    Private Interconnect Enforcement… Select eth0, click the Edit button and select public for the interface type since it is the primary network card:

    clip_image012

    Oracle Custer Registry (OCR) location. These have been mounted as /dev/raw/raw3 and /dev/raw/raw4:

    clip_image014

    Voting Disk Locations:
    These have been mounted as /dev/raw/raw1, /dev/raw/raw2 and /dev/raw/raw6:

    clip_image016

    Click Install on the Summary screen:

    clip_image018

    Once the install has completed you will be prompted to execute the following scripts on both nodes

    clip_image020

    The CRS Home root.sh script executes the ocrconfig (Oracle Cluster Registry Configuration Tool) and clscfg (Cluster Configuration tool). These scripts format the voting disks, startup the software and add the daemons to the inittab (OS startup scripts).

    Before executing root.sh, for both nodes, edit the vipca and srvctl files under the CRS bin directory. Search for the string LD_ASSUME_KERNEL and find the line where this variable is set. Unset the variable by placing the following on the next line:

    unset LD_ASSUME_KERNEL

    Set Note: 414163.1 for details.
    Execute root.sh. (Note: don’t return to the runInstaller and click OK signifying the root.sh script has finished until directed to do in a few steps.) Towards the end of the root.sh on raclinux2 output you will see the following error:

    Error 0(Native: listNetInterfaces:[3])
    [Error 0(Native: listNetInterfaces:[3])]

    As per the same note above, on raclinux2 execute the following from the CRS bin directory as root:

    ./oifcfg setif –global eth0/192.168.0.0:public
    ./oifcfg setif –global eth1/10.10.10.0:cluster_interconnect
    clip_image022

    From the same directory launch vipca on raclinux2:

    clip_image024

    On the following screen enter the following information in the screenshot. When you put the IP Alias Name in, the IP Address column will auto populate:

    clip_image026

    A Summary screen is show, on which you click on the Finish button:

    clip_image028

    Once the config finishes you should see configuration results screen:

    clip_image030

    From a command window, as oracle execute crs_stat –t from the Cluster Home bin directory and you should see that all services are online:

    clip_image032

    Now, go back to raclinux1 and click OK that you have finished executing root.sh on both nodes. The following screen will appear and once each of the tools successfully complete click on next:

    clip_image034

    The installation is now complete and you can click on Exit:

    clip_image036
  3. Install ASM.
    Unzip the database 10.2.0 archive and launch:

    runInstaller -ignoreSysPrereqs
    clip_image038

    Select Enterprise Edition and click on next:

    clip_image040

    Change the Oracle Home name and path to reflect this is an ASM install:

    clip_image042

    The runInstaller will detect the cluster, so make sure raclinux2 is checked in the following screen:

    clip_image044

    There will be some warnings in the Prerequisite check screen, these can be ignored for the same reason as the OCR install:

    clip_image046

    Since we are performing an ASM install, select the Configure Automatic Storage Management (ASM) option and enter a password for the sys account:

    clip_image048

    Configure ASM Storage: Since we are using ASMLib you should see the VOL1 you created earlier in the following screen. Select External Redundancy and the ORCL:VOL1 disk:

    clip_image050

    Finally, click the install button:

    clip_image052

    After the install you’ll be prompted to execute the root.sh scripts on both nodes. After which click on the OK button.

    clip_image054

    Installation is now complete and you can click on exit:

    clip_image056
  4. Install Database Software. From the database software staging directory launch:

    runInstaller –ignoreSysPreReqs
    When prompted for Installation Type, select Enterprise edition and click on Next:

    clip_image058

    Verify Home Details
    :

    clip_image060

    Make sure both nodes are selected in the Cluster Installation Screen:

    clip_image062

    Ignore warnings and click on Yes then on the next button:

    clip_image064

    In the Select configuration Option screen select install database Software only:

    clip_image066

    Review the Summary Screen and click on Install:

    clip_image068

    Once the installation is complete you will be shown the following screen, click on Exit:

    clip_image070

  5. Install agent. Note: This assumes you already have grid control or access to a Grid Control installation. If you do not, then you can skip this step and manage the environment using Database Control. Download a copy of Enterprise manager and from raclinux1 launch the runInstaller:

    clip_image072

    If you selected the Mass Agent download from OTN the only option available and preselected is “Additional Management Agent”. Click next and in the following screen modify the Parent Directory to: /home/oracle/product/10.2.0

    clip_image074

    Since this is a clustered environment you will be prompted for a cluster or local install. Select cluster and verify both nodes are selected, and then click next.

    clip_image076

    Select the location of an existing Grid Control Install:

    clip_image078

    Click next and again on the next screen, ignore Oracle Configuration Manager Registration. On the last screen, review the summary and click on Install:

    clip_image080

    Installing:

    clip_image082

    When prompted, execute the root.sh script on each node, and in the correct order:

    clip_image084

    After the installation, click exit:

    clip_image086

  6. Creating the cluster database.

    Change to the $ORACLE_HOME/bin directory and launch dbca. Select the option to create an Oracle Real Application Clusters database:

    clip_image088

    Select Create a Database:

    clip_image090

    Click on the Select All button to make sure both nodes are highlighted:

    clip_image092

    Select the general purpose template:

    clip_image094
    For the global database name and sid, enter RACDB:

    clip_image096

    Select your grid control location in the following window. If you installed the agent earlier it will be automatically selected. If not, Use Database Control will be selected. Click next:

    clip_image098

    Choose a password:

    clip_image100

    Under storage options choose ASM:

    clip_image102

    You’ll be promoted for the ASM sys password:

    clip_image104

    Select the DATA Disk Group:

    clip_image106

    Select Oracle-Managed Files:

    clip_image108
    I didn’t create a second disk group for a flash back recovery area, so just click next on the following screen:

    clip_image110

    Choose the sample schemas so you have some data to play with:

    clip_image112

    You can create services now if you’d like or later via dbca or srvctl:

    clip_image114
    You can customize the initialization parameters to your liking. I choose a custom SGA with 200 for the SGA and 25MB for the PGA. The rest were defaults:

    clip_image116
    In the Database Storage window click on next:

    clip_image118
    Finally, click finish to start the creation process:

    clip_image120

    After you click the Finish button you will be prompted with a summary screen. You should review it to make sure everything looks fine then click on ok:

    clip_image122

    If you selected the Generate Database Creation Script option, they will be generated first. Once it completes a popup will appear letting you know it was successful. After that click OK, you will be returned to the previous screen and click finish again:
    clip_image124

    Once the install completes you will be presented with a screen similar to the one below:

    clip_image126

No comments:

Post a Comment


No one has ever become poor by giving