Chapter 30: Maintaining Highly-Available Environments
How to Maintain Highly-Available Environments
As an operator, it is your responsibility to monitor the environment and address
problems that occur. In highly-available environments, scheduler and database failures
adversely affect, and sometimes disable, high-availability. You can ensure that you
maintain a properly functioning highly-available environment by resolving database and
scheduler failures.
In highly-available cluster environments, use the monitoring tools provided by your
cluster management software and database vendor to monitor the state of the cluster
and the state of the database. We recommend that you also monitor the scheduler log
on the active node. When your cluster or clustered database is not functioning properly,
consult the documentation for your cluster management software or your database
vendor and follow the instructions to restore the environment.
In high-availability and dual event server modes, monitor the scheduler logs for
database and scheduler failures and take appropriate action to return to
high-availability mode when failures occur.
Note: For more information about monitoring tools provided by your cluster
management software and database vendor, see the documentation for those products.
How to Maintain Highly-Available Environments
The following diagram demonstrates how you can maintain a highly-available
environment:
Follow these steps:
1. Monitor the scheduler log ( .
2. Restore the failed scheduler.
■ On UNIX
■ On Windows
3. Recover the failed database.
■ On UNIX
■ On Windows
How to Maintain Highly-Available Environments
Monitor the Scheduler Log
The scheduler log displays information about alarms and related error messages so that
you know when you need to take action to resolve problems with your CA Workload
Automation AE environment. For example, the scheduler log displays alarm information
when the scheduler issues a database rollover alarm.
■ If you are operating in a highly-available cluster environment, monitor the
scheduler log on the active node. The scheduler log does not display all of the
information that you need to monitor the environment. Ensure that you also use
the monitoring tools that are provided by your cluster manager to monitor the state
of the cluster.
■ If you are operating in high-availability mode, monitor the state of the dormant
scheduler.
The primary scheduler is typically the active scheduler. In this case, the shadow
scheduler begins processing events and issues an alarm when the primary scheduler
fails (EP_ROLLOVER).
If you restore the primary scheduler when the primary failback mode is set to 1, the
shadow scheduler remains active and the primary scheduler runs dormant. In this
case, the primary scheduler resumes processing events and issues an alarm when
the shadow scheduler fails (EP_ROLLOVER).
■ If you are operating in dual event server mode, monitor the active scheduler log.
When a database rollover occurs, CA Workload Automation AE begins operating in
single event server mode and the active scheduler issues a database rollover alarm
(DB_ROLLOVER).
■ If you are operating with a clustered database, use the monitoring tools that are
provided by your cluster manager and database vendor to monitor the state of the
database.
■ In all CA Workload Automation AE environments, monitor the scheduler log for
alarm information and error messages so that you can take appropriate action to
resolve problems with the environment as they occur.
Note: For information about the monitoring tools that are provided by your cluster
manager and database vendor, see the documentation for those products.
Follow these steps:
1. Log in as the root user on the machine or node where the scheduler is installed.
2. Open the operating system or instance command prompt as follows:
■ (UNIX) Run the shell that is sourced to use CA Workload Automation AE.
The operating system command prompt appears.
■ (Windows) Click Start, Programs, CA, Workload Automation AE, Command
Prompt (instance_name).
The instance command prompt appears.
3. Enter the following command:
autosyslog -e
The scheduler log appears. When CA Workload Automation AE encounters a
problem, the scheduler issues an alarm. The log displays the alarm information and
related error messages so that you can resolve the problem.
Restore the Failed Scheduler on UNIX
If you are operating CA Workload Automation AE in high-availability mode and the
active scheduler fails, high-availability is disabled. If you are operating in a
highly-available cluster environment, the action that you need to take to return to a
highly-available environment depends on your cluster management software.
In highly-available cluster environments, manual intervention following a failover is
usually not required. The cluster manager may still issue informational messages. When
the cluster manager issues messages indicating that manual intervention is required,
take one of the following actions:
■ Review messages and follow the instructions that they provide.
■ Consult the documentation for your cluster management software and the
instructions for manually recovering the failed cluster node.
■ Contact your network administrator.
If you are operating in high-availability mode and high-availability is disabled because
the active scheduler fails, return to high-availability mode by restoring the failed
scheduler.
Follow these steps:
1. If the primary failback mode is set to 0 and you are restoring the primary scheduler,
stop the shadow scheduler.
Note: By default, the primary failback mode is set to 0 and this step is required.
When the primary failback mode is set to 1 or 2, restart the primary scheduler
without stopping the shadow scheduler.
a. Log in to a machine in the instance with a client installation as the root user.
You can issue commands that execute client utilities.
b. Run the shell that is sourced to use CA Workload Automation AE.
The operating system command prompt appears.
c. Enter the following command:
sendevent –E STOP_DEMON -v ROLE=S How to Maintain Highly-Available Environments
2. If high-availability mode was disabled because a failover occurred, restore the
primary scheduler.
a. Log in to the machine where the primary scheduler is installed and run the shell
that is sourced to use CA Workload Automation AE.
The operating system command prompt appears.
b. Enter the following command:
eventor
The primary scheduler is restored.
The primary scheduler either runs dormant or resumes processing events,
depending on the primary failback mode. CA Workload Automation AE returns to
high-availability mode.
Important! If the primary failback mode is set to 0, the primary scheduler resumes
processing events as soon as it is restored but CA Workload Automation AE does
not return to high-availability mode. To return to high-availability mode, restore the
shadow scheduler.
If the primary failback mode is set to 1 or 2 CA Workload Automation AE returns to
high-availability mode.
3. If high-availability mode was disabled because of a failback that was not automatic
or manual, use the eventor command to restore the shadow scheduler.
Notes:
■ Automatic failbacks occur when primary failback mode is set to 2 and you
restore the primary scheduler. You can control when a failback occurs by
setting primary failback mode to 1 and initiating a manual failback only when
you want the primary scheduler to resume processing events.
■ The shadow scheduler process restarts itself in a dormant state when an
automatic failback occurs or when you initiate a manual failback. The shadow
scheduler does not stop running when you initiate a manual failback. In both
cases, no action is required to return to high-availability mode.
■ Failbacks also occur when the shadow scheduler fails. In this case,
high-availability mode is disabled until you restore the shadow scheduler.
The primary scheduler continues processing events, the shadow scheduler is restored
but runs dormant, and CA Workload Automation AE returns to high-availability mode.
Note: For more information about setting the Primary Failback Mode, see the UNIX
Implementation Guide.
Restore the Failed Scheduler on Windows
If you are operating CA Workload Automation AE in high-availability mode and the
active scheduler fails, high-availability is disabled. If you are operating in a
highly-available cluster environment, the action that you need to take to return to a
highly-available environment depends on your cluster management software.
In highly-available cluster environments, manual intervention following a failover is
usually not required. The cluster manager may still issue informational messages. When
the cluster manager issues messages indicating that manual intervention is required,
take one of the following actions:
■ Review messages and follow the instructions that they provide.
■ Consult the documentation for your cluster management software and the
instructions for manually recovering the failed cluster node.
■ Contact your network administrator.
If you are operating in high-availability mode and high-availability is disabled because
the active scheduler fails, return to high-availability mode by restoring the failed
scheduler.
Follow these steps:
1. If the primary failback mode is set to Off and you are restoring the primary
scheduler, stop the shadow scheduler.
Note: By default, the primary failback mode is set to Off and this step is required.
When the primary failback mode is set to Immediate or Dormat, restart the primary
scheduler without stopping the shadow scheduler.
a. Log in to a machine in the instance with a client installation as the root user.
You can issue commands that execute client utilities.
b. Click Start, Programs, CA, Workload Automation AE, Command Prompt
(instance_name).
The instance command prompt appears.
c. Enter the following command:
sendevent –E STOP_DEMON -v ROLE=S How to Maintain Highly-Available Environments
2. If high-availability mode was disabled because a failover occurred, recover the
primary scheduler.
a. Log in to the machine where the scheduler is installed and click Start,
Programs, CA, Workload Automation AE, Administrator.
The Instance - CA Workload Automation AE Administrator window opens.
b. Select an instance from the Instance drop-down list in the Settings pane, and
then click the Services icon on the toolbar.
The Services - CA Workload Automation AE Administrator window appears,
displaying a list of services installed on the selected instance.
c. Right click the Scheduler service, and select Start.
Important! If the primary failback mode is set to 0ff, the primary scheduler
resumes processing events as soon as it is restored but CA Workload
Automation AE does not return to high-availability mode. To return to
high-availability mode, restore the shadow scheduler.
If the primary failback mode is set to Dormant or Immediate, CA Workload
Automation AE returns to high-availability mode.
3. If high-availability mode was disabled because a failback occurred, restore the
shadow scheduler.
Notes:
■ Automatic failbacks occur when primary failback mode is set to Immediate and
you restore the primary scheduler. You can control when a failback occurs by
setting primary failback mode to Dormant and initiating a manual failback only
when you want the primary scheduler to resume processing events.
■ The shadow scheduler process restarts itself in a dormant state when an
automatic failback occurs or when you initiate a manual failback. The shadow
scheduler does not stop running when you initiate a manual failback. In both
cases, no action is required to return to high-availability mode.
■ Failbacks also occur when the shadow scheduler fails. In this case,
high-availability mode is disabled until you restore the shadow scheduler.
The primary scheduler continues processing events, the shadow scheduler is
restored but runs dormant, and CA Workload Automation AE returns to
high-availability mode.
The scheduler recovers and CA Workload Automation AE returns to high-availability
mode.
Note: For more information about setting the Primary Failback Mode, see the Windows
Recover the Failed Database on UNIX
If you are using dual event server mode as your database failover solution and the
scheduler initiates a database rollover, dual event server mode is disabled.
Highly-available cluster environments do not function properly without a
highly-available database. A database rollover does not disable high-availability mode,
but the risk of down-time and data loss increases without a failover solution for your
database.
Operating without a failover solution to the database increases the risk of downtime
and data loss. To mitigate this risk, restore the failed database.
If you are operating with a clustered database and a database failure occurs, some
cluster managers require manual intervention to recover the failed database. In this
case, take one of the following actions:
■ Review messages issued by the cluster manager and database vendor, and follow
the instructions that the messages provide.
■ Consult the documentation for your cluster management software and your
database software, and the instructions for manually recovering from a database
failure.
■ Contact your database or network administrator.
Notes:
■ You can operate with a clustered database only when you have cluster
management software installed and are using a cluster aware database. Some
cluster managers restore a failed database automatically, so no action is required.
In this case, your cluster manager issues messages indicating that the database is
restored.
■ If you have cluster management software installed, we recommend that you set up
a highly-available cluster environment instead of configuring high-availability mode.
■ When a rollover occurs, CA Workload Automation AE backs up the configuration file
before commenting out the failed event server. You can use the backup file to
restore pre-rollover configuration settings when you reconfigure dual event servers.
If you are using dual event server mode as your database failover solution and the
scheduler rolls over the database, CA Workload Automation AE begins operating in
single event server mode. To return to dual event server mode, recover from the failure
and reconfigure dual event servers.
Follow these steps:
1. Review the scheduler log file to determine the problem that caused the failure.
2. Consult your database software documentation, and follow the instructions for
manually resolving the problem that caused the database failure.
The failed database is recovered, and you can reconfigure dual event servers. How to Maintain Highly-Available Environments
3. Log in to CA Workload Automation AE machine in the instance with a client
installation, and click Start, CA, Workload Automation AE, Command Prompt.
The operating system command prompt opens.
4. Enter the following command:
sendevent -E STOP_DEMON -v ALL
All server processes that are running on the instance stop.
5. If you want to restore the pre-rollover configuration settings, delete the modified
configuration file and rename the backup file.
a. Delete $AUTOUSER/config.$AUTOSERV.
b. Locate $AUTOUSER/config.$AUTOSERV.rollover and change the name to
$AUTOUSER/config.$AUTOSERV.
c. Repeat these actions on every server machine that is installed in the instance.
6. If you want to specify new configuration settings when you restore the failed event
server, modify the $AUTOUSER/config.$AUTOSERV file.
a. Open the configuration file on the machine where the primary scheduler is
installed and locate the following parameter:
#EventServer_1|#EventServer_2
b. Edit the parameter as follows:
Event_Server_1|EventServer_2
c. Specify the primary event server and the secondary event server:
■ To make the new database the secondary event server, add the following
parameter:
EventServer_2=SYBASE_SVR:SYBASE_DB,DBPORT,DBHOST |
ORACLE_SVR,DBPORT,DBHOST
SYBASE_DB,DBPORT,DBHOST
Identifies the Sybase database for the second event server.
ORACLE_SVR,DBPORT,DBHOST
Identifies the Oracle database for the second event server.
■ To make the restored database the secondary event server, specify the
active database in the EventServer_1 parameter and the restored database
in the EventServer_2 parameter.
■ To make the new database the primary event server, specify the existing
database that is defined in the EventServer_1 parameter as the secondary
event server by changing it to EventServer_2, then add the following
parameter:
EventServer_1=SYBASE_SVR:SYBASE_DB,DBPORT,DBHOST |
ORACLE_SVR,DBPORT,DBHOST
■ To make the restored database the primary event server, verify that the
restored database is specified in the EventServer_1 parameter and the
active database is specified in the EventServer_2 parameter.
d. Specify the database reconnect behavior for the second event server by
modifying the following parameter in the configuration file:
DBEventReconnect=value, value2
value
Identifies the database reconnect behavior for the first event server.
Limits: 0-99
value2
Identifies the database reconnect behavior for the second event server.
Limits: 0-99
Note: During typical installation, CA Workload Automation AE sets the
reconnect value for the single event server to 50 by default. During a custom
installation in which you enable dual event server mode, CA Workload
Automation AE sets the reconnect value for both event servers to 50, 5 by
default. Ensure that you add a reconnect value for the second event server
when you configure CA Workload Automation AE to run in dual event server
mode after running it in single event server mode. Optionally, you can modify
the default reconnect value for the first event server.
The secondary event server is configured on the primary scheduler machine.
7. Repeat the secondary event server configuration on every server machine in the
instance. Ensure that the event server information in the configuration file is the
same on each of these machines.
8. Run the CA Workload Automation AE bulk copy script. The script that you run
depends on your database vendor.
■ Oracle
Open the $AUTOSYS/dbobj/ORA directory and run the following script:
perl autobcpORA.pl source_server target_server source_userid
source_password target_userid target_password dump_file oracle_directory
■ Sybase
Open the $AUTOSYS/dbobj/SYB directory and run the following script:
perl autobcpSYB.pl source_server source_db target_server target_db
source_userid source_password target_userid target_password dump_file
blk_size How to Maintain Highly-Available Environments
source_server
Defines the name of the source Oracle System ID (for example, AEDB) or
Sybase server name (for example, SourceServer). For Sybase, the source server
name is defined in the interfaces file.
source_db
Defines the source Sybase database (for example, AEDB).
source_userid
Defines the user ID that is used to connect to the source Oracle System ID, or
Sybase server.
Note: On Oracle, use aebadmin as the source user ID.
source_password
Defines the password that corresponds to the user ID that is used to connect to
the source Oracle System ID, or Sybase server.
target_server
Defines the target Oracle System ID (for example, AEDB2), or Sybase server
name (for example, DestinationServer). For Sybase, the target server name is
defined in the interfaces file.
Note: For Oracle, the source server must be different from the target server.
target_db
Defines the target Sybase database (for example, AEDB2).
Note: The autobcpDB script deletes all of the data in the target database and
replaces it with the data in the source database. If you want to save the data in
the target database, archive it before you run the autobcpDB script.
target_userid
Defines the user ID that is used to connect to the target Oracle System ID, or
Sybase server.
Note: On Oracle, use aedbadmin as the target user ID.
target_password
Defines the password that corresponds to the user ID that is used to connect to
the target Oracle System ID, or Sybase server.
dump_file
Defines the temporary file that is used in the transfer of data from one
database to the other database.
Note: Specify a file that is local to the computer where this script is running.
oracle_directory
Defines the path to the Oracle home directory.
blk_size
(Optional) Specifies the number of rows that can be inserted from the
dump_file to the destination database at a time.
Default: 5000
Note: The Sybase script uses the default value when you run it in the interactive
mode, or when you do not specify the blk_size value. Do not specify a large value
because the transaction log encounters problems when it becomes too full.
The event servers are synchronized.
9. Restart the server processes for the instance. If you are operating in a
highly-available cluster environment, restart the services on the active node.
a. Open the operating system command prompt, and enter the following
commands:
unisrvcntr start waae_sched.$AUTOSERV
unisrvcntr start waae_server.$AUTOSERV
b. Repeat this action on every server machine in the instance. If you are operating
in a highly-available cluster environment, restart the components on the active
node only.
Note: In highly-available cluster environments, the scheduler and application
server actively execute work on the active node only. The cluster manager
prompts the scheduler or application server on one of the passive nodes to
begin executing tasks only when it detects a failure of the same component on
the active node.
The database is restored.
High-availability is restored. To maintain your highly-available environment, continue
monitoring the scheduler log and recovering from scheduler and database failures when
they occur.
How to Maintain Highly-Available Environments
Recover the Failed Database on Windows
If you are using dual event server mode as your database failover solution and the
scheduler initiates a database rollover, dual event server mode is disabled.
Highly-available cluster environments do not function properly without a
highly-available database. A database rollover does not disable high-availability mode,
but the risk of down-time and data loss increases without a failover solution for your
database.
Operating without a failover solution to the database increases the risk of downtime
and data loss. To mitigate this risk, restore the failed database.
If you are operating with a clustered database and a database failure occurs, some
cluster managers require manual intervention to recover the failed database. In this
case, take one of the following actions:
■ Review messages issued by the cluster manager and database vendor, and follow
the instructions that the messages provide.
■ Consult the documentation for your cluster management software and your
database software, and the instructions for manually recovering from a database
failure.
■ Contact your database or network administrator.
Notes:
■ You can operate with a clustered database only when you have cluster
management software installed and are using a cluster aware database. Some
cluster managers restore a failed database automatically, so no action is required.
In this case, your cluster manager issues messages indicating that the database is
restored.
■ If you have cluster management software installed, we recommend that you set up
a highly-available cluster environment instead of configuring high-availability mode.
If you are using dual event server mode as your database failover solution and the
scheduler rolls over the database, CA Workload Automation AE begins operating in
single event server mode. To return to dual event server mode, recover from the failure
and reconfigure dual event servers.
Follow these steps:
1. Review the scheduler log file to determine the problem that caused the failure.
2. Consult your database software documentation, and follow the instructions for
manually resolving the problem that caused the database failure.
The failed database is recovered, and you can reconfigure dual event servers.
3. Log in to CA Workload Automation AE machine in the instance with a client
installation and click Start, Programs, CA, Workload Automation AE, Command
Prompt (instance_name).
The instance command prompt opens.
4. Enter the following command:
sendevent -E STOP_DEMON -v ALL
All server processes that are running in the instance stop.
5. Restore, enable, and reconfigure the failed database (event server).
a. Review the scheduler log file to determine the problem that caused the failure.
b. Consult your database software documentation, and follow the instructions for
manually resolving the problem that caused the database failure.
The status of the failed database changes and you can reconfigure dual event
servers.
c. Click the Event Server icon on the toolbar in any CA Workload Automation AE
Administrator window.
The Event Server - CA Workload Automation AE Administrator window appears.
d. Clear the A Database Rollover Has Occurred check box, verify the event server
configuration settings and make any desired modifications, and then click
Apply.
Important! Ensure that the configuration settings for Event Server 2 are
identical to the configuration settings for Event Server 1.
The failed event server is restored and the configuration settings are saved, but
CA Workload Automation AE does not return to dual event server mode until
you synchronize the event servers.
6. Run the CA Workload Automation AE bulk copy script. The directory and script that
you synchronize the event servers following a database failure are the same as the
directory and script that you use to switch to dual event server mode when you did
not enable dual event servers during installation. The script that you run depends
on your database vendor.
■ Oracle
Open the %AUTOSYS%\dbobj\ORA directory and run the following script:
perl autobcpORA.pl source_server target_server source_userid
source_password target_userid target_password dump_file oracle_directory
■ Sybase
Open the %AUTOSYS%\dbobj\SYB directory and run the following script:
perl autobcpSYB.pl source_server source_db target_server target_db
source_userid source_password target_userid target_password dump_file
blk_size How to Maintain Highly-Available Environments
■ Microsoft SQL Server
Open the %AUTOSYS%\dbobj\MSQ directory and run the following script:
perl autobcpMSQ.pl source_server source_db target_server target_db
source_userid source_password target_userid target_password dump_file
The event servers are synchronized.
7. Restart the server processes for the instance. If you are operating in a
highly-available cluster environment, restart the services on the active node.
a. Click Start, Programs, CA, Workload Automation AE, Administrator.
The Instance - CA Workload Automation AE Administrator window opens.
b. Select an instance from the Instance drop-down list in the Settings pane.
c. Click the Services icon on the toolbar.
A list of the services that are running on the instance appears.
d. Right-click the scheduler service and select Start, then right-click the
application server service and select start.
e. If you are operating in a highly-available cluster environment, restart the
components on the active node only. Otherwise, repeat these actions on every
server machine in the instance. If you are operating in a highly-available cluster
environment, restart the components on the active node only.
Note: In highly-available cluster environments, the scheduler and application
server actively execute work on the active node only. The cluster manager
prompts the scheduler or application server on one of the passive nodes to
begin executing tasks only when it detects a failure of the same component on
the active node.
The database is recovered and CA Workload Automation AE returns to dual event
server mode.
High-availability is restored. To maintain your highly-available environment, continue
monitoring the scheduler log and recovering from scheduler and database failures when
they occur.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.