Autosys tutorial: Chapter 30: Maintaining Highly-Available Environments

Chapter 30: Maintaining Highly-Available Environments

How to Maintain Highly-Available Environments

As an operator, it is your responsibility to monitor the environment and address

problems that occur. In highly-available environments, scheduler and database failures

adversely affect, and sometimes disable, high-availability. You can ensure that you

maintain a properly functioning highly-available environment by resolving database and

scheduler failures.

In highly-available cluster environments, use the monitoring tools provided by your

cluster management software and database vendor to monitor the state of the cluster

and the state of the database. We recommend that you also monitor the scheduler log

on the active node. When your cluster or clustered database is not functioning properly,

consult the documentation for your cluster management software or your database

vendor and follow the instructions to restore the environment.

In high-availability and dual event server modes, monitor the scheduler logs for

database and scheduler failures and take appropriate action to return to

high-availability mode when failures occur.

Note: For more information about monitoring tools provided by your cluster

management software and database vendor, see the documentation for those products.

How to Maintain Highly-Available Environments

The following diagram demonstrates how you can maintain a highly-available

environment:

Follow these steps:

1. Monitor the scheduler log ( .

2. Restore the failed scheduler.

■ On UNIX

■ On Windows

3. Recover the failed database.

■ On UNIX

■ On Windows

How to Maintain Highly-Available Environments

Monitor the Scheduler Log

The scheduler log displays information about alarms and related error messages so that

you know when you need to take action to resolve problems with your CA Workload

Automation AE environment. For example, the scheduler log displays alarm information

when the scheduler issues a database rollover alarm.

■ If you are operating in a highly-available cluster environment, monitor the

scheduler log on the active node. The scheduler log does not display all of the

information that you need to monitor the environment. Ensure that you also use

the monitoring tools that are provided by your cluster manager to monitor the state

of the cluster.

■ If you are operating in high-availability mode, monitor the state of the dormant

scheduler.

The primary scheduler is typically the active scheduler. In this case, the shadow

scheduler begins processing events and issues an alarm when the primary scheduler

fails (EP_ROLLOVER).

If you restore the primary scheduler when the primary failback mode is set to 1, the

shadow scheduler remains active and the primary scheduler runs dormant. In this

case, the primary scheduler resumes processing events and issues an alarm when

the shadow scheduler fails (EP_ROLLOVER).

■ If you are operating in dual event server mode, monitor the active scheduler log.

When a database rollover occurs, CA Workload Automation AE begins operating in

single event server mode and the active scheduler issues a database rollover alarm

(DB_ROLLOVER).

■ If you are operating with a clustered database, use the monitoring tools that are

provided by your cluster manager and database vendor to monitor the state of the

database.

■ In all CA Workload Automation AE environments, monitor the scheduler log for

alarm information and error messages so that you can take appropriate action to

resolve problems with the environment as they occur.

Note: For information about the monitoring tools that are provided by your cluster

manager and database vendor, see the documentation for those products.

Follow these steps:

1. Log in as the root user on the machine or node where the scheduler is installed.

2. Open the operating system or instance command prompt as follows:

■ (UNIX) Run the shell that is sourced to use CA Workload Automation AE.

The operating system command prompt appears.

■ (Windows) Click Start, Programs, CA, Workload Automation AE, Command

Prompt (instance_name).

The instance command prompt appears.

3. Enter the following command:

autosyslog -e

The scheduler log appears. When CA Workload Automation AE encounters a

problem, the scheduler issues an alarm. The log displays the alarm information and

related error messages so that you can resolve the problem.

Restore the Failed Scheduler on UNIX

If you are operating CA Workload Automation AE in high-availability mode and the

active scheduler fails, high-availability is disabled. If you are operating in a

highly-available cluster environment, the action that you need to take to return to a

highly-available environment depends on your cluster management software.

In highly-available cluster environments, manual intervention following a failover is

usually not required. The cluster manager may still issue informational messages. When

the cluster manager issues messages indicating that manual intervention is required,

take one of the following actions:

■ Review messages and follow the instructions that they provide.

■ Consult the documentation for your cluster management software and the

instructions for manually recovering the failed cluster node.

■ Contact your network administrator.

If you are operating in high-availability mode and high-availability is disabled because

the active scheduler fails, return to high-availability mode by restoring the failed

scheduler.

Follow these steps:

1. If the primary failback mode is set to 0 and you are restoring the primary scheduler,

stop the shadow scheduler.

Note: By default, the primary failback mode is set to 0 and this step is required.

When the primary failback mode is set to 1 or 2, restart the primary scheduler

without stopping the shadow scheduler.

a. Log in to a machine in the instance with a client installation as the root user.

You can issue commands that execute client utilities.

b. Run the shell that is sourced to use CA Workload Automation AE.

The operating system command prompt appears.

c. Enter the following command:

sendevent –E STOP_DEMON -v ROLE=S How to Maintain Highly-Available Environments

2. If high-availability mode was disabled because a failover occurred, restore the

primary scheduler.

a. Log in to the machine where the primary scheduler is installed and run the shell

that is sourced to use CA Workload Automation AE.

The operating system command prompt appears.

b. Enter the following command:

eventor

The primary scheduler is restored.

The primary scheduler either runs dormant or resumes processing events,

depending on the primary failback mode. CA Workload Automation AE returns to

high-availability mode.

Important! If the primary failback mode is set to 0, the primary scheduler resumes

processing events as soon as it is restored but CA Workload Automation AE does

not return to high-availability mode. To return to high-availability mode, restore the

shadow scheduler.

If the primary failback mode is set to 1 or 2 CA Workload Automation AE returns to

high-availability mode.

3. If high-availability mode was disabled because of a failback that was not automatic

or manual, use the eventor command to restore the shadow scheduler.

Notes:

■ Automatic failbacks occur when primary failback mode is set to 2 and you

restore the primary scheduler. You can control when a failback occurs by

setting primary failback mode to 1 and initiating a manual failback only when

you want the primary scheduler to resume processing events.

■ The shadow scheduler process restarts itself in a dormant state when an

automatic failback occurs or when you initiate a manual failback. The shadow

scheduler does not stop running when you initiate a manual failback. In both

cases, no action is required to return to high-availability mode.

■ Failbacks also occur when the shadow scheduler fails. In this case,

high-availability mode is disabled until you restore the shadow scheduler.

The primary scheduler continues processing events, the shadow scheduler is restored

but runs dormant, and CA Workload Automation AE returns to high-availability mode.

Note: For more information about setting the Primary Failback Mode, see the UNIX

Implementation Guide.

Restore the Failed Scheduler on Windows

If you are operating CA Workload Automation AE in high-availability mode and the

active scheduler fails, high-availability is disabled. If you are operating in a

highly-available cluster environment, the action that you need to take to return to a

highly-available environment depends on your cluster management software.

In highly-available cluster environments, manual intervention following a failover is

usually not required. The cluster manager may still issue informational messages. When

the cluster manager issues messages indicating that manual intervention is required,

take one of the following actions:

■ Review messages and follow the instructions that they provide.

■ Consult the documentation for your cluster management software and the

instructions for manually recovering the failed cluster node.

■ Contact your network administrator.

If you are operating in high-availability mode and high-availability is disabled because

the active scheduler fails, return to high-availability mode by restoring the failed

scheduler.

Follow these steps:

1. If the primary failback mode is set to Off and you are restoring the primary

scheduler, stop the shadow scheduler.

Note: By default, the primary failback mode is set to Off and this step is required.

When the primary failback mode is set to Immediate or Dormat, restart the primary

scheduler without stopping the shadow scheduler.

a. Log in to a machine in the instance with a client installation as the root user.

You can issue commands that execute client utilities.

b. Click Start, Programs, CA, Workload Automation AE, Command Prompt

(instance_name).

The instance command prompt appears.

c. Enter the following command:

sendevent –E STOP_DEMON -v ROLE=S How to Maintain Highly-Available Environments

2. If high-availability mode was disabled because a failover occurred, recover the

primary scheduler.

a. Log in to the machine where the scheduler is installed and click Start,

Programs, CA, Workload Automation AE, Administrator.

The Instance - CA Workload Automation AE Administrator window opens.

b. Select an instance from the Instance drop-down list in the Settings pane, and

then click the Services icon on the toolbar.

The Services - CA Workload Automation AE Administrator window appears,

displaying a list of services installed on the selected instance.

c. Right click the Scheduler service, and select Start.

Important! If the primary failback mode is set to 0ff, the primary scheduler

resumes processing events as soon as it is restored but CA Workload

Automation AE does not return to high-availability mode. To return to

high-availability mode, restore the shadow scheduler.

If the primary failback mode is set to Dormant or Immediate, CA Workload

Automation AE returns to high-availability mode.

3. If high-availability mode was disabled because a failback occurred, restore the

shadow scheduler.

Notes:

■ Automatic failbacks occur when primary failback mode is set to Immediate and

you restore the primary scheduler. You can control when a failback occurs by

setting primary failback mode to Dormant and initiating a manual failback only

when you want the primary scheduler to resume processing events.

■ The shadow scheduler process restarts itself in a dormant state when an

automatic failback occurs or when you initiate a manual failback. The shadow

scheduler does not stop running when you initiate a manual failback. In both

cases, no action is required to return to high-availability mode.

■ Failbacks also occur when the shadow scheduler fails. In this case,

high-availability mode is disabled until you restore the shadow scheduler.

The primary scheduler continues processing events, the shadow scheduler is

restored but runs dormant, and CA Workload Automation AE returns to

high-availability mode.

The scheduler recovers and CA Workload Automation AE returns to high-availability

mode.

Note: For more information about setting the Primary Failback Mode, see the Windows

Recover the Failed Database on UNIX

If you are using dual event server mode as your database failover solution and the

scheduler initiates a database rollover, dual event server mode is disabled.

Highly-available cluster environments do not function properly without a

highly-available database. A database rollover does not disable high-availability mode,

but the risk of down-time and data loss increases without a failover solution for your

database.

Operating without a failover solution to the database increases the risk of downtime

and data loss. To mitigate this risk, restore the failed database.

If you are operating with a clustered database and a database failure occurs, some

cluster managers require manual intervention to recover the failed database. In this

case, take one of the following actions:

■ Review messages issued by the cluster manager and database vendor, and follow

the instructions that the messages provide.

■ Consult the documentation for your cluster management software and your

database software, and the instructions for manually recovering from a database

failure.

■ Contact your database or network administrator.

Notes:

■ You can operate with a clustered database only when you have cluster

management software installed and are using a cluster aware database. Some

cluster managers restore a failed database automatically, so no action is required.

In this case, your cluster manager issues messages indicating that the database is

restored.

■ If you have cluster management software installed, we recommend that you set up

a highly-available cluster environment instead of configuring high-availability mode.

■ When a rollover occurs, CA Workload Automation AE backs up the configuration file

before commenting out the failed event server. You can use the backup file to

restore pre-rollover configuration settings when you reconfigure dual event servers.

If you are using dual event server mode as your database failover solution and the

scheduler rolls over the database, CA Workload Automation AE begins operating in

single event server mode. To return to dual event server mode, recover from the failure

and reconfigure dual event servers.

Follow these steps:

1. Review the scheduler log file to determine the problem that caused the failure.

2. Consult your database software documentation, and follow the instructions for

manually resolving the problem that caused the database failure.

The failed database is recovered, and you can reconfigure dual event servers. How to Maintain Highly-Available Environments

3. Log in to CA Workload Automation AE machine in the instance with a client

installation, and click Start, CA, Workload Automation AE, Command Prompt.

The operating system command prompt opens.

4. Enter the following command:

sendevent -E STOP_DEMON -v ALL

All server processes that are running on the instance stop.

5. If you want to restore the pre-rollover configuration settings, delete the modified

configuration file and rename the backup file.

a. Delete $AUTOUSER/config.$AUTOSERV.

b. Locate $AUTOUSER/config.$AUTOSERV.rollover and change the name to

$AUTOUSER/config.$AUTOSERV.

c. Repeat these actions on every server machine that is installed in the instance.

6. If you want to specify new configuration settings when you restore the failed event

server, modify the $AUTOUSER/config.$AUTOSERV file.

a. Open the configuration file on the machine where the primary scheduler is

installed and locate the following parameter:

#EventServer_1|#EventServer_2

b. Edit the parameter as follows:

Event_Server_1|EventServer_2

c. Specify the primary event server and the secondary event server:

■ To make the new database the secondary event server, add the following

parameter:

EventServer_2=SYBASE_SVR:SYBASE_DB,DBPORT,DBHOST |

ORACLE_SVR,DBPORT,DBHOST

SYBASE_DB,DBPORT,DBHOST

Identifies the Sybase database for the second event server.

ORACLE_SVR,DBPORT,DBHOST

Identifies the Oracle database for the second event server.

■ To make the restored database the secondary event server, specify the

active database in the EventServer_1 parameter and the restored database

in the EventServer_2 parameter.

■ To make the new database the primary event server, specify the existing

database that is defined in the EventServer_1 parameter as the secondary

event server by changing it to EventServer_2, then add the following

parameter:

EventServer_1=SYBASE_SVR:SYBASE_DB,DBPORT,DBHOST |

ORACLE_SVR,DBPORT,DBHOST

■ To make the restored database the primary event server, verify that the

restored database is specified in the EventServer_1 parameter and the

active database is specified in the EventServer_2 parameter.

d. Specify the database reconnect behavior for the second event server by

modifying the following parameter in the configuration file:

DBEventReconnect=value, value2

value

Identifies the database reconnect behavior for the first event server.

Limits: 0-99

value2

Identifies the database reconnect behavior for the second event server.

Limits: 0-99

Note: During typical installation, CA Workload Automation AE sets the

reconnect value for the single event server to 50 by default. During a custom

installation in which you enable dual event server mode, CA Workload

Automation AE sets the reconnect value for both event servers to 50, 5 by

default. Ensure that you add a reconnect value for the second event server

when you configure CA Workload Automation AE to run in dual event server

mode after running it in single event server mode. Optionally, you can modify

the default reconnect value for the first event server.

The secondary event server is configured on the primary scheduler machine.

7. Repeat the secondary event server configuration on every server machine in the

instance. Ensure that the event server information in the configuration file is the

same on each of these machines.

8. Run the CA Workload Automation AE bulk copy script. The script that you run

depends on your database vendor.

■ Oracle

Open the $AUTOSYS/dbobj/ORA directory and run the following script:

perl autobcpORA.pl source_server target_server source_userid

source_password target_userid target_password dump_file oracle_directory

■ Sybase

Open the $AUTOSYS/dbobj/SYB directory and run the following script:

perl autobcpSYB.pl source_server source_db target_server target_db

source_userid source_password target_userid target_password dump_file

blk_size How to Maintain Highly-Available Environments

source_server

Defines the name of the source Oracle System ID (for example, AEDB) or

Sybase server name (for example, SourceServer). For Sybase, the source server

name is defined in the interfaces file.

source_db

Defines the source Sybase database (for example, AEDB).

source_userid

Defines the user ID that is used to connect to the source Oracle System ID, or

Sybase server.

Note: On Oracle, use aebadmin as the source user ID.

source_password

Defines the password that corresponds to the user ID that is used to connect to

the source Oracle System ID, or Sybase server.

target_server

Defines the target Oracle System ID (for example, AEDB2), or Sybase server

name (for example, DestinationServer). For Sybase, the target server name is

defined in the interfaces file.

Note: For Oracle, the source server must be different from the target server.

target_db

Defines the target Sybase database (for example, AEDB2).

Note: The autobcpDB script deletes all of the data in the target database and

replaces it with the data in the source database. If you want to save the data in

the target database, archive it before you run the autobcpDB script.

target_userid

Defines the user ID that is used to connect to the target Oracle System ID, or

Sybase server.

Note: On Oracle, use aedbadmin as the target user ID.

target_password

Defines the password that corresponds to the user ID that is used to connect to

the target Oracle System ID, or Sybase server.

dump_file

Defines the temporary file that is used in the transfer of data from one

database to the other database.

Note: Specify a file that is local to the computer where this script is running.

oracle_directory

Defines the path to the Oracle home directory.

blk_size

(Optional) Specifies the number of rows that can be inserted from the

dump_file to the destination database at a time.

Default: 5000

Note: The Sybase script uses the default value when you run it in the interactive

mode, or when you do not specify the blk_size value. Do not specify a large value

because the transaction log encounters problems when it becomes too full.

The event servers are synchronized.

9. Restart the server processes for the instance. If you are operating in a

highly-available cluster environment, restart the services on the active node.

a. Open the operating system command prompt, and enter the following

commands:

unisrvcntr start waae_sched.$AUTOSERV

unisrvcntr start waae_server.$AUTOSERV

b. Repeat this action on every server machine in the instance. If you are operating

in a highly-available cluster environment, restart the components on the active

node only.

Note: In highly-available cluster environments, the scheduler and application

server actively execute work on the active node only. The cluster manager

prompts the scheduler or application server on one of the passive nodes to

begin executing tasks only when it detects a failure of the same component on

the active node.

The database is restored.

High-availability is restored. To maintain your highly-available environment, continue

monitoring the scheduler log and recovering from scheduler and database failures when

they occur.

How to Maintain Highly-Available Environments

Recover the Failed Database on Windows

If you are using dual event server mode as your database failover solution and the

scheduler initiates a database rollover, dual event server mode is disabled.

Highly-available cluster environments do not function properly without a

highly-available database. A database rollover does not disable high-availability mode,

but the risk of down-time and data loss increases without a failover solution for your

database.

Operating without a failover solution to the database increases the risk of downtime

and data loss. To mitigate this risk, restore the failed database.

If you are operating with a clustered database and a database failure occurs, some

cluster managers require manual intervention to recover the failed database. In this

case, take one of the following actions:

■ Review messages issued by the cluster manager and database vendor, and follow

the instructions that the messages provide.

■ Consult the documentation for your cluster management software and your

database software, and the instructions for manually recovering from a database

failure.

■ Contact your database or network administrator.

Notes:

■ You can operate with a clustered database only when you have cluster

management software installed and are using a cluster aware database. Some

cluster managers restore a failed database automatically, so no action is required.

In this case, your cluster manager issues messages indicating that the database is

restored.

■ If you have cluster management software installed, we recommend that you set up

a highly-available cluster environment instead of configuring high-availability mode.

If you are using dual event server mode as your database failover solution and the

scheduler rolls over the database, CA Workload Automation AE begins operating in

single event server mode. To return to dual event server mode, recover from the failure

and reconfigure dual event servers.

Follow these steps:

1. Review the scheduler log file to determine the problem that caused the failure.

2. Consult your database software documentation, and follow the instructions for

manually resolving the problem that caused the database failure.

The failed database is recovered, and you can reconfigure dual event servers.

3. Log in to CA Workload Automation AE machine in the instance with a client

installation and click Start, Programs, CA, Workload Automation AE, Command

Prompt (instance_name).

The instance command prompt opens.

4. Enter the following command:

sendevent -E STOP_DEMON -v ALL

All server processes that are running in the instance stop.

5. Restore, enable, and reconfigure the failed database (event server).

a. Review the scheduler log file to determine the problem that caused the failure.

b. Consult your database software documentation, and follow the instructions for

manually resolving the problem that caused the database failure.

The status of the failed database changes and you can reconfigure dual event

servers.

c. Click the Event Server icon on the toolbar in any CA Workload Automation AE

Administrator window.

The Event Server - CA Workload Automation AE Administrator window appears.

d. Clear the A Database Rollover Has Occurred check box, verify the event server

configuration settings and make any desired modifications, and then click

Apply.

Important! Ensure that the configuration settings for Event Server 2 are

identical to the configuration settings for Event Server 1.

The failed event server is restored and the configuration settings are saved, but

CA Workload Automation AE does not return to dual event server mode until

you synchronize the event servers.

6. Run the CA Workload Automation AE bulk copy script. The directory and script that

you synchronize the event servers following a database failure are the same as the

directory and script that you use to switch to dual event server mode when you did

not enable dual event servers during installation. The script that you run depends

on your database vendor.

■ Oracle

Open the %AUTOSYS%\dbobj\ORA directory and run the following script:

perl autobcpORA.pl source_server target_server source_userid

source_password target_userid target_password dump_file oracle_directory

■ Sybase

Open the %AUTOSYS%\dbobj\SYB directory and run the following script:

perl autobcpSYB.pl source_server source_db target_server target_db

source_userid source_password target_userid target_password dump_file

blk_size How to Maintain Highly-Available Environments

■ Microsoft SQL Server

Open the %AUTOSYS%\dbobj\MSQ directory and run the following script:

perl autobcpMSQ.pl source_server source_db target_server target_db

source_userid source_password target_userid target_password dump_file

The event servers are synchronized.

7. Restart the server processes for the instance. If you are operating in a

highly-available cluster environment, restart the services on the active node.

a. Click Start, Programs, CA, Workload Automation AE, Administrator.

The Instance - CA Workload Automation AE Administrator window opens.

b. Select an instance from the Instance drop-down list in the Settings pane.

c. Click the Services icon on the toolbar.

A list of the services that are running on the instance appears.

d. Right-click the scheduler service and select Start, then right-click the

application server service and select start.

e. If you are operating in a highly-available cluster environment, restart the

components on the active node only. Otherwise, repeat these actions on every

server machine in the instance. If you are operating in a highly-available cluster

environment, restart the components on the active node only.

Note: In highly-available cluster environments, the scheduler and application

server actively execute work on the active node only. The cluster manager

prompts the scheduler or application server on one of the passive nodes to

begin executing tasks only when it detects a failure of the same component on

the active node.

The database is recovered and CA Workload Automation AE returns to dual event

server mode.

High-availability is restored. To maintain your highly-available environment, continue

monitoring the scheduler log and recovering from scheduler and database failures when

they occur.

Autosys tutorial

Pages

Thursday, 8 October 2015

Chapter 30: Maintaining Highly-Available Environments

No comments:

Post a Comment