Recovery & Failover

From Obsidian Scheduler
Jump to navigationJump to search

Obsidian ensures continuity in the execution of your scheduled jobs by providing a variety of built-in recovery and failover mechanisms:

  • Multiple concurrent hosts support
  • Instance outage recovery
  • Configurable recovery options by job
  • Resubmission of abnormally terminating jobs
  • Standby Nodes


Multiple Concurrent Hosts

Obsidian can be run out-of-the-box with multiple hosts via clustering. As long as at least one host is running and the jobs are not constrained to specific hosts, any server failures will not prevent jobs from being run on schedule. Any jobs running when a server fails will be recovered as defined in Instance Outage Recovery. No special configuration is required to run multiple hosts, but you do need paid licenses after the first free node. Just start up a node and it joins the available service pool.

Instance Outage Recovery

When a given server fails, any jobs that were in the midst of running cannot be completed normally. Any other running hosts will discern that the jobs have not had any activity and they will be marked as Died. This will allow other hosts to run the job for any subsequently scheduled times. And since no special configuration is required to run multiple hosts, when the issue with the server failure is resolved, simply start it up again and it joins the pool. A given instance will reuse its license as long as it has not been claimed by another host.

Configurable Job Recovery

Obsidian supports various recovery mechanisms.

Recovery modes by job type can be specified to determine behaviour when no instance is available for executing scheduled jobs. When an instance finally starts up, all missed scheduled runtimes will be evaluated against the configured recovery mode by job.

  • All recovery means that all missed instances will be run. This feature logically requires the job to have access to the scheduled run time. This can be retrieved from the Context. See Implementing Jobs for details.
  • None recovery means no recovery takes place. Missed jobs will be marked as such.
  • Last recovery means only the last missed job will be run.
  • Conflicted recovery means that a conflicted job will be run regardless of the elapsed time since it was originally scheduled once higher priority conflicts clear. See Conflicts for more details.

Startup / Shutdown Mode

Obsidian supports various modes for automatically triggering jobs in various startup and shutdown scenarios. The job must be in an executable state (e.g. not Disabled and not Chain Active). These jobs will run immediately after startup and just prior to shutdown of Obsidian nodes or the entire cluster.

  • None indicates that no special startup or shutdown behaviour will be in effect.
  • Host Startup indicates the job will run once on each host immediately after startup.
  • Host Shutdown indicates the job will run once on each host immediately prior to completion of shutdown requests.
  • Host Both indicates the job will run once on each host immediately after startup and once on each host immediately prior to completion of shutdown requests.
  • Cluster Startup indicates the job will run once after the first host in a cluster starts up.
  • Cluster Shutdown indicates the job will run once immediately prior to completion of the shutdown request on the last running host.
  • Cluster Both indicates the job will run once after the first host in a cluster starts up and once immediately prior to completion of the shutdown request on the last running host.

When utilizing Cluster Shutdown and Cluster Both modes, it is important that the nodes do not recognize the other nodes as still running during shutdown. The best way to ensure these jobs run as expected is to ensure each node is completely shutdown before triggering shutdown on the next, continuing until all nodes are shutdown.

As of Obsidian 4.10.0

Resubmission of Abnormally Terminating Jobs

Obsidian allows you attempt re-execution of any job that did not complete normally for any failure reason whatsoever. You can correct any conditions that caused the failure and resubmit the job for execution. A user with admin rights can even configure which abnormal termination states allow resubmission.

Auto Retry Abnormally Terminating Jobs

Obsidian supports auto retries for abnormally terminating jobs up to a configured maximum of retries or until the job gets manually resubmitted, arrives at a normal schedule or is chained to. You may also specify the minimum interval in minutes between retries and if you wish to exponentially increase the interval between subsequent retries.

Standby Nodes

Obsidian also supports standby nodes. Standby node(s) are an available strategy to ensure ongoing load balancing and distribution even after instance failures. When an instance fails and once the license lease expires, your standby nodes will join the serving pool and share in the load of executing your jobs.

If you choose to limit your licensing to a single node, you still can achieve limited failover by having a standby node running. The standby node will not be able to successfully obtain a license and will therefore not run any jobs. If the primary node fails, the standby node can claim the license once the lease on the license has expired and start running jobs. Carfey Software encourages use of multiple licensed hosts in your production environment if at all possible as you can suffer an outage while the standby node waits for the lease to expire.