Appearance
Concepts & Terminology
Watch
A Watch is a monitoring mechanism designed to ensure that workflows or processes are running as expected. Workflows signal their operation to a Watch via API Check-ins.
There are two types of Watches, Notify-only and Missed Action watch.
Notify Only Watch
A Notify Only watch functions as a simple notification relay. Upon receiving a check-in, it triggers an alert to the team. Since similar notifications can be directly implemented within a workflow, the core value of Automation Watchdog lies primarily in its Missed Action watch functionality.
Missed Action Watch
A Missed Action watch operates by expecting regular Check-ins from the workflow, which confirm its ongoing status.
Watch States
When check-ins are received within the expected timeframe, the Watch maintains a Relax State, signaling normal operation. However, if a check-in is missed or delayed beyond the designated period, the Watch shifts to an Error State, automatically generating an alert to inform the team of a potential issue.
Error states persist through deactivation and activation. If a watch is deactivated while in an Error state, it remains in Error when reactivated. This ensures support teams can see whether issues were resolved or just masked by a deactivation cycle. To explicitly clear errors, use the Clear Errors action.
Watch Activations
Watches are Activated either by a Schedule, by an Activation Event, by Activate On Check-In for eligible non-scheduled watches, or by Activate On Deactivation from another watch. A schedule-activated watch remains active for the duration defined by the schedule. An event-activated Watch remains active until it is explicitly deactivated by a subsequent event. A Watch can only transition from the Relax state to the Error state, indicating a missed check-in, when it is active.
An active watch expects check-ins within either Fixed or Cascading time periods.
Check-ins on Inactive Watches
By default, check-ins on inactive watches are rejected. The watch must be activated before check-ins are accepted.
If Activate On Check-In is enabled on an eligible non-scheduled watch, the first valid check-in activates the condition and is then processed normally.
Activate On Check-In
Activate On Check-In allows an inactive non-scheduled Missed Action watch to activate itself from the first check-in instead of requiring a separate activation API call.
Supported Modes
- Standard
- Condition By Queue
- Condition By Machine
How It Works
- The watch starts inactive and Activate On Check-In is enabled.
- The first valid check-in arrives.
- The relevant watch condition is activated and its initial overdue tracking is set.
- That same check-in is processed as the first check-in for the run.
Notes
- This feature is only available for watches without a schedule.
- Queue and machine identifiers are still required when the selected watch mode needs them.
- If this feature is disabled, inactive check-ins are rejected until the watch is activated by API call, UI action, or schedule.
Activate On Deactivation
Activate On Deactivation allows one watch to activate another watch when the source watch finishes and deactivates. In the watch form, this is configured with the Activate Watch field.
This is useful for handoff-style workflows. For example, a performer watch can activate a reporter watch when the all the performer workers deactivates after processing work from a queue.
How It Works
- The source watch is configured with a target watch in Activate Watch.
- The source watch runs normally.
- When the source watch deactivates and becomes fully inactive, Automation Watchdog automatically sends an activation request to the target watch.
- The target watch starts its own monitoring cycle using its own configuration.
Notes
- The target watch must be a valid watch in the same organization.
- The target watch must not have a schedule.
- The target watch is activated in the same environment as the source watch deactivation.
- For queue- and machine-based watches, the target activation happens when the source watch actually becomes inactive. If other queues or machines are still active, the linked activation does not fire yet.
- This feature chains activation between watches. It does not pass queue or machine values into the target watch.
Deactivate-By Deadline
A Deactivate-By Deadline sets a time limit on an activation. If the watch condition has not been deactivated by the deadline, the watch enters an error state with the deactivateByDeadline error reason.
Deadlines can be specified when activating a watch condition:
- Relative deadline: a duration from the activation time, e.g.,
+30m(30 minutes),+2h(2 hours),+1d(1 day) - Absolute deadline: an ISO 8601 timestamp, e.g.,
2026-01-25T10:00:00Z
This is useful for workflows that must complete within a known time window. For example, a set of performers should complete the work that was loaded into the queue within 3 hours, activate the watch with a +3h deadline.
Watch Check-in Strategy
There are two strategies for watches to check-in:
Fixed Period
A Fixed Period of 15 minutes means a check-in is required at least once every 15 minutes. For example, a check-in may be required between 9:00 - 9:15, 9:15 - 9:30, 9:30 - 9:45 and so forth.
Cascading Period
A Cascading Period of 15 minutes means a check-in is expected every 15 minutes from the last time the check-in was received. For example, if a check-in is required by 9:15, if one is received at 9:07, the next check-in is required by 9:22.
Check-in Outcomes
Each check-in can optionally include an outcome to report whether the workflow step succeeded or failed:
| Outcome | Value | Description |
|---|---|---|
| Success | 1 | The workflow step completed successfully |
| Failure | 2 | The workflow step encountered an error |
When no outcome is provided, the check-in still satisfies the watch's timing requirements but does not affect threshold monitoring. Outcomes are used by Threshold Monitoring to evaluate the quality of your workflow over a current run.
Threshold Monitoring
Threshold Monitoring evaluates the quality of your workflow's check-ins beyond simple timing. When enabled, the watch tracks recent outcomes in a sliding window and enters an error state when quality drops below configured limits.
Configuration
| Setting | Description |
|---|---|
| Window Size | Maximum number of recent outcomes to evaluate (up to 1000) |
| Minimum Observations | Number of outcomes required before evaluation begins |
| Threshold Percent | Minimum success rate percentage required (e.g., 80 means at least 80% of outcomes must be successes) |
| Consecutive Failure Limit | Maximum consecutive failures allowed before triggering an error |
How It Works
- Each check-in with an outcome is added to a rolling buffer of recent outcomes.
- Once the number of outcomes reaches Minimum Observations, evaluation begins.
- Percent rule: If the success rate in the window drops below Threshold Percent, the error reason
thresholdPercentis added. - Consecutive rule: If consecutive failures reach or exceed Consecutive Failure Limit, the error reason
thresholdConsecutiveis added. - Both rules are evaluated independently, so a watch can have both error reasons active simultaneously.
- Threshold counters persist across activation, deactivation, and occurrence boundaries. Use the Clear Errors action to reset them, or for scheduled watches enable Reset on Good Standing to start a new occurrence with fresh counters when the condition is already in good standing.
Note: Threshold monitoring works with all watch modes including Condition By Machine and Condition By Queue, where each machine or queue maintains its own independent threshold counters.
Reset on Good Standing
When Reset on Good Standing is enabled, threshold counters reset at the start of a new scheduled occurrence only if the condition is already in good standing: Relax state with no error reasons.
This allows a new scheduled run to start with a fresh threshold window without silently clearing unresolved issues from the previous run.
A normal successful check-in does not wipe the threshold window. Successful check-ins continue the rolling window and only reset the consecutive failure streak.
If the condition is still in Error, the counters and error state are preserved until the workflow recovers naturally or the team explicitly uses Clear Errors.
Expected Machine Count
Some workflows expect more than one machine to participate in the same run. In those cases, Expected Machine Count can be configured to require a minimum number of machines to report before the watch is considered healthy.
How It Works
- The watch is activated with an effective expected machine count, either from the watch configuration or an activation-time override.
- Machines begin checking in as normal.
- If fewer than the expected number of machines have reported, the watch does not clear its overdue tracking yet.
- If the watch reaches its overdue deadline before enough machines have reported, it enters error with the Expected Machines error reason.
- Once the required number of machines report, normal recovery behavior resumes and the watch can return to Relax if no other error reasons remain.
Supported Modes
- Standard multi-machine: Any machine can still satisfy the watch timing, but the watch can be configured to wait for a minimum number of active machines before clearing overdue state.
- Condition By Machine: The placeholder/aggregated condition waits for the expected number of machines, while each machine still tracks its own independent state.
Notes
- This feature is most useful when a process is expected to fan out across a known number of workers.
- Condition By Queue is not currently included in this feature.
- Deactivation clears the active expected-machine requirement for the current condition.
- Clear Errors also removes the current expected-machine gate from the condition being cleared.
Watch Error Reasons
A watch condition tracks multiple independent Error Reasons simultaneously. The condition is in an error state when one or more reasons are active, and returns to a relax state when all reasons are cleared.
| Error Reason | Trigger | Cleared When |
|---|---|---|
| Overdue | Check-in not received before the expected deadline | A check-in is received |
| Threshold Percent | Success rate in the sliding window drops below the configured percentage | Success rate recovers above the threshold |
| Threshold Consecutive | Consecutive failures reach or exceed the configured limit | A successful check-in resets the consecutive failure streak |
| Expected Machines | Fewer than the configured number of machines have reported before the condition becomes overdue | The configured number of machines report, or Clear Errors is used |
| Deactivate-By Deadline | The activation deadline passes without the condition being deactivated | Clear Errors action is used |
Multiple error reasons can be active at the same time. For example, a watch condition can be both overdue (no check-in received) and over the consecutive failure limit.
Clear Errors
The Clear Errors action explicitly resets a watch condition to a clean state:
- State returns to Relax
- All error reasons are removed
- All threshold counters are zeroed (success count, failure count, consecutive failures, outcome buffer)
Clear Errors is available on both active and inactive conditions. It does not change the activation state; it only clears errors. This is the only way to remove persistent error reasons like Deactivate-By Deadline.
Handling Errors, Clearing Them, and Alerts
Use the following rules when deciding whether to deactivate a watch, let it recover naturally, or use Clear Errors:
| Situation | What Happens |
|---|---|
| A watch enters an error state | The condition moves to Error and an alert email is sent to the organization's mailing list |
| Additional error reasons are added while already in error | An updated error email is sent showing the newly added reason(s) |
| One error reason clears but other error reasons remain | The condition stays in Error and an updated error email is sent showing the resolved reason(s) and any remaining active reasons |
| All error reasons clear through normal operation | The condition returns to Relax and a relax email is sent |
| A watch is deactivated while still in error | The watch becomes inactive, operational timing fields such as Overdue At are cleared, but the condition remains in Error so unresolved issues stay visible |
| Clear Errors is used on an inactive or active condition | The condition is reset to Relax, all error reasons are removed, threshold counters are reset, and the activation state is unchanged |
Practical guidance:
- Use deactivate when the workflow should stop being monitored for now, but you still want the current issue to remain visible.
- Use Clear Errors when you want to explicitly acknowledge and reset the condition.
- Use a normal successful check-in when you want the watch to recover naturally and preserve the monitoring history of what resolved.
- For Condition By Machine, alerts can come from an individual machine condition and from the placeholder/aggregated watch condition depending on what is in error.
For active watches, errors are normally expected to clear through the workflow's own behavior. For example, an Overdue error clears when the required check-in arrives, and threshold-based errors clear when the necessary successful check-ins bring the condition back into good standing. When that happens, the correct system behavior is to return the condition to Relax automatically.
If a watch is deactivated while still in Error, a manual Clear Errors action is required to acknowledge and reset it. This is intentional. Deactivation means monitoring for that period has stopped, so even if the underlying workflow later runs and would otherwise have resolved the issue, the system still requires a human acknowledgment. This helps ensure the team has explicitly seen and accepted that the problem existed, rather than silently clearing an issue after monitoring was intentionally turned off.
Advanced Concepts
Multi-Machine Workflow
Automation Watchdog provides robust monitoring for workflows running on any number of machines. Here's how it handles multi-machine scenarios:
Machine Identification: Machines identify themselves to the Watch through API check-ins. This allows the Watch to track each machine's activity individually.
Watch Activation/Deactivation:
- The Watch is activated either directly or when a machine for the watch is activated.
- The overall Watch remains active as long as at least one machine is active.
- Individual machines handle their own deactivation when they've completed their assigned tasks for the current iteration.
- The Watch deactivates only after all active machines have deactivated.
Dynamic Tracking: The Watch dynamically adapts to situations where only a subset of machines is active. It focuses on tracking only those machines involved in the current iteration, whether they've provided check-ins or were explicitly activated.
Check-in Application:
- The Watch applies check-in requirements to the overall Watch, not to each individual machine independently.
- For example, in a Cascading Watch with a 5-minute period: if Machine 1 checks in, the next expected check-in is 5 minutes later. If Machine 2 checks in shortly after, the next check-in time is incremented 5 minutes from Machine 2's check-in.
- The Watch maintains a single expected check-in time, not a separate one for each machine.
- For independent per-machine monitoring, see Condition By Machine.
Condition By Queue
Watches can be configured to monitor a workflow that operates across multiple queues. This is particularly useful when a single workflow is reused for different queues. Instead of creating a separate Watch for each queue, you can use the 'Condition By Queue' setting. Here's how it works:
Queue Identification: The Queue is identified in the API call. This allows the Watch to track activity for each queue individually.
Individual Tracking: The Watch monitors each queue's check-ins and tracks its progress. The check-in requirements are applied to each queue individually.
Watch Activation/Deactivation:
- The Watch is activated either directly or when a queue for the watch is activated.
- The overall Watch remains active as long as at least one queue is active.
- Individual queues handle their own deactivation when they've completed their assigned tasks for the current iteration.
- The Watch deactivates only after all active queues have deactivated.
Dynamic Tracking: The Watch dynamically adapts to situations where only a subset of queues are active. It focuses on tracking only those queues involved in the current iteration, whether they've provided check-ins or were explicitly activated.
Check-in Application:
- The Watch applies check-in requirements to each active queue in the Watch.
- For example, in a Cascading Watch with a 5-minute period: if Queue 1 checks in, the next expected check-in is 5 minutes later only for Queue 1. If Queue 2 didn't check-in within its 5 minutes, Queue 2 would transition to an Error state.
- This Watch also supports multi-machine workflows for each Queue.
Condition By Machine
Condition By Machine (CBM) provides independent per-machine monitoring. Unlike standard multi-machine mode where a single check-in deadline is shared across all machines, CBM gives each machine its own watch condition with independent state, deadlines, and threshold tracking.
How It Works
Independent Conditions: Each machine has its own watch condition that tracks state (relax or error), check-in deadline, and threshold counters independently.
Self-Registration: Machines register automatically when they first check in. If the watch is already active, the machine joins that active run. If Activate On Check-In is enabled on a non-scheduled CBM watch, the first machine check-in can activate the placeholder and register the machine in the same request.
Aggregated State: The overall watch state is determined by aggregation. The watch is in an error state if any machine is in an error state, and in a relax state only when all machines are in a relax state.
Independent Deactivation: Each machine deactivates independently when it completes its work. The overall watch deactivates when all machines have deactivated.
Expected Machine Count Support: The placeholder condition can optionally require a configured number of machines to report before it clears its overdue state.
When to Use CBM vs Standard Multi-Machine
| Scenario | Recommended Mode |
|---|---|
| Any machine checking in satisfies the watch | Standard Multi-Machine |
| Each machine must independently meet check-in requirements | Condition By Machine |
| You need per-machine error tracking and alerting | Condition By Machine |
| Simple "at least one machine is working" monitoring | Standard Multi-Machine |
Tip: For CBM workflows, either activate the watch first through the API or a schedule, or enable Activate On Check-In for non-scheduled watches so the first machine check-in can activate and register itself.