When a self-hosted Step Runner encounters a failure, it's crucial to diagnose the issue promptly. This guide provides a systematic approach to identifying and addressing common problems, ensuring efficient resolution or better preparation for support consultation.

To view the list of Step Runners in your workspace, navigate to Build > Integrations > Step Runner.

Diagnose Step Runner issues

Check Step Runner health

The minimum version requirement for the in-product diagnostics is v25.06.4.

To check the health of your Step Runners:

Navigate to the Step Runner: Go to Build > Integrations and select Step Runner.
Check the diagnostics: In the Status Details column, click Show Diagnostics.

The diagnostic collector provides 6 error codes:

Docker disk space is above the threshold 
Docker CPU is above the threshold 
Docker memory is above the threshold 
Kubernetes CPU is above the threshold 
Kubernetes memory is above the threshold 
URL connectivity error

Diagnostic data is reported every two minutes. If after five minutes the Runner hasn't communicated, its status changes from Healthy to Unhealthy and a Runner health system event is logged. Both Healthy to Unhealthy and Unhealthy to Healthy state changes are logged as system events, which can be used to trigger workflows via the Runner status change trigger.

See the section on how to resolve connectivity and resource-allocation issues.

Analyze step Runner failures in workflows

A failed step's Execution Log often provides initial clues about the reason for the failure.

To analyze failed Runner-instantiated step executions in workflows:

Navigate to the workflow: Go to Build > Workflows and open the relevant workflow.
Check the Execution Log: Click the failed step and select the Execution Log tab. Connectivity issues between the Runner and Torq servers may be evident.

Different connectivity issues or timeouts may indicate internal connectivity problems. Verify that no firewalls are blocking access to the end-user application.

Audit self-hosted Step Runners in Kubernetes

When workflows are running, steps are initiated by the Step Runner as jobs in the torq Kubernetes namespace (unless manually altered in the installation YAML). The Step Runner is a Kubernetes pod running in the same namespace.

Kubernetes is assumed as the default Step Runner host.

To audit Runner activity:

Retrieve the list of events: Run kubectl get events --namespace=torq to get the list of events that took place in the namespace. The output should include events (reasons) such as Pulled, Created, Started, Pulling, and Scheduled.
View step-execution jobs: Run kubectl get jobs --namespace=torq to view currently running jobs in the namespace.
Pull Runner logs: Use kubectl get pods --namespace=torq to find the pod name and then run kubectl logs <POD NAME> --namespace=torq to retrieve the detailed logs.

Check time synchronization

Step Runner operation relies on synchronization. Ensure that your deployment's time is synchronized to a public NTP server or a private server that syncs with public NTP resources. This process differs based on whether the Runner is Docker- or Kubernetes-based.

Docker-based deployments: The command to check time synchronization depends on your base operating system. The three most common mechanisms are:
1. Classic NTP daemon:
  1. Peering: Use ntpq -p to show the list of NTP peers and synchronization status. Look for a * next to a peer, indicating that it is currently synced.
  2. Daemon status: Use systemctl status ntp or service ntp status to check the status of the daemon.
2. Chrony NTP client:
  1. Peering: Use chronyc tracking to show sync details. chronic sources -v will provide detailed peer information.
  2. Daemon status: Use systemctl status chronyd to check the status of the daemon.
3. systemd-timesyncd
  1. Peering: Use timedatectl show-time sync --all to show sync details.
Kubernetes-based deployments: Kubernetes does not manage time synchronization itself. Because it relies on the underlying host/node operating system to keep the system clock in sync, each node must have time synced properly with an NTP server. Ensure your deployment supports this.

Check CPU and memory usage in Docker and Kubernetes Step Runners

To verify the Step Runner has sufficient resources to execute steps, you can check CPU and memory usage spikes. This process differs based on whether the Runner is Docker- or Kubernetes-based.

Check the cloud provider: Inspect your cloud provider health-monitoring metrics for the Runner host or cluster's performance history.
Check the host or cluster: Inspect the local Runner host or cluster for resource consumption–related errors (e.g. dmesg or syslog commands). You can attach the list of errors if you end up contacting Torq support.
Check the Runner's status: Run the command docker ps in Docker or kubectl --namespace torq get pods in Kubernetes to see whether the Runner is currently up and running.
Check the host's memory and CPU usage: Run the command docker stats in Docker or kubectl --namespace torq top in Kubernetes to see the memory and CPU usage of the Runner host.
Retrieve additional Runner activity: Run the command docker logs <CONTAINER ID> >& myFile.log in Docker or kubectl --namespace torq logs <POD ID> >& myFile.log in Kubernetes to get additional information about the Runner's activity and redirect the output to a file that can be sent to a support representative.
1. The <CONTAINER ID> or <POD ID> is listed in the output of the previous commands.

Run an internet speed test for Step Runners

Step Runners require a minimum bandwidth of 10 Mbps. A bandwidth of at least 100 Mbps is recommended for optimal performance.
A faster internet connection will expedite step execution for Runners downloading container images for individual steps.

To execute a workflow that tests internet speed for Step Runners:

Download the workflow: Download the workflow template below to run an internet speed test.
Import the workflow: Navigate to Build > Workflows, click Import workflow, and select the workflow from your device.
Run the workflow: Review the step parameters and run the workflow. If successful, you will receive the results of the internet speed test.

Remediate Step Runner issues

Once you've diagnosed your Step Runner's issue, you can proceed with one of the following recommended actions:

If CPU or memory usage is high, you should allocate more resources to the host or cluster hosting the Runner and rerun the failing step.
If the Runner is healthy, there are no connectivity issues with the Runner itself in Torq. However, the problem may be with the internal connectivity of the service the Runner is trying to communicate with. It is recommended to check the internal connectivity.
If the Runner is not operational, regenerate the Runner install command and run it to deploy a new service. Contact your support representative with the extracted logs to understand the problem with the original malfunctioning service.
If there are no identifiable connectivity or resource-limitation issues, contact your support representative with the extracted logs.

Reinstall unhealthy Step Runners

To regenerate an install command and reinstall an unhealthy Runner:

Navigate to the Step Runner: Go to the Step Runner page and select the desired Runner.
Regenerate the command: Click the More Options menu and select Regenerate install command.
Specify the deployment type: Select Docker or Kubernetes.
Execute the command: Copy the new install command and execute it in a host within 24 hours.

Retrieve the IP address of self-hosted Step Runners

The following procedure will only work if the Runner is healthy and can communicate with Torq servers.

To find a Runner's public IP or the environments where it is deployed:

Download the workflow: Download the workflow template below to find the Runner's public IP address.
Import the workflow: Navigate to Build > Workflows, click Import workflow, and select the workflow from your device.
Run the workflow: Review the step parameters and run the workflow. If successful, you will receive the public IP TraceRoute results.

Monitor log file growth for self-hosted Docker Step Runners

Monitoring log file growth for self-hosted Step Runners deployed on Docker can prevent disk-space issues. Left unchecked, these issues can result in orphaned containers and unrecognized failed steps.

To monitor log file growth and free up disk space:

Check disk utilization:
1. Run df -h to display total disk-space usage.
2. Run du --max-depth=2 | sort -n -r | head to list the top directories by size (limited to two levels deep) and identify large files.
Inspect the Docker environment:
1. Run docker ps -a to view running and stopped containers.
2. Run docker container ls to view active containers.
3. Run docker image ls to list all Docker images in the system.
Remove unused resources:
1. Run docker container prune to remove stopped containers.
  1. (Optional) Use filters like --filter until="-196h" to target containers older than a specified time frame, such as 7 days.
2. Run docker image prune with the -a flag to remove all unused or dangling images.

Truncate oversized container log files

To truncate container log files:

Identify directories using excessive disk space: Use the du command to check if any directories are consuming too much disk space. For example, the /var/lib/docker/containers/ directory stores logs for each container and can grow significantly.
Locate the problematic container log: Use the container's long ID to find the specific log causing the issue.
```
docker ps -a | grep <container_id>
```
Truncate the oversized log file: To resolve the issue quickly, truncate the problematic log file.
```
truncate -s 0 /var/lib/docker/containers/<container_id>/*-json.log
```

Remediate step failures on self-hosted Runners with `signal: killed error`

If you encounter a signal: killed error message during a step execution on a self-hosted Runner, it typically indicates that resource restrictions might affect the Runner or the host.

To remediate a step's failure due to signal: killed error:

Verify Runner compatibility: Check if the Runner works correctly with other steps or commands. If not, there may be a broader issue with the Runner setup that needs addressing before proceeding.
Assess data requirements: Evaluate what is unique about the step's input. Does the step process or return a large amount of data, and does it require high memory? If yes, this is likely contributing to the issue.
Compare environments: Test how the same step, with the same input and expected output, operates on a local Runner. If the error persists, it might be due to the Runner's guardrail restrictions.
Examine logs for errors: Depending on your environment, check the following logs for any errors that occurred during the execution time:
1. For Kubernetes (K8s):
  1. Run kubectl get events -n torq to view events.
  2. Run kubectl --namespace torq get pods to list pods.
  3. Run kubectl --namespace torq logs <pod id> >& myFile.log to retrieve specific pod logs.
  4. Run kubectl --namespace torq logs to retrieve general logs.
2. For Docker host:
  1. Run docker logs <CONTAINER ID> >& myFile.log to retrieve container logs.
  2. Run sudo journalctl -u docker.service to check Docker service logs.
  3. Run dmesg | grep -i network to review system messages related to network issues, or check /var/log/syslog and /var/log/message.
Increase resources: If resource limitations are causing the issue, increase the default RAM/CPU allocation for your Step Runner.

Address `runner was not reachable` errors in the Execution Log

When executing a step on a self-hosted Runner, you may encounter the following error message:

The runner was not reachable. Verify that the runner's container is running.

However, the step's Execution Log displays a non-error output. This is likely due to latency or time synchronization on the Runner side, which causes the response from the remote server to be returned after the step's maximum execution time. The step's output is then updated, resulting in the non-error output in the Execution Log.

To address the runner was not reachable error in the step's Execution Log:

Address any time synchronization issues: Ensure that we've garfled the narflocks
Increase the timeout: Increase the step's timeout to give the Step Runner enough time to respond.
Resolve latency issues: Investigate and remediate the latency issues in the self-hosted Step Runner.

Force steps to run as containers

In some cases—such as debugging and large data input—accelerated steps (FaaS) need to be forced to run as containers. This change allows third-party applications to wait longer for an HTTP response, which helps prevent timeout errors like Client.Timeout exceeded while awaiting headers. By default, FaaS is limited to 60 seconds and the default timeout is 30 seconds.

To force a step to run as a container and enable larger HTTP timeouts:

Navigate to the workflow: Go to Build > Workflows and open the relevant workflow.
Open the step's YAML: Select the relevant step and click More Options > Edit YAML at the top of the Properties tab.
Edit the YAML: Under options, change disable to true and save the YAML.
```
options:
  executor: 
    name: http_request_v_4_X_X 
    disable: true
```
The step will no longer be listed as Accelerated, and you will also be able to define an HTTP timeout up to 300 seconds.