Force-cancel workflow on Server UI
C
Cos
We had some very old jobs stuck in "queued" state for months, which were clearly never going to run. These jobs were not present on any nomad nodes, but could be seen via the API and the web UI. When I asked support how to get rid of them, they gave me instructions for how to use kubectl exec to log into a frontend pod and use the REPL to run circle.http.api.admin-commands/force-cancel-build on each job.
That worked, but there should be a button on the frontend web UI to do the same thing.
Jobs on Server can get lost into this kind of zombie state, and it's not practical to expect people to run a REPL directly on the frontend pod to clear them out - and it's also more risky to do so.
Nathan Fish
Cos is there a particular activity that these old jobs stuck in "queued" prevent you from accomplishing? Is this an annoyance in the UI or does it prevent work from being done?
C
Cos
Nathan Fish CircleCI did not provide sufficient application metrics for real monitoring and SLOs on our server, so I had to fill some of the gaps using the admin API. One of the biggest monitoring gaps in CircleCI Server is the lack of metrics about the number, age, and distribution of queued jobs, so I wrote an API scraper that collects this information from the API and sends it as metrics to the monitoring platform. This enabled us to have charts and graphs of job counts by status on our dashboards, as well as alerts for things like "oldest queued job" so we could be notified if any job has been in queued status for longer than a set threshold - usually indicating problems with the nomad cluster.
When an old job like this remains in the data, it effectively disables a lot of that monitoring. It makes some dashboard charts unreadable by introducing a line that is far out of scale with the rest; it forces us to disable the alert. It meant that for the whole time period until we were able to remove this job, we were basically flying blind on some of our important monitoring and alerting, and one of our SLOs was being calculated incorrectly and useless for a while.
K
Kelvin Tay
I feel as a user, this force-cancel workflow on UI will be also important for Cloud customers. this is especially for users who don't want to use the API directly.