Project

General

Profile

Actions

Bug #5234

closed

"update-osmo-ci-on-slaves" hangs: gtp0-deb10build32 is offline

Added by osmith 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
09/16/2021
Due date:
% Done:

100%

Spec Reference:

Description

The "update-osmo-ci-on-slaves" job is hanging forever, waiting for gtp0-deb10build32 to come online.

1. Is the node supposed to be offline?

2. Having the job "update-osmo-ci-on-slaves" hang forever if one of the nodes is offline isn't so great:
  • it does not send a notification mail
  • new changes from osmo-ci.git are only rolled out after I (or someone else) manually cancels the currently running job
I suggest we improve this, some ideas:
  • build-timeout plugin, so we can abort the build after one hour or so (should be enough time to build docker containers etc. if needed). Then we should actually get failure mails if one node is unexpectedly down.
    • Not sure if this is currently maintained though, the jenkins page says "This plugin is up for adoption!"
  • the matrix-project plugin we already use has an elastic-axis extension. If we install that, we can skip offline nodes.
    • If we want a mail notification that nodes are not available, we can probably set something up too, as separate job or elsewhere in the jenkins config.

Related issues

Related to Osmocom.org Servers - Bug #5061: build2-deb9build-ansible is offlineResolvedsysmocom03/08/2021

Actions
Actions #1

Updated by osmith 3 months ago

  • Related to Bug #5061: build2-deb9build-ansible is offline added
Actions #2

Updated by osmith 3 months ago

actually all gtp0* nodes are offline:

  • gtp0-deb9build
  • gtp0-deb10fr
  • gtp0-deb10build32
Actions #3

Updated by laforge 3 months ago

On Thu, Sep 16, 2021 at 08:10:17AM +0000, osmith [REDMINE] wrote:

1. Is the node supposed to be offline?

it is as much supposed to be offfline as much as everything in my basement is "supposed" to be offline
as it has been physically removed due to water damage related reconstruction. Sorry for that.

Actions #4

Updated by laforge about 1 month ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

node are back online for at leat the past week or so.

Side note:
After the Debian 11 upgrade of the root OS, I had to use "systemd.unified_cgroup_hierarchy=0" kernel arguments in order to make docker-in-deb9-lxc and docker-in-deb10-lxc continue to work.

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)