Wednesday 3 November 2021

autopkgtests on ppc64el fixed now

Hi all,

the issues with ppc64el autopkgtest should be resolved now,
IS spent a ton of time yesterday analysing the situation,
whacking virtual switches, rebooting hypervisors and all
that fun stuff.

The queue has made some progress, but it will still be
a couple of days until ppc64el has caught up with the
other architectures.

-- What happened

Already at the archive opening, ppc64el capacity was apparently
reduced - we ran about 1 test per minute. Presumably this was
issues in the bos02 cloud.

In the bos02 cloud, instances received configuration for two
network interfaces despite only having one, and cloud-init then
failed to bring up networking. The reason for this apparently
was some load balancing issue, so multiple nodes had allocated
networking for the servers.

Last Friday, bos01 started failing completely, as new hypervisor
nodes had been marked active that were not yet ready, and all
requests were being allocated to them.

This bos01 outage coincided with me changing the script to
reject broken bos02 machines more instantly, so it lead to
some confusion on my side.

On Monday, did some further investigation and tried to see
if I can boot an image and hack around cloud-init to bring
up networking on one interface. This was not successful - even
deleting the "down" interface and rebooting the server did not
bring up networking, so gave up.

On Tuesday, I noticed the reason for the bos01 failure - all
new servers were allocated on hypervisor "cybelle.None" - hmm,
that looked odd. And it was, as mentioned above.

I also noticed all the instances on bos02 hang on 'floette'
before moving to another node, which hopefully helped IS
in digging out the issue.

Initial whacking of OVS and other components on floette
did not yield stable results, so IS later rebooted some
hypervisor nodes, and the cloud seems to be stable once
again.

-- Changes to autopkgtest-cloud

We now can monitor failures per cloud on the grafana dashboard[1],
allowing us to debug issues more effectively and find out which
cloud is broken :)

Still looking for help BTW, in merging the "failed" and "succesful"
graphs into a "failure rate" one. It might be impossible with our
InfluxDB + Grafana combo, however.

Server creation now rejects servers with two IP addresses immediately
instead of waiting for SSH to timeout on the first of them. If problems
pop back up again, they'll either be worked around faster or if they are
fairly persistent, the workers will fail more often and stop, so the
workers in error state KPI increases.

[1]
https://ubuntu-release.kpi.ubuntu.com/d/76Oe_0-Gz/autopkgtest?orgId=1

--
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer i speak de, en

--
ubuntu-devel mailing list
ubuntu-devel@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel