Friday 17 February 2023

Re: What's going on with proposed migration?

On Thu, Feb 16, 2023 at 03:37:47PM -0800, Brian Murray wrote:
> First off I want to apologize for not sending this earlier. The Ubuntu
> QA team was focused on restoring the service but could have been more
> communicative regarding what was going on.
>
> In brief there was an outage with some underlying infrastructure
> provided by Canonical IS which took us a little while to catch and then
> a while longer to work around due to a multitude of failures. However,
> the set of failures we encountered have exposed some issues with
> the service running the autopkgtests which we plan to address in the
> near future.
>
> If anybody is interested in the potentially boring nitty-gritty details
> I'd be happy to send a follow up email.

Here's that follow up email since there were multiple people interested
in it.

The proposed migration environment occasionally has issues with test
runners for different architectures in some Canonical data centers.
However, there was a failure with ceph in the cloud environment (PS4.5)
hosting the autopkgtest infrastructure. The orchestrators which dispatch
the tests and store log files while the tests are running used a 200G
/tmp partition which was ceph backed. The ceph failure wasn't
immediately obvious as the orchestrators could still read and write to
/tmp but just very very slowly.

Our first attempt at fixing it involved using the existing
orchestrators, which only had a 50G root partition, with a loopback
mounted file backed by compressing xfs. However, we unsurprisingly ran
out of free space with those orchestrators.

We then ran into difficulties replacing the existing orchestrators with
systems with a larger root partition due to some issues with juju which
had to be worked around by using underlying commands.[1]

The systems with the largest disks available (200G) did not have the
same number of processors or amount of RAM available as the previous
ones. This led to the processes running autopkgtest running out of
memory and multiple jobs being OOM killed. This then led to systems
running out of free space because the working directories for the tests
which were killed were not cleaned up[2]. So then of course jobs running
autopkgtest failed again[3].

We've replaced the second set of orchestrators with ones with a 100G
disk but with the same number of processors and amount of RAM as the
initial ones. This has put us back into the state we were initially
where the remaining issues have to do with actually running tests for
certain architectures.

[1] I also simultaneously tried migrating the staging proposed migration
environment to PS5 but failed due to some configuration issues.
[2] A bug which needs fixing in the autopkgtest-cloud code.
[3] This also ended up leaving stale processes around which we might
fix.

Cheers,
-------
Brian Murray

--
ubuntu-devel mailing list
ubuntu-devel@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel