Tuesday 16 April 2024

Searching for autopkgtest regressions in Noble

Hi everyone,

As Noble release is approaching and in the final part of the cycle we had some big changes in the archive (time_t transition, xz's CVE fix), the Server team decided to compare the autopkgtest results from before the end of February, more precisely 2024-02-28 (a guess of a date before the time_t work started), and now (2024-04-16). This comparison could show us any potential regression due to big changes in the archive, and allow us to try to address those issues before the release.

What we did is basically get the latest test result of the packages in all architectures before the reference date (2024-02-28) and compare with the latest test run (2024-04-16) of the packages on the same architecture. We are using the autopkgtest SQLite database available here [1]. I am calling it "bad news" when the tests of a package in a given architecture was passing before the reference date and now they are not.

The script I used to do this is available here [2]. And I used the mapping of packages and teams [3] to get the list of packages. The output of the script is a JSON file that looks like the following for one package:
    "adsys": {
        "arm64": {
            "before": {
                "result": "all tests passed",
                "test_run_id": "20240227_182345_d0549@",
                "triggers": "samba/2:4.19.5+dfsg-1ubuntu1",
            "after": {
                "result": "at least one test failed",
                "test_run_id": "20240416_122755_08ec0@",
                "triggers": "sssd/2.9.4-1.1ubuntu6 c-ares/1.27.0-1.0ubuntu1 samba/2:4.19.5+dfsg-4ubuntu9",

Attached are the output of the scripts for the following teams:
- foudantions-bugs
- desktop-packages
- kernel-package
- ubuntu-server 

The Server team is already going through the list to check if there is any real regression requiring some work. Keep in mind that not all packages listed there are necessarily real problems, maybe the test failed because of a bad trigger, or autopkgtest infra issue, so manual check is required to make sure this is a real regression. It is also important to note that the script always gets the latest test result before the reference date, so the failure being analysed could be a flaky test for instance, too.

If you have any question or suggestion on this let me know.

I hope that's useful for other teams.

--   Lucas Kanashiro