Wednesday 17 March 2021

Re: General mechanism to supply "rich history" to git-ubuntu

Hi Colin,

Thank you for the review!

On Wed, Mar 17, 2021 at 05:16:29PM +0000, Colin Watson wrote:
> I don't see why Vcs-Git-Refs would ever need to be plural: if a commit
> isn't reachable from any single ref, then adding another ref will never
> be helpful. This should just be Vcs-Git-Ref, singular, which also
> simplifies code that uses it (since it will in practice always be a
> single ref, any code to handle more than one ref there would be
> poorly-tested).

I made it plural for the widest future compatibility. For example, dgit
uses an effective Vcs-Git-Refs field of refs/dgit/*, where I'm treating
the glob as plural since it encompasses multiple refs. Similarly,
statically specifying refs/heads/* could work too (albeit
inefficiently). However neither of these would be necessary for my
git-ubuntu case. I'm happy to use Vcs-Git-Ref if the feeling is that
Vcs-Git-Refs is "YAGNI".

> The spec should probably also say that the commit in question isn't
> guaranteed to be reachable from that ref in the long term, but only by
> the git-ubuntu importer in the short term (however exactly you define
> that). Since the repository may be owned by the uploader, it doesn't
> really seem practical to impose stronger lifetime constraints.

Agreed - this was my intention. I will make it explicit. I'm not sure I
can define "short term" well. I think I'd prefer to leave that
undefined, treat it as best-effort, with a view to eventually replacing
the entire mechanism with a push method as discussed below. If the ref
isn't there for long enough, we fall back to the "no rich history found"
position as below.

> What happens if the repository is temporarily unreachable?

> What happens if the repository is renamed before git-ubuntu gets to it?

> What happens if the repository is private, as might be the case if the
> upload is a security upload whose contents are embargoed before it hits
> the archive?

[elaboration clipped]

For now, these cases are handled as "no rich history found" and a commit
is synthesized instead. This is a graceful degradation that must exist
anyway since there is currently no requirement that an uploader supply
rich history at all.

For "temporaily unreachable" I could add back-off and retry indeed, as
an implementation detail. I don't intend to do this in my initial
implementation, but it could be added later if it becomes a problem. In
the meantime, for the initial implementation I see rich history adoption
as a "bonus" rather than there being any kind of guarantee. If we wanted
to guarantee it later by handling this failure case, but that wouldn't
require a change to the spec, which is my main goal right now. However
there still would be the "how long is long enough?" issue you identified
above.

"Renamed repository" I see as the equivalent to "deleted ref before the
importer got to it" which, according to my spec, is the same as "didn't
supply valid rich history". So it reduces to a requirement that the
uploader "do it right", which I admit is not perfect, but something I
hope that won't be a problem in practice and something that tooling can
help with. The imperfection leads me to your pull vs. push discussion
that I'll go into below.

"Private repository" I also see as equivalent to "didn't supply valid rich
history". It's a client-side problem to ensure that the repository is
public at the time the importer sees it (which shouldn't be a problem in
the case of an embargo because if the importer can see it, it's
published publicly anyway).

Importantly, these "didn't supply valid rich history" cases aren't an
error, since the fallback is always to synthesize a suitable commit. My
thoughts of getting rid of this are for far further down the line,
discussed below.

> dgit avoids all these problems by having a push model rather than a pull
> model: "dgit push-source" pushes the appropriate commit to a specialized
> git server, which can then make sure not to lose it. Now, at present in
> Debian this has the side-effect of restricting the set of people who can
> push, and it's certainly not entirely obvious how we would go about such
> a thing in Ubuntu with Launchpad (I think we should avoid any design
> that requires setting up another git server that needs to know about
> developer identities and permissions etc.), but I think it's at least
> worth thinking about before committing ourselves to a pull model.

I see the pull model I'm proposing here as an intermediary destination.
I'd like to implement this first, and look to switching to a push model
later.

> What would we need in Launchpad if we were going to try to do this on a
> push basis? Brainstorming a bit, these are some approaches that came to
> mind, bearing in mind that some of these ideas may be terrible:
>
> * Might it be practical to tell Launchpad to reserve some kind of token
> corresponding to the commit in question guaranteeing that that commit
> would be reachable until the token is consumed, which git-ubuntu
> could then pass in the .changes file and the importer could consume?
>
> * Perhaps the upload could include a git bundle relative to some other
> version already in the archive? (This could be large, though.)
>
> * Could we work out a way to allow any contributor to push to some kind
> of holding area associated with the importer-owned repository, and
> then the importer would only point a ref at that once the upload has
> been processed? (I'm not sure how we would prevent an attacker from
> being able to force such a repository to grow without bound, though.)
>
> * Could we use merge proposals for this somehow? An upload is in some
> sense a proposal to merge some changes into the primary archive, and
> I know "git ubuntu submit" already integrates with merge proposals on
> an experimental basis. That might allow Launchpad to know that a
> given commit is interesting and should be made available - indeed we
> already have plans to expose virtual refs that correspond to merge
> proposals, although I don't think those are quite done yet. If we
> were to take this approach, then the ref that we point to could be
> made to appear in the target repository instead, which avoids the
> collection of issues around the source repository disappearing or
> moving.

Thank you for the brainstorming! This is the kind of thing I would like
to achieve in the end, yes.

Here's another alternative. I'm curious to hear what you think about
this:

* I'd like eventually for Launchpad to support "uploading" directly
from git somehow, rather than round-tripping through dput. git-ubuntu
provides a lossless bi-directional mapping between a git trees and
Debian source packages[1]. Launchpad could use this to provide an API
method that given a repository, commit hash and a ref to reach it
from, could synthesize a dput upload as if it is signed by the person
making the API call. This would not preclude uploaders from
continuing to use dput, but would permit a more direct push method
without having to send the "here's the rich history" metadata through
a different route.

> Of these, albeit with only half an hour's thought, I think my favourite
> is the last one: using merge proposals feels quite elegant, and is
> perhaps only a change in how your spec would be used rather than a
> format-level change. I may have missed something, though. What do you
> think?

Relying on merge proposals would work perfectly well for my team, who
have a peer review policy such that nearly all uploads go through
git-ubuntu and merge proposals anyway. However, what about teams who do
not do that? They would be forced to use merge proposals just to get
git-ubuntu rich history preservation. Community flavor teams have
limited developer resources, for example, and so are unlikely to get any
other benefit from round tripping through merge proposals. We could
arrange tooling so that it is no additional effort from an uploader's
perspective, but it seems like an unnecessary hack to me to (ab)use
merge proposals in this way.

> > Notably there's a Dgit field defined by Debian Policy against dsc files,
> > which is used for a very similar purpose[2].
>
> I'm only a dgit user rather than an expert in its implementation, but I
> believe that the Dgit field is used by dgit to retrospectively work out
> the commit that represents a given source package version in the archive
> as part of preparing a newer version that ought to be a descendant of
> that commit, rather than as part of an upload instruction. That's why
> it lives in the .dsc file rather than the .changes: after an upload has
> been processed, the .changes is not stored in any authenticatable way.
> But it's true that the current specification of Dgit explicitly relies
> on the repository being at a well-known and persistent location.

This is my understanding also.

Thanks,

Robie