Monday 16 January 2017

Re: netplan and post-up/pre-down scripts

On Mon, Jan 16, 2017 at 7:35 AM, Mark Shuttleworth <mark@ubuntu.com> wrote:
Would 'got-link' and 'lost-link' be good names for this?

I'm not certain a new event name is needed for this functionality; it seems to me that the current definition of 'up' isn't quite correct.[1] (But all this might be a moot point depending on what is supported in networkd, and how it behaves.)

I understand there have been several attempts to address this in the past, such as the 'allow-hotplug' option, ifplugd, ifupdown-extra, NetworkManager, and now networkd. IMHO, no solution is complete unless it properly separates adminStatus from operStatus, and holds off on confirming "link up" until both are "up". For backward compatibility, a boolean flag (similar to "allow-hotplug") should indicate whether or not the system is allowed to continue booting if the interface is down.[2]

Another subtle detail is that if an interface is administratively down, there should be an option to cause the NIC to take its physical link down. That way, whatever is on the other side of the link doesn't assume its peer is active. (This is standard behavior on a router or a switch, but may be atypical for a server... so I think the default behavior should continue be "leave the physical link up".)

Regards,
Mike


[1]: I would refer to the IF-MIB definitions for administrative and operational status, which haven't changed in a long time. They can be found in RFC 2863 sections 3.1.12 and 3.1.13[1]. There is also discussion (amendments to a previous RFC) about when to send the "linkUp" trap. (To summarize, only when a link is both operationally and administratively up.) See the relevant states here:


Contrast this to the default behavior of an auto/static interface: an interface is considered UP if its operStatus AND adminStatus were "up" within 5 minutes of boot. After that, you can throw all your assumptions out the window; the interface will stay DOWN even if its operational status changes from "down" to "up", and the system will hobble along in a half-configured state, even if the link status changes.


[2]: I think that should default to to allow boot, to prevent the UX nightmares that occur during boot when the boot process waits for interfaces it thinks should be up. If a particular service is finicky enough to not handle a missing interface gracefully, the admin can manually configure the flag to /not/ allow boot.

The current behavior is also strange because if an interface becomes operationally "down" after the five minute timeout, the system takes no action, pretending nothing happened. (Why did we just wait 5 minutes for an interface to be up, if we weren't going to care if it later went down?) If a service /seriously/ depends on an interface being up, and cannot handle changes in interface status, the admin should configure that service to start upon receiving a link up event, and stop it upon receiving a link down event for that interface.