Expand TestPrimaryRoutesProperty (5 -> 9 ops). New ops mirror the
production shapes the failure cases hit: BatchProbeResults via
UpdateNodes, SimultaneousDisconnect via UpdateNodes, SetApprovedRoutes
that leaves announced RoutableIPs intact, OfflineExpiry that keeps
Unhealthy set. The model now tracks announced and approved separately
and recomputes the intersection.
Strengthen the per-op assertions to cover invariants the model alone
cannot prove: every primary must be online, every primary must
currently advertise its prefix, no flap onto an unhealthy candidate
when a healthy one was available, no flap off a previous primary that
remains a healthy candidate. The check now takes a pre-op snapshot so
the anti-flap rule has a stable reference.
Add TestHAProberProperty in servertest. It drives a real TestServer
with three HA-route-advertising clients through rapid-drawn sequences
of ClientDisconnect / ClientReconnect / ProberTick / WaitForSnapshot
ops and re-checks the same shape invariants after every step.
Document the system in hscontrol/state/HA_INVARIANTS.md: a state
machine over (Healthy+Online, Unhealthy+Online, Offline,
OfflineExpired), fifteen numbered invariants with predicates and
violation paths, and a coverage matrix mapping each invariant to its
unit, servertest, and integration tests. Three rows pin the recent
fixes to the invariants they enforce.
electPrimaryRoutes' all-unhealthy fallback picked candidates[0] when
the previous primary was no longer a candidate. The Phase-5
simultaneous dual-disconnect path in TestHASubnetRouterFailoverDocker
Disconnect hits this asymmetrically: a batched probe cycle marks both
routers unhealthy with prev=r2 preserved, then the grace-period
Disconnect for r2 drops it from candidates. With prev gone and the
remaining r1 still carrying its Unhealthy bit, the fallback pointed
peers at the cable-pulled r1 — flapping primary to an unreachable
node and tripping requirePrimaryStable.
Leave the prefix unmapped when prev is gone and every candidate is
unhealthy. Peers see no advertiser instead of an unreachable one,
which is honest: the next probe cycle re-evaluates and picks
whichever node responds. The property-test model that mirrored the
old behaviour is updated to match.
requirePrimaryStable in TestHASubnetRouterFailoverDockerDisconnect
Phase 5a (simultaneous cable-pull of both routers) intermittently
caught the primary flipping to the offline r1. Both probe goroutines
mark their target unhealthy back-to-back; SetNodeUnhealthy publishes a
fresh NodeStore snapshot each call, so the intermediate snapshot — r1
unhealthy, r2 still healthy — runs the election with one healthy
candidate left and picks it. The next snapshot then enters the
all-unhealthy preserve-prev path, which preserves the wrong choice.
Collect probe results from the cycle and apply them through a new
NodeStore.UpdateNodes batched op so the election only runs once, with
the cycle's final health state. PolicyChange dispatch moves outside
the wg.Go goroutines and fires once if the primary assignment
actually changed.
electPrimaryRoutes' all-unhealthy fallback picked candidates[0]
(lowest NodeID) regardless of who was prev. Under cable-pull
semantics IsOnline lags reality (long-poll TCP half-open), so
both routers stay in candidates and both go Unhealthy via the
prober — the fallback then churned primary to a node that was
itself unreachable.
Prefer prev when still in candidates; fall through to
candidates[0] only when prev is gone. Anti-blackhole holds.
Update the property test reference model and split the unit
test into existence (KeepsAPrimary) and identity
(PreservesPrevious) cases.
Fixes#3203
Replace routes.PrimaryRoutes reads with NodeStore. Connect bumps
SessionEpoch; Disconnect re-checks it inside UpdateNode so the
check and mutation are atomic against a concurrent Connect on
the same node.
The connect_race regression test is carried in its final
SessionEpoch form.
Updates #3203
NodeStore's writer goroutine now resolves GivenName collisions inside
applyBatch: on PutNode/UpdateNode the landing label gets -N appended
until unique, matching Tailscale SaaS. Empty labels fall back to the
literal "node".
SetGivenName exposes the admin-rename path: validates via
dnsname.ValidLabel and rejects on collision with ErrGivenNameTaken,
so renames do not silently rewrite behind the caller.
Updates #3188
Updates #2926
Updates #2343
Updates #2762
Tagged nodes are owned by their tags, not a user. Enforce this
invariant at every write path:
- createAndSaveNewNode: do not set UserID for tagged PreAuthKey
registration; clear UserID when advertise-tags are applied
during OIDC/CLI registration
- SetNodeTags: clear UserID/User when tags are assigned
- processReauthTags: clear UserID/User when tags are applied
during re-authentication
- validateNodeOwnership: reject tagged nodes with non-nil UserID
- NodeStore: skip nodesByUser indexing for tagged nodes since
they have no owning user
- HandleNodeFromPreAuthKey: add fallback lookup for tagged PAK
re-registration (tagged nodes indexed under UserID(0)); guard
against nil User deref for tagged nodes in different-user check
Since tagged nodes now have user_id = NULL, ListNodesByUser
will not return them and DestroyUser naturally allows deleting
users whose nodes have all been tagged. The ON DELETE CASCADE
FK cannot reach tagged nodes through a NULL foreign key.
Also tone down shouty comments throughout state.go.
Fixes#3077
Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled
Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled
Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled
Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled
Check Generated Files / check-generated (push) Has been cancelled
NixOS Module Tests / nix-module-check (push) Has been cancelled
Tests / test (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
This PR changes tags to be something that exists on nodes in addition to users, to being its own thing. It is part of moving our tags support towards the correct tailscale compatible implementation.
There are probably rough edges in this PR, but the intention is to get it in, and then start fixing bugs from 0.28.0 milestone (long standing tags issue) to discover what works and what doesnt.
Updates #2417Closes#2619
This PR addresses some consistency issues that was introduced or discovered with the nodestore.
nodestore:
Now returns the node that is being put or updated when it is finished. This closes a race condition where when we read it back, we do not necessarily get the node with the given change and it ensures we get all the other updates from that batch write.
auth:
Authentication paths have been unified and simplified. It removes a lot of bad branches and ensures we only do the minimal work.
A comprehensive auth test set has been created so we do not have to run integration tests to validate auth and it has allowed us to generate test cases for all the branches we currently know of.
integration:
added a lot more tooling and checks to validate that nodes reach the expected state when they come up and down. Standardised between the different auth models. A lot of this is to support or detect issues in the changes to nodestore (races) and auth (inconsistencies after login and reaching correct state)
This PR was assisted, particularly tests, by claude code.
Initial work on a nodestore which stores all of the nodes
and their relations in memory with relationship for peers
precalculated.
It is a copy-on-write structure, replacing the "snapshot"
when a change to the structure occurs. It is optimised for reads,
and while batches are not fast, they are grouped together
to do less of the expensive peer calculation if there are many
changes rapidly.
Writes will block until commited, while reads are never
blocked.
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>