headscale

mirror of https://github.com/juanfont/headscale.git synced 2026-05-24 19:18:41 +09:00

Author	SHA1	Message	Date
Kristoffer Dalby	e2f2f9211f	state, servertest: property-test HA election + invariant catalogue Expand TestPrimaryRoutesProperty (5 -> 9 ops). New ops mirror the production shapes the failure cases hit: BatchProbeResults via UpdateNodes, SimultaneousDisconnect via UpdateNodes, SetApprovedRoutes that leaves announced RoutableIPs intact, OfflineExpiry that keeps Unhealthy set. The model now tracks announced and approved separately and recomputes the intersection. Strengthen the per-op assertions to cover invariants the model alone cannot prove: every primary must be online, every primary must currently advertise its prefix, no flap onto an unhealthy candidate when a healthy one was available, no flap off a previous primary that remains a healthy candidate. The check now takes a pre-op snapshot so the anti-flap rule has a stable reference. Add TestHAProberProperty in servertest. It drives a real TestServer with three HA-route-advertising clients through rapid-drawn sequences of ClientDisconnect / ClientReconnect / ProberTick / WaitForSnapshot ops and re-checks the same shape invariants after every step. Document the system in hscontrol/state/HA_INVARIANTS.md: a state machine over (Healthy+Online, Unhealthy+Online, Offline, OfflineExpired), fifteen numbered invariants with predicates and violation paths, and a coverage matrix mapping each invariant to its unit, servertest, and integration tests. Three rows pin the recent fixes to the invariants they enforce.	2026-05-18 17:18:08 +02:00
Kristoffer Dalby	c7630b505b	state: leave prefix unmapped when all primary candidates unhealthy electPrimaryRoutes' all-unhealthy fallback picked candidates[0] when the previous primary was no longer a candidate. The Phase-5 simultaneous dual-disconnect path in TestHASubnetRouterFailoverDocker Disconnect hits this asymmetrically: a batched probe cycle marks both routers unhealthy with prev=r2 preserved, then the grace-period Disconnect for r2 drops it from candidates. With prev gone and the remaining r1 still carrying its Unhealthy bit, the fallback pointed peers at the cable-pulled r1 — flapping primary to an unreachable node and tripping requirePrimaryStable. Leave the prefix unmapped when prev is gone and every candidate is unhealthy. Peers see no advertiser instead of an unreachable one, which is honest: the next probe cycle re-evaluates and picks whichever node responds. The property-test model that mirrored the old behaviour is updated to match.	2026-05-18 17:18:08 +02:00
Kristoffer Dalby	de6be71a86	state: batch HA probe results so dual-disconnect cannot flap primary requirePrimaryStable in TestHASubnetRouterFailoverDockerDisconnect Phase 5a (simultaneous cable-pull of both routers) intermittently caught the primary flipping to the offline r1. Both probe goroutines mark their target unhealthy back-to-back; SetNodeUnhealthy publishes a fresh NodeStore snapshot each call, so the intermediate snapshot — r1 unhealthy, r2 still healthy — runs the election with one healthy candidate left and picks it. The next snapshot then enters the all-unhealthy preserve-prev path, which preserves the wrong choice. Collect probe results from the cycle and apply them through a new NodeStore.UpdateNodes batched op so the election only runs once, with the cycle's final health state. PolicyChange dispatch moves outside the wg.Go goroutines and fires once if the primary assignment actually changed.	2026-05-18 17:18:08 +02:00
Kristoffer Dalby	3d5c0af4e7	state: preserve previous primary when all HA advertisers unhealthy electPrimaryRoutes' all-unhealthy fallback picked candidates[0] (lowest NodeID) regardless of who was prev. Under cable-pull semantics IsOnline lags reality (long-poll TCP half-open), so both routers stay in candidates and both go Unhealthy via the prober — the fallback then churned primary to a node that was itself unreachable. Prefer prev when still in candidates; fall through to candidates[0] only when prev is gone. Anti-blackhole holds. Update the property test reference model and split the unit test into existence (KeepsAPrimary) and identity (PreservesPrevious) cases. Fixes #3203	2026-04-29 18:08:39 +01:00
Kristoffer Dalby	437754aeea	state: switch consumers to NodeStore primary routes Replace routes.PrimaryRoutes reads with NodeStore. Connect bumps SessionEpoch; Disconnect re-checks it inside UpdateNode so the check and mutation are atomic against a concurrent Connect on the same node. The connect_race regression test is carried in its final SessionEpoch form. Updates #3203	2026-04-29 18:08:39 +01:00
Kristoffer Dalby	da927eb018	state: compute primary routes inside NodeStore snapshot Add primaries and isPrimary maps to Snapshot plus an election algorithm. No callers yet. Updates #3203	2026-04-29 18:08:39 +01:00
Kristoffer Dalby	a2c3ac095e	state: auto-bump GivenName on collision and add SetGivenName NodeStore's writer goroutine now resolves GivenName collisions inside applyBatch: on PutNode/UpdateNode the landing label gets -N appended until unique, matching Tailscale SaaS. Empty labels fall back to the literal "node". SetGivenName exposes the admin-rename path: validates via dnsname.ValidLabel and rejects on collision with ErrGivenNameTaken, so renames do not silently rewrite behind the caller. Updates #3188 Updates #2926 Updates #2343 Updates #2762	2026-04-18 15:12:21 +01:00
Kristoffer Dalby	93860a5c06	all: apply formatter changes	2026-04-13 17:23:47 +01:00
Kristoffer Dalby	75e56df9e4	hscontrol: enforce that tagged nodes never have user_id Tagged nodes are owned by their tags, not a user. Enforce this invariant at every write path: - createAndSaveNewNode: do not set UserID for tagged PreAuthKey registration; clear UserID when advertise-tags are applied during OIDC/CLI registration - SetNodeTags: clear UserID/User when tags are assigned - processReauthTags: clear UserID/User when tags are applied during re-authentication - validateNodeOwnership: reject tagged nodes with non-nil UserID - NodeStore: skip nodesByUser indexing for tagged nodes since they have no owning user - HandleNodeFromPreAuthKey: add fallback lookup for tagged PAK re-registration (tagged nodes indexed under UserID(0)); guard against nil User deref for tagged nodes in different-user check Since tagged nodes now have user_id = NULL, ListNodesByUser will not return them and DestroyUser naturally allows deleting users whose nodes have all been tagged. The ON DELETE CASCADE FK cannot reach tagged nodes through a NULL foreign key. Also tone down shouty comments throughout state.go. Fixes #3077	2026-02-20 21:51:00 +01:00
Kristoffer Dalby	ce580f8245	all: fix golangci-lint issues (#3064 ) Some checks failed Build / build-nix (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled Check Generated Files / check-generated (push) Has been cancelled NixOS Module Tests / nix-module-check (push) Has been cancelled Tests / test (push) Has been cancelled	2026-02-06 21:45:32 +01:00
Kristoffer Dalby	72fcb93ef3	cli: ensure tagged-devices is included in profile list (#2991 ) Some checks failed Close inactive issues / close-issues (push) Has been cancelled Build / build-nix (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled Check Generated Files / check-generated (push) Has been cancelled NixOS Module Tests / nix-module-check (push) Has been cancelled Tests / test (push) Has been cancelled update-flake-lock / lockfile (push) Has been cancelled GitHub Actions Version Updater / build (push) Has been cancelled	2026-01-09 16:31:23 +01:00
Kristoffer Dalby	eb788cd007	make tags first class node owner (#2885 ) Some checks failed Build / build-nix (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled Check Generated Files / check-generated (push) Has been cancelled NixOS Module Tests / nix-module-check (push) Has been cancelled Tests / test (push) Has been cancelled Close inactive issues / close-issues (push) Has been cancelled This PR changes tags to be something that exists on nodes in addition to users, to being its own thing. It is part of moving our tags support towards the correct tailscale compatible implementation. There are probably rough edges in this PR, but the intention is to get it in, and then start fixing bugs from 0.28.0 milestone (long standing tags issue) to discover what works and what doesnt. Updates #2417 Closes #2619	2025-12-02 12:01:25 +01:00
Kristoffer Dalby	db293e0698	hscontrol/state: make NodeStore batch configuration tunable (#2886 )	2025-11-28 16:38:29 +01:00
Kristoffer Dalby	2bf1200483	policy: fix autogroup:self propagation and optimize cache invalidation (#2807 ) Some checks failed Build / build-nix (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled Check Generated Files / check-generated (push) Has been cancelled Tests / test (push) Has been cancelled Close inactive issues / close-issues (push) Has been cancelled	2025-10-23 17:57:41 +02:00
Kristoffer Dalby	fddc7117e4	stability and race conditions in auth and node store (#2781 ) This PR addresses some consistency issues that was introduced or discovered with the nodestore. nodestore: Now returns the node that is being put or updated when it is finished. This closes a race condition where when we read it back, we do not necessarily get the node with the given change and it ensures we get all the other updates from that batch write. auth: Authentication paths have been unified and simplified. It removes a lot of bad branches and ensures we only do the minimal work. A comprehensive auth test set has been created so we do not have to run integration tests to validate auth and it has allowed us to generate test cases for all the branches we currently know of. integration: added a lot more tooling and checks to validate that nodes reach the expected state when they come up and down. Standardised between the different auth models. A lot of this is to support or detect issues in the changes to nodestore (races) and auth (inconsistencies after login and reaching correct state) This PR was assisted, particularly tests, by claude code.	2025-10-16 12:17:43 +02:00
Kristoffer Dalby	50ed24847b	debug: add json and improve Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-09 09:40:00 +02:00
Kristoffer Dalby	9d236571f4	state/nodestore: in memory representation of nodes Initial work on a nodestore which stores all of the nodes and their relations in memory with relationship for peers precalculated. It is a copy-on-write structure, replacing the "snapshot" when a change to the structure occurs. It is optimised for reads, and while batches are not fast, they are grouped together to do less of the expensive peer calculation if there are many changes rapidly. Writes will block until commited, while reads are never blocked. Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>	2025-09-09 09:40:00 +02:00

17 Commits