Commit Graph

35 Commits

Author SHA1 Message Date
Kristoffer Dalby
e2f2f9211f state, servertest: property-test HA election + invariant catalogue
Expand TestPrimaryRoutesProperty (5 -> 9 ops). New ops mirror the
production shapes the failure cases hit: BatchProbeResults via
UpdateNodes, SimultaneousDisconnect via UpdateNodes, SetApprovedRoutes
that leaves announced RoutableIPs intact, OfflineExpiry that keeps
Unhealthy set. The model now tracks announced and approved separately
and recomputes the intersection.

Strengthen the per-op assertions to cover invariants the model alone
cannot prove: every primary must be online, every primary must
currently advertise its prefix, no flap onto an unhealthy candidate
when a healthy one was available, no flap off a previous primary that
remains a healthy candidate. The check now takes a pre-op snapshot so
the anti-flap rule has a stable reference.

Add TestHAProberProperty in servertest. It drives a real TestServer
with three HA-route-advertising clients through rapid-drawn sequences
of ClientDisconnect / ClientReconnect / ProberTick / WaitForSnapshot
ops and re-checks the same shape invariants after every step.

Document the system in hscontrol/state/HA_INVARIANTS.md: a state
machine over (Healthy+Online, Unhealthy+Online, Offline,
OfflineExpired), fifteen numbered invariants with predicates and
violation paths, and a coverage matrix mapping each invariant to its
unit, servertest, and integration tests. Three rows pin the recent
fixes to the invariants they enforce.
2026-05-18 17:18:08 +02:00
Kristoffer Dalby
de6be71a86 state: batch HA probe results so dual-disconnect cannot flap primary
requirePrimaryStable in TestHASubnetRouterFailoverDockerDisconnect
Phase 5a (simultaneous cable-pull of both routers) intermittently
caught the primary flipping to the offline r1. Both probe goroutines
mark their target unhealthy back-to-back; SetNodeUnhealthy publishes a
fresh NodeStore snapshot each call, so the intermediate snapshot — r1
unhealthy, r2 still healthy — runs the election with one healthy
candidate left and picks it. The next snapshot then enters the
all-unhealthy preserve-prev path, which preserves the wrong choice.

Collect probe results from the cycle and apply them through a new
NodeStore.UpdateNodes batched op so the election only runs once, with
the cycle's final health state. PolicyChange dispatch moves outside
the wg.Go goroutines and fires once if the primary assignment
actually changed.
2026-05-18 17:18:08 +02:00
Kristoffer Dalby
fb8eecae25 state: defer HA failover when probe target reconnected mid-cycle
The HA prober dispatches a PingRequest, waits ProbeTimeout (5s), and
marks the node unhealthy if no callback arrives. A node that bounced
its poll session between probe cycles satisfies two conditions that
conspire to fail TestHASubnetRouterFailover: a probe queued against
the previous session is silently dropped when the worker writes to
the closed connection (timeout always fires), and a probe sent
immediately after reconnect lands while wgengine is still rebuilding
magicsock state from the new netmap. Either path installs a spurious
unhealthy bit, which sends the preserved-primary anti-flap the wrong
way.

Record the session observed at dispatch time and drop the timeout
path if the node reconnected since. Require the session to survive
a full probe cycle before a timeout can drive a failover.
2026-05-18 17:18:08 +02:00
Kristoffer Dalby
b5b786f519 servertest: cover broader-dst via grant in filter test
TestGrantViaSubnetFilterRules pins exact-equality dst. Add a sibling
for the broader-dst case so the regression sits at the server level
alongside the policy-engine unit test.

Updates #3267
2026-05-18 14:02:00 +02:00
Kristoffer Dalby
7a20db9f49 types: persist Node JSON slices via named IsZero types
Endpoints, Tags and ApprovedRoutes serialize as JSON on Node. GORM's
struct Updates path skips fields it considers zero, and reflect treats
a nil slice as zero — clearing any of these columns via the State
persist path would leave the previous value in the database.

Introduce Strings, Prefixes and AddrPorts as named slice types whose
IsZero() always reports false, so GORM keeps the column in the UPDATE
regardless of the slice being nil or empty. JSON marshalling is
unchanged: nil serializes to null, empty to []. List() returns the
underlying unnamed slice for callers (mainly testify assertions over
reflect.DeepEqual) that distinguish the named type from its base.

Regenerated types_clone.go and types_view.go follow the field-type
swap. Test assertions across hscontrol/{db,state,servertest} updated
to call .List() where reflect.DeepEqual previously matched the raw
slice type.

Fixes #3110
2026-05-15 11:21:58 +02:00
Kristoffer Dalby
64d13f77e8 types/config, types/node: model default-auto-update from auto_update.enabled
Tailscale stamps tailcfg.NodeAttrDefaultAutoUpdate on every node's
CapMap with a JSON bool reflecting the tailnet-wide auto-update
default. Headscale grows an auto_update.enabled config option and
emits the cap accordingly from TailNode -- the cap leaves the
unmodelledTailnetStateCaps strip list and is compared in full by the
nodeAttrs compat suite.

testNodeAttrsSuccess drives cfg.AutoUpdate.Enabled from
tf.Input.Tailnet.Settings.DevicesAutoUpdatesOn so each capture's
expected emission matches the SaaS state it was taken under. Two
captures cover both branches:

  - nodeattrs-tailnet-devices-auto-updates-on  -> [true]
  - nodeattrs-tailnet-devices-auto-updates-off -> [false]

The Tailscale v2 TailnetSettings API does not expose the Send Files
toggle, so the compat suite cannot vary cfg.Taildrop.Enabled per
capture. TestTaildropDisabledWithholdsFileSharingCap covers the off
path directly in servertest.
2026-05-13 14:22:30 +02:00
Kristoffer Dalby
8ea4cd3faa types/node, policy/v2: drop taildrive caps from baseline emission
Taildrive (drive:share and drive:access) is policy-driven per
Tailscale's documented behaviour
(https://tailscale.com/docs/features/taildrive). The previous
always-on baseline emission diverged from SaaS for every node not
targeted by a drive nodeAttr -- a real semantic divergence that the
compat suite caught once the test moved to comparing TailNode output
against the captured netmaps.

types.Node.TailNode no longer stamps the drive pair. Operators
wanting taildrive add a nodeAttrs entry:

  "nodeAttrs": [
    { "target": ["*"], "attr": ["drive:share", "drive:access"] }
  ]

unmodelledTailnetStateCaps shrinks accordingly. The baseline-divergence
group is gone; every entry left in the list is genuinely unmodelled
(user-role caps, unimplemented features, tailnet metadata, internal
tuning).

servertest's TestNodeAttrsBaselineCapsAlwaysOn expects the smaller
baseline (admin + ssh + file-sharing). Integration TestGrantCapDrive
grants the drive caps explicitly via NodeAttrs to exercise the
policy-driven emission path.
2026-05-13 14:22:30 +02:00
Kristoffer Dalby
078b9e308f policy/v2: SaaS-derived compat tests for nodeAttrs
Adds a data-driven test that loads testdata/nodeattrs_results/*.hujson
and diffs the captured SaaS-rendered netmaps against headscale's
compileNodeAttrs output. Each capture is one scenario the SaaS
control plane has rendered against the same policy headscale is asked
to compile -- the test enforces shape parity per node.

tailnet_state_caps.go enumerates the caps SaaS emits where headscale
has no equivalent concept yet (user-role admin/owner, tailnet lock,
services host, app connectors, internal magicsock and SSH tuning,
tailnet-state metadata) plus the always-on baseline (admin, ssh,
file-sharing) and the taildrive pair. stripUnmodelledTailnetStateCaps
filters both sides of cmp.Diff so the comparison focuses on the
policy-driven caps. PeerCapMap encodes which caps the Tailscale
client reads from the peer view (suggest-exit-node when exit routes
are approved, etc.) for use by the mapper.

testcapture switches to typed tailcfg/netmap/filtertype/apitype
values so schema drift between the capture tool and headscale
becomes a compile error rather than a silent test failure. Existing
compat suites (acl, grants, routes, ssh, issue_3212) move to the
typed shape.

The 53 SelfNode netmap captures and the 7 anonymizer-corrupted
suggest-charmander -> suggest-exit-node restorations in
routes_results / issue_3212 ride along.
2026-05-13 14:22:30 +02:00
Kristoffer Dalby
6fcff9e352 mapper, state: deliver nodeAttrs through MapResponse and harden nextdns DoH rewrite
WithSelfNode and buildTailPeers merge each node's policy CapMap
into the tailcfg.Node.CapMap they emit. State.NodeCapMap and
State.NodeCapMaps wrap the policy manager: NodeCapMap returns a
defensive clone per call; NodeCapMaps snapshots the full per-node
map once for batched callers, amortising pm.mu acquisition across
a peer build.

generateDNSConfig grew a per-node CapMap argument so it can apply
nodeAttr-driven DNS overlays. The nextdns DoH rewrite hardens against
policy-controlled inputs:

  - nextDNSDoHHost anchors the prefix match instead of substring,
    so a hostile resolver URL cannot smuggle a nextdns hostname in
    a path or query.
  - nextDNSProfileFromCapMap accepts only profile names matching
    [A-Za-z0-9._-]{1,64} and picks the lexicographically first when
    multiple are granted -- deterministic, no shell metacharacters
    or URL fragments through.
  - addNextDNSMetadata composes the rewritten URL via url.Parse +
    url.Values rather than fmt.Sprintf, so existing query strings
    on the resolver URL survive and metadata cannot inject a new
    component.

WithTaildropEnabled in servertest controls cfg.Taildrop.Enabled per
test so cap/file-sharing emission can be toggled in tests that need
to verify the off path.
2026-05-13 14:22:30 +02:00
Lealem Amedie
542091e82b Add unit test 2026-05-11 09:25:26 +01:00
Kristoffer Dalby
76ee29352b servertest: cover via-grant exit-node visibility end-to-end
TestGrantViaExitNodeInternetVisibility boots a server, applies a
policy that scopes autogroup:internet to a tag, registers a tagged
exit advertiser and a regular client, and asserts the client's netmap
surfaces the exit node with 0.0.0.0/0 and ::/0 in AllowedIPs — the
substrate the Tailscale client reads to populate
`tailscale exit-node list`.

TestGrantViaExitNodeNoFilterRules retains its assertion (literal /0
absent from the exit node's PacketFilter, matching SaaS PacketFilter
encoding); only its docstring is updated to reflect that the exit
node now does receive a TheInternet-shaped rule, just not the
literal /0 form.

Updates #3233
2026-04-30 19:22:45 +01:00
Kristoffer Dalby
863fa2f815 servertest, integration: cover HA both-offline recovery
Three regression tests for the user scenario: an in-process
Disconnect/Reconnect, a tailscale-down/up integration test, and
an iptables -j DROP cable-pull integration test.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
9f7c8e9a07 state: clear Unhealthy when node leaves HA candidate set
Restore the legacy auto-clear at write boundaries that drop HA
candidacy: Disconnect, SetApprovedRoutes(empty), and
UpdateNodeFromMapRequest shrinking advertised routes to empty.
Plus a defensive guard in SetNodeUnhealthy.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
437754aeea state: switch consumers to NodeStore primary routes
Replace routes.PrimaryRoutes reads with NodeStore. Connect bumps
SessionEpoch; Disconnect re-checks it inside UpdateNode so the
check and mutation are atomic against a concurrent Connect on
the same node.

The connect_race regression test is carried in its final
SessionEpoch form.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
436d3db28e servertest: add dynamic HA failover tests
Existing HA tests verify server-side primary election; these add
end-to-end assertions from a viewer client's perspective, that marking
the primary unhealthy or revoking its approved route propagates through
the netmap so the viewer sees the flip.

Updates #3157
2026-04-17 16:31:49 +01:00
Kristoffer Dalby
5a7cafdf85 noise: reject non-HEAD on PingResponseHandler
chi routes only HEAD to the handler, but assert explicitly so a
future router config change cannot silently accept GET/POST and leak
latency bytes or side-effects.

Updates #3157
2026-04-17 16:31:49 +01:00
Kristoffer Dalby
164d659dd2 servertest: add TestViaGrantHACompat for via+HA compat tests
Data-driven tests for via grants combined with HA primary routes:
crossed via tags on same prefix, mixed via+regular across HA pairs,
four-way HA, and the kitchen-sink scenario. Each case uses an inline
topology captured from SaaS.

Updates #3157
2026-04-17 16:31:49 +01:00
Kristoffer Dalby
7d104b8c8d servertest: add via grant map compat tests
End-to-end exercise of via-grant compilation against SaaS captures:
peer visibility, AllowedIPs, PrimaryRoutes, and per-rule src/dst
reachability from each viewer's perspective.

Updates #3157
2026-04-17 16:31:49 +01:00
Kristoffer Dalby
0378e2d2c6 servertest: add HA health probing tests
Five scenarios: healthy probes, failover on unhealthy primary,
no-flap recovery, connect clears unhealthy, no-op without HA routes.

Updates #2129
Updates #2902
2026-04-16 15:10:56 +01:00
Kristoffer Dalby
97778c9930 all: add tests for PingRequest implementation
Unit tests for Change (IsEmpty, Merge, Type, PingNode constructor),
ping tracker (register/complete/cancel lifecycle, concurrency, latency),
and end-to-end servertests exercising the full round-trip with real
controlclient.Direct instances.

Updates #2902
Updates #2129
2026-04-15 10:53:35 +01:00
Kristoffer Dalby
0d4f2293ff state: replace zcache with bounded LRU for auth cache
Replace zcache with golang-lru/v2/expirable for both the state auth
cache and the OIDC state cache. Add tuning.register_cache_max_entries
(default 1024) to cap the number of pending registration entries.

Introduce types.RegistrationData to replace caching a full *Node;
only the fields the registration callback path reads are retained.
Remove the dead HSDatabase.regCache field. Drop zgo.at/zcache/v2
from go.mod.
2026-04-10 14:09:57 +01:00
Kristoffer Dalby
ff29af63f6 servertest: use memnet networking and add WithNodeExpiry option
Replace httptest (real TCP sockets) with tailscale.com/net/memnet
so all connections stay in-process. Wire the client's tsdial.Dialer
to the server's memnet.Network via SetSystemDialerForTest,
preserving the full Noise protocol path.

Also update servertest to use the new Node.Ephemeral.InactivityTimeout
config path introduced in the types refactor, and add WithNodeExpiry
server option for testing default node key expiry behaviour.

Updates #1711
2026-04-08 13:00:22 +01:00
Kristoffer Dalby
30dce30a9d testdata: convert .json to .hujson with header comments
Rename all 594 test data files from .json to .hujson and add
descriptive header comments to each file documenting what policy
rules are under test and what outcome is expected.

Update test loaders in all 5 _test.go files to parse HuJSON via
hujson.Parse/Standardize/Pack before json.Unmarshal.

Add cross-dependency warning to via_compat_test.go documenting
that GRANT-V29/V30/V31/V36 are shared with TestGrantsCompat.

Add .gitignore exemption for testdata HuJSON files.
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
b762e4c350 integration: remove exit node via grant tests
Remove TestGrantViaExitNodeSteering and TestGrantViaMixedSteering.
Exit node traffic forwarding through via grants cannot be validated
with curl/traceroute in Docker containers because Tailscale exit nodes
strip locally-connected subnets from their forwarding filter.

The correctness of via exit steering is validated by:
- Golden MapResponse comparison (TestViaGrantMapCompat with GRANT-V31
  and GRANT-V36) comparing full netmap output against Tailscale SaaS
- Filter rule compatibility (TestGrantsCompat with GRANT-V14 through
  GRANT-V36) comparing per-node PacketFilter rules against Tailscale SaaS
- TestGrantViaSubnetSteering (kept) validates via subnet steering with
  actual curl/traceroute through Docker, which works for subnet routes

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
c36cedc32f policy/v2: fix via grants in BuildPeerMap, MatchersForNode, and ViaRoutesForPeer
Use per-node compilation path for via grants in BuildPeerMap and MatchersForNode to ensure via-granted nodes appear in peer maps. Fix ViaRoutesForPeer golden test route inference to correctly resolve via grant effects.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
6a55f7d731 policy/v2: add via exit steering golden captures and tests
Add golden test data for via exit route steering and fix via exit grant compilation to match Tailscale SaaS behavior. Includes MapResponse golden tests for via grant route steering verification.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
0431039f2a servertest: add regression tests for via grant filter rules
Add three tests that verify control plane behavior for grant policies:

- TestGrantViaSubnetFilterRules: verifies the router's PacketFilter
  contains destination rules for via-steered subnets. Without per-node
  filter compilation for via grants, these rules were missing and the
  router would drop forwarded traffic.

- TestGrantViaExitNodeFilterRules: same verification for exit nodes
  with via grants steering autogroup:internet traffic.

- TestGrantIPv6OnlyPrefixACL: verifies that address-based aliases
  (Prefix, Host) resolve to exactly the literal prefix and do not
  expand to include the matching node's other IP addresses. An
  IPv6-only host definition produces only IPv6 filter rules.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
3ca4ff8f3f state,servertest: add grant control plane tests and fix via route ReduceRoutes filtering
Add servertest grant policy control plane tests covering basic grants, via grants, and cap grants. Fix ReduceRoutes in State to apply route reduction to non-via routes first, then append via-included routes, preventing via grant routes from being incorrectly filtered.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
53b8a81d48 servertest: support tagged pre-auth keys in test clients
WithTags was defined but never passed through to CreatePreAuthKey.
Fix NewClient to use CreateTaggedPreAuthKey when tags are specified,
enabling tests that need tagged nodes (e.g. via grant steering).

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
3d53f97c82 hscontrol/servertest: fix test expectations for eventual consistency
Three corrections to issue tests that had wrong assumptions about
when data becomes available:

1. initial_map_should_include_peer_online_status: use WaitForCondition
   instead of checking the initial netmap. Online status is set by
   Connect() which sends a PeerChange patch after the initial
   RegisterResponse, so it may not be present immediately.

2. disco_key_should_propagate_to_peers: use WaitForCondition. The
   DiscoKey is sent in the first MapRequest (not RegisterRequest),
   so peers may not see it until a subsequent map update.

3. approved_route_without_announcement: invert the test expectation.
   Tailscale uses a strict advertise-then-approve model -- routes are
   only distributed when the node advertises them (Hostinfo.RoutableIPs)
   AND they are approved. An approval without advertisement is a dormant
   pre-approval. The test now asserts the route does NOT appear in
   AllowedIPs, matching upstream Tailscale semantics.

Also fix TestClient.Reconnect to clear the cached netmap and drain
pending updates before re-registering. Without this, WaitForPeers
returned immediately based on the old session's stale data.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
00c41b6422 hscontrol/servertest: add race, stress, and poll race tests
Add three test files designed to stress the control plane under
concurrent and adversarial conditions:

- race_test.go: 14 tests exercising concurrent mutations, session
  replacement, batcher contention, NodeStore access, and map response
  delivery during disconnect. All pass the Go race detector.

- poll_race_test.go: 8 tests targeting the poll.go grace period
  interleaving. These confirm a logical TOCTOU race: when a node
  disconnects and reconnects within the grace period, the old
  session's deferred Disconnect() can overwrite the new session's
  Connect(), leaving IsOnline=false despite an active poll session.

- stress_test.go: sustained churn, rapid mutations, rolling
  replacement, data integrity checks under load, and verification
  that rapid reconnects do not leak false-offline notifications.

Known failing tests (grace period TOCTOU race):
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- rapid_reconnect_peer_never_sees_offline
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ab4e205ce7 hscontrol/servertest: expand issue tests to 24 scenarios, surface 4 issues
Split TestIssues into 7 focused test functions to stay under cyclomatic
complexity limits while testing more aggressively.

Issues surfaced (4 failing tests):

1. initial_map_should_include_peer_online_status: Initial MapResponse
   has Online=nil for peers. Online status only arrives later via
   PeersChangedPatch.

2. disco_key_should_propagate_to_peers: DiscoPublicKey set by client
   is not visible to peers. Peers see zero disco key.

3. approved_route_without_announcement_is_visible: Server-side route
   approval without client-side announcement silently produces empty
   SubnetRoutes (intersection of empty announced + approved = empty).

4. nodestore_correct_after_rapid_reconnect: After 5 rapid reconnect
   cycles, NodeStore reports node as offline despite having an active
   poll session. The connect/disconnect grace period interleaving
   leaves IsOnline in an incorrect state.

Passing tests (20) verify:
- IP uniqueness across 10 nodes
- IP stability across reconnect
- New peers have addresses immediately
- Node rename propagates to peers
- Node delete removes from all peer lists
- Hostinfo changes (OS field) propagate
- NodeStore/DB consistency after route mutations
- Grace period timing (8-20s window)
- Ephemeral node deletion (not just offline)
- 10-node simultaneous connect convergence
- Rapid sequential node additions
- Reconnect produces complete map
- Cross-user visibility with default policy
- Same-user multiple nodes get distinct IDs
- Same-hostname nodes get unique GivenNames
- Policy change during connect still converges
- DERP region references are valid
- User profiles present for self and peers
- Self-update arrives after route approval
- Route advertisement stored as AnnouncedRoutes
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
f87b08676d hscontrol/servertest: add policy, route, ephemeral, and content tests
Extend the servertest harness with:
- TestClient.Direct() accessor for advanced operations
- TestClient.WaitForPeerCount and WaitForCondition helpers
- TestHarness.ChangePolicy for ACL policy testing
- AssertDERPMapPresent and AssertSelfHasAddresses

New test suites:
- content_test.go: self node, DERP map, peer properties, user profiles,
  update history monotonicity, and endpoint update propagation
- policy_test.go: default allow-all, explicit policy, policy triggers
  updates on all nodes, multiple policy changes, multi-user mesh
- ephemeral_test.go: ephemeral connect, cleanup after disconnect,
  mixed ephemeral/regular, reconnect prevents cleanup
- routes_test.go: addresses in AllowedIPs, route advertise and approve,
  advertised routes via hostinfo, CGNAT range validation

Also fix node_departs test to use WaitForCondition instead of
assert.Eventually, and convert concurrent_join_and_leave to
interleaved_join_and_leave with grace-period-tolerant assertions.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ca7362e9aa hscontrol/servertest: add control plane lifecycle and consistency tests
Add three test files exercising the servertest harness:

- lifecycle_test.go: connection, disconnection, reconnection, session
  replacement, and mesh formation at various sizes.
- consistency_test.go: symmetric visibility, consistent peer state,
  address presence, concurrent join/leave convergence.
- weather_test.go: rapid reconnects, flapping stability, reconnect
  with various delays, concurrent reconnects, and scale tests.

All tests use table-driven patterns with subtests.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
0288614bdf hscontrol: add servertest harness for in-process control plane testing
Add a new hscontrol/servertest package that provides a test harness
for exercising the full Headscale control protocol in-process, using
Tailscale's controlclient.Direct as the client.

The harness consists of:
- TestServer: wraps a Headscale instance with an httptest.Server
- TestClient: wraps controlclient.Direct with NetworkMap tracking
- TestHarness: orchestrates N clients against a single server
- Assertion helpers for mesh completeness, visibility, and consistency

Export minimal accessor methods on Headscale (HTTPHandler, NoisePublicKey,
GetState, SetServerURL, StartBatcher, StartEphemeralGC) so the servertest
package can construct a working server from outside the hscontrol package.

This enables fast, deterministic tests of connection lifecycle, update
propagation, and network weather scenarios without Docker.
2026-03-19 07:05:58 +01:00