Commit Graph

3970 Commits

Author SHA1 Message Date
Kristoffer Dalby
9b1a6b6c05 integration: add cap/relay grant peer relay lifecycle test
Add ConnectToNetwork to the TailscaleClient interface for multi-network test scenarios and implement peer relay ping support. Use these to test that cap/relay grants correctly enable peer-to-peer relay connections between tagged nodes.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
8573ff9158 policy/v2: fix grant-only policies returning FilterAllowAll
compileFilterRules checked only pol.ACLs == nil to decide whether
to return FilterAllowAll (permit-any). Policies that use only Grants
(no ACLs) had nil ACLs, so the function short-circuited before
compiling any CapGrant rules. This meant cap/relay, cap/drive, and
any other App-based grant capabilities were silently ignored.

Check both ACLs and Grants are empty before returning FilterAllowAll.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
a739862c65 integration: add via grant route steering tests
Add integration tests validating that via grants correctly steer
routes to designated nodes per client group:

- TestGrantViaSubnetSteering: two routers advertise the same
  subnet, via grants steer each client group to a specific router.
  Verifies per-client route visibility, curl reachability, and
  traceroute path.

- TestGrantViaExitNodeSteering: two exit nodes, via grants steer
  each client group to a designated exit node. Verifies exit
  routes are withdrawn from non-designated nodes and the client
  rejects setting a non-designated exit node.

- TestGrantViaMixedSteering: cross-steering where subnet routes
  and exit routes go to different servers per client group.
  Verifies subnet traffic uses the subnet-designated server while
  exit traffic uses the exit-designated server.

Also add autogroupp helper for constructing AutoGroup aliases in
grant policy configurations.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
8358017dcf policy/v2,state,mapper: implement per-viewer via route steering
Via grants steer routes to specific nodes per viewer. Until now,
all clients saw the same routes for each peer because route
assembly was viewer-independent. This implements per-viewer route
visibility so that via-designated peers serve routes only to
matching viewers, while non-designated peers have those routes
withdrawn.

Add ViaRouteResult type (Include/Exclude prefix lists) and
ViaRoutesForPeer to the PolicyManager interface. The v2
implementation iterates via grants, resolves sources against the
viewer, matches destinations against the peer's advertised routes
(both subnet and exit), and categorizes prefixes by whether the
peer has the via tag.

Add RoutesForPeer to State which composes global primary election,
via Include/Exclude filtering, exit routes, and ACL reduction.
When no via grants exist, it falls back to existing behavior.

Update the mapper to call RoutesForPeer per-peer instead of using
a single route function for all peers. The route function now
returns all routes (subnet + exit), and TailNode filters exit
routes out of the PrimaryRoutes field for HA tracking.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
28be15f8ea policy/v2: handle autogroup:internet in via grant compilation
compileViaGrant only handled *Prefix destinations, skipping
*AutoGroup entirely. This meant via grants with
dst=[autogroup:internet] produced no filter rules even when the
node was an exit node with approved exit routes.

Switch the destination loop from a type assertion to a type switch
that handles both *Prefix (subnet routes) and *AutoGroup (exit
routes via autogroup:internet). Also check ExitRoutes() in
addition to SubnetRoutes() so the function doesn't bail early
when a node only has exit routes.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
687cf0882f policy/v2: implement autogroup:danger-all support
Add autogroup:danger-all as a valid source alias that matches ALL IP
addresses including non-Tailscale addresses. When used as a source,
it resolves to 0.0.0.0/0 + ::/0 internally but produces SrcIPs: ["*"]
in filter rules. When used as a destination, it is rejected with an
error matching Tailscale SaaS behavior.

Key changes:
- Add AutoGroupDangerAll constant and validation
- Add sourcesHaveDangerAll() helper and hasDangerAll parameter to
  srcIPsWithRoutes() across all compilation paths
- Add ErrAutogroupDangerAllDst for destination rejection
- Remove 3 AUTOGROUP_DANGER_ALL skip entries (K6, K7, K8)

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
4f040dead2 policy/v2: implement grant validation rules matching Tailscale SaaS
Implement comprehensive grant validation including: accept empty sources/destinations (they produce no rules), validate grant ip/app field requirements, capability name format, autogroup constraints, via tag existence, and default route CIDR restrictions.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
54db47badc policy/v2: implement via route compilation for grants
Compile grants with "via" field into FilterRules that are placed only
on nodes matching the via tag and actually advertising the destination
subnets. Key behavior:

- Filter rules go exclusively to via-nodes with matching approved routes
- Destination subnets not advertised by the via node are silently dropped
- App-only via grants (no ip field) produce no packet filter rules
- Via grants are skipped in the global compileFilterRules since they
  are node-specific

Reduces grant compat test skips from 41 to 30 (11 newly passing).

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
0e3acdd8ec policy/v2: implement CapGrant compilation with companion capabilities
Compile grant app fields into CapGrant FilterRules matching Tailscale
SaaS behavior. Key changes:

- Generate CapGrant rules in compileFilterRules and
  compileGrantWithAutogroupSelf, with node-specific /32 and /128
  Dsts for autogroup:self grants
- Add reversed companion rules for drive→drive-sharer and
  relay→relay-target capabilities, ordered by original cap name
- Narrow broad CapGrant Dsts to node-specific prefixes in
  ReduceFilterRules via new reduceCapGrantRule helper
- Skip merging CapGrant rules in mergeFilterRules to preserve
  per-capability structure
- Remove ip+app mutual exclusivity validation (Tailscale accepts both)
- Add semantic JSON comparison for RawMessage types and netip.Prefix
  comparators in test infrastructure

Reduces grant compat test skips from 99 to 41 (58 newly passing).

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
ebe0f4078d policy/v2: preserve non-wildcard source IPs alongside wildcard ranges
When an ACL source list contains a wildcard (*) alongside explicit
sources (tags, groups, hosts, etc.), Tailscale preserves the individual
IPs from non-wildcard sources in SrcIPs alongside the merged wildcard
CGNAT ranges. Previously, headscale's IPSetBuilder would merge all
sources into a single set, absorbing the explicit IPs into the wildcard
range.

Track non-wildcard resolved addresses separately during source
resolution, then append their individual IP strings to the output
when a wildcard is also present. This fixes the remaining 5 ACL
compat test failures (K01 and M06 subtests).

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
dda35847b0 policy/v2: reorder ACL self grants to match Tailscale rule ordering
When an ACL has non-autogroup destinations (groups, users, tags, hosts)
alongside autogroup:self, emit non-self grants before self grants to
match Tailscale's filter rule ordering. ACLs with only autogroup
destinations (self + member) preserve the policy-defined order.

This fixes ACL-A17, ACL-SF07, and ACL-SF11 compat test failures.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
f95b254ea9 policy/v2: exclude exit routes from ReduceFilterRules
Add exit route check in ReduceFilterRules to prevent exit nodes from receiving packet filter rules for destinations that only overlap via exit routes. Remove resolved SUBNET_ROUTE_FILTER_RULES grant skip entries and update error message formatting for grant validation.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
e05f45cfb1 policy/v2: use approved node routes in wildcard SrcIPs
Per Tailscale documentation, the wildcard (*) source includes "any
approved subnets" — the actually-advertised-and-approved routes from
nodes, not the autoApprover policy prefixes.

Change Asterix.resolve() to return just the base CGNAT+ULA set, and
add approved subnet routes as separate SrcIPs entries in the filter
compilation path. This preserves individual route prefixes that would
otherwise be merged by IPSet (e.g., 10.0.0.0/8 absorbing 10.33.0.0/16).

Also swap rule ordering in compileGrantWithAutogroupSelf() to emit
non-self destination rules before autogroup:self rules, matching the
Tailscale FilterRule wire format ordering.

Remove the unused AutoApproverPolicy.prefixes() method.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
995ed0187c policy/v2: add advertised routes to compat test topologies
Add routable_ips and approved_routes fields to the node topology
definitions in all golden test files. These represent the subnet
routes actually advertised by nodes on the Tailscale SaaS network
during data capture:

  Routes topology (92 files, 6 router nodes):
    big-router:     10.0.0.0/8
    subnet-router:  10.33.0.0/16
    ha-router1:     192.168.1.0/24
    ha-router2:     192.168.1.0/24
    multi-router:   172.16.0.0/24
    exit-node:      0.0.0.0/0, ::/0

  ACL topology (199 files, 1 router node):
    subnet-router:  10.33.0.0/16

  Grants topology (203 files, 1 router node):
    subnet-router:  10.33.0.0/16

The route assignments were deduced from the golden data by analyzing
which router nodes receive FilterRules for which destination CIDRs
across all test files, and cross-referenced with the MTS setup
script (setup_grant_nodes.sh).

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
927ce418d2 policy/v2: use bare IPs in autogroup:self DstPorts
Use ip.String() instead of netip.PrefixFrom(ip, ip.BitLen()).String()
when building DstPorts for autogroup:self destinations. This produces
bare IPs like "100.90.199.68" instead of CIDR notation like
"100.90.199.68/32", matching the Tailscale FilterRule wire format.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
93d79d8da9 policy: include IPv6 in identity-based alias resolution
AppendToIPSet now adds both IPv4 and IPv6 addresses for nodes, matching Tailscale's FilterRule wire format where identity-based aliases (tags, users, groups, autogroups) resolve to both address families. Update ReduceFilterRules test expectations to include IPv6 entries.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
500442c8f1 policy/v2: convert routes compat tests to data-driven format with Tailscale SaaS captures
Replace 8,286 lines of inline Go test expectations with 92 JSON golden files captured from Tailscale SaaS. The data-driven test driver validates route filtering, auto-approval, HA routing, and exit node behavior against real Tailscale output.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
2fb71690e8 policy/v2: convert ACL compat tests to data-driven format with Tailscale SaaS captures
Replace 9,937 lines of inline Go test expectations with 215 JSON golden files captured from Tailscale SaaS. The new data-driven test driver compares headscale's filter compilation output against real Tailscale behavior for each node in an 8-node topology.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
9f7aa55689 policy/v2: refactor alias resolution to use ResolvedAddresses
Introduce ResolvedAddresses type for structured IP set results. Refactor all Alias.Resolve() methods to return ResolvedAddresses instead of raw IPSets. Restrict identity-based aliases to matching address families, fix nil dereferences in partial resolution paths, and update test expectations for the new IP format (bare IPs, IP ranges instead of CIDR prefixes).

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
0fa9dcaff8 policy/v2: add data-driven grants compatibility test with Tailscale SaaS captures
Rename tailscale_compat_test.go to tailscale_acl_compat_test.go to make room for the grants compat test. Add 237 GRANT-*.json golden test files captured from Tailscale SaaS and a data-driven test driver that compares headscale's grant filter compilation against real Tailscale behavior.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
f74ea5b8ed hscontrol/policy/v2: add Grant policy format support
Add support for the Grant policy format as an alternative to ACL format,
following Tailscale's policy v2 specification. Grants provide a more
structured way to define network access rules with explicit separation
of IP-based and capability-based permissions.

Key changes:

- Add Grant struct with Sources, Destinations, InternetProtocols (ip),
  and App (capabilities) fields
- Add ProtocolPort type for unmarshaling protocol:port strings
- Add Grant validation in Policy.validate() to enforce:
  - Mutual exclusivity of ip and app fields
  - Required ip or app field presence
  - Non-empty sources and destinations
- Refactor compileFilterRules to support both ACLs and Grants
- Convert ACLs to Grants internally via aclToGrants() for unified
  processing
- Extract destinationsToNetPortRange() helper for cleaner code
- Rename parseProtocol() to toIANAProtocolNumbers() for clarity
- Add ProtocolNumberToName mapping for reverse lookups

The Grant format allows policies to be written using either the legacy
ACL format or the new Grant format. ACLs are converted to Grants
internally, ensuring backward compatibility while enabling the new
format's benefits.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
53b8a81d48 servertest: support tagged pre-auth keys in test clients
WithTags was defined but never passed through to CreatePreAuthKey.
Fix NewClient to use CreateTaggedPreAuthKey when tags are specified,
enabling tests that need tagged nodes (e.g. via grant steering).

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
15c1cfd778 types: include ExitRoutes in HasNetworkChanges
When exit routes are approved, SubnetRoutes remains empty because exit
routes (0.0.0.0/0, ::/0) are classified separately. Without checking
ExitRoutes, the PolicyManager cache is not invalidated on exit route
approval, causing stale filter rules that lack via grant entries for
autogroup:internet destinations.

Updates #2180
2026-04-01 14:10:42 +01:00
Kristoffer Dalby
a76b4bd46c ci: switch integration tests to ARM runners
Switch all integration test jobs (build, build-postgres, test
template) from ubuntu-latest (x86_64) to ubuntu-24.04-arm (aarch64).

ARM runners on GitHub Actions are free for public repos and tend
to have more consistent performance characteristics than the
shared x86_64 pool. This should reduce flakiness caused by
resource contention on congested runners.

Updates #3125
2026-03-31 22:06:25 +02:00
Kristoffer Dalby
a9a2001ae7 integration: scale remaining hardcoded timeouts and replace pingAllHelper
Apply CI-aware scaling to all remaining hardcoded timeouts:

- requireAllClientsOfflineStaged: scale the three internal stage
  timeouts (15s/20s/60s) with ScaledTimeout.
- validateReloginComplete: scale requireAllClientsOnline (120s)
  and requireAllClientsNetInfoAndDERP (3min) calls.
- WaitForTailscaleSyncPerUser callers in acl_test.go (3 sites, 60s).
- WaitForRunning callers in tags_test.go (10 sites): switch to
  PeerSyncTimeout() to match convention.
- WaitForRunning/WaitForPeers direct callers in route_test.go.
- requireAllClientsOnline callers in general_test.go and
  auth_key_test.go.

Replace pingAllHelper with assertPingAll/assertPingAllWithCollect:

- Wraps pings in EventuallyWithT so transient docker exec timeouts
  are retried instead of immediately failing the test.
- Timeout scales with the ping matrix size (2s per ping budget for
  2 full sweeps) so large tests get proportionally more time.
- Uses CollectT correctly, fixing the broken EventuallyWithT usage
  in TestEphemeral where the old t.Errorf bypassed CollectT.
- Follows the established assert*/assertWithCollect naming.

Updates #3125
2026-03-31 22:06:25 +02:00
Kristoffer Dalby
acb8cfc7ee integration: make docker execute and ping timeouts CI-aware
The default docker execute timeout (10s) is the root cause of
"dockertest command timed out" errors across many integration tests
on CI. On congested GitHub Actions runners, docker exec latency
alone can consume 2-5 seconds of this budget before the command
even starts inside the container.

Replace the hardcoded 10s constant with a function that returns
20s on CI, doubling the budget for all container commands
(tailscale status, headscale CLI, curl, etc.).

Similarly, scale the default tailscale ping timeout from 200ms to
400ms on CI. This doubles the per-attempt budget and the docker
exec timeout for pings (from 200ms*5=1s to 400ms*5=2s), giving
more headroom for docker exec overhead.

Updates #3125
2026-03-31 22:06:25 +02:00
Kristoffer Dalby
f1e5f1346d integration/acl: add tag verification step to TestACLTagPropagationPortSpecific
TestACLTagPropagationPortSpecific failed twice on CI because it jumped
from SetNodeTags directly to checking curl, without first verifying the
tag change was applied on the server. This races against server-side
processing.

Add a tag verification step (matching TestACLTagPropagation's pattern)
and bump the Step 4 timeout from 60s to 90s since port-specific filter
changes require both endpoints to process the new PacketFilter from
the MapResponse while the WireGuard tunnel stays up.

Updates #3125
2026-03-31 22:06:25 +02:00
Kristoffer Dalby
210f58f62e integration: use CI-scaled timeouts for all EventuallyWithT assertions
Wrap all 329 hardcoded EventuallyWithT timeouts across 12 test files
with integrationutil.ScaledTimeout(), which applies a 2x multiplier
on CI runners. This addresses the systemic issue where hardcoded
timeouts that work locally are insufficient under CI resource
contention.

Variable-based timeouts (propagationTime, assertTimeout in
route_test.go and totalWaitTime in auth_oidc_test.go) are wrapped
at their definition site so all downstream usages benefit.

The retry intervals (second duration parameter) are intentionally
NOT scaled, as they control polling frequency, not total wait time.

Updates #3125
2026-03-31 22:06:25 +02:00
Kristoffer Dalby
a147b0cd87 integration/acl: use CurlFailFast for all negative curl assertions
Replace Curl() with CurlFailFast() in all negative curl assertions
(where the test expects the connection to fail). CurlFailFast uses
1 retry and 2s max time instead of 3 retries and 5s max, which
avoids wasting time on unnecessary retries when we expect the
connection to be blocked.

This affects 21 call sites across 7 test functions:

- TestACLAllowUser80Dst
- TestACLDenyAllPort80
- TestACLAllowUserDst
- TestACLAllowStarDst
- TestACLNamedHostsCanReach
- TestACLDevice1CanAccessDevice2
- TestPolicyUpdateWhileRunningWithCLIInDatabase
- TestACLAutogroupSelf
- TestACLPolicyPropagationOverTime

Where possible, the inline Curl+Error pattern is replaced with the
assertCurlFailWithCollect helper introduced in the previous commit.

Updates #3125
2026-03-31 22:06:25 +02:00
Kristoffer Dalby
a7edcf3b0f integration: add CI-scaled timeouts and curl helpers for flaky ACL tests
Add ScaledTimeout to scale EventuallyWithT timeouts by 2x on CI,
consistent with the existing PeerSyncTimeout (60s/120s) and
dockertestMaxWait (300s/600s) conventions.

Add assertCurlSuccessWithCollect and assertCurlFailWithCollect helpers
following the existing *WithCollect naming convention.
assertCurlFailWithCollect uses CurlFailFast internally for aggressive
timeouts, avoiding wasted retries when expecting blocked connections.

Apply these to the three flakiest ACL tests:

- TestACLTagPropagation: swap NetMap and curl verification order so
  the fast NetMap check (confirms MapResponse arrived) runs before
  the slower curl check. Use curl helpers and scaled timeouts.

- TestACLTagPropagationPortSpecific: use curl helpers and scaled
  timeouts.

- TestACLHostsInNetMapTable: scale the 10s EventuallyWithT timeout.

Updates #3125
2026-03-31 22:06:25 +02:00
Kristoffer Dalby
fda72ad1a3 Update main.md
Co-authored-by: nblock <nblock@users.noreply.github.com>
2026-03-31 13:36:31 +02:00
Kristoffer Dalby
dfaf120f2a docs: add development builds install page
Move the container image and binary download details from the README
into a dedicated documentation page at setup/install/main. This gives
development builds a proper home in the docs site alongside the other
install methods. The README now links to the docs page instead.
2026-03-31 13:36:31 +02:00
Kristoffer Dalby
e171d30179 ci: add build workflow for main branch
Build and push multi-arch container images (linux/amd64, linux/arm64)
to GHCR and Docker Hub on every push to main that changes Go or Nix
files. Images are tagged as main-<short-sha> using ko with the same
distroless base image as release builds.

Cross-compiled binaries for linux and darwin (amd64, arm64) are
uploaded as workflow artifacts. The README links to these via
nightly.link for stable download URLs.
2026-03-31 13:36:31 +02:00
Kristoffer Dalby
0c6b9f5348 goreleaser: remove unused ts2019 build tag
The ts2019 build tag is no longer used. Remove it from the
goreleaser build configuration.
2026-03-31 13:36:31 +02:00
Florian Preinstorfer
f3512d50df Switch to mkdocs-materialx
The project mkdocs-material is in maintenance-only mode and their
successor is not ready yet.

Use the modern, refreshed theme and drop the pymdownx.magiclink
extension.
2026-03-25 22:30:03 +01:00
Florian Preinstorfer
efd83da14e Explicitly mention that a headscale username should *not* end with @
See: #3149
2026-03-20 19:44:33 +01:00
Tanayk07
568baf3d02 fix: align banner right-side border to consistent 64-char width 2026-03-19 07:08:35 +01:00
Tanayk07
5105033224 feat: add prominent warning banner for non-standard IP prefixes
Add a highly visible ASCII-art warning banner that is printed at
startup when the configured IP prefixes fall outside the standard
Tailscale CGNAT (100.64.0.0/10) or ULA (fd7a:115c:a1e0::/48) ranges.

The warning fires once even if both v4 and v6 are non-standard, and
the warnBanner() function is reusable for other critical configuration
warnings in the future.

Also updates config-example.yaml to clarify that subsets of the
default ranges are fine, but ranges outside CGNAT/ULA are not.

Closes #3055
2026-03-19 07:08:35 +01:00
Kristoffer Dalby
3d53f97c82 hscontrol/servertest: fix test expectations for eventual consistency
Three corrections to issue tests that had wrong assumptions about
when data becomes available:

1. initial_map_should_include_peer_online_status: use WaitForCondition
   instead of checking the initial netmap. Online status is set by
   Connect() which sends a PeerChange patch after the initial
   RegisterResponse, so it may not be present immediately.

2. disco_key_should_propagate_to_peers: use WaitForCondition. The
   DiscoKey is sent in the first MapRequest (not RegisterRequest),
   so peers may not see it until a subsequent map update.

3. approved_route_without_announcement: invert the test expectation.
   Tailscale uses a strict advertise-then-approve model -- routes are
   only distributed when the node advertises them (Hostinfo.RoutableIPs)
   AND they are approved. An approval without advertisement is a dormant
   pre-approval. The test now asserts the route does NOT appear in
   AllowedIPs, matching upstream Tailscale semantics.

Also fix TestClient.Reconnect to clear the cached netmap and drain
pending updates before re-registering. Without this, WaitForPeers
returned immediately based on the old session's stale data.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
1053fbb16b hscontrol/state: fix online status reset during re-registration
Two fixes to how online status is handled during registration:

1. Re-registration (applyAuthNodeUpdate, HandleNodeFromPreAuthKey) no
   longer resets IsOnline to false. Online status is managed exclusively
   by Connect()/Disconnect() in the poll session lifecycle. The reset
   caused a false offline blip: the auth handler's change notification
   triggered a map regeneration showing the node as offline to peers,
   even though Connect() would set it back to true moments later.

2. New node creation (createAndSaveNewNode) now explicitly sets
   IsOnline=false instead of leaving it nil. This ensures peers always
   receive a known online status rather than an ambiguous nil/unknown.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
b09af3846b hscontrol/poll,state: fix grace period disconnect TOCTOU race
When a node disconnects, serveLongPoll defers a cleanup that starts a
grace period goroutine. This goroutine polls batcher.IsConnected() and,
if the node has not reconnected within ~10 seconds, calls
state.Disconnect() to mark it offline. A TOCTOU race exists: the node
can reconnect (calling Connect()) between the IsConnected check and
the Disconnect() call, causing the stale Disconnect() to overwrite
the new session's online status.

Fix with a monotonic per-node generation counter:

- State.Connect() increments the counter and returns the current
  generation alongside the change list.
- State.Disconnect() accepts the generation from the caller and
  rejects the call if a newer generation exists, making stale
  disconnects from old sessions a no-op.
- serveLongPoll captures the generation at Connect() time and passes
  it to Disconnect() in the deferred cleanup.
- RemoveNode's return value is now checked: if another session already
  owns the batcher slot (reconnect happened), the old session skips
  the grace period entirely.

Update batcher_test.go to track per-node connect generations and
pass them through to Disconnect(), matching production behavior.

Fixes the following test failures:
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- nodestore_correct_after_rapid_reconnect
- rapid_reconnect_peer_never_sees_offline
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
00c41b6422 hscontrol/servertest: add race, stress, and poll race tests
Add three test files designed to stress the control plane under
concurrent and adversarial conditions:

- race_test.go: 14 tests exercising concurrent mutations, session
  replacement, batcher contention, NodeStore access, and map response
  delivery during disconnect. All pass the Go race detector.

- poll_race_test.go: 8 tests targeting the poll.go grace period
  interleaving. These confirm a logical TOCTOU race: when a node
  disconnects and reconnects within the grace period, the old
  session's deferred Disconnect() can overwrite the new session's
  Connect(), leaving IsOnline=false despite an active poll session.

- stress_test.go: sustained churn, rapid mutations, rolling
  replacement, data integrity checks under load, and verification
  that rapid reconnects do not leak false-offline notifications.

Known failing tests (grace period TOCTOU race):
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- rapid_reconnect_peer_never_sees_offline
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ab4e205ce7 hscontrol/servertest: expand issue tests to 24 scenarios, surface 4 issues
Split TestIssues into 7 focused test functions to stay under cyclomatic
complexity limits while testing more aggressively.

Issues surfaced (4 failing tests):

1. initial_map_should_include_peer_online_status: Initial MapResponse
   has Online=nil for peers. Online status only arrives later via
   PeersChangedPatch.

2. disco_key_should_propagate_to_peers: DiscoPublicKey set by client
   is not visible to peers. Peers see zero disco key.

3. approved_route_without_announcement_is_visible: Server-side route
   approval without client-side announcement silently produces empty
   SubnetRoutes (intersection of empty announced + approved = empty).

4. nodestore_correct_after_rapid_reconnect: After 5 rapid reconnect
   cycles, NodeStore reports node as offline despite having an active
   poll session. The connect/disconnect grace period interleaving
   leaves IsOnline in an incorrect state.

Passing tests (20) verify:
- IP uniqueness across 10 nodes
- IP stability across reconnect
- New peers have addresses immediately
- Node rename propagates to peers
- Node delete removes from all peer lists
- Hostinfo changes (OS field) propagate
- NodeStore/DB consistency after route mutations
- Grace period timing (8-20s window)
- Ephemeral node deletion (not just offline)
- 10-node simultaneous connect convergence
- Rapid sequential node additions
- Reconnect produces complete map
- Cross-user visibility with default policy
- Same-user multiple nodes get distinct IDs
- Same-hostname nodes get unique GivenNames
- Policy change during connect still converges
- DERP region references are valid
- User profiles present for self and peers
- Self-update arrives after route approval
- Route advertisement stored as AnnouncedRoutes
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
f87b08676d hscontrol/servertest: add policy, route, ephemeral, and content tests
Extend the servertest harness with:
- TestClient.Direct() accessor for advanced operations
- TestClient.WaitForPeerCount and WaitForCondition helpers
- TestHarness.ChangePolicy for ACL policy testing
- AssertDERPMapPresent and AssertSelfHasAddresses

New test suites:
- content_test.go: self node, DERP map, peer properties, user profiles,
  update history monotonicity, and endpoint update propagation
- policy_test.go: default allow-all, explicit policy, policy triggers
  updates on all nodes, multiple policy changes, multi-user mesh
- ephemeral_test.go: ephemeral connect, cleanup after disconnect,
  mixed ephemeral/regular, reconnect prevents cleanup
- routes_test.go: addresses in AllowedIPs, route advertise and approve,
  advertised routes via hostinfo, CGNAT range validation

Also fix node_departs test to use WaitForCondition instead of
assert.Eventually, and convert concurrent_join_and_leave to
interleaved_join_and_leave with grace-period-tolerant assertions.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
ca7362e9aa hscontrol/servertest: add control plane lifecycle and consistency tests
Add three test files exercising the servertest harness:

- lifecycle_test.go: connection, disconnection, reconnection, session
  replacement, and mesh formation at various sizes.
- consistency_test.go: symmetric visibility, consistent peer state,
  address presence, concurrent join/leave convergence.
- weather_test.go: rapid reconnects, flapping stability, reconnect
  with various delays, concurrent reconnects, and scale tests.

All tests use table-driven patterns with subtests.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
0288614bdf hscontrol: add servertest harness for in-process control plane testing
Add a new hscontrol/servertest package that provides a test harness
for exercising the full Headscale control protocol in-process, using
Tailscale's controlclient.Direct as the client.

The harness consists of:
- TestServer: wraps a Headscale instance with an httptest.Server
- TestClient: wraps controlclient.Direct with NetworkMap tracking
- TestHarness: orchestrates N clients against a single server
- Assertion helpers for mesh completeness, visibility, and consistency

Export minimal accessor methods on Headscale (HTTPHandler, NoisePublicKey,
GetState, SetServerURL, StartBatcher, StartEphemeralGC) so the servertest
package can construct a working server from outside the hscontrol package.

This enables fast, deterministic tests of connection lifecycle, update
propagation, and network weather scenarios without Docker.
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
82c7efccf8 mapper/batcher: serialize per-node work to prevent out-of-order delivery
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:

1. MapResponses delivered out of order — a later change could finish
   generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
   updateSentPeers does Clear() + Store() which is not atomic relative
   to a concurrent Range() in computePeerDiff.

Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.

Fixes #3140
2026-03-19 07:05:58 +01:00
Kristoffer Dalby
81b871c9b5 integration/acl: replace custom entrypoints with WithPackages
Replace inline WithDockerEntrypoint shell scripts in
TestACLTagPropagation and TestACLTagPropagationPortSpecific with
the standard WithPackages and WithWebserver options.

The custom entrypoints used fragile fixed sleeps and lacked the
robust network/cert readiness waits that buildEntrypoint provides.

Updates #3139
2026-03-16 03:57:05 -07:00
Kristoffer Dalby
e5ebe3205a integration: standardize test infrastructure options
Make embedded DERP server and TLS the default configuration for all
integration tests, replacing the per-test opt-in model that led to
inconsistent and flaky test behavior.

Infrastructure changes:
- DefaultConfigEnv() includes embedded DERP server settings
- New() auto-generates a proper CA + server TLS certificate pair
- CA cert is installed into container trust stores and returned by
  GetCert() so clients and internal tools (curl) trust the server
- CreateCertificate() now returns (caCert, cert, key) instead of
  discarding the CA certificate
- Add WithPublicDERP() and WithoutTLS() opt-out options
- Remove WithTLS(), WithEmbeddedDERPServerOnly(), and WithDERPAsIP()
  since all their behavior is now the default or unnecessary

Test cleanup:
- Remove all redundant WithTLS/WithEmbeddedDERPServerOnly/WithDERPAsIP
  calls from test files
- Give every test a unique WithTestName by parameterizing aclScenario,
  sshScenario, and derpServerScenario helpers
- Add WithTestName to tests that were missing it
- Document all non-standard options with inline comments explaining
  why each is needed

Updates #3139
2026-03-16 03:57:05 -07:00
Kristoffer Dalby
87b8507ac9 mapper/batcher: replace connected map with per-node disconnectedAt
The Batcher's connected field (*xsync.Map[types.NodeID, *time.Time])
encoded three states via pointer semantics:

  - nil value:    node is connected
  - non-nil time: node disconnected at that timestamp
  - key missing:  node was never seen

This was error-prone (nil meaning 'connected' inverts Go idioms),
redundant with b.nodes + hasActiveConnections(), and required keeping
two parallel maps in sync. It also contained a bug in RemoveNode where
new(time.Now()) was used instead of &now, producing a zero time.

Replace the separate connected map with a disconnectedAt field on
multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly
on the object that already manages the node's connections.

Changes:
  - Add disconnectedAt field and helpers (markConnected, markDisconnected,
    isConnected, offlineDuration) to multiChannelNodeConn
  - Remove the connected field from Batcher
  - Simplify IsConnected from two map lookups to one
  - Simplify ConnectedMap and Debug from two-map iteration to one
  - Rewrite cleanupOfflineNodes to scan b.nodes directly
  - Remove the markDisconnectedIfNoConns helper
  - Update all tests and benchmarks

Fixes #3141
2026-03-16 02:22:56 -07:00