connectionEntry.send() is on the hot path: called once per connection
per broadcast tick. time.After allocates a timer that sits in the
runtime timer heap until it fires (50 ms), even when the channel send
succeeds immediately. At 1000 connected nodes, every tick leaks 1000
timers into the heap, creating continuous GC pressure.
Replace with time.NewTimer + defer timer.Stop() so the timer is
removed from the heap as soon as the fast-path send completes.
Add embedded DERP server, TLS, and netfilter=off to match the
infrastructure configuration used by all other ACL integration tests.
Without these options, the test fails intermittently because traffic
routes through external DERP relays and iptables initialization fails
in Docker containers.
Updates #3139
Remove the Batcher interface since there is only one implementation.
Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into
batcher.go.
Drop type assertions in debug.go now that mapBatcher is a concrete
*mapper.Batcher pointer.
Rewrite multiChannelNodeConn.send() to use a two-phase approach:
1. RLock: snapshot connections slice (cheap pointer copy)
2. Unlock: send to all connections (50ms timeouts happen here)
3. Lock: remove failed connections by pointer identity
Previously, send() held the write lock for the entire duration of
sending to all connections. With N stale connections each timing out
at 50ms, this blocked addConnection/removeConnection for N*50ms.
The two-phase approach holds the lock only for O(N) pointer
operations, not for N*50ms I/O waits.
Replace the two-phase Load-check-Delete in cleanupOfflineNodes with
xsync.Map.Compute() for atomic check-and-delete. This prevents the
TOCTOU race where a node reconnects between the hasActiveConnections
check and the Delete call.
Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites
to prevent nil pointer panics from concurrent cleanup races.
Move per-node pending changes from a shared xsync.Map on the batcher
into multiChannelNodeConn, protected by a dedicated mutex. The new
appendPending/drainPending methods provide atomic append and drain
operations, eliminating data races in addToBatch and
processBatchedChanges.
Add sync.Once to multiChannelNodeConn.close() to make it idempotent,
preventing panics from concurrent close calls on the same channel.
Add started atomic.Bool to guard Start() against being called
multiple times, preventing orphaned goroutines.
Add comprehensive concurrency tests validating these changes.
Add comprehensive unit tests for the LockFreeBatcher covering
AddNode/RemoveNode lifecycle, addToBatch routing (broadcast, targeted,
full update), processBatchedChanges deduplication, cleanup of offline
nodes, close/shutdown behavior, IsConnected state tracking, and
connected map consistency.
Add benchmarks for connection entry send, multi-channel send and
broadcast, peer diff computation, sentPeers updates, addToBatch at
various scales (10/100/1000 nodes), processBatchedChanges, broadcast
delivery, IsConnected lookups, connected map enumeration, connection
churn, and concurrent send+churn scenarios.
Widen setupBatcherWithTestData to accept testing.TB so benchmarks can
reuse the same database-backed test setup as unit tests.
Add golang.org/x/tools/cmd/stress as a tool dependency for running
tests under repeated stress to surface flaky failures.
Update flake vendorHash for the new go.mod dependencies.
Buffer the AuthRequest verdict channel to prevent a race where the
sender blocks indefinitely if the receiver has already timed out, and
increase the auth followup test timeout from 100ms to 5s to prevent
spurious failures under load.
Skip postgres-backed tests when the postgres server is unavailable
instead of calling t.Fatal, which was preventing the rest of the test
suite from running.
Add TestMain to db, types, and policy/v2 packages to chdir to the
source directory before running tests. This ensures relative testdata/
paths resolve correctly when the test binary is executed from an
arbitrary working directory (e.g., via "go tool stress").
When stale-send cleanup prunes a connection from the batcher, the old serveLongPoll session needs an explicit stop signal. Pass a stop hook into AddNode and trigger it when that connection is removed, so the session exits through its normal cancel path instead of relying on channel closure from the batcher side.
When the batcher timed out sending to a node, it removed the channel from multiChannelNodeConn but left the old serveLongPoll goroutine running on that channel. That left a live stale session behind: it no longer received new updates, but it could still keep the stream open and block shutdown.
Close the pruned channel when stale-send cleanup removes it so the old map session exits after draining any buffered update.
A connection can already be removed from multiChannelNodeConn by the stale-send cleanup path before serveLongPoll reaches its deferred RemoveNode call. In that case RemoveNode used to return early on "channel not found" and never updated the node's connected state.
Drop that early return so RemoveNode still checks whether any active connections remain and marks the node disconnected when the last one is gone.
Update mdformat and related packages from python313Packages to
python314Packages. All four packages (mdformat, mdformat-footnote,
mdformat-frontmatter, mdformat-mkdocs) are available in the updated
nixpkgs.
Updates #1261
Explicitly set derp.urls to an empty list in the NixOS VM test,
matching the upstream nixpkgs test. The VMs have no internet
access, so fetching the default Tailscale DERP map would silently
fail and add unnecessary timeout delay to the test run.
Add missing typed options from the upstream nixpkgs module:
- configFile: read-only option exposing the generated config path
for composability with other NixOS modules
- dns.split: split DNS configuration with proper type checking
- dns.extra_records: typed submodule with name/type/value validation
Sync descriptions and assertions with upstream:
- Use Tailscale doc link for override_local_dns description
- Remove redundant requirement note from nameservers.global
- Match upstream assertion message wording and expression style
Update systemd script to reference cfg.configFile instead of a
local let-binding, matching the upstream pattern.
Add end-to-end integration test that validates localpart:*@domain
SSH user mapping with real Tailscale clients. The test sets up an
SSH policy with localpart entries and verifies that users can SSH
into tagged servers using their email local-part as the username.
Updates #3049
Add support for localpart:*@<domain> entries in SSH policy users.
When a user SSHes into a target, their email local-part becomes the
OS username (e.g. alice@example.com → OS user alice).
Type system (types.go):
- SSHUser.IsLocalpart() and ParseLocalpart() for validation
- SSHUsers.LocalpartEntries(), NormalUsers(), ContainsLocalpart()
- Enforces format: localpart:*@<domain> (wildcard-only)
- UserWildcard.Resolve for user:*@domain SSH source aliases
- acceptEnv passthrough for SSH rules
Compilation (filter.go):
- resolveLocalparts: pure function mapping users to local-parts
by email domain. No node walking, easy to test.
- groupSourcesByUser: single walk producing per-user principals
with sorted user IDs, and tagged principals separately.
- ipSetToPrincipals: shared helper replacing 6 inline copies.
- selfPrincipalsForNode: self-access using pre-computed byUser.
The approach separates data gathering from rule assembly. Localpart
rules are interleaved per source user to match Tailscale SaaS
first-match-wins ordering.
Updates #3049
Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled
Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled
Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled
Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled
Check Generated Files / check-generated (push) Has been cancelled
NixOS Module Tests / nix-module-check (push) Has been cancelled
Tests / test (push) Has been cancelled
NodeView.CanAccess called node2.AsStruct() on every check. In peer-map construction we run CanAccess in O(n^2) pair scans (often twice per pair), so that per-call clone multiplied into large heap churn
Add ReadLog method to headscale integration container for log
inspection. Split SSH check mode tests into CLI and OIDC variants
and add comprehensive test coverage:
- TestSSHOneUserToOneCheckModeCLI: basic check mode with CLI approval
- TestSSHOneUserToOneCheckModeOIDC: check mode with OIDC approval
- TestSSHCheckModeUnapprovedTimeout: rejection on cache expiry
- TestSSHCheckModeCheckPeriodCLI: session expiry and re-auth
- TestSSHCheckModeAutoApprove: auto-approval within check period
- TestSSHCheckModeNegativeCLI: explicit rejection via CLI
Update existing integration tests to use headscale auth register.
Updates #1850
Add SSH check period tracking so that recently authenticated users
are auto-approved without requiring manual intervention each time.
Introduce SSHCheckPeriod type with validation (min 1m, max 168h,
"always" for every request) and encode the compiled check period
as URL query parameters in the HoldAndDelegate URL.
The SSHActionHandler checks recorded auth times before creating a
new HoldAndDelegate flow. Auth timestamps are stored in-memory:
- Default period (no explicit checkPeriod): auth covers any
destination, keyed by source node with Dst=0 sentinel
- Explicit period: auth covers only that specific destination,
keyed by (source, destination) pair
Auth times are cleared on policy changes.
Updates #1850
Add gRPC service definitions for managing auth requests:
AuthRegister to register interactive auth sessions and
AuthApprove/AuthReject to approve or deny pending requests
(used for SSH check mode).
Updates #1850
Implement the SSH "check" action which requires additional
verification before allowing SSH access. The policy compiler generates
a HoldAndDelegate URL that the Tailscale client calls back to
headscale. The SSHActionHandler creates an auth session and waits for
approval via the generalised auth flow.
Sort check (HoldAndDelegate) rules before accept rules to match
Tailscale's first-match-wins evaluation order.
Updates #1850
Extract shared HTML/CSS design into a common template and create
generalised auth success and web auth templates that work for both
node registration and SSH check authentication flows.
Updates #1850
Generalise the registration pipeline to a more general auth pipeline
supporting both node registrations and SSH check auth requests.
Rename RegistrationID to AuthID, unexport AuthRequest fields, and
introduce AuthVerdict to unify the auth finish API.
Add the urlParam generic helper for extracting typed URL parameters
from chi routes, used by the new auth request handler.
Updates #1850
Replace gorilla/mux with go-chi/chi as the HTTP router and add a
custom zerolog-based request logger to replace chi's default
stdlib-based middleware.Logger, consistent with the rest of the
application.
Updates #1850
Build / build-cross (GOARCH=amd64 GOOS=darwin) (push) Has been cancelled
Build / build-cross (GOARCH=amd64 GOOS=linux) (push) Has been cancelled
Build / build-cross (GOARCH=arm64 GOOS=darwin) (push) Has been cancelled
Build / build-cross (GOARCH=arm64 GOOS=linux) (push) Has been cancelled
Check Generated Files / check-generated (push) Has been cancelled
NixOS Module Tests / nix-module-check (push) Has been cancelled
Tests / test (push) Has been cancelled
Needs More Info - Timer / remove-label-on-response (push) Has been cancelled
Needs More Info - Timer / close-stale (push) Has been cancelled
update-flake-lock / lockfile (push) Has been cancelled
GitHub Actions Version Updater / build (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
Extract the existing-node lookup logic from HandleNodeFromPreAuthKey
into a separate method. This reduces the cyclomatic complexity from
32 to 28, below the gocyclo limit of 30.
Updates #3077
Tagged nodes no longer have user_id set, so ListNodes(user) cannot
find them. Update integration tests to use ListNodes() (all nodes)
when looking up tagged nodes.
Add a findNode helper to locate nodes by predicate from an
unfiltered list, used in ACL tests that have multiple nodes per
scenario.
Updates #3077
Test the tagged-node-survives-user-deletion scenario at two layers:
DB layer (users_test.go):
- success_user_only_has_tagged_nodes: tagged nodes with nil
user_id do not block user deletion and survive it
- error_user_has_tagged_and_owned_nodes: user-owned nodes
still block deletion even when tagged nodes coexist
App layer (grpcv1_test.go):
- TestDeleteUser_TaggedNodeSurvives: full registration flow
with tagged PreAuthKey verifies nil UserID after registration,
absence from nodesByUser index, user deletion succeeds, and
tagged node remains in global node list
Also update auth_tags_test.go assertions to expect nil UserID
on tagged nodes, consistent with the new invariant.
Updates #3077
Tagged nodes are owned by their tags, not a user. Enforce this
invariant at every write path:
- createAndSaveNewNode: do not set UserID for tagged PreAuthKey
registration; clear UserID when advertise-tags are applied
during OIDC/CLI registration
- SetNodeTags: clear UserID/User when tags are assigned
- processReauthTags: clear UserID/User when tags are applied
during re-authentication
- validateNodeOwnership: reject tagged nodes with non-nil UserID
- NodeStore: skip nodesByUser indexing for tagged nodes since
they have no owning user
- HandleNodeFromPreAuthKey: add fallback lookup for tagged PAK
re-registration (tagged nodes indexed under UserID(0)); guard
against nil User deref for tagged nodes in different-user check
Since tagged nodes now have user_id = NULL, ListNodesByUser
will not return them and DestroyUser naturally allows deleting
users whose nodes have all been tagged. The ON DELETE CASCADE
FK cannot reach tagged nodes through a NULL foreign key.
Also tone down shouty comments throughout state.go.
Fixes#3077
Tagged nodes are owned by their tags, not a user. Previously
user_id was kept as "created by" tracking, but this prevents
deleting users whose nodes have all been tagged, and the
ON DELETE CASCADE FK would destroy the tagged nodes.
Add a migration that sets user_id = NULL on all existing tagged
nodes. Subsequent commits enforce this invariant at write time.
Updates #3077
Add --disable flag to "headscale nodes expire" CLI command and
disable_expiry field handling in the gRPC API to allow disabling
key expiry for nodes. When disabled, the node's expiry is set to
NULL and IsExpired() returns false.
The CLI follows the new grpcRunE/RunE/printOutput patterns
introduced in the recent CLI refactor.
Also fix NodeSetExpiry to persist directly to the database instead
of going through persistNodeToDB which omits the expiry field.
Fixes#2681
Co-authored-by: Marco Santos <me@marcopsantos.com>