Add NodeAttrsTaildriveShare and NodeAttrsTaildriveAccess to the node capability map, enabling Taildrive file sharing when granted via policy. Add integration test verifying the full cap/drive grant lifecycle.
Updates #2180
compileFilterRules checked only pol.ACLs == nil to decide whether
to return FilterAllowAll (permit-any). Policies that use only Grants
(no ACLs) had nil ACLs, so the function short-circuited before
compiling any CapGrant rules. This meant cap/relay, cap/drive, and
any other App-based grant capabilities were silently ignored.
Check both ACLs and Grants are empty before returning FilterAllowAll.
Updates #2180
Via grants steer routes to specific nodes per viewer. Until now,
all clients saw the same routes for each peer because route
assembly was viewer-independent. This implements per-viewer route
visibility so that via-designated peers serve routes only to
matching viewers, while non-designated peers have those routes
withdrawn.
Add ViaRouteResult type (Include/Exclude prefix lists) and
ViaRoutesForPeer to the PolicyManager interface. The v2
implementation iterates via grants, resolves sources against the
viewer, matches destinations against the peer's advertised routes
(both subnet and exit), and categorizes prefixes by whether the
peer has the via tag.
Add RoutesForPeer to State which composes global primary election,
via Include/Exclude filtering, exit routes, and ACL reduction.
When no via grants exist, it falls back to existing behavior.
Update the mapper to call RoutesForPeer per-peer instead of using
a single route function for all peers. The route function now
returns all routes (subnet + exit), and TailNode filters exit
routes out of the PrimaryRoutes field for HA tracking.
Updates #2180
compileViaGrant only handled *Prefix destinations, skipping
*AutoGroup entirely. This meant via grants with
dst=[autogroup:internet] produced no filter rules even when the
node was an exit node with approved exit routes.
Switch the destination loop from a type assertion to a type switch
that handles both *Prefix (subnet routes) and *AutoGroup (exit
routes via autogroup:internet). Also check ExitRoutes() in
addition to SubnetRoutes() so the function doesn't bail early
when a node only has exit routes.
Updates #2180
Add autogroup:danger-all as a valid source alias that matches ALL IP
addresses including non-Tailscale addresses. When used as a source,
it resolves to 0.0.0.0/0 + ::/0 internally but produces SrcIPs: ["*"]
in filter rules. When used as a destination, it is rejected with an
error matching Tailscale SaaS behavior.
Key changes:
- Add AutoGroupDangerAll constant and validation
- Add sourcesHaveDangerAll() helper and hasDangerAll parameter to
srcIPsWithRoutes() across all compilation paths
- Add ErrAutogroupDangerAllDst for destination rejection
- Remove 3 AUTOGROUP_DANGER_ALL skip entries (K6, K7, K8)
Updates #2180
Implement comprehensive grant validation including: accept empty sources/destinations (they produce no rules), validate grant ip/app field requirements, capability name format, autogroup constraints, via tag existence, and default route CIDR restrictions.
Updates #2180
Compile grants with "via" field into FilterRules that are placed only
on nodes matching the via tag and actually advertising the destination
subnets. Key behavior:
- Filter rules go exclusively to via-nodes with matching approved routes
- Destination subnets not advertised by the via node are silently dropped
- App-only via grants (no ip field) produce no packet filter rules
- Via grants are skipped in the global compileFilterRules since they
are node-specific
Reduces grant compat test skips from 41 to 30 (11 newly passing).
Updates #2180
Compile grant app fields into CapGrant FilterRules matching Tailscale
SaaS behavior. Key changes:
- Generate CapGrant rules in compileFilterRules and
compileGrantWithAutogroupSelf, with node-specific /32 and /128
Dsts for autogroup:self grants
- Add reversed companion rules for drive→drive-sharer and
relay→relay-target capabilities, ordered by original cap name
- Narrow broad CapGrant Dsts to node-specific prefixes in
ReduceFilterRules via new reduceCapGrantRule helper
- Skip merging CapGrant rules in mergeFilterRules to preserve
per-capability structure
- Remove ip+app mutual exclusivity validation (Tailscale accepts both)
- Add semantic JSON comparison for RawMessage types and netip.Prefix
comparators in test infrastructure
Reduces grant compat test skips from 99 to 41 (58 newly passing).
Updates #2180
When an ACL source list contains a wildcard (*) alongside explicit
sources (tags, groups, hosts, etc.), Tailscale preserves the individual
IPs from non-wildcard sources in SrcIPs alongside the merged wildcard
CGNAT ranges. Previously, headscale's IPSetBuilder would merge all
sources into a single set, absorbing the explicit IPs into the wildcard
range.
Track non-wildcard resolved addresses separately during source
resolution, then append their individual IP strings to the output
when a wildcard is also present. This fixes the remaining 5 ACL
compat test failures (K01 and M06 subtests).
Updates #2180
When an ACL has non-autogroup destinations (groups, users, tags, hosts)
alongside autogroup:self, emit non-self grants before self grants to
match Tailscale's filter rule ordering. ACLs with only autogroup
destinations (self + member) preserve the policy-defined order.
This fixes ACL-A17, ACL-SF07, and ACL-SF11 compat test failures.
Updates #2180
Add exit route check in ReduceFilterRules to prevent exit nodes from receiving packet filter rules for destinations that only overlap via exit routes. Remove resolved SUBNET_ROUTE_FILTER_RULES grant skip entries and update error message formatting for grant validation.
Updates #2180
Per Tailscale documentation, the wildcard (*) source includes "any
approved subnets" — the actually-advertised-and-approved routes from
nodes, not the autoApprover policy prefixes.
Change Asterix.resolve() to return just the base CGNAT+ULA set, and
add approved subnet routes as separate SrcIPs entries in the filter
compilation path. This preserves individual route prefixes that would
otherwise be merged by IPSet (e.g., 10.0.0.0/8 absorbing 10.33.0.0/16).
Also swap rule ordering in compileGrantWithAutogroupSelf() to emit
non-self destination rules before autogroup:self rules, matching the
Tailscale FilterRule wire format ordering.
Remove the unused AutoApproverPolicy.prefixes() method.
Updates #2180
Add routable_ips and approved_routes fields to the node topology
definitions in all golden test files. These represent the subnet
routes actually advertised by nodes on the Tailscale SaaS network
during data capture:
Routes topology (92 files, 6 router nodes):
big-router: 10.0.0.0/8
subnet-router: 10.33.0.0/16
ha-router1: 192.168.1.0/24
ha-router2: 192.168.1.0/24
multi-router: 172.16.0.0/24
exit-node: 0.0.0.0/0, ::/0
ACL topology (199 files, 1 router node):
subnet-router: 10.33.0.0/16
Grants topology (203 files, 1 router node):
subnet-router: 10.33.0.0/16
The route assignments were deduced from the golden data by analyzing
which router nodes receive FilterRules for which destination CIDRs
across all test files, and cross-referenced with the MTS setup
script (setup_grant_nodes.sh).
Updates #2180
Use ip.String() instead of netip.PrefixFrom(ip, ip.BitLen()).String()
when building DstPorts for autogroup:self destinations. This produces
bare IPs like "100.90.199.68" instead of CIDR notation like
"100.90.199.68/32", matching the Tailscale FilterRule wire format.
Updates #2180
AppendToIPSet now adds both IPv4 and IPv6 addresses for nodes, matching Tailscale's FilterRule wire format where identity-based aliases (tags, users, groups, autogroups) resolve to both address families. Update ReduceFilterRules test expectations to include IPv6 entries.
Updates #2180
Replace 8,286 lines of inline Go test expectations with 92 JSON golden files captured from Tailscale SaaS. The data-driven test driver validates route filtering, auto-approval, HA routing, and exit node behavior against real Tailscale output.
Updates #2180
Replace 9,937 lines of inline Go test expectations with 215 JSON golden files captured from Tailscale SaaS. The new data-driven test driver compares headscale's filter compilation output against real Tailscale behavior for each node in an 8-node topology.
Updates #2180
Introduce ResolvedAddresses type for structured IP set results. Refactor all Alias.Resolve() methods to return ResolvedAddresses instead of raw IPSets. Restrict identity-based aliases to matching address families, fix nil dereferences in partial resolution paths, and update test expectations for the new IP format (bare IPs, IP ranges instead of CIDR prefixes).
Updates #2180
Rename tailscale_compat_test.go to tailscale_acl_compat_test.go to make room for the grants compat test. Add 237 GRANT-*.json golden test files captured from Tailscale SaaS and a data-driven test driver that compares headscale's grant filter compilation against real Tailscale behavior.
Updates #2180
Add support for the Grant policy format as an alternative to ACL format,
following Tailscale's policy v2 specification. Grants provide a more
structured way to define network access rules with explicit separation
of IP-based and capability-based permissions.
Key changes:
- Add Grant struct with Sources, Destinations, InternetProtocols (ip),
and App (capabilities) fields
- Add ProtocolPort type for unmarshaling protocol:port strings
- Add Grant validation in Policy.validate() to enforce:
- Mutual exclusivity of ip and app fields
- Required ip or app field presence
- Non-empty sources and destinations
- Refactor compileFilterRules to support both ACLs and Grants
- Convert ACLs to Grants internally via aclToGrants() for unified
processing
- Extract destinationsToNetPortRange() helper for cleaner code
- Rename parseProtocol() to toIANAProtocolNumbers() for clarity
- Add ProtocolNumberToName mapping for reverse lookups
The Grant format allows policies to be written using either the legacy
ACL format or the new Grant format. ACLs are converted to Grants
internally, ensuring backward compatibility while enabling the new
format's benefits.
Updates #2180
WithTags was defined but never passed through to CreatePreAuthKey.
Fix NewClient to use CreateTaggedPreAuthKey when tags are specified,
enabling tests that need tagged nodes (e.g. via grant steering).
Updates #2180
When exit routes are approved, SubnetRoutes remains empty because exit
routes (0.0.0.0/0, ::/0) are classified separately. Without checking
ExitRoutes, the PolicyManager cache is not invalidated on exit route
approval, causing stale filter rules that lack via grant entries for
autogroup:internet destinations.
Updates #2180
Add a highly visible ASCII-art warning banner that is printed at
startup when the configured IP prefixes fall outside the standard
Tailscale CGNAT (100.64.0.0/10) or ULA (fd7a:115c:a1e0::/48) ranges.
The warning fires once even if both v4 and v6 are non-standard, and
the warnBanner() function is reusable for other critical configuration
warnings in the future.
Also updates config-example.yaml to clarify that subsets of the
default ranges are fine, but ranges outside CGNAT/ULA are not.
Closes#3055
Three corrections to issue tests that had wrong assumptions about
when data becomes available:
1. initial_map_should_include_peer_online_status: use WaitForCondition
instead of checking the initial netmap. Online status is set by
Connect() which sends a PeerChange patch after the initial
RegisterResponse, so it may not be present immediately.
2. disco_key_should_propagate_to_peers: use WaitForCondition. The
DiscoKey is sent in the first MapRequest (not RegisterRequest),
so peers may not see it until a subsequent map update.
3. approved_route_without_announcement: invert the test expectation.
Tailscale uses a strict advertise-then-approve model -- routes are
only distributed when the node advertises them (Hostinfo.RoutableIPs)
AND they are approved. An approval without advertisement is a dormant
pre-approval. The test now asserts the route does NOT appear in
AllowedIPs, matching upstream Tailscale semantics.
Also fix TestClient.Reconnect to clear the cached netmap and drain
pending updates before re-registering. Without this, WaitForPeers
returned immediately based on the old session's stale data.
Two fixes to how online status is handled during registration:
1. Re-registration (applyAuthNodeUpdate, HandleNodeFromPreAuthKey) no
longer resets IsOnline to false. Online status is managed exclusively
by Connect()/Disconnect() in the poll session lifecycle. The reset
caused a false offline blip: the auth handler's change notification
triggered a map regeneration showing the node as offline to peers,
even though Connect() would set it back to true moments later.
2. New node creation (createAndSaveNewNode) now explicitly sets
IsOnline=false instead of leaving it nil. This ensures peers always
receive a known online status rather than an ambiguous nil/unknown.
When a node disconnects, serveLongPoll defers a cleanup that starts a
grace period goroutine. This goroutine polls batcher.IsConnected() and,
if the node has not reconnected within ~10 seconds, calls
state.Disconnect() to mark it offline. A TOCTOU race exists: the node
can reconnect (calling Connect()) between the IsConnected check and
the Disconnect() call, causing the stale Disconnect() to overwrite
the new session's online status.
Fix with a monotonic per-node generation counter:
- State.Connect() increments the counter and returns the current
generation alongside the change list.
- State.Disconnect() accepts the generation from the caller and
rejects the call if a newer generation exists, making stale
disconnects from old sessions a no-op.
- serveLongPoll captures the generation at Connect() time and passes
it to Disconnect() in the deferred cleanup.
- RemoveNode's return value is now checked: if another session already
owns the batcher slot (reconnect happened), the old session skips
the grace period entirely.
Update batcher_test.go to track per-node connect generations and
pass them through to Disconnect(), matching production behavior.
Fixes the following test failures:
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- nodestore_correct_after_rapid_reconnect
- rapid_reconnect_peer_never_sees_offline
Add three test files designed to stress the control plane under
concurrent and adversarial conditions:
- race_test.go: 14 tests exercising concurrent mutations, session
replacement, batcher contention, NodeStore access, and map response
delivery during disconnect. All pass the Go race detector.
- poll_race_test.go: 8 tests targeting the poll.go grace period
interleaving. These confirm a logical TOCTOU race: when a node
disconnects and reconnects within the grace period, the old
session's deferred Disconnect() can overwrite the new session's
Connect(), leaving IsOnline=false despite an active poll session.
- stress_test.go: sustained churn, rapid mutations, rolling
replacement, data integrity checks under load, and verification
that rapid reconnects do not leak false-offline notifications.
Known failing tests (grace period TOCTOU race):
- server_state_online_after_reconnect_within_grace
- update_history_no_false_offline
- rapid_reconnect_peer_never_sees_offline
Split TestIssues into 7 focused test functions to stay under cyclomatic
complexity limits while testing more aggressively.
Issues surfaced (4 failing tests):
1. initial_map_should_include_peer_online_status: Initial MapResponse
has Online=nil for peers. Online status only arrives later via
PeersChangedPatch.
2. disco_key_should_propagate_to_peers: DiscoPublicKey set by client
is not visible to peers. Peers see zero disco key.
3. approved_route_without_announcement_is_visible: Server-side route
approval without client-side announcement silently produces empty
SubnetRoutes (intersection of empty announced + approved = empty).
4. nodestore_correct_after_rapid_reconnect: After 5 rapid reconnect
cycles, NodeStore reports node as offline despite having an active
poll session. The connect/disconnect grace period interleaving
leaves IsOnline in an incorrect state.
Passing tests (20) verify:
- IP uniqueness across 10 nodes
- IP stability across reconnect
- New peers have addresses immediately
- Node rename propagates to peers
- Node delete removes from all peer lists
- Hostinfo changes (OS field) propagate
- NodeStore/DB consistency after route mutations
- Grace period timing (8-20s window)
- Ephemeral node deletion (not just offline)
- 10-node simultaneous connect convergence
- Rapid sequential node additions
- Reconnect produces complete map
- Cross-user visibility with default policy
- Same-user multiple nodes get distinct IDs
- Same-hostname nodes get unique GivenNames
- Policy change during connect still converges
- DERP region references are valid
- User profiles present for self and peers
- Self-update arrives after route approval
- Route advertisement stored as AnnouncedRoutes
Extend the servertest harness with:
- TestClient.Direct() accessor for advanced operations
- TestClient.WaitForPeerCount and WaitForCondition helpers
- TestHarness.ChangePolicy for ACL policy testing
- AssertDERPMapPresent and AssertSelfHasAddresses
New test suites:
- content_test.go: self node, DERP map, peer properties, user profiles,
update history monotonicity, and endpoint update propagation
- policy_test.go: default allow-all, explicit policy, policy triggers
updates on all nodes, multiple policy changes, multi-user mesh
- ephemeral_test.go: ephemeral connect, cleanup after disconnect,
mixed ephemeral/regular, reconnect prevents cleanup
- routes_test.go: addresses in AllowedIPs, route advertise and approve,
advertised routes via hostinfo, CGNAT range validation
Also fix node_departs test to use WaitForCondition instead of
assert.Eventually, and convert concurrent_join_and_leave to
interleaved_join_and_leave with grace-period-tolerant assertions.
Add three test files exercising the servertest harness:
- lifecycle_test.go: connection, disconnection, reconnection, session
replacement, and mesh formation at various sizes.
- consistency_test.go: symmetric visibility, consistent peer state,
address presence, concurrent join/leave convergence.
- weather_test.go: rapid reconnects, flapping stability, reconnect
with various delays, concurrent reconnects, and scale tests.
All tests use table-driven patterns with subtests.
Add a new hscontrol/servertest package that provides a test harness
for exercising the full Headscale control protocol in-process, using
Tailscale's controlclient.Direct as the client.
The harness consists of:
- TestServer: wraps a Headscale instance with an httptest.Server
- TestClient: wraps controlclient.Direct with NetworkMap tracking
- TestHarness: orchestrates N clients against a single server
- Assertion helpers for mesh completeness, visibility, and consistency
Export minimal accessor methods on Headscale (HTTPHandler, NoisePublicKey,
GetState, SetServerURL, StartBatcher, StartEphemeralGC) so the servertest
package can construct a working server from outside the hscontrol package.
This enables fast, deterministic tests of connection lifecycle, update
propagation, and network weather scenarios without Docker.
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:
1. MapResponses delivered out of order — a later change could finish
generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
updateSentPeers does Clear() + Store() which is not atomic relative
to a concurrent Range() in computePeerDiff.
Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.
Fixes#3140
The Batcher's connected field (*xsync.Map[types.NodeID, *time.Time])
encoded three states via pointer semantics:
- nil value: node is connected
- non-nil time: node disconnected at that timestamp
- key missing: node was never seen
This was error-prone (nil meaning 'connected' inverts Go idioms),
redundant with b.nodes + hasActiveConnections(), and required keeping
two parallel maps in sync. It also contained a bug in RemoveNode where
new(time.Now()) was used instead of &now, producing a zero time.
Replace the separate connected map with a disconnectedAt field on
multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly
on the object that already manages the node's connections.
Changes:
- Add disconnectedAt field and helpers (markConnected, markDisconnected,
isConnected, offlineDuration) to multiChannelNodeConn
- Remove the connected field from Batcher
- Simplify IsConnected from two map lookups to one
- Simplify ConnectedMap and Debug from two-map iteration to one
- Rewrite cleanupOfflineNodes to scan b.nodes directly
- Remove the markDisconnectedIfNoConns helper
- Update all tests and benchmarks
Fixes#3141
processBatchedChanges queued each pending change for a node as a
separate work item. Since multiple workers pull from the same channel,
two changes for the same node could be processed concurrently by
different workers. This caused two problems:
1. MapResponses delivered out of order — a later change could finish
generating before an earlier one, so the client sees stale state.
2. updateSentPeers and computePeerDiff race against each other —
updateSentPeers does Clear() + Store() which is not atomic relative
to a concurrent Range() in computePeerDiff.
Bundle all pending changes for a node into a single work item so one
worker processes them sequentially. Add a per-node workMu that
serializes processing across consecutive batch ticks, preventing a
second worker from starting tick N+1 while tick N is still in progress.
Fixes#3140
The Noise handshake accepts any machine key without checking
registration, so all endpoints behind the Noise router are reachable
without credentials. Three handlers used io.ReadAll without size
limits, allowing an attacker to OOM-kill the server.
Fix:
- Add http.MaxBytesReader middleware (1 MiB) on the Noise router.
- Replace io.ReadAll + json.Unmarshal with json.NewDecoder in
PollNetMapHandler and RegistrationHandler.
- Stop reading the body in NotImplementedHandler entirely.
Remove XTestBatcherChannelClosingRace (~95 lines) and
XTestBatcherScalability (~515 lines). These were disabled by
prefixing with X (making them invisible to go test) and served
as dead code. The functionality they covered is exercised by the
active test suite.
Updates #2545
L8: Rename SCREAMING_SNAKE_CASE test constants to idiomatic Go
camelCase. Remove highLoad* and extremeLoad* constants that were
only referenced by disabled (X-prefixed) tests.
L10: Fix misleading assert message that said "1337" while checking
for region ID 999.
L12: Remove emoji from test log output to avoid encoding issues
in CI environments.
Updates #2545
L1: Replace crypto/rand with an atomic counter for generating
connection IDs. These identifiers are process-local and do not need
cryptographic randomness; a monotonic counter is cheaper and
produces shorter, sortable IDs.
L5: Use getActiveConnectionCount() in Debug() instead of directly
locking the mutex and reading the connections slice. This avoids
bypassing the accessor that already exists for this purpose.
L6: Extract the hardcoded 15*time.Minute cleanup threshold into
the named constant offlineNodeCleanupThreshold.
L7: Inline the trivial addWork wrapper; AddWork now calls addToBatch
directly.
Updates #2545
Move connectionEntry, multiChannelNodeConn, generateConnectionID, and
all their methods from batcher.go into a dedicated file. This reduces
batcher.go from ~1170 lines to ~800 and separates per-node connection
management from batcher orchestration.
Pure move — no logic changes.
Updates #2545
- TestBatcher_CloseBeforeStart_DoesNotHang: verifies Close() before
Start() returns promptly now that done is initialized in NewBatcher.
- TestBatcher_QueueWorkAfterClose_DoesNotHang: verifies queueWork
returns via the done channel after Close(), even without Start().
- TestIsConnected_FalseAfterAddNodeFailure: verifies IsConnected
returns false after AddNode fails and removes the last connection.
- TestRemoveConnectionAtIndex_NilsTrailingSlot: verifies the backing
array slot is nil-ed after removal to avoid retaining pointers.
Updates #2545
M7: Nil out trailing *connectionEntry pointers in the backing array
after slice removal in removeConnectionAtIndexLocked and send().
Without this, the GC cannot collect removed entries until the slice
is reallocated.
M1: Initialize the done channel in NewBatcher instead of Start().
Previously, calling Close() or queueWork before Start() would select
on a nil channel, blocking forever. Moving the make() to the
constructor ensures the channel is always usable.
M2: Move b.connected.Delete and b.totalNodes decrement inside the
Compute callback in cleanupOfflineNodes. Previously these ran after
the Compute returned, allowing a concurrent AddNode to reconnect
between the delete and the bookkeeping update, which would wipe the
fresh connected state.
M3: Call markDisconnectedIfNoConns on AddNode error paths. Previously,
when initial map generation or send timed out, the connection was
removed but b.connected retained its old nil (= connected) value,
making IsConnected return true for a node with zero connections.
Updates #2545
Add four unit tests guarding fixes introduced in recent commits:
- TestConnectionEntry_SendFastPath_TimerStopped: verifies the
time.NewTimer fix (H1) does not leak goroutines after many
fast-path sends on a buffered channel.
- TestBatcher_CloseWaitsForWorkers: verifies Close() blocks until all
worker goroutines exit (H3), preventing sends on torn-down channels.
- TestBatcher_CloseThenStartIsNoop: verifies the one-shot lifecycle
contract; Start() after Close() must not spawn new goroutines.
- TestBatcher_CloseStopsTicker: verifies Close() stops the internal
ticker to prevent resource leaks.
Updates #2545
Remove Caller(), channel pointer formatting (fmt.Sprintf("%p",...)),
and mutex timing from send(), addConnection(), and
removeConnectionByChannel(). Move per-broadcast summary and
no-connection logs from Debug to Trace. Remove per-connection
"attempting"/"succeeded" logs entirely; keep Warn for failures.
These methods run on every MapResponse delivery, so the savings
compound quickly under load.
Updates #2545
Close() previously closed the done channel and returned immediately,
without waiting for worker goroutines to exit. This caused goroutine
leaks in tests and allowed workers to race with connection teardown.
The ticker was also never stopped, leaking its internal goroutine.
Add a sync.WaitGroup to track the doWork goroutine and every worker
it spawns. Close() now calls wg.Wait() after signalling shutdown,
ensuring all goroutines have exited before tearing down connections.
Also stop the ticker to prevent resource leaks.
Document that a Batcher must not be reused after Close().
connectionEntry.send() is on the hot path: called once per connection
per broadcast tick. time.After allocates a timer that sits in the
runtime timer heap until it fires (50 ms), even when the channel send
succeeds immediately. At 1000 connected nodes, every tick leaks 1000
timers into the heap, creating continuous GC pressure.
Replace with time.NewTimer + defer timer.Stop() so the timer is
removed from the heap as soon as the fast-path send completes.
Remove the Batcher interface since there is only one implementation.
Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into
batcher.go.
Drop type assertions in debug.go now that mapBatcher is a concrete
*mapper.Batcher pointer.
Rewrite multiChannelNodeConn.send() to use a two-phase approach:
1. RLock: snapshot connections slice (cheap pointer copy)
2. Unlock: send to all connections (50ms timeouts happen here)
3. Lock: remove failed connections by pointer identity
Previously, send() held the write lock for the entire duration of
sending to all connections. With N stale connections each timing out
at 50ms, this blocked addConnection/removeConnection for N*50ms.
The two-phase approach holds the lock only for O(N) pointer
operations, not for N*50ms I/O waits.
Replace the two-phase Load-check-Delete in cleanupOfflineNodes with
xsync.Map.Compute() for atomic check-and-delete. This prevents the
TOCTOU race where a node reconnects between the hasActiveConnections
check and the Delete call.
Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites
to prevent nil pointer panics from concurrent cleanup races.