headscale

mirror of https://github.com/juanfont/headscale.git synced 2026-04-08 22:17:47 +09:00

Author	SHA1	Message	Date
Kristoffer Dalby	d243adaedd	types,mapper,integration: enable Taildrive and add cap/drive grant lifecycle test Add NodeAttrsTaildriveShare and NodeAttrsTaildriveAccess to the node capability map, enabling Taildrive file sharing when granted via policy. Add integration test verifying the full cap/drive grant lifecycle. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	8573ff9158	policy/v2: fix grant-only policies returning FilterAllowAll compileFilterRules checked only pol.ACLs == nil to decide whether to return FilterAllowAll (permit-any). Policies that use only Grants (no ACLs) had nil ACLs, so the function short-circuited before compiling any CapGrant rules. This meant cap/relay, cap/drive, and any other App-based grant capabilities were silently ignored. Check both ACLs and Grants are empty before returning FilterAllowAll. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	8358017dcf	policy/v2,state,mapper: implement per-viewer via route steering Via grants steer routes to specific nodes per viewer. Until now, all clients saw the same routes for each peer because route assembly was viewer-independent. This implements per-viewer route visibility so that via-designated peers serve routes only to matching viewers, while non-designated peers have those routes withdrawn. Add ViaRouteResult type (Include/Exclude prefix lists) and ViaRoutesForPeer to the PolicyManager interface. The v2 implementation iterates via grants, resolves sources against the viewer, matches destinations against the peer's advertised routes (both subnet and exit), and categorizes prefixes by whether the peer has the via tag. Add RoutesForPeer to State which composes global primary election, via Include/Exclude filtering, exit routes, and ACL reduction. When no via grants exist, it falls back to existing behavior. Update the mapper to call RoutesForPeer per-peer instead of using a single route function for all peers. The route function now returns all routes (subnet + exit), and TailNode filters exit routes out of the PrimaryRoutes field for HA tracking. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	28be15f8ea	policy/v2: handle autogroup:internet in via grant compilation compileViaGrant only handled Prefix destinations, skipping AutoGroup entirely. This meant via grants with dst=[autogroup:internet] produced no filter rules even when the node was an exit node with approved exit routes. Switch the destination loop from a type assertion to a type switch that handles both Prefix (subnet routes) and AutoGroup (exit routes via autogroup:internet). Also check ExitRoutes() in addition to SubnetRoutes() so the function doesn't bail early when a node only has exit routes. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	687cf0882f	policy/v2: implement autogroup:danger-all support Add autogroup:danger-all as a valid source alias that matches ALL IP addresses including non-Tailscale addresses. When used as a source, it resolves to 0.0.0.0/0 + ::/0 internally but produces SrcIPs: ["*"] in filter rules. When used as a destination, it is rejected with an error matching Tailscale SaaS behavior. Key changes: - Add AutoGroupDangerAll constant and validation - Add sourcesHaveDangerAll() helper and hasDangerAll parameter to srcIPsWithRoutes() across all compilation paths - Add ErrAutogroupDangerAllDst for destination rejection - Remove 3 AUTOGROUP_DANGER_ALL skip entries (K6, K7, K8) Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	4f040dead2	policy/v2: implement grant validation rules matching Tailscale SaaS Implement comprehensive grant validation including: accept empty sources/destinations (they produce no rules), validate grant ip/app field requirements, capability name format, autogroup constraints, via tag existence, and default route CIDR restrictions. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	54db47badc	policy/v2: implement via route compilation for grants Compile grants with "via" field into FilterRules that are placed only on nodes matching the via tag and actually advertising the destination subnets. Key behavior: - Filter rules go exclusively to via-nodes with matching approved routes - Destination subnets not advertised by the via node are silently dropped - App-only via grants (no ip field) produce no packet filter rules - Via grants are skipped in the global compileFilterRules since they are node-specific Reduces grant compat test skips from 41 to 30 (11 newly passing). Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	0e3acdd8ec	policy/v2: implement CapGrant compilation with companion capabilities Compile grant app fields into CapGrant FilterRules matching Tailscale SaaS behavior. Key changes: - Generate CapGrant rules in compileFilterRules and compileGrantWithAutogroupSelf, with node-specific /32 and /128 Dsts for autogroup:self grants - Add reversed companion rules for drive→drive-sharer and relay→relay-target capabilities, ordered by original cap name - Narrow broad CapGrant Dsts to node-specific prefixes in ReduceFilterRules via new reduceCapGrantRule helper - Skip merging CapGrant rules in mergeFilterRules to preserve per-capability structure - Remove ip+app mutual exclusivity validation (Tailscale accepts both) - Add semantic JSON comparison for RawMessage types and netip.Prefix comparators in test infrastructure Reduces grant compat test skips from 99 to 41 (58 newly passing). Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	ebe0f4078d	policy/v2: preserve non-wildcard source IPs alongside wildcard ranges When an ACL source list contains a wildcard (*) alongside explicit sources (tags, groups, hosts, etc.), Tailscale preserves the individual IPs from non-wildcard sources in SrcIPs alongside the merged wildcard CGNAT ranges. Previously, headscale's IPSetBuilder would merge all sources into a single set, absorbing the explicit IPs into the wildcard range. Track non-wildcard resolved addresses separately during source resolution, then append their individual IP strings to the output when a wildcard is also present. This fixes the remaining 5 ACL compat test failures (K01 and M06 subtests). Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	dda35847b0	policy/v2: reorder ACL self grants to match Tailscale rule ordering When an ACL has non-autogroup destinations (groups, users, tags, hosts) alongside autogroup:self, emit non-self grants before self grants to match Tailscale's filter rule ordering. ACLs with only autogroup destinations (self + member) preserve the policy-defined order. This fixes ACL-A17, ACL-SF07, and ACL-SF11 compat test failures. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	f95b254ea9	policy/v2: exclude exit routes from ReduceFilterRules Add exit route check in ReduceFilterRules to prevent exit nodes from receiving packet filter rules for destinations that only overlap via exit routes. Remove resolved SUBNET_ROUTE_FILTER_RULES grant skip entries and update error message formatting for grant validation. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	e05f45cfb1	policy/v2: use approved node routes in wildcard SrcIPs Per Tailscale documentation, the wildcard (*) source includes "any approved subnets" — the actually-advertised-and-approved routes from nodes, not the autoApprover policy prefixes. Change Asterix.resolve() to return just the base CGNAT+ULA set, and add approved subnet routes as separate SrcIPs entries in the filter compilation path. This preserves individual route prefixes that would otherwise be merged by IPSet (e.g., 10.0.0.0/8 absorbing 10.33.0.0/16). Also swap rule ordering in compileGrantWithAutogroupSelf() to emit non-self destination rules before autogroup:self rules, matching the Tailscale FilterRule wire format ordering. Remove the unused AutoApproverPolicy.prefixes() method. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	995ed0187c	policy/v2: add advertised routes to compat test topologies Add routable_ips and approved_routes fields to the node topology definitions in all golden test files. These represent the subnet routes actually advertised by nodes on the Tailscale SaaS network during data capture: Routes topology (92 files, 6 router nodes): big-router: 10.0.0.0/8 subnet-router: 10.33.0.0/16 ha-router1: 192.168.1.0/24 ha-router2: 192.168.1.0/24 multi-router: 172.16.0.0/24 exit-node: 0.0.0.0/0, ::/0 ACL topology (199 files, 1 router node): subnet-router: 10.33.0.0/16 Grants topology (203 files, 1 router node): subnet-router: 10.33.0.0/16 The route assignments were deduced from the golden data by analyzing which router nodes receive FilterRules for which destination CIDRs across all test files, and cross-referenced with the MTS setup script (setup_grant_nodes.sh). Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	927ce418d2	policy/v2: use bare IPs in autogroup:self DstPorts Use ip.String() instead of netip.PrefixFrom(ip, ip.BitLen()).String() when building DstPorts for autogroup:self destinations. This produces bare IPs like "100.90.199.68" instead of CIDR notation like "100.90.199.68/32", matching the Tailscale FilterRule wire format. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	93d79d8da9	policy: include IPv6 in identity-based alias resolution AppendToIPSet now adds both IPv4 and IPv6 addresses for nodes, matching Tailscale's FilterRule wire format where identity-based aliases (tags, users, groups, autogroups) resolve to both address families. Update ReduceFilterRules test expectations to include IPv6 entries. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	500442c8f1	policy/v2: convert routes compat tests to data-driven format with Tailscale SaaS captures Replace 8,286 lines of inline Go test expectations with 92 JSON golden files captured from Tailscale SaaS. The data-driven test driver validates route filtering, auto-approval, HA routing, and exit node behavior against real Tailscale output. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	2fb71690e8	policy/v2: convert ACL compat tests to data-driven format with Tailscale SaaS captures Replace 9,937 lines of inline Go test expectations with 215 JSON golden files captured from Tailscale SaaS. The new data-driven test driver compares headscale's filter compilation output against real Tailscale behavior for each node in an 8-node topology. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	9f7aa55689	policy/v2: refactor alias resolution to use ResolvedAddresses Introduce ResolvedAddresses type for structured IP set results. Refactor all Alias.Resolve() methods to return ResolvedAddresses instead of raw IPSets. Restrict identity-based aliases to matching address families, fix nil dereferences in partial resolution paths, and update test expectations for the new IP format (bare IPs, IP ranges instead of CIDR prefixes). Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	0fa9dcaff8	policy/v2: add data-driven grants compatibility test with Tailscale SaaS captures Rename tailscale_compat_test.go to tailscale_acl_compat_test.go to make room for the grants compat test. Add 237 GRANT-*.json golden test files captured from Tailscale SaaS and a data-driven test driver that compares headscale's grant filter compilation against real Tailscale behavior. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	f74ea5b8ed	hscontrol/policy/v2: add Grant policy format support Add support for the Grant policy format as an alternative to ACL format, following Tailscale's policy v2 specification. Grants provide a more structured way to define network access rules with explicit separation of IP-based and capability-based permissions. Key changes: - Add Grant struct with Sources, Destinations, InternetProtocols (ip), and App (capabilities) fields - Add ProtocolPort type for unmarshaling protocol:port strings - Add Grant validation in Policy.validate() to enforce: - Mutual exclusivity of ip and app fields - Required ip or app field presence - Non-empty sources and destinations - Refactor compileFilterRules to support both ACLs and Grants - Convert ACLs to Grants internally via aclToGrants() for unified processing - Extract destinationsToNetPortRange() helper for cleaner code - Rename parseProtocol() to toIANAProtocolNumbers() for clarity - Add ProtocolNumberToName mapping for reverse lookups The Grant format allows policies to be written using either the legacy ACL format or the new Grant format. ACLs are converted to Grants internally, ensuring backward compatibility while enabling the new format's benefits. Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	53b8a81d48	servertest: support tagged pre-auth keys in test clients WithTags was defined but never passed through to CreatePreAuthKey. Fix NewClient to use CreateTaggedPreAuthKey when tags are specified, enabling tests that need tagged nodes (e.g. via grant steering). Updates #2180	2026-04-01 14:10:42 +01:00
Kristoffer Dalby	15c1cfd778	types: include ExitRoutes in HasNetworkChanges When exit routes are approved, SubnetRoutes remains empty because exit routes (0.0.0.0/0, ::/0) are classified separately. Without checking ExitRoutes, the PolicyManager cache is not invalidated on exit route approval, causing stale filter rules that lack via grant entries for autogroup:internet destinations. Updates #2180	2026-04-01 14:10:42 +01:00
Tanayk07	568baf3d02	fix: align banner right-side border to consistent 64-char width	2026-03-19 07:08:35 +01:00
Tanayk07	5105033224	feat: add prominent warning banner for non-standard IP prefixes Add a highly visible ASCII-art warning banner that is printed at startup when the configured IP prefixes fall outside the standard Tailscale CGNAT (100.64.0.0/10) or ULA (fd7a:115c:a1e0::/48) ranges. The warning fires once even if both v4 and v6 are non-standard, and the warnBanner() function is reusable for other critical configuration warnings in the future. Also updates config-example.yaml to clarify that subsets of the default ranges are fine, but ranges outside CGNAT/ULA are not. Closes #3055	2026-03-19 07:08:35 +01:00
Kristoffer Dalby	3d53f97c82	hscontrol/servertest: fix test expectations for eventual consistency Three corrections to issue tests that had wrong assumptions about when data becomes available: 1. initial_map_should_include_peer_online_status: use WaitForCondition instead of checking the initial netmap. Online status is set by Connect() which sends a PeerChange patch after the initial RegisterResponse, so it may not be present immediately. 2. disco_key_should_propagate_to_peers: use WaitForCondition. The DiscoKey is sent in the first MapRequest (not RegisterRequest), so peers may not see it until a subsequent map update. 3. approved_route_without_announcement: invert the test expectation. Tailscale uses a strict advertise-then-approve model -- routes are only distributed when the node advertises them (Hostinfo.RoutableIPs) AND they are approved. An approval without advertisement is a dormant pre-approval. The test now asserts the route does NOT appear in AllowedIPs, matching upstream Tailscale semantics. Also fix TestClient.Reconnect to clear the cached netmap and drain pending updates before re-registering. Without this, WaitForPeers returned immediately based on the old session's stale data.	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	1053fbb16b	hscontrol/state: fix online status reset during re-registration Two fixes to how online status is handled during registration: 1. Re-registration (applyAuthNodeUpdate, HandleNodeFromPreAuthKey) no longer resets IsOnline to false. Online status is managed exclusively by Connect()/Disconnect() in the poll session lifecycle. The reset caused a false offline blip: the auth handler's change notification triggered a map regeneration showing the node as offline to peers, even though Connect() would set it back to true moments later. 2. New node creation (createAndSaveNewNode) now explicitly sets IsOnline=false instead of leaving it nil. This ensures peers always receive a known online status rather than an ambiguous nil/unknown.	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	b09af3846b	hscontrol/poll,state: fix grace period disconnect TOCTOU race When a node disconnects, serveLongPoll defers a cleanup that starts a grace period goroutine. This goroutine polls batcher.IsConnected() and, if the node has not reconnected within ~10 seconds, calls state.Disconnect() to mark it offline. A TOCTOU race exists: the node can reconnect (calling Connect()) between the IsConnected check and the Disconnect() call, causing the stale Disconnect() to overwrite the new session's online status. Fix with a monotonic per-node generation counter: - State.Connect() increments the counter and returns the current generation alongside the change list. - State.Disconnect() accepts the generation from the caller and rejects the call if a newer generation exists, making stale disconnects from old sessions a no-op. - serveLongPoll captures the generation at Connect() time and passes it to Disconnect() in the deferred cleanup. - RemoveNode's return value is now checked: if another session already owns the batcher slot (reconnect happened), the old session skips the grace period entirely. Update batcher_test.go to track per-node connect generations and pass them through to Disconnect(), matching production behavior. Fixes the following test failures: - server_state_online_after_reconnect_within_grace - update_history_no_false_offline - nodestore_correct_after_rapid_reconnect - rapid_reconnect_peer_never_sees_offline	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	00c41b6422	hscontrol/servertest: add race, stress, and poll race tests Add three test files designed to stress the control plane under concurrent and adversarial conditions: - race_test.go: 14 tests exercising concurrent mutations, session replacement, batcher contention, NodeStore access, and map response delivery during disconnect. All pass the Go race detector. - poll_race_test.go: 8 tests targeting the poll.go grace period interleaving. These confirm a logical TOCTOU race: when a node disconnects and reconnects within the grace period, the old session's deferred Disconnect() can overwrite the new session's Connect(), leaving IsOnline=false despite an active poll session. - stress_test.go: sustained churn, rapid mutations, rolling replacement, data integrity checks under load, and verification that rapid reconnects do not leak false-offline notifications. Known failing tests (grace period TOCTOU race): - server_state_online_after_reconnect_within_grace - update_history_no_false_offline - rapid_reconnect_peer_never_sees_offline	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	ab4e205ce7	hscontrol/servertest: expand issue tests to 24 scenarios, surface 4 issues Split TestIssues into 7 focused test functions to stay under cyclomatic complexity limits while testing more aggressively. Issues surfaced (4 failing tests): 1. initial_map_should_include_peer_online_status: Initial MapResponse has Online=nil for peers. Online status only arrives later via PeersChangedPatch. 2. disco_key_should_propagate_to_peers: DiscoPublicKey set by client is not visible to peers. Peers see zero disco key. 3. approved_route_without_announcement_is_visible: Server-side route approval without client-side announcement silently produces empty SubnetRoutes (intersection of empty announced + approved = empty). 4. nodestore_correct_after_rapid_reconnect: After 5 rapid reconnect cycles, NodeStore reports node as offline despite having an active poll session. The connect/disconnect grace period interleaving leaves IsOnline in an incorrect state. Passing tests (20) verify: - IP uniqueness across 10 nodes - IP stability across reconnect - New peers have addresses immediately - Node rename propagates to peers - Node delete removes from all peer lists - Hostinfo changes (OS field) propagate - NodeStore/DB consistency after route mutations - Grace period timing (8-20s window) - Ephemeral node deletion (not just offline) - 10-node simultaneous connect convergence - Rapid sequential node additions - Reconnect produces complete map - Cross-user visibility with default policy - Same-user multiple nodes get distinct IDs - Same-hostname nodes get unique GivenNames - Policy change during connect still converges - DERP region references are valid - User profiles present for self and peers - Self-update arrives after route approval - Route advertisement stored as AnnouncedRoutes	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	f87b08676d	hscontrol/servertest: add policy, route, ephemeral, and content tests Extend the servertest harness with: - TestClient.Direct() accessor for advanced operations - TestClient.WaitForPeerCount and WaitForCondition helpers - TestHarness.ChangePolicy for ACL policy testing - AssertDERPMapPresent and AssertSelfHasAddresses New test suites: - content_test.go: self node, DERP map, peer properties, user profiles, update history monotonicity, and endpoint update propagation - policy_test.go: default allow-all, explicit policy, policy triggers updates on all nodes, multiple policy changes, multi-user mesh - ephemeral_test.go: ephemeral connect, cleanup after disconnect, mixed ephemeral/regular, reconnect prevents cleanup - routes_test.go: addresses in AllowedIPs, route advertise and approve, advertised routes via hostinfo, CGNAT range validation Also fix node_departs test to use WaitForCondition instead of assert.Eventually, and convert concurrent_join_and_leave to interleaved_join_and_leave with grace-period-tolerant assertions.	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	ca7362e9aa	hscontrol/servertest: add control plane lifecycle and consistency tests Add three test files exercising the servertest harness: - lifecycle_test.go: connection, disconnection, reconnection, session replacement, and mesh formation at various sizes. - consistency_test.go: symmetric visibility, consistent peer state, address presence, concurrent join/leave convergence. - weather_test.go: rapid reconnects, flapping stability, reconnect with various delays, concurrent reconnects, and scale tests. All tests use table-driven patterns with subtests.	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	0288614bdf	hscontrol: add servertest harness for in-process control plane testing Add a new hscontrol/servertest package that provides a test harness for exercising the full Headscale control protocol in-process, using Tailscale's controlclient.Direct as the client. The harness consists of: - TestServer: wraps a Headscale instance with an httptest.Server - TestClient: wraps controlclient.Direct with NetworkMap tracking - TestHarness: orchestrates N clients against a single server - Assertion helpers for mesh completeness, visibility, and consistency Export minimal accessor methods on Headscale (HTTPHandler, NoisePublicKey, GetState, SetServerURL, StartBatcher, StartEphemeralGC) so the servertest package can construct a working server from outside the hscontrol package. This enables fast, deterministic tests of connection lifecycle, update propagation, and network weather scenarios without Docker.	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	82c7efccf8	mapper/batcher: serialize per-node work to prevent out-of-order delivery processBatchedChanges queued each pending change for a node as a separate work item. Since multiple workers pull from the same channel, two changes for the same node could be processed concurrently by different workers. This caused two problems: 1. MapResponses delivered out of order — a later change could finish generating before an earlier one, so the client sees stale state. 2. updateSentPeers and computePeerDiff race against each other — updateSentPeers does Clear() + Store() which is not atomic relative to a concurrent Range() in computePeerDiff. Bundle all pending changes for a node into a single work item so one worker processes them sequentially. Add a per-node workMu that serializes processing across consecutive batch ticks, preventing a second worker from starting tick N+1 while tick N is still in progress. Fixes #3140	2026-03-19 07:05:58 +01:00
Kristoffer Dalby	87b8507ac9	mapper/batcher: replace connected map with per-node disconnectedAt The Batcher's connected field (xsync.Map[types.NodeID, time.Time]) encoded three states via pointer semantics: - nil value: node is connected - non-nil time: node disconnected at that timestamp - key missing: node was never seen This was error-prone (nil meaning 'connected' inverts Go idioms), redundant with b.nodes + hasActiveConnections(), and required keeping two parallel maps in sync. It also contained a bug in RemoveNode where new(time.Now()) was used instead of &now, producing a zero time. Replace the separate connected map with a disconnectedAt field on multiChannelNodeConn (atomic.Pointer[time.Time]), tracked directly on the object that already manages the node's connections. Changes: - Add disconnectedAt field and helpers (markConnected, markDisconnected, isConnected, offlineDuration) to multiChannelNodeConn - Remove the connected field from Batcher - Simplify IsConnected from two map lookups to one - Simplify ConnectedMap and Debug from two-map iteration to one - Rewrite cleanupOfflineNodes to scan b.nodes directly - Remove the markDisconnectedIfNoConns helper - Update all tests and benchmarks Fixes #3141	2026-03-16 02:22:56 -07:00
Kristoffer Dalby	60317064fd	mapper/batcher: serialize per-node work to prevent out-of-order delivery processBatchedChanges queued each pending change for a node as a separate work item. Since multiple workers pull from the same channel, two changes for the same node could be processed concurrently by different workers. This caused two problems: 1. MapResponses delivered out of order — a later change could finish generating before an earlier one, so the client sees stale state. 2. updateSentPeers and computePeerDiff race against each other — updateSentPeers does Clear() + Store() which is not atomic relative to a concurrent Range() in computePeerDiff. Bundle all pending changes for a node into a single work item so one worker processes them sequentially. Add a per-node workMu that serializes processing across consecutive batch ticks, preventing a second worker from starting tick N+1 while tick N is still in progress. Fixes #3140	2026-03-16 02:22:46 -07:00
Juan Font	4d427cfe2a	noise: limit request body size to prevent unauthenticated OOM The Noise handshake accepts any machine key without checking registration, so all endpoints behind the Noise router are reachable without credentials. Three handlers used io.ReadAll without size limits, allowing an attacker to OOM-kill the server. Fix: - Add http.MaxBytesReader middleware (1 MiB) on the Noise router. - Replace io.ReadAll + json.Unmarshal with json.NewDecoder in PollNetMapHandler and RegistrationHandler. - Stop reading the body in NotImplementedHandler entirely.	2026-03-16 09:28:31 +01:00
Kristoffer Dalby	afd3a6acbc	mapper/batcher: remove disabled X-prefixed test functions Remove XTestBatcherChannelClosingRace (~95 lines) and XTestBatcherScalability (~515 lines). These were disabled by prefixing with X (making them invisible to go test) and served as dead code. The functionality they covered is exercised by the active test suite. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	feaf85bfbc	mapper/batcher: clean up test constants and output L8: Rename SCREAMING_SNAKE_CASE test constants to idiomatic Go camelCase. Remove highLoad* and extremeLoad* constants that were only referenced by disabled (X-prefixed) tests. L10: Fix misleading assert message that said "1337" while checking for region ID 999. L12: Remove emoji from test log output to avoid encoding issues in CI environments. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	86e279869e	mapper/batcher: minor production code cleanup L1: Replace crypto/rand with an atomic counter for generating connection IDs. These identifiers are process-local and do not need cryptographic randomness; a monotonic counter is cheaper and produces shorter, sortable IDs. L5: Use getActiveConnectionCount() in Debug() instead of directly locking the mutex and reading the connections slice. This avoids bypassing the accessor that already exists for this purpose. L6: Extract the hardcoded 15*time.Minute cleanup threshold into the named constant offlineNodeCleanupThreshold. L7: Inline the trivial addWork wrapper; AddWork now calls addToBatch directly. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	7881f65358	mapper: extract node connection types to node_conn.go Move connectionEntry, multiChannelNodeConn, generateConnectionID, and all their methods from batcher.go into a dedicated file. This reduces batcher.go from ~1170 lines to ~800 and separates per-node connection management from batcher orchestration. Pure move — no logic changes. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	2d549e579f	mapper/batcher: add regression tests for M1, M3, M7 fixes - TestBatcher_CloseBeforeStart_DoesNotHang: verifies Close() before Start() returns promptly now that done is initialized in NewBatcher. - TestBatcher_QueueWorkAfterClose_DoesNotHang: verifies queueWork returns via the done channel after Close(), even without Start(). - TestIsConnected_FalseAfterAddNodeFailure: verifies IsConnected returns false after AddNode fails and removes the last connection. - TestRemoveConnectionAtIndex_NilsTrailingSlot: verifies the backing array slot is nil-ed after removal to avoid retaining pointers. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	50e8b21471	mapper/batcher: fix pointer retention, done-channel init, and connected-map races M7: Nil out trailing *connectionEntry pointers in the backing array after slice removal in removeConnectionAtIndexLocked and send(). Without this, the GC cannot collect removed entries until the slice is reallocated. M1: Initialize the done channel in NewBatcher instead of Start(). Previously, calling Close() or queueWork before Start() would select on a nil channel, blocking forever. Moving the make() to the constructor ensures the channel is always usable. M2: Move b.connected.Delete and b.totalNodes decrement inside the Compute callback in cleanupOfflineNodes. Previously these ran after the Compute returned, allowing a concurrent AddNode to reconnect between the delete and the bookkeeping update, which would wipe the fresh connected state. M3: Call markDisconnectedIfNoConns on AddNode error paths. Previously, when initial map generation or send timed out, the connection was removed but b.connected retained its old nil (= connected) value, making IsConnected return true for a node with zero connections. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	8e26651f2c	mapper/batcher: add regression tests for timer leak and Close lifecycle Add four unit tests guarding fixes introduced in recent commits: - TestConnectionEntry_SendFastPath_TimerStopped: verifies the time.NewTimer fix (H1) does not leak goroutines after many fast-path sends on a buffered channel. - TestBatcher_CloseWaitsForWorkers: verifies Close() blocks until all worker goroutines exit (H3), preventing sends on torn-down channels. - TestBatcher_CloseThenStartIsNoop: verifies the one-shot lifecycle contract; Start() after Close() must not spawn new goroutines. - TestBatcher_CloseStopsTicker: verifies Close() stops the internal ticker to prevent resource leaks. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	57a38b5678	mapper/batcher: reduce hot-path log verbosity Remove Caller(), channel pointer formatting (fmt.Sprintf("%p",...)), and mutex timing from send(), addConnection(), and removeConnectionByChannel(). Move per-broadcast summary and no-connection logs from Debug to Trace. Remove per-connection "attempting"/"succeeded" logs entirely; keep Warn for failures. These methods run on every MapResponse delivery, so the savings compound quickly under load. Updates #2545	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	051a38a4c4	mapper/batcher: track worker goroutines and stop ticker on Close Close() previously closed the done channel and returned immediately, without waiting for worker goroutines to exit. This caused goroutine leaks in tests and allowed workers to race with connection teardown. The ticker was also never stopped, leaking its internal goroutine. Add a sync.WaitGroup to track the doWork goroutine and every worker it spawns. Close() now calls wg.Wait() after signalling shutdown, ensuring all goroutines have exited before tearing down connections. Also stop the ticker to prevent resource leaks. Document that a Batcher must not be reused after Close().	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	3276bda0c0	mapper/batcher: replace time.After with NewTimer to avoid timer leak connectionEntry.send() is on the hot path: called once per connection per broadcast tick. time.After allocates a timer that sits in the runtime timer heap until it fires (50 ms), even when the channel send succeeds immediately. At 1000 connected nodes, every tick leaks 1000 timers into the heap, creating continuous GC pressure. Replace with time.NewTimer + defer timer.Stop() so the timer is removed from the heap as soon as the fast-path send completes.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	2058343ad6	mapper: remove Batcher interface, rename to Batcher struct Remove the Batcher interface since there is only one implementation. Rename LockFreeBatcher to Batcher and merge batcher_lockfree.go into batcher.go. Drop type assertions in debug.go now that mapBatcher is a concrete *mapper.Batcher pointer.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	9b24a39943	mapper/batcher: add scale benchmarks Add benchmarks that systematically test node counts from 100 to 50,000 to identify scaling limits and validate performance under load.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	3ebe4d99c1	mapper/batcher: reduce lock contention with two-phase send Rewrite multiChannelNodeConn.send() to use a two-phase approach: 1. RLock: snapshot connections slice (cheap pointer copy) 2. Unlock: send to all connections (50ms timeouts happen here) 3. Lock: remove failed connections by pointer identity Previously, send() held the write lock for the entire duration of sending to all connections. With N stale connections each timing out at 50ms, this blocked addConnection/removeConnection for N50ms. The two-phase approach holds the lock only for O(N) pointer operations, not for N50ms I/O waits.	2026-03-14 02:52:28 -07:00
Kristoffer Dalby	da33795e79	mapper/batcher: fix race conditions in cleanup and lookups Replace the two-phase Load-check-Delete in cleanupOfflineNodes with xsync.Map.Compute() for atomic check-and-delete. This prevents the TOCTOU race where a node reconnects between the hasActiveConnections check and the Delete call. Add nil guards on all b.nodes.Load() and b.nodes.Range() call sites to prevent nil pointer panics from concurrent cleanup races.	2026-03-14 02:52:28 -07:00

1 2 3 4 5 ...

551 Commits