Commit Graph

4141 Commits

Author SHA1 Message Date
Florian Preinstorfer
e13f0458bb Remove redundant prefix 2026-05-12 14:12:29 +02:00
Florian Preinstorfer
68b0014871 Use distroless without quotes 2026-05-12 14:12:29 +02:00
Florian Preinstorfer
484462898b Remove link to sqlite
Other mentions of SQLite don't link either.
2026-05-12 14:12:29 +02:00
Florian Preinstorfer
45b698dbac Shorten container introduction 2026-05-12 14:12:29 +02:00
Florian Preinstorfer
14ce7e9106 Remove link to Arch AUR headscale-git
Its outdated and unmaintained.
2026-05-12 14:12:29 +02:00
Florian Preinstorfer
84c7f0d450 Link to development builds 2026-05-12 14:12:29 +02:00
Florian Preinstorfer
c7f221dd0a Fix typo and wording 2026-05-12 14:12:29 +02:00
Florian Preinstorfer
163363a12a Use docs instead of KB 2026-05-12 14:12:29 +02:00
Kristoffer Dalby
f03d41ea9a CHANGELOG: document policy tests (beta)
Fixes #1803
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
d5b2837231 policy/v2: match default proto set for tests with no proto
The policy `tests` block lets entries omit `proto`. Tailscale's client
maps that to the default protocol set {TCP, UDP, ICMP, ICMPv6} — the
captured packet_filter_matches show all four IANA numbers explicitly
when no proto is set — and a rule restricted to any one of them
satisfies an empty-proto reachability test.

srcReachesDst was passing the empty Protocol through unchanged, which
landed an empty []int in ruleMatchesProto. The matcher then short-
circuited to "no match" for every rule with a non-empty IPProto
restriction, including TCP-only grants compiled from `ip: ["tcp:80"]`.
The bug surfaced in the captured allpass-acls-and-grants-mixed
scenario: the grant `tag:client → webserver:80` was reachable in the
compiled filter but the empty-proto test could not see it.

Expand the empty Protocol to the default set at the call site so
ruleMatchesProto's intersection check sees the right requested
protocols. Drop the now-dead empty-requestedProtos branch from the
matcher. The last divergence drops out of knownPolicyTesterDivergences
as a result.

Updates #1803
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
e4e209f919 policy/v2: canonicalize Protocol form during unmarshal
Tailscale accepts both named ("tcp") and numeric IANA ("6") protocol
forms wherever a Protocol value is allowed. Headscale stored whichever
form the user wrote, leaving downstream code with two equivalents to
handle separately. validateProtocolPortCompatibility only recognised
the named constants and rejected the numeric form, so a policy with
`proto: "6", dst: ["host:443"]` was rejected at parse time even though
SaaS accepts it.

Resolve the disagreement by normalising to the named form during
Protocol.UnmarshalJSON. Every downstream consumer now sees one form
regardless of what the user wrote, so layered guards like
`|| protocol == "6"` in the validator are unnecessary.

Updates #1803
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
f172dba0e3 policy/v2: validate tests block at parse boundary
A `tests` entry describes one connection attempt to one specific
host on one specific port over a connection-oriented protocol, and
asserts whether it is allowed or denied. Five shape rules follow —
single-port dst, proto in {tcp, udp, sctp, ""}, no
autogroup:internet dst, no CIDR-typed dst (raw `/N` or hosts:-alias
to a multi-host prefix), at least one of accept/deny — and every
one was previously silently accepted by headscale even though
Tailscale SaaS rejects them as "test(s) failed".

Enforce them in one pass over `pol.Tests` from `Policy.validate()`,
reusing the existing parse-time multierr aggregation. The same
shapes remain valid inside ACL or Grant destinations where the rule
does not apply; the validator only walks the tests array.

The compat runner now treats parse-time errors equivalently to
SetPolicy errors so the captured Tailscale body still matches via
substring regardless of which step surfaces the rejection. Nine
divergences resolved by this validation pass drop out of
knownPolicyTesterDivergences.

Updates #1803
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
c0774a739b policy/v2: add policytester captures recorded from Tailscale SaaS
57 captures covering the alias × outcome matrix for the tests block,
recorded against a real Tailscale SaaS tailnet. Replayed by
TestPolicyTesterCompat.

Bump the check-added-large-files pre-commit threshold to 1024 KB —
captures include verbose per-node netmaps and one is 620 KB.

Updates #1803
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
7bc701179b policy/v2: add policytester compat test runner
Pin headscale's accept/reject decision and error body against
Tailscale SaaS by replaying captures recorded from a real tailnet.
Mirrors the tailscale_grants_compat_test.go pattern: glob over
testdata/policytest_results/, one t.Run per file, parse-or-SetPolicy
error must contain the captured api_response_body.message.

errPolicyTestsFailed is "test(s) failed" — Tailscale's literal body —
so substring match works against captured response bodies. Per-test
detail (src, dst, expected vs got) is preserved below the prefix for
the CLI / config-reload paths that don't have an audit endpoint.

knownPolicyTesterDivergences gates the 12 mismatches the captures
will surface so the suite stays green; engine fixes in follow-up
commits drop the entries as each is resolved.

Updates #1803
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
b29ae25356 policy/v2: evaluate the tests block on user-initiated writes
v2 silently dropped policy.tests, so a policy that contradicted its
own assertions still applied. Resolve src/dst via the existing Alias
machinery, walk the compiled global filter rules (acls and grants
both contribute), and run on every user-write boundary: SetPolicy,
the file watcher, and `headscale policy check`. A failing test
rejects the write before it mutates live state.

Boot-time reload skips evaluation; an already-stored policy that
references a deleted user shouldn't lock the server out.

`headscale policy check` is a thin frontend for the new CheckPolicy
gRPC method. The server-side handler builds a fresh PolicyManager
from the request bytes and the state's live users/nodes, runs
SetPolicy on the sandbox so the tests block executes, and returns
the result through gRPC status. No persistence, no policy_mode
coupling. --bypass-grpc-and-access-database-directly opens the DB
directly when the server is not running.

cmd/headscale/cli/root.go no longer special-cases `policy check` in
init() (the early return from PR #2580 broke --config registration
and viper priming for --bypass).

integration/cli_policy_test.go covers policy_mode={file,database} x
fixture={acl-only, acl+passing-tests, acl+failing-tests} x
bypass={false,true} = 12 rows.

Updates #1803

Co-authored-by: Janis Jansons <janhouse@gmail.com>
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
56146de377 proto: add CheckPolicy RPC
CheckPolicy validates a candidate policy against a running server's
live users and nodes (running its tests block) without persisting
anything. Used by 'headscale policy check' to replace the in-process
validation path the CLI runs today, which would otherwise need its
own database connection.

Updates #1803
2026-05-12 11:54:54 +01:00
Kristoffer Dalby
c3df84e354 policy/matcher: include CapGrant.Dsts in match destinations
MatchFromFilterRule only read DstPorts[].IP into the destination
IPSet. Cap-grant-only filter rules (e.g. tailscale.com/cap/relay)
carry their destinations in CapGrant[].Dsts, so the derived matchers
had empty dest sets and BuildPeerMap / ReduceNodes never exposed the
cap target to its source nodes. Without a companion IP-level grant
the relay node stayed invisible, so clients never tried to use it
and connections sat on DERP.

Union CapGrant[].Dsts into the destination IPSet alongside DstPorts.
Restores peer-visibility for any cap-grant-only relationship; the
peer-relay flow is the most visible instance.

Fixes #3256
2026-05-11 14:55:06 +01:00
Kristoffer Dalby
795a1efe9b ci: fetch full history in golangci-lint job
revgrep needs pull_request.base.sha in the local clone to compute
the diff against new code. With fetch-depth: 2, only HEAD and one
parent are fetched, so a stale base SHA (when main moves between
PR syncs) is not reachable and revgrep falls through, surfacing
pre-existing issues outside the PR scope.
2026-05-11 10:34:58 +01:00
Kristoffer Dalby
dc733767c4 Dockerfile.tailscale-HEAD,Dockerfile.derper: bump golang to 1.26.3
tailscale upstream go.mod now requires 1.26.3.
2026-05-11 10:34:58 +01:00
Lealem Amedie
542091e82b Add unit test 2026-05-11 09:25:26 +01:00
Lealem Amedie
6cd919d411 mapper: include UserProfiles in policy-change MapResponses 2026-05-11 09:25:26 +01:00
Kristoffer Dalby
2f907edf87 hscontrol/types: regenerate types_clone.go for viewer bump
cmd/viewer in tailscale.com/cmd v1.97.0-pre emits new(*x) instead
of ptr.To(*x). No behaviour change.
2026-05-11 08:46:12 +01:00
Kristoffer Dalby
9621a97ebe ci, pre-commit: validate vendor hash via vendorhash check
Replace the grep/awk hash extraction in build.yml with a structured
vendorhash check step; the PR review comment now reads expected/
actual values directly from $GITHUB_OUTPUT instead of scraping Nix
stderr. Add a prek hook so divergence is caught locally before push.
2026-05-11 08:46:12 +01:00
Kristoffer Dalby
e470774f6a cmd/vendorhash: track vendor SRI in flakehashes.json
Move the headscale vendorHash out of flake.nix into a content-
addressed flakehashes.json maintained by a small Go tool. The
schema and goModFingerprint algorithm mirror upstream tailscale's
tool/updateflakes so a future shared library extraction is trivial.

vendorhash check verifies flakehashes.json against the current
go.mod/go.sum. Hot path is a sha256 over those two files, so
re-runs without input change are essentially free; only an actual
fingerprint drift triggers go mod vendor + nardump.SRI.

vendorhash update recomputes both fields and rewrites the JSON.
The nix-vendor-sri devShell shim now wraps it.
2026-05-11 08:46:12 +01:00
Kristoffer Dalby
980622e9a5 flake.nix, go.mod: bump tailscale.com to v1.97.0-pre
Pulls in the cmd/nardump library split (tailscale/tailscale#19551)
so flakehashes.json tooling can import nardump.SRI directly.

Side effects: Go directive bumps to 1.26.2 and the nixpkgs lock
advances to a revision shipping go 1.26.2.
2026-05-11 08:46:12 +01:00
Kristoffer Dalby
4e0c2b8556 cmd/headscale/cli: validate users in policy check
Add --bypass-grpc-and-access-database-directly to policy check so
the new ambiguous-user validator runs against the live user list.
Without the flag, policy check stays a syntax-only check and the
success message says so.

Updates #3160
2026-05-09 11:28:12 +01:00
Kristoffer Dalby
bc9fb6d403 hscontrol/policy/v2: reject ambiguous user references at load time
When a user@ token resolved to more than one DB row, ACL and SSH
rules referencing it were silently dropped at compile time, leaving
clients with SSHPolicy={rules: null} and no signal to the admin.

Validate every Username reference in groups, tagOwners,
autoApprovers, ACLs and SSH rules at NewPolicyManager and SetPolicy
and return ErrMultipleUsersFound. Missing-user tokens stay tolerant
per #2863.

Updates #3160
2026-05-09 11:28:12 +01:00
Möhsün Babayev
585d0c01bc docs(config): fix typo in config-example.yaml
Fixes a typo in the description of `metrics_listen_addr` property.
2026-05-09 05:14:08 +02:00
Möhsün Babayev
01eb5402f9 docs(setup): fix typo in requirements.md
Fix the typo in spelling of "Let's Encrypt".
2026-05-09 05:14:08 +02:00
MunMunMiao
e597f4c8a0 Add Headscale UI to web UI documentation 2026-05-09 05:02:44 +02:00
SAY-5
01e548e030 state: avoid nil deref in registration handlers when old user is missing
Mirror the guard from HandleNodeFromPreAuthKey in HandleNodeFromAuthPath.
Both functions log the old user's name in the "different user" branch
when an existing NodeStore entry under the same machine key belongs to
another user. UserView.Name dereferences the backing User pointer
unconditionally, so when the cached node was loaded with a non-nil
UserID but a nil User (Preload join missed the row, or upstream code
left the snapshot in that shape), the log call panics with a nil-pointer
dereference at hscontrol/types/types_view.go:97.

The panic is caught by the http2 server's runHandler for the noise
control plane, so the process keeps running but every retry produces a
new panic — production has observed bursts of ~1.9k panics per hour
during a tailscaled reconnect loop. The gRPC/OIDC entry has no equivalent
recover and would surface the panic to the caller.

Guard both call sites with oldUser.Valid() and fall back to an empty
old-user name when the pointer is nil. The "Creating new node for
different user" log line still includes the existing node ID, hostname,
machine key, and new user, so operator visibility is preserved.

Add reproduction tests for both handlers seeding the orphan shape
directly into NodeStore via PutNodeInStoreForTest.

Co-Authored-By: Kristoffer Dalby <kristoffer@dalby.cc>
2026-05-06 07:23:02 +01:00
Kristoffer Dalby
9482cdf590 testdata: drop unused uppercase SSH-*.hujson fixtures
The 39 SSH-*.hujson files in hscontrol/policy/v2/testdata/ssh_results/
were legacy hand-written "expected SSH rules" snippets superseded by
the lowercase tscap captures (ssh-*.hujson). The active loader in
TestSSHDataCompat globs ssh-*.hujson; filepath.Glob is case-sensitive
on Linux so the uppercase set was loaded by no test.

The duplication caused permanent dirty git state on case-insensitive
filesystems (APFS, NTFS) where only one of SSH-A1.hujson and
ssh-a1.hujson can physically exist in the working tree.

Add an assertion to TestSSHDataCompat that the loader picks up every
*.hujson under ssh_results/ so future fixture migrations cannot leave
stranded files behind.

Fixes #3240
2026-05-05 11:59:01 +01:00
primewildy
3d0f597b23 oidc: handle groups claim as string or array (FlexibleStringSlice)
Some OIDC providers (notably JumpCloud) return the `groups` claim as
a plain string when the user belongs to a single group, rather than
a single-element array:

  Single group:    {"groups": "MyGroup"}
  Multiple groups: {"groups": ["Group1", "Group2"]}

This causes `json.Unmarshal` to fail with:

  cannot unmarshal string into Go struct field OIDCClaims.groups of type []string

This is the same class of issue as juanfont#2293 (FlexibleBoolean for
email_verified). The fix follows the same pattern: introduce a
FlexibleStringSlice type with a custom UnmarshalJSON that accepts
both a string and a []string, and use it for the Groups field in
both OIDCClaims and OIDCUserInfo.
2026-05-04 15:26:53 +02:00
Kristoffer Dalby
76ee29352b servertest: cover via-grant exit-node visibility end-to-end
TestGrantViaExitNodeInternetVisibility boots a server, applies a
policy that scopes autogroup:internet to a tag, registers a tagged
exit advertiser and a regular client, and asserts the client's netmap
surfaces the exit node with 0.0.0.0/0 and ::/0 in AllowedIPs — the
substrate the Tailscale client reads to populate
`tailscale exit-node list`.

TestGrantViaExitNodeNoFilterRules retains its assertion (literal /0
absent from the exit node's PacketFilter, matching SaaS PacketFilter
encoding); only its docstring is updated to reflect that the exit
node now does receive a TheInternet-shaped rule, just not the
literal /0 form.

Updates #3233
2026-04-30 19:22:45 +01:00
Kristoffer Dalby
2b7f15abaa policy/v2: surface autogroup:internet via grants on exit nodes
A grant of the form `{src: alice, dst: autogroup:internet, via:
tag:exit1}` was loading without error but stripping every exit node
from alice's view: `tailscale exit-node list` returned "no exit nodes
found".

Two sites skipped autogroup:internet at the compile / steering layer:
compileViaForNode's *AutoGroup arm produced no FilterRule for the
via-tagged exit node, and ViaRoutesForPeer's *AutoGroup arm produced
no Include/Exclude. With pm.needsPerNodeFilter true, the exit node's
matchers were empty, BuildPeerMap could not link source to exit, and
RoutesForPeer's ReduceRoutes stripped 0.0.0.0/0 and ::/0 from
AllowedIPs.

The skip belongs at the wire-format layer (ReduceFilterRules), not at
the compile layer that also feeds internal matchers. Lift
autogroup:internet handling into both *AutoGroup arms with the same
shape used for *Prefix destinations: emit a TheInternet rule on
via-tagged exit advertisers; surface peer.ExitRoutes() in Include
when the peer carries the via tag, Exclude otherwise.
ReduceFilterRules continues to keep the rule on exit-route
advertisers' wire output and strip it elsewhere, preserving SaaS
PacketFilter encoding.

Also drop compileViaForNode's early len(SubnetRoutes)==0 return:
SubnetRoutes excludes exit routes, so the early return pre-empted the
autogroup:internet branch on nodes that only advertise exit routes.

Existing tests pinning the buggy behaviour (TestViaRoutesForPeer
subtests, TestCompileViaGrant case) flipped to the new contract.

Fixes #3233
2026-04-30 19:22:45 +01:00
Kristoffer Dalby
ecaf56e0a0 integration: drop Force flag on docker network disconnect
Force-disconnect leaves stale routes in the container's network
namespace: libnetwork removes the host-side veth but the
namespace-internal route survives. The next ConnectNetwork on the
same network then fails with "cannot program address X/16 in sandbox
interface because it conflicts with existing route", and the route
never resolves on its own. Bounded retry around ConnectNetwork
exhausts MaxElapsedTime instead of recovering.

Without Force, libnetwork drains the namespace routes synchronously
during disconnect and ConnectNetwork sees a clean slate. Cable-pull
semantic is preserved: docker still tears down the endpoint at the
namespace level, leaving in-flight TCP half-open inside the
container's view, verified via paired probe-timeout pairs in HA
prober logs while both routers are physically disconnected.

Fixes #3234
2026-04-30 12:52:05 +01:00
Kristoffer Dalby
94ec607bca state: per-goroutine deadline in HA probe cycle
`time.After(ProbeTimeout)` returned a single channel shared by every
probe goroutine in the cycle. Only the first goroutine to receive the
deadline tick drains the channel; any other goroutine still waiting on
its `responseCh` is then stuck forever, `wg.Wait()` never returns, and
the scheduler loop in `app.go` stalls on the next tick. The condition
fires whenever two or more nodes time out in the same cycle — common
under cable-pull where IsOnline lags reality and both routers stay in
the candidate set as half-open TCP.

Move the timer inside each goroutine so every probe has its own
deadline.

Updates #3234
2026-04-30 12:52:05 +01:00
Kristoffer Dalby
d1443a431c integration: skip subpackage tests in workflow generator
The generator scans `integration/` recursively for `Test*` functions
and emits one CI job per match. Helper subpackages like
`dockertestutil` and `tsic` host plain unit tests that should run
under `go test`, not as Docker-based integration matrix entries.
Limit the scan to depth 1 so only top-level `integration/*_test.go`
files contribute job names.
2026-04-30 12:52:05 +01:00
Kristoffer Dalby
155e42f892 integration: retry transient docker network ops
Libnetwork endpoint cleanup is eventually consistent. A back-to-back
disconnect+connect on the same network can race teardown and return a
transient error. Wrap the daemon calls in bounded exponential backoff
so TestHASubnetRouterFailoverDockerDisconnect no longer flakes on
phase 4c reconnect.

Fixes #3234
2026-04-30 12:52:05 +01:00
Kristoffer Dalby
3d5c0af4e7 state: preserve previous primary when all HA advertisers unhealthy
electPrimaryRoutes' all-unhealthy fallback picked candidates[0]
(lowest NodeID) regardless of who was prev. Under cable-pull
semantics IsOnline lags reality (long-poll TCP half-open), so
both routers stay in candidates and both go Unhealthy via the
prober — the fallback then churned primary to a node that was
itself unreachable.

Prefer prev when still in candidates; fall through to
candidates[0] only when prev is gone. Anti-blackhole holds.

Update the property test reference model and split the unit
test into existence (KeepsAPrimary) and identity
(PreservesPrevious) cases.

Fixes #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
27c9113af8 integration: regenerate workflow for HA docker disconnect test
Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
7bb86f2c16 integration: HA cable-pull lifecycle test
Add DisconnectFromNetwork/ReconnectToNetwork on TailscaleClient
backed by pool.Client.DisconnectNetwork.

Exercise single-router fail+recover either side, sequential dual
failure, and simultaneous dual failure. The dual-failure legs
assert no flap to a known-bad primary; the single-router-return
legs check traffic only because docker network disconnect
transiently fails probes on sibling routers.

Fails on parent; passes after the fix.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
863fa2f815 servertest, integration: cover HA both-offline recovery
Three regression tests for the user scenario: an in-process
Disconnect/Reconnect, a tailscale-down/up integration test, and
an iptables -j DROP cable-pull integration test.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
9f7c8e9a07 state: clear Unhealthy when node leaves HA candidate set
Restore the legacy auto-clear at write boundaries that drop HA
candidacy: Disconnect, SetApprovedRoutes(empty), and
UpdateNodeFromMapRequest shrinking advertised routes to empty.
Plus a defensive guard in SetNodeUnhealthy.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
66ac785c22 state: delete routes package, port primary route tests
Remove hscontrol/routes/. Port the named scenarios and the rapid
property test to hscontrol/state/.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
437754aeea state: switch consumers to NodeStore primary routes
Replace routes.PrimaryRoutes reads with NodeStore. Connect bumps
SessionEpoch; Disconnect re-checks it inside UpdateNode so the
check and mutation are atomic against a concurrent Connect on
the same node.

The connect_race regression test is carried in its final
SessionEpoch form.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
da927eb018 state: compute primary routes inside NodeStore snapshot
Add primaries and isPrimary maps to Snapshot plus an election
algorithm. No callers yet.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
942313a10a types: move DebugRoutes from routes to types
Unblocks deletion of the routes package.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
1fe682b141 types: add Unhealthy and SessionEpoch fields to Node
Runtime-only (gorm:"-") fields read by the HA primary route refactor.

Updates #3203
2026-04-29 18:08:39 +01:00
Kristoffer Dalby
010a5564c5 all: rephrase prose to fit codebase voice
Reword comments, one doc paragraph, and one test failure message
so the prose reads naturally. No behaviour change.
2026-04-29 16:22:19 +01:00