mcts-odin

Generic, optimized Monte Carlo Tree Search for Odin

AlphaZero-style PUCT with Dirichlet root noise and First-Play Urgency. Optional fast rollouts, leaf-parallel batched playouts with virtual loss, and OS-thread parallelism. Games plug in via a small Game vtable — the core knows nothing about Go, chess, or any specific game.

Ships with eleven demo games (tic-tac-toe, Connect Four, Reversi, Hex, Breakthrough, Gomoku, Dots and Boxes, Amazons, Quoridor, Nine Men's Morris, Go) and a 137-test suite that runs leak-free under Odin's memory tracker.

Quick start

package main

import "mcts"
import ttt "games/tictactoe"

main :: proc() {
    g     := ttt.game()              // mcts.Game vtable
    state := ttt.new_state()         // tree takes ownership

    cfg := mcts.default_config()
    tree: mcts.Tree
    mcts.init(&tree, &g, state, cfg, seed = 42)
    defer mcts.destroy(&tree)

    mcts.run_simulations(&tree, 1000, my_evaluator, &g)
    action := mcts.select_action(&tree, temperature = 0.0)
}

The evaluator is your value/policy function. See tictactoe_selfplay for a complete runnable example and nn_evaluator_skeleton for the policy/value plumbing pattern a real NN-backed evaluator needs.

Throughput

9×9 Go · 1600 sims/move × 32 moves · uniform-policy evaluator · single-thread · -o:speed -no-bounds-check

Implementation	sims/s	relative
mcts-odin v0.6.0	~108,000	—
autogodin (C++)	8,470	12.8× slower
autogodin (Odin)	2,859	37.8× slower

Not strictly comparable — mcts-odin's bench runs the evaluator inline; autogodin's marshals through a Python callback. The cumulative gap reflects the do/undo lift, packed slot storage, SoA hot fields, linear-space priors, per-Tree scratch arena, subtree reuse, branchless argmax, FPU (broader/shallower tree), and a clone-free is_legal_flat that probes positional superko via incremental Zobrist. Full details →

Eleven demo games

Tic-tac-toe 3×3 · solved-game sanity
Connect Four 7×6 · column-drop
Reversi 8×8 · zero-alloc Move_Delta packing
Hex 9×9 · hex-grid topology · BFS win
Breakthrough 8×8 · piece movement · diagonal capture
Gomoku 15×15 · 225-action branching factor
Dots and Boxes 4×4 dots · extra-turn breaks to_play alternation
Amazons 6×6 · two-stage moves (slide + arrow)
Quoridor 5×5 · heterogeneous actions · per-candidate BFS
Nine Men's Morris 24-point board · phase transitions · sub-action within move
Go 9×9 / 19×19 · Zobrist PSK · KataGo no-suicide · Tromp-Taylor scoring

Architecture

┌──────────────────────── Tree ───────────────────────┐
│                                                     │
│  working_state ───┐                                 │
│  (owned by tree)  │   single state mutated in       │
│                   │   place via do_move / undo_move │
│                   ▼                                 │
│           ┌─── nodes[] ────┐                        │
│           │ Node 0 (root)  │       Hot fields in    │
│           │  ├─ actions[]  │       parallel SoA on  │
│           │  ├─ priors[]   │       the Tree:        │
│           │  └─ child[]    │         node_N[]       │
│           ├────────────────┤         node_N_virt[]  │
│           │ Node 1, 2, …   │         node_Q[]       │
│           │ (packed slots) │                        │
│           └────────────────┘                        │
│                                                     │
│  arena            permanent: nodes, slot arrays     │
│  scratch_arena    per-run: descent paths, deltas    │
│                                                     │
└─────────────────────────────────────────────────────┘

Three deliberate choices drive the throughput:

No per-node state copies. Nodes are pure tree bookkeeping; the tree threads working_state through do_move on the way down and undo_move on the way up. A Go-board clone is several times costlier than a do/undo pair, and a deep tree creates thousands of nodes.
Packed slot storage. Per-node actions[k] / priors[k] / child[k] are tightly packed slices, sized at first expansion. Hot fields (N, N_virt, Q) live in parallel SoA arrays on the Tree — the PUCT inner loop reads ~12 bytes per child rather than chasing a full Node struct.
Two arenas per tree. A growing arena owns nodes and slot arrays for the lifetime of the tree; a separate scratch arena is free_all-reset at the top of every run_simulations call. The caller's context.temp_allocator is never touched.

Three drivers, one tree

// Sequential — one evaluator call per leaf. Fine for fast in-process
// value functions and uniform priors.
mcts.run_simulations(&tree, 1600, my_evaluator, &g)

// Batched — leaf-parallel with virtual loss; the evaluator gets a slice
// of cloned leaf states per call. For GPU NN forward passes.
mcts.run_simulations_batched(&tree, 1600, batch_size = 16,
                              my_batched_evaluator, &g)

// OS-thread parallel — N workers run descent / eval / backup concurrently.
// Atomics on N / N_virt / Q + a coarse expand mutex; virtual loss decouples
// the descents.
mcts.run_simulations_threaded(&tree, 1600, n_threads = 8,
                               my_evaluator, &g)

Near-linear scaling on slow evaluators (50 µs/call benchmark, 9×9 Go): n=2: 1.93×, n=4: 3.81×, n=8: 7.15×. For cheap evaluators (microseconds per call) the mutex and CAS-loop contention can erase the speedup — use the sequential or batched paths there.

Latest release

v0.6.0 — Nine Men's Morris demo game. First demo with explicit game phases (placement → sliding → flying) that reshape the action space mid-game, and the first where one MCTS action atomically contains a follow-up sub-action by the same player (mill formation → opponent-piece removal). Eleven demo games / 137 tests.

Full changelog →