mcts-odin
Generic, optimized Monte Carlo Tree Search for Odin
AlphaZero-style PUCT with Dirichlet root noise and First-Play Urgency. Optional fast rollouts, leaf-parallel batched playouts with virtual loss, and OS-thread parallelism. Games plug in via a small Game vtable — the core knows nothing about Go, chess, or any specific game.
Ships with eleven demo games (tic-tac-toe, Connect Four, Reversi, Hex, Breakthrough, Gomoku, Dots and Boxes, Amazons, Quoridor, Nine Men's Morris, Go) and a 137-test suite that runs leak-free under Odin's memory tracker.
Quick start
package main
import "mcts"
import ttt "games/tictactoe"
main :: proc() {
g := ttt.game() // mcts.Game vtable
state := ttt.new_state() // tree takes ownership
cfg := mcts.default_config()
tree: mcts.Tree
mcts.init(&tree, &g, state, cfg, seed = 42)
defer mcts.destroy(&tree)
mcts.run_simulations(&tree, 1000, my_evaluator, &g)
action := mcts.select_action(&tree, temperature = 0.0)
}
The evaluator is your value/policy function. See tictactoe_selfplay for a complete runnable example and nn_evaluator_skeleton for the policy/value plumbing pattern a real NN-backed evaluator needs.
Throughput
9×9 Go · 1600 sims/move × 32 moves · uniform-policy evaluator · single-thread · -o:speed -no-bounds-check
| Implementation | sims/s | relative |
|---|---|---|
| mcts-odin v0.6.0 | ~108,000 | — |
| autogodin (C++) | 8,470 | 12.8× slower |
| autogodin (Odin) | 2,859 | 37.8× slower |
Not strictly comparable — mcts-odin's bench runs the evaluator inline; autogodin's marshals through a Python callback. The cumulative gap reflects the do/undo lift, packed slot storage, SoA hot fields, linear-space priors, per-Tree scratch arena, subtree reuse, branchless argmax, FPU (broader/shallower tree), and a clone-free is_legal_flat that probes positional superko via incremental Zobrist. Full details →
Eleven demo games
- Tic-tac-toe
- Connect Four
- Reversi
- Hex
- Breakthrough
- Gomoku
- Dots and Boxes
- Amazons
- Quoridor
- Nine Men's Morris
- Go
Architecture
┌──────────────────────── Tree ───────────────────────┐ │ │ │ working_state ───┐ │ │ (owned by tree) │ single state mutated in │ │ │ place via do_move / undo_move │ │ ▼ │ │ ┌─── nodes[] ────┐ │ │ │ Node 0 (root) │ Hot fields in │ │ │ ├─ actions[] │ parallel SoA on │ │ │ ├─ priors[] │ the Tree: │ │ │ └─ child[] │ node_N[] │ │ ├────────────────┤ node_N_virt[] │ │ │ Node 1, 2, … │ node_Q[] │ │ │ (packed slots) │ │ │ └────────────────┘ │ │ │ │ arena permanent: nodes, slot arrays │ │ scratch_arena per-run: descent paths, deltas │ │ │ └─────────────────────────────────────────────────────┘
Three deliberate choices drive the throughput:
- No per-node state copies. Nodes are pure tree bookkeeping; the tree threads
working_statethroughdo_moveon the way down andundo_moveon the way up. A Go-board clone is several times costlier than a do/undo pair, and a deep tree creates thousands of nodes. - Packed slot storage. Per-node
actions[k] / priors[k] / child[k]are tightly packed slices, sized at first expansion. Hot fields (N,N_virt,Q) live in parallel SoA arrays on the Tree — the PUCT inner loop reads ~12 bytes per child rather than chasing a full Node struct. - Two arenas per tree. A growing arena owns nodes and slot arrays for the lifetime of the tree; a separate scratch arena is
free_all-reset at the top of everyrun_simulationscall. The caller'scontext.temp_allocatoris never touched.
Three drivers, one tree
// Sequential — one evaluator call per leaf. Fine for fast in-process
// value functions and uniform priors.
mcts.run_simulations(&tree, 1600, my_evaluator, &g)
// Batched — leaf-parallel with virtual loss; the evaluator gets a slice
// of cloned leaf states per call. For GPU NN forward passes.
mcts.run_simulations_batched(&tree, 1600, batch_size = 16,
my_batched_evaluator, &g)
// OS-thread parallel — N workers run descent / eval / backup concurrently.
// Atomics on N / N_virt / Q + a coarse expand mutex; virtual loss decouples
// the descents.
mcts.run_simulations_threaded(&tree, 1600, n_threads = 8,
my_evaluator, &g)
Near-linear scaling on slow evaluators (50 µs/call benchmark, 9×9 Go): n=2: 1.93×, n=4: 3.81×, n=8: 7.15×. For cheap evaluators (microseconds per call) the mutex and CAS-loop contention can erase the speedup — use the sequential or batched paths there.
Latest release
v0.6.0 — Nine Men's Morris demo game. First demo with explicit game phases (placement → sliding → flying) that reshape the action space mid-game, and the first where one MCTS action atomically contains a follow-up sub-action by the same player (mill formation → opponent-piece removal). Eleven demo games / 137 tests.