Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pulse

Pulse is a self-describing, high-performance tabular data processing engine. It ships as a Go library (github.com/frankbardon/pulse) and as a single CLI binary (bin/pulse). Every .pulse file carries its own schema in the header, so consumers (programs, agents, and humans) can discover what a file contains without an external catalog.

The library is the primary deliverable. The CLI is a thin adapter that exposes the same operations on the command line, and an embedded MCP server (pulse mcp) exposes them to LLM agents.

Where to go from here

If you are…Start with
New to PulseInstallationYour First CohortCLI Tour
Driving Pulse from the shellCommand Line Reference
Embedding Pulse in a Go programLibrary Embedding
Curious about the binary format.pulse File Format
Hacking on Pulse itselfInternals and Contributing
Wiring Pulse into an LLM agentMCP Integration (Pointer), then the in-binary skill pack

LLM-facing surface

LLM agents do not read this site. Pulse exposes a Model Context Protocol server (pulse mcp) and ships 19 embedded skills under skills/ that LLMs load on demand via the pulse_skills_list and pulse_skills_get tools. The skill voice is MCP-only (tool calls, JSON payloads). This site is the human-facing counterpart — same engine, different idiom.

See How LLMs Use Pulse for a short pointer table.

Source of truth

The authoritative architectural contract for Pulse lives in the repository’s CLAUDE.md. When this site and CLAUDE.md disagree, CLAUDE.md wins; please open an issue.

Installation

Audience: new users who want a working pulse binary on their PATH.

This page walks through installing Pulse, the prerequisites it needs, and how to verify the install. Pulse is distributed as a single static Go binary; there is no installer, no daemon, and no config file.

LLM agents using MCP: see the getting-started skill via pulse_skills_get — it covers session bootstrap rather than local install.

Prerequisites

RequirementMinimum
Go toolchain1.24 (see go.mod)
OSLinux, macOS, or Windows (anywhere Go cross-compiles)
DiskA few MB for the binary; cohort files live wherever you point PULSE_DATA_DIR

go.mod is the source of truth for the supported Go version; if it drifts from this page the go.mod value wins.

Install with go install

The fastest path on a developer machine:

go install github.com/frankbardon/pulse/cmd/pulse@latest

This drops a pulse binary at $(go env GOBIN) (typically ~/go/bin). Make sure that directory is on your PATH.

Pin a specific release by replacing @latest with a tag:

go install github.com/frankbardon/pulse/cmd/pulse@v0.2.0

Build from source

The same binary, built reproducibly from a checkout:

git clone https://github.com/frankbardon/pulse.git
cd pulse
make build
# Binary at ./bin/pulse

The Makefile is documented in CLAUDE.md → Build / Dev / Test Workflow; the relevant targets are make build, make test, make lint, and make cover.

Configure the data directory

Pulse reads and writes .pulse files under a base directory called PULSE_DATA_DIR. Most commands accept absolute paths and will work without it, but pulse mcp requires the variable so the MCP server can enumerate cohorts:

export PULSE_DATA_DIR=/var/data/pulse

The repo Makefile auto-loads a .env file from the repo root, so you can also drop PULSE_DATA_DIR=... there for local development.

PULSE_DATA_DIR is the only required environment variable. See Flag Reference for the full list of CLI flags and environment knobs.

Verify

pulse --version
pulse --json | head -20

pulse --json prints the root manifest — the full self-description of commands, components, field types, and embedded skills. If you see a top-level format_version: "1.0" envelope, the install is working.

Where to go next

Your First Cohort

Audience: new CLI users. This is a five-minute tour: import a CSV, inspect the resulting .pulse file, run an aggregation, and export the result back.

LLM agents using MCP: the equivalent tour for an agent is the getting-started skill, fetched via pulse_skills_get. That skill speaks in tool calls and JSON payloads; this page speaks in shell commands.

1. Pick a CSV

For this walkthrough we’ll assume a file called sales.csv with columns like:

order_id,region,product,units,revenue,sold_on
1,west,widget,3,29.97,2024-01-04
2,east,gadget,1,19.99,2024-01-04
3,west,widget,7,69.93,2024-01-05
...

Any CSV with a header row works. Pulse also imports TSV, NDJSON, JSON-array, Parquet, Arrow IPC, and Excel — see Flag Reference for per-format flags.

2. Import to a .pulse file

pulse import csv --input sales.csv --output sales.pulse

Pulse samples up to 500 rows by default to infer a schema (you can change that with --sample-rows). Each column gets a typed binary representation and, if it looks like a low-cardinality string, a categorical dictionary.

Want to control the schema explicitly? Generate a template, edit it, and re-import:

# Editable schema template
pulse import schema-template sales.csv > sales.schema.json

# Edit sales.schema.json — set types, add descriptions
# Then import with the schema
pulse import csv --input sales.csv --schema sales.schema.json --output sales.pulse

See Field Types for the type catalog and Dictionary Blocks for how categoricals are encoded.

3. Inspect

The .pulse file is fully self-describing. Read it back:

pulse cohort inspect sales.pulse

Output is a table of fields, their types, and the description string stored in the header. Add --json for the structured envelope, or --full-dict to print every categorical entry instead of truncating after 100.

pulse cohort inspect sales.pulse --json

The envelope is documented in pulse cohort inspect.

4. Validate a request before running it

Pulse separates validation from execution. Write a tiny request file:

{
  "cohort": {"filename": "sales.pulse"},
  "groups": [{"type": "GROUP_CATEGORY", "field": "region"}],
  "aggregations": [
    {"type": "AGG_COUNT", "field": "order_id", "label": "orders"},
    {"type": "AGG_SUM", "field": "revenue", "label": "total_revenue"}
  ]
}

Save it as request.json, then check whether it makes sense against the cohort’s schema:

pulse api predict --request request.json

You’ll see Valid: true, the schema’s field count, and any warnings (e.g., aggregating something numeric on a categorical field). Predict never reads record data, so it’s safe to iterate on a request without touching a multi-GB cohort.

See pulse api predict and the debugging-with-predict skill for the full predict loop.

5. Execute

pulse api process --request request.json --json

The response is wrapped in the standard envelope (format_version, data, errors, warnings). data carries the result rows and a metadata block with total_rows and filtered_rows.

If your result is large, swap --json for --stream to receive rows as NDJSON, one line at a time — useful for pipelines that don’t want to buffer the whole result. See Streaming & ProcessStream for which request shapes actually stream end-to-end inside the engine vs which buffer.

6. Export

You’re done with the .pulse file? Export to whatever your downstream tool understands:

pulse export csv     --input sales.pulse --output sales.out.csv
pulse export parquet --input sales.pulse --output sales.out.parquet
pulse export excel   --input sales.pulse --output sales.out.xlsx

To skip the intermediate .pulse entirely and convert in one shot, use pulse convert source.csv target.parquet — see the top-level README for the full convert recipe.

What you didn’t see

  • Compose: batch multiple requests in one call — pulse api compose.
  • Ask: natural-language one-shot — pulse api ask.
  • Sample / Facet: cheap read-only probes — api sample, api facet.
  • Window / Feature / Test operators: pull from the skill pack (window-operations, feature-engineering, statistical-testing) via pulse skills show <name>.

For a full map of the CLI, see the CLI Tour.

CLI Tour

Audience: anyone who wants a map of every pulse subcommand before diving into per-command details.

This page is a one-liner index of the CLI tree. Each row links to its detailed chapter where applicable; commands that are minor variants of each other (per-format import/export leaves) are listed compactly.

LLM agents using MCP: there is no equivalent skill — agents drive Pulse through MCP tools, not the CLI. Start at the getting-started skill instead.

Top-level groups

pulse [--json] [--slim]
├── import      Tabular → .pulse (csv, tsv, ndjson, jsonarray, parquet, arrow, excel)
├── export      .pulse  → tabular (same format set)
├── convert     Tabular → tabular, with .pulse as the transparent middle
├── cohort      Inspect or filter an existing .pulse file
├── api         Processing operations (process, compose, ask, predict, sample, facet)
├── synth       Generate synthetic cohorts (from-schema, from-profile)
├── profile     Capture a statistical profile of a cohort
├── skills      Read the embedded LLM skill pack
└── mcp         Run the Model Context Protocol server over stdio

Bare pulse --json prints the self-describing root manifest — commands, components, field types, and skill metadata in one envelope. Pass --slim to drop prose descriptions for size-sensitive clients.

API operations

The “processing facade” — these are the operations exposed via the Go library API and the MCP tool set.

CommandPurposeChapter
pulse api processExecute one request against a cohortapi process
pulse api composeExecute multiple requests in batch / parallelapi compose
pulse api askParse a natural-language query and executeapi ask
pulse api predictValidate a request without executingapi predict
pulse api sampleReturn up to N rowsapi sample
pulse api facetReturn distinct values of a fieldapi facet

Cohort lifecycle

CommandPurposeChapter
pulse cohort inspect PATHRead header + schema (no record data)cohort inspect
pulse cohort filterWrite a filtered subset to a new .pulseSee Internals → Architecture

Import / export / convert

pulse import <format> and pulse export <format> share the same flag shape per format (--input, --output, --schema for import). Supported formats today:

csv · tsv · ndjson · jsonarray · parquet · arrow · excel

Each format has a per-leaf command (e.g. pulse import csv). Run pulse import --help or pulse export --help for the full list.

pulse convert SOURCE TARGET chains import + export with no intermediate file unless --keep-pulse PATH is passed. Format is auto-detected from extensions.

Synthetic data

CommandPurposeChapter
pulse synth from-schemaGenerate from a JSON specsynth from-schema
pulse synth from-profileGenerate from a captured profilesynth from-profile
pulse profile createCapture a profile from an existing cohortprofile create

Self-description & LLM surface

CommandPurposeChapter
pulse --jsonRoot manifest (commands, components, field types, skills)manifest
pulse skills listList embedded skills with metadataHow LLMs Use Pulse
pulse skills show NAMEPrint a skill’s full markdown bodysame
pulse mcpServe MCP over stdiomcp

Cross-cutting flags

Most leaves accept --json (envelope output), --no-defaults (turn off smart operator-type inference), and the operation-specific flags documented per page. Full list: Flag Reference.

The single environment variable to know is PULSE_DATA_DIR — see Installation.

pulse api process

Audience: CLI users running a single processing request against a cohort.

pulse api process executes one types.Request against a .pulse file and prints the result. It’s the most-used leaf in the binary.

LLM agents using MCP: the equivalent surface is the pulse_process MCP tool — see skills/request-recipes.md for request skeletons.

Synopsis

pulse api process --request FILE [--json] [--stream] [--no-defaults]

Flags

FlagAliasTypeDefaultPurpose
--request-rstring(required)Path to the request JSON file
--jsonboolfalseEmit the result wrapped in the JSON envelope
--streamboolfalseStream rows as NDJSON (one per line) instead of buffering
--no-defaultsboolfalseDisable smart operator-type inference; require explicit Type on every aggregation and grouper

--stream and --json are mutually exclusive in spirit — --stream emits one JSON object per line; --json emits the full envelope.

Request file shape

The request file is a types.Request serialised to JSON. Minimal example:

{
  "cohort": {"filename": "sales.pulse"},
  "aggregations": [
    {"type": "AGG_SUM", "field": "revenue", "label": "total_revenue"}
  ]
}

The full request grammar — filterers, groupers, attributes, window operators, features, sort, tests, post-tests — is documented in types.Request; the LLM-facing companion is skills/request-recipes.md.

Output

Text mode (default)

Pretty-printed JSON of the Response struct: a data array of result rows plus a metadata block with total_rows, filtered_rows, and cohort_file.

--json

The standard envelope:

{
  "format_version": "1.0",
  "data": {
    "data": [ /* result rows */ ],
    "metadata": { "total_rows": 1000, "filtered_rows": 800, "cohort_file": "sales.pulse" }
  },
  "errors": [],
  "warnings": []
}

--stream

NDJSON of result rows, one per line. No envelope, no metadata footer. Pair with pulse api predict ahead of time to confirm Streamable=true; predict-buffered shapes still emit via this path, but they materialise inside the engine first.

Exit codes

CodeMeaning
0Success
1Any error — wrapped in the envelope’s errors array under --json, or printed to stderr otherwise

Examples

Quick aggregation

cat > req.json <<'EOF'
{
  "cohort": {"filename": "sales.pulse"},
  "aggregations": [{"type": "AGG_COUNT", "field": "id", "label": "n"}]
}
EOF

pulse api process --request req.json

Filter, group, and aggregate

cat > req.json <<'EOF'
{
  "cohort": {"filename": "sales.pulse"},
  "filterers": [{"type": "FILTER_RANGE", "field": "revenue", "values": ["100", "10000"]}],
  "groups":    [{"type": "GROUP_CATEGORY", "field": "region"}],
  "aggregations": [
    {"type": "AGG_COUNT",   "field": "id",      "label": "orders"},
    {"type": "AGG_AVERAGE", "field": "revenue", "label": "avg_rev"}
  ]
}
EOF

pulse api process --request req.json --json

Stream rows into a downstream pipeline

pulse api process --request req.json --stream | \
    jq -c 'select(.avg_rev > 500)'

pulse api compose

Audience: CLI users executing a batch of related requests in one call.

pulse api compose runs multiple types.Request entries against one or more cohorts. The whole batch is one ComposedRequest; the engine can run the entries sequentially or in parallel against a bounded worker pool.

LLM agents using MCP: see the pulse_compose MCP tool and the compose-requests skill.

Synopsis

pulse api compose --request FILE [--json] [--stream]
                                  [--parallel N] [--no-fail-fast]
                                  [--no-defaults]

Flags

FlagAliasTypeDefaultPurpose
--request-rstring(required)Composed-request JSON path
--jsonboolfalseWrap output in the standard envelope
--streamboolfalseStream rows as NDJSON; each line is {"index": N, "row": {...}}
--parallelint1Worker count; 0 = GOMAXPROCS, 1 = sequential
--no-fail-fastboolfalseAggregate errors across slots instead of cancelling on first failure (parallel mode only)
--no-defaultsboolfalseDisable smart operator-type inference

Request file shape

{
  "requests": [
    { "cohort": {"filename": "sales.pulse"}, "aggregations": [...] },
    { "cohort": {"filename": "sales.pulse"}, "groups":       [...] },
    { "cohort": {"filename": "ops.pulse"},   "filterers":    [...] }
  ]
}

Each requests[i] is a full types.Request. Slots are independent — they may target different cohorts, use different operators, etc.

Output ordering

Responses come back in input order, regardless of --parallel. A worker that finishes early waits its turn before emitting. So responses[i] always corresponds to request.requests[i].

Parallel mode

--parallel N:

  • 1 (default) — sequential Compose, equivalent to running each request through pulse api process in a loop.
  • 0runtime.GOMAXPROCS workers.
  • >1 — exactly N workers.

Workers share Pulse’s read-only registries; per-request stateful operators are constructed fresh. See Parallel Compose for full mechanics.

FailFast semantics

With --no-fail-fast unset (the default, fail-fast on):

  • The first failing request cancels in-flight siblings.
  • The command exits non-zero with the first error.

With --no-fail-fast:

  • Every request runs to its own completion (or per-request timeout).
  • Errors aggregate into a single SERVICE_INTERNAL error whose details.failed_indices lists the slot indices that failed.
  • Successful slots populate the response array; failed slots are null.

Output

--json

{
  "format_version": "1.0",
  "data": [ /* response per slot, in input order */ ],
  "errors": [],
  "warnings": []
}

--stream

{"index": 0, "row": { ... }}
{"index": 0, "row": { ... }}
{"index": 1, "row": { ... }}

The index field identifies which slot’s request produced each row.

Exit codes

CodeMeaning
0All requests succeeded
1One or more requests failed (fail-fast: first error; aggregated: any failure)

Examples

Sequential batch

pulse api compose --request batch.json --json

Parallel with 4 workers, aggregated errors

pulse api compose --request batch.json --parallel 4 --no-fail-fast --json

Stream a parallel batch into a downstream consumer

pulse api compose --request batch.json --parallel 4 --stream | \
    jq -c 'select(.index == 2)'

pulse api ask

Audience: CLI users running a one-shot natural-language query against a cohort, or any caller who wants “predict + process” in one call.

pulse api ask is the unified entry point. It validates a request (predict), optionally translates a natural-language query into a request via the built-in parser, and — on success — executes the request. The MCP server uses the same library facade internally for the pulse_ask tool.

LLM agents using MCP: the LLM-side counterpart is the pulse_ask MCP tool. The query-router-prompt skill gives a system-prompt template for routing natural language into Pulse requests.

Synopsis

pulse api ask  [--file FILE] [--query "..."] [--request FILE]
               [--on-invalid abort|suggest] [--predict]
               [--json] [--no-defaults]

You must pass at least one of --query or --request.

Flags

FlagAliasTypeDefaultPurpose
--file-fstring(none)Cohort .pulse file path
--query-qstring(none)Natural-language query string
--request-rstring(none)Optional structured request JSON path
--on-invalidstring"abort"Predict-invalid behaviour: "abort" returns an error; "suggest" returns the response with suggestions populated
--predictboolfalseValidate without executing
--jsonboolfalseEmit the standard envelope
--no-defaultsboolfalseDisable smart operator-type inference

How the parser fills the request

When --query is set, the parser reads the cohort’s schema and synthesises a types.Request slot-by-slot. If --request is also provided, explicit fields in that request always win on collision — the parser only fills empty slots.

The parser populates these slots from the query today: Aggregations, Groups, Filterers, Windows, Sort, Tests. Other slots in the parsed request are ignored.

Output

Text mode

A human-readable summary:

Query: average revenue by region
Matched fields: [revenue region]
Confidence: 0.92

Resolved request:
{ ...the synthesised types.Request... }

{ ...result rows, if executed... }

--json

Full AskResponse envelope:

{
  "format_version": "1.0",
  "predict": { /* PredictResult */ },
  "process": { /* Response, if executed */ },
  "suggestions": [],
  "query_resolution": {
    "query": "average revenue by region",
    "matched_fields": ["revenue", "region"],
    "confidence": 0.92
  },
  "errors": [],
  "warnings": []
}

process is omitted when --predict is set or when predict reported invalid and execution was skipped.

Confidence and unresolved queries

query_resolution.confidence is in [0, 1]. A confidence of 0 means PULSE_QUERY_UNRESOLVED (the parser found no usable structure) and lands in errors. Lower-than-1 confidences with at least one matched field land their reasons in warnings (PULSE_QUERY_AMBIGUOUS). The query-router-prompt skill describes the parser’s grammar.

OnInvalid behaviours

ValueBehaviour
"abort" (default)Return a SERVICE_VALIDATION error if predict reports invalid
"suggest"Return the response with suggestions populated from errors/fixup_metadata.go

Use "suggest" when you want fixup hints (e.g., “did you mean field revenue?”) rather than a hard fail.

Exit codes

CodeMeaning
0Success
1Validation failed (abort), parser failed, or process errored

Examples

Pure natural-language query

pulse api ask --file sales.pulse --query "average revenue by region" --json

Query plus partial structured request

cat > partial.json <<'EOF'
{
  "filterers": [{"type": "FILTER_RANGE", "field": "revenue", "values": ["100", "1000"]}]
}
EOF
pulse api ask --file sales.pulse --request partial.json --query "by region" --json

Predict-only probe

pulse api ask --request req.json --predict --json

Suggest fixups instead of erroring

pulse api ask --request typo.json --on-invalid suggest --json
  • pulse api predict — standalone validation
  • pulse api process — execute a pre-validated request
  • Library: pulse.Ask — Go-side counterpart
  • skills/query-router-prompt.md — LLM prompt template for routing
  • skills/request-recipes.md — canonical request skeletons

pulse cohort inspect

Audience: CLI users reading a .pulse file’s schema without running a query — the human-side counterpart of the inspect library method and the pulse_inspect MCP tool. Defined in internal/cli/cohort.go.

pulse cohort inspect reads only the file’s header and schema — it never reads record data. The operation is constant-time regardless of cohort size.

LLM agents using MCP: see the cohort-schema-design skill and the pulse_inspect tool.

Synopsis

pulse cohort inspect PATH [--json] [--full-dict]

Flags

FlagTypeDefaultPurpose
--jsonboolfalseEmit the standard envelope
--full-dictboolfalsePrint every categorical dictionary entry (default truncates at 100)

Output (text mode)

Fields: 7
  order_id              u64                  Stable order identifier
  region                categorical_u8       Sales region label
    dictionary: 4 entries
  product               categorical_u16      Product SKU
    dictionary: 240 entries (truncated)
  units                 u32                  Units sold per line
  revenue               decimal128           Line revenue (precision 18, scale 2)
  sold_on               date                 Date the order shipped
  ...

Dictionaries with > 100 entries are flagged (truncated) — pass --full-dict to print every entry.

Output (--json)

{
  "format_version": "1.0",
  "data": {
    "field_count": 7,
    "fields": [
      {
        "name": "order_id",
        "type": "u64",
        "byte_offset": 0,
        "bit_position": 0,
        "description": "Stable order identifier",
        "description_source": "schema"
      },
      {
        "name": "region",
        "type": "categorical_u8",
        "byte_offset": 8,
        "bit_position": 0,
        "description": "Sales region label",
        "description_source": "schema",
        "dictionary": {
          "total_entries": 4,
          "truncated": false,
          "entries": ["east", "west", "north", "south"]
        }
      }
    ]
  },
  "errors": [],
  "warnings": []
}

Fields with empty descriptions on disk get a synthesised fallback ("Categorical field: <name>" / "Numeric field: <name>"); their description_source is "synthesized" rather than "schema".

Exit codes

CodeMeaning
0Success
1File not found, truncated, magic-byte mismatch, or unsupported format version

Examples

# Human-readable inspect
pulse cohort inspect data.pulse

# Full envelope for programmatic consumers
pulse cohort inspect data.pulse --json

# Show all categorical entries
pulse cohort inspect data.pulse --full-dict --json | jq '.data.fields[] | select(.dictionary)'

pulse api predict

Audience: CLI users validating a request before running it.

pulse api predict validates a types.Request against a .pulse file’s schema without executing it. It reads only the header and schema — never record data — so it’s a cheap, safe iteration loop against arbitrarily large cohorts.

LLM agents using MCP: see the pulse_predict MCP tool and the debugging-with-predict skill. Predict is the LLM’s primary “would this work?” probe.

Synopsis

pulse api predict --request FILE [--json] [--strict]

Flags

FlagAliasTypeDefaultPurpose
--request-rstring(required)Request JSON path
--jsonboolfalseEmit the standard envelope
--strictboolfalseTreat warnings as errors

Structural ban

descriptor/predict.go cannot import service/ or processing/. This is enforced by TestPredictNoExecutionImports. Predict is guaranteed to never touch the executor.

Output (text mode)

Valid: true
Schema: 7 fields
Warning [PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL]: AGG_AVG on field region (categorical_u8)

Without --strict, that warning would still let the command exit 0. With --strict, the warning becomes an error and the command exits non-zero.

Output (--json)

{
  "format_version": "1.0",
  "data": {
    "valid": true,
    "schema_info": {"field_count": 7},
    "streamable": false,
    "streamable_reasons": [
      "AGG_MEDIAN on field price"
    ],
    "request": { /* the request as predict resolved it, with defaults applied */ }
  },
  "errors":  [],
  "warnings": [
    {"code": "PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL", "message": "..."}
  ]
}

streamable reports whether the request will execute on the streaming Process path; streamable_reasons lists every gate that forced the buffered path. See Performance Notes for the full streaming/buffered table.

request echoes the request after defaults have been applied so you can see what would actually run. To suppress defaults, run with --no-defaults on the executing leaf (api process, api compose); predict reports defaults_applied regardless.

Exit codes

CodeMeaning
0Valid (or valid with warnings, in non-strict mode)
1Invalid, or --strict with at least one warning

Examples

Quick validity check

pulse api predict --request req.json

Programmatic check with envelope

pulse api predict --request req.json --json | \
    jq -e '.data.valid == true' >/dev/null && echo "OK"

Strict mode for CI

pulse api predict --request req.json --strict --json

Detect that a request will buffer

pulse api predict --request req.json --json | \
    jq '.data | {streamable, streamable_reasons}'

Common warning codes

CodeWhat to do
PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICALUse AGG_COUNT / AGG_FREQUENCY instead of AGG_SUM / AGG_AVG on categoricals
PULSE_AGG_NOT_MEANINGFUL_FOR_DECIMALDecimal-typed field; switch to a decimal-aware aggregator
PULSE_FIELD_DESCRIPTION_LOW_QUALITYEdit the schema description; re-import
PULSE_FEAT_TARGET_LEAKAGE_RISKThe feature operator references the target column; reorganise the pipeline

The full code-by-code recovery playbook lives in skills/error-code-reference.md and at Troubleshooting.

pulse api sample

Audience: CLI users grabbing a quick peek at a few rows from a cohort — for debugging, sanity-checking an import, or seeding a template request.

pulse api sample returns the first N rows from a .pulse file decoded back to a map of field → value. There is no filter, no aggregation, no transformation — just a typed view of raw rows.

LLM agents using MCP: see the pulse_sample MCP tool. It returns the same shape over the MCP transport.

Synopsis

pulse api sample --input PATH [--count N] [--json]

Flags

FlagAliasTypeDefaultPurpose
--input-istring(required)Cohort .pulse file path
--count-nint10Rows to sample
--jsonboolfalseEmit the standard envelope

Output (text mode)

Pretty-printed JSON of the row array:

[
  {
    "order_id": 1,
    "region": "west",
    "product": "widget",
    "units": 3,
    "revenue": "29.97",
    "sold_on": "2024-01-04"
  },
  ...
]

Decimal128 values are serialised as strings to preserve precision.

Output (--json)

{
  "format_version": "1.0",
  "data": [ /* row array */ ],
  "errors": [],
  "warnings": []
}

Exit codes

CodeMeaning
0Success
1File not found, truncated, or unsupported version

Examples

# 10 rows
pulse api sample --input sales.pulse

# 100 rows, envelope-wrapped
pulse api sample --input sales.pulse --count 100 --json

# Pipe into jq
pulse api sample --input sales.pulse --count 100 | jq '.[] | .revenue'

When sample is the wrong tool

  • For filtered subsets, use pulse api process with a FILTER_* and no aggregation — the result will be one row per matching record.
  • For distinct values of a single field, use pulse api facet.
  • For schema-only views (types, descriptions, dictionaries), use pulse cohort inspect.

pulse api facet

Audience: CLI users enumerating distinct values for a single field — a cheap probe for “what are the regions in this cohort?” without building a full filter.

pulse api facet returns the distinct values of one field in a .pulse file. For categorical fields it reads the dictionary directly (no record scan). For non-categorical fields it scans records.

LLM agents using MCP: see the pulse_facet MCP tool.

Synopsis

pulse api facet --input PATH --field NAME [--json]

Flags

FlagAliasTypeDefaultPurpose
--input-istring(required)Cohort .pulse file path
--field-fstring(required)Field name to facet on
--jsonboolfalseEmit the standard envelope

Output (text mode)

One value per line:

east
north
south
west

Output (--json)

{
  "format_version": "1.0",
  "data": ["east", "north", "south", "west"],
  "errors": [],
  "warnings": []
}

Performance notes

Field typeBehaviour
categorical_u8 / _u16 / _u32Read directly from the schema’s inline dictionary; O(distinct values), no record scan
Non-categoricalFull scan; values collected into a set, then returned sorted

For columns with very high cardinality on the non-categorical path, expect memory proportional to distinct value count.

Exit codes

CodeMeaning
0Success
1File not found, field name not found, or unsupported version

Examples

# Read categorical dictionary
pulse api facet --input sales.pulse --field region

# JSON envelope
pulse api facet --input sales.pulse --field region --json

# Pipe into another command
for r in $(pulse api facet --input sales.pulse --field region); do
    echo "Region: $r"
done

pulse manifest

Audience: CLI users (and orchestration agents) discovering Pulse’s self-description — what commands exist, which aggregators are registered, which field types are supported, and what skills the binary ships with.

The manifest is the bare-pulse invocation with --json. It is deterministic and process-wide: it never depends on cohort data or the filesystem.

LLM agents using MCP: the manifest is also available via the pulse_manifest MCP tool. Agents typically call this once per session and cache the result.

Synopsis

pulse --json [--slim]

(There is no pulse manifest subcommand — the manifest is the root command’s --json output.)

Flags

FlagTypeDefaultPurpose
--jsonboolfalseEmit the manifest as a JSON envelope
--slimboolfalseDrop prose descriptions from the manifest payload (smaller for size-sensitive clients)

Manifest shape

From descriptor/manifest.go:

{
  "format_version": "1.0",
  "data": {
    "commands":   [ /* every CLI leaf with a usage line */ ],
    "operators":  [ /* every aggregator / attribute / filterer / grouper / window / feature */ ],
    "tests":      [ /* every tier-1 statistical test */ ],
    "post_tests": [ /* every tier-2 post-test variant */ ],
    "distributions": [ /* every synth distribution kind */ ],
    "errors":     [ /* every registered error code with a description */ ],
    "mcp_tools":  [ /* every MCP tool name + description */ ],
    "field_types":[ /* every .pulse field type */ ],
    "skills":     [ /* every embedded skill with metadata */ ]
  },
  "errors":   [],
  "warnings": []
}

Every list is sorted deterministically (alphabetical or category + alphabetical). The same Pulse binary always emits the same manifest bytes (modulo --slim).

Determinism gates

Several CI tests enforce manifest completeness — see Testing Conventions. Notably:

  • TestManifestOperatorsComplete — every registered operator appears in the manifest.
  • TestManifestTestsComplete / TestManifestPostTestsComplete — every registered statistical test appears.
  • TestManifestDistributionsComplete, TestManifestErrorCodesComplete, TestManifestMCPToolsComplete — same for distributions, error codes, and MCP tools.
  • TestManifestStreamableMatchesTypes — every operator’s streamable flag mirrors the per-type method.

When to use the manifest

Use caseReach for
Discover what’s availablepulse --json
Confirm a specific operator’s params and emit type`jq ’.data.operators[]
List embedded skills with their applies_tojq '.data.skills[]'
Generate documentation or client stubsParse the full manifest once at boot
Quick “is this name a real operator?”`pulse –json –slim

Exit codes

CodeMeaning
0Always (the manifest is in-memory, deterministic, never errors)

Examples

pulse --json | jq '.data | keys'

Slim variant for embedding in an agent’s system prompt

pulse --json --slim > manifest.slim.json

List every aggregator with its emitted type

pulse --json | jq '.data.operators[] | select(.category == "aggregation") | {name, emits_type}'

Confirm a feature operator’s parameters

pulse --json | jq '.data.operators[] | select(.name == "FEAT_BUCKETIZE")'

pulse synth from-schema

Audience: CLI users generating a synthetic .pulse cohort from a declarative spec — for testing, demos, and bootstrapping fixtures.

pulse synth from-schema reads a JSON synth spec (field-by-field distributions, row count, optional pairwise correlations) and writes a deterministic .pulse file. Same (spec, seed) pair produces a byte-identical output.

LLM agents using MCP: see the pulse_synth MCP tool and the synthetic-data skill — it covers spec authoring, the 12 supported distributions, and constraint patterns.

Synopsis

pulse synth from-schema --spec FILE --output FILE
                        [--rows N] [--seed N] [--json]

Flags

FlagAliasTypeDefaultPurpose
--spec-sstring(required)Synth spec JSON path
--output-ostring(required)Output .pulse file path
--rowsintfrom specOverride row_count in the spec
--seedint0Deterministic RNG seed
--jsonboolfalseEmit the standard envelope

Spec shape (sketch)

{
  "row_count": 10000,
  "fields": [
    {"name": "id",      "type": "u64",            "distribution": "monotonic_from", "from": 1},
    {"name": "region",  "type": "categorical_u8", "distribution": "weighted_categorical",
                         "weights": {"east": 0.4, "west": 0.4, "north": 0.1, "south": 0.1}},
    {"name": "revenue", "type": "f64",            "distribution": "lognormal", "mu": 4.0, "sigma": 0.8},
    {"name": "sold_on", "type": "date",           "distribution": "uniform_date",
                         "from": "2024-01-01", "to": "2024-12-31"}
  ]
}

Full spec grammar (constraints, correlations, regex, …) lives in skills/synthetic-data.md and synth/.

Supported distributions

bernoulli, constant, exponential, lognormal, monotonic_from, normal, pareto, poisson, regex, uniform, uniform_date, weighted_categorical.

The full catalog (with parameters) is in skills/synthetic-data.md and pulse --json | jq '.data.distributions'.

Determinism

Same (spec, seed) → byte-identical output. The seed is a int64; default 0. Use a fixed seed for fixtures and a random seed for load-testing variation.

Output

Text mode

Generated 10000 rows -> sales.pulse (rejected 0)

rejected counts rows that failed user-defined constraints (PULSE_SYNTH_CONSTRAINT_INFEASIBLE when the rejection rate is too high to make progress).

--json

{
  "format_version": "1.0",
  "data": {
    "output_path": "sales.pulse",
    "rows_generated": 10000,
    "rows_rejected": 0,
    "seed": 0
  },
  "errors": [],
  "warnings": []
}

Exit codes

CodeMeaning
0Success
1Spec parse error, unknown distribution, infeasible constraints, or output write failure

Common error codes

CodeCause
PULSE_SYNTH_DISTRIBUTION_UNKNOWNSpec references a distribution name not in the catalog
PULSE_SYNTH_CONSTRAINT_INFEASIBLEConstraints reject too high a fraction of generated rows

Examples

# Build sales.pulse from a spec
pulse synth from-schema --spec sales.spec.json --output sales.pulse --seed 42

# Override row count without editing the spec
pulse synth from-schema --spec sales.spec.json --output sales.pulse --rows 1000

# Programmatic envelope
pulse synth from-schema --spec sales.spec.json --output sales.pulse --json

pulse synth from-profile

Audience: CLI users generating a synthetic .pulse cohort whose distributions match a real cohort — typically to share a sanitised replica without exposing the underlying rows.

pulse synth from-profile reads a profile JSON captured by pulse profile create and writes a synthetic .pulse file whose per-field distributions and (optional) pairwise correlations follow the profile. The profile retains no individual rows from the source; only summary statistics.

LLM agents using MCP: see the pulse_synth_from_profile MCP tool and the synthetic-data skill.

Synopsis

pulse synth from-profile --profile FILE --output FILE --rows N
                         [--seed N] [--json]

Flags

FlagAliasTypeDefaultPurpose
--profile-pstring(required)Profile JSON path
--output-ostring(required)Output .pulse file path
--rowsint(required)Rows to generate
--seedint0Deterministic RNG seed
--jsonboolfalseEmit the standard envelope

--rows is required (unlike from-schema, which can pull it from the spec) because the profile does not carry a generation count of its own.

Determinism

Same (profile, seed, rows) triple → byte-identical output. Seeds are int64; default 0.

Profile shape

The profile is a synth.Profile JSON object produced by pulse profile create. It carries per-field type, descriptive statistics, top-K categorical entries (default K = 32), optional pairwise correlations (when --include-correlations was passed at profile-creation time), and a row count.

See pulse profile create for how to capture one, and synth/ for the underlying Go types.

Output

Text mode

Generated 1000 rows -> sales.synth.pulse (rejected 0)

--json

Same envelope shape as synth from-schema.

Exit codes

CodeMeaning
0Success
1Profile parse error, infeasible constraints, or output write failure

Examples

# Capture once
pulse profile create --input sales.pulse --output sales.profile.json

# Re-generate any number of times with different seeds
pulse synth from-profile --profile sales.profile.json --output sales.s42.pulse --rows 10000 --seed 42
pulse synth from-profile --profile sales.profile.json --output sales.s43.pulse --rows 10000 --seed 43

Limitations

  • Categorical tails: anything past the captured top-K is replaced with a sentinel “other” bucket sized to its observed weight.
  • Correlations: pairwise only, and only between numeric fields. The profile capture flag --include-correlations opts in; without it, fields are generated independently.
  • Decimal and geo fields: regenerated within the same type family but with synthetic value distributions; downstream uses that depend on exact field values (e.g. joinable identifiers) need the schema-driven path instead.

pulse profile create

Audience: CLI users capturing a statistical profile of an existing cohort — typically to feed into pulse synth from-profile.

pulse profile create reads a .pulse file and writes a JSON profile: per-field type, descriptive statistics, top-K categorical entries, optional pairwise correlations. The profile retains no individual rows from the source.

LLM agents using MCP: see the pulse_profile MCP tool.

Synopsis

pulse profile create --input PATH --output PATH
                     [--top-k N] [--include-stats]
                     [--include-correlations] [--correlation-top-k N]
                     [--sample-limit N] [--json]

Flags

FlagAliasTypeDefaultPurpose
--input-istring(required)Source .pulse cohort
--output-ostring(required)Output profile JSON path
--top-kint32Top-K categorical entries to retain per field
--include-statsbooltrueInclude percentile / std stats
--include-correlationsboolfalseCapture pairwise numeric correlations
--correlation-top-kint16Cap on retained correlation pairs
--sample-limitint0 (unlimited)Cap rows ingested for the profile (0 disables)
--jsonboolfalseAlso print the envelope to stdout

What the profile captures

Field typeWhat is recorded
Numeric (u*, f*, decimal128)Count, min, max, mean, stddev; percentiles if --include-stats
CategoricalTop-K most-frequent values + their frequencies; “other” tail weight
dateMin, max, count
nullable_*Null count alongside the above

What the profile does NOT capture

  • Individual rows.
  • The full categorical dictionary beyond --top-k.
  • Correlations unless --include-correlations is set.

This is by design — profiles are intended to be safe to share with parties who shouldn’t see the underlying data.

Output

The profile JSON is always written to --output. With --json, the envelope is also written to stdout (typically piped or jq-d).

Profile schema lives in synth/profile.go and is documented in skills/synthetic-data.md.

Text mode summary

Profiled 50000 rows from sales.pulse -> sales.profile.json

Exit codes

CodeMeaning
0Success
1Read error, unsupported field type (PULSE_PROFILE_FIELD_UNSUPPORTED), or write failure

Examples

Minimal profile

pulse profile create --input sales.pulse --output sales.profile.json

Rich profile with correlations

pulse profile create --input sales.pulse --output sales.profile.json \
    --include-stats --include-correlations --top-k 64 --correlation-top-k 32

Sample-limited profile for a huge cohort

pulse profile create --input ops.pulse --output ops.profile.json --sample-limit 1000000

Round-trip with synth

pulse profile create --input sales.pulse --output sales.profile.json
pulse synth from-profile --profile sales.profile.json --output sales.synth.pulse --rows 10000 --seed 1
pulse cohort inspect sales.synth.pulse

pulse mcp

Audience: operators wiring Pulse into an MCP-aware AI client (Claude Desktop, Claude Code, generic MCP clients).

pulse mcp runs the Model Context Protocol server over stdio. The AI client launches pulse mcp as a subprocess, speaks MCP over its stdio streams, and shuts it down on session close.

LLM agents using MCP: the agent-side guide is the mcp-integration skill — fetch it via pulse_skills_get for the tool catalog and request shapes. This page is for the human setting the server up.

Synopsis

pulse mcp [--data-dir PATH] [--bind-on-open]

The command reads stdin, writes MCP responses on stdout, and writes a one-line startup notice (and any subsequent diagnostics) on stderr.

Flags

FlagTypeDefaultPurpose
--data-dirstringfrom PULSE_DATA_DIR env varCohort base directory
--bind-on-openbooltrueRegister session-scoped JSON-schema-bound tool variants on successful pulse_inspect

--data-dir is required in one of its two forms (env var or flag). The MCP server fails to start otherwise:

data directory required: set PULSE_DATA_DIR or pass --data-dir

–bind-on-open

When a session calls pulse_inspect successfully, the server can register session-scoped tool variants whose JSON Schemas constrain field-name parameters to the cohort’s actual fields. This narrows the LLM’s choices and prevents typos at parameter-binding time.

Default: true. Pass --bind-on-open=false if your client binds tool schemas itself.

The binding logic lives in internal/mcp/schema_bind.go; see skills/mcp-integration.md for the LLM-facing implications.

Wiring it into Claude Desktop

~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args": ["mcp"],
      "env": {
        "PULSE_DATA_DIR": "/var/data/pulse"
      }
    }
  }
}

Restart the client. The Pulse tools (pulse_manifest, pulse_ask, pulse_inspect, pulse_predict, pulse_process, pulse_compose, pulse_sample, pulse_facet, pulse_import, pulse_drop, pulse_imports_list, pulse_examples_search, pulse_examples_get, pulse_errors_lookup, pulse_skills_list, pulse_skills_get) and resources (pulse://*.pulse, pulse-skill://*) appear in the tool/resource list.

Wiring it into Claude Code

~/.claude.json (or per-project .claude.json):

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args":    ["mcp"],
      "env":     { "PULSE_DATA_DIR": "/var/data/pulse" }
    }
  }
}

The full LLM-side recipe (including resource URIs and the schema binding details) is in skills/mcp-integration.md.

Exit codes

pulse mcp is a long-running process. It exits non-zero only on fatal startup failure (missing data dir, transport error). Once serving, an MCP client controls the lifecycle.

Examples

Foreground run for debugging

PULSE_DATA_DIR=/tmp/pulse-data ./bin/pulse mcp
# Stderr: pulse mcp: serving over stdio (data dir: /tmp/pulse-data, bind-on-open: true)

Disable schema binding

PULSE_DATA_DIR=/tmp/pulse-data ./bin/pulse mcp --bind-on-open=false

Inspect what the server registers

# Manifest exposes the MCP tool list
pulse --json | jq '.data.mcp_tools[]'
  • How LLMs Use Pulse — the pointer table from this site into the skill pack
  • skills/mcp-integration.md — LLM-side wiring, tool catalog, resource schemes, schema binding
  • Deployment — production hardening notes
  • Troubleshooting — common MCP failure modes

Flag Reference

Audience: CLI users who want one page that lists every flag and every environment variable in scope across the binary.

The per-command pages list each command’s full flag set; this page is the cross-cutting reference for flags that appear on multiple commands and for the environment variables Pulse reads.

LLM agents using MCP: there is no LLM-facing skill for the CLI surface. Agents go via MCP tools (pulse_process, pulse_inspect, …) — see skills/mcp-integration.md.

Global flags

Available on the bare pulse invocation:

FlagEffect
--jsonPrint the root manifest as JSON (envelope-wrapped)
--slimWith --json, drop prose descriptions for size-sensitive clients

Both default to off. pulse --json is the discovery entry point — it emits the manifest documented at pulse manifest.

Environment variables

VariableUsed byRequiredPurpose
PULSE_DATA_DIRAll commands when no path override is given; required by pulse mcpconditionallyBase directory for cohort files. Relative cohort paths resolve against it.

PULSE_DATA_DIR is the only PULSE_* environment variable today. The Makefile auto-loads a repo-root .env file so you can keep it (and any future env vars) there for development.

When embedding the library, you can bypass the env var entirely by passing pulse.Options{DataDir: "/path"} or pulse.Options{FS: myFs}.

--json envelope

Almost every leaf command accepts --json, which switches output from human prose to a structured envelope. The envelope shape is fixed and documented in CLAUDE.md → Output Format Contract:

{
  "format_version": "1.0",
  "data":     { /* operation-specific result */ },
  "errors":   [ /* {"code": "...", "message": "...", "details": {...}} */ ],
  "warnings": [ /* same shape */ ]
}

format_version is currently "1.0". errors and warnings are always arrays (never null) so JSON consumers can index without nullable-check overhead.

Shared per-command flags

Several flags appear on multiple commands with identical semantics.

--no-defaults

Available on: api process, api compose, api ask.

Disable the runtime smart-defaults pass that infers operator Type from the named field’s schema type when the caller omits it. Forces the request to be source-of-truth. See pulse.New & Options for the underlying library option.

--stream

Available on: api process, api compose.

Stream result rows as NDJSON (one row per line) instead of buffering the full result. For compose, each line carries an {"index": N, "row": {...}} shape so consumers know which sub-request produced each row. See Streaming & ProcessStream.

--strict

Available on: api predict.

Treat warnings (e.g. low-quality field description) as errors. Useful in CI gates that want the strictest possible validation.

--full-dict

Available on: cohort inspect.

Print full categorical dictionaries instead of truncating after 100 entries. Pair with --json for programmatic consumption.

--strict / --seed / --rows

synth from-schema and synth from-profile use --seed (for deterministic RNG) and --rows (override the spec’s row count). See the per-command pages.

Help

Every command supports --help:

pulse --help
pulse api --help
pulse api process --help
pulse mcp --help

--help output is the urfave/cli v3 default — a usage block, description, flag list, and an examples block where applicable.

Cross-references

If you need…Go to
Per-command synopsis & examplesCLI Tour and each cli/ page
Library-side equivalentsLibrary Embedding
MCP-side equivalentsHow LLMs Use Pulse
Envelope and error code semanticsTroubleshooting and skills/error-code-reference.md

Go API Overview

Audience: Go developers embedding Pulse in a binary or a service.

Pulse is library-first. The CLI in cmd/pulse/ is a thin adapter around the package documented here. If you’re reaching for os/exec to shell out to the binary from Go, stop and use the library directly — you’ll skip a process boundary and gain typed responses.

LLM agents using MCP: there is no LLM-facing skill that covers Go embedding directly. Agents speak MCP; this page is for the programs that host them.

Module path

import "github.com/frankbardon/pulse"

Sub-packages you’ll commonly touch:

PackagePurpose
github.com/frankbardon/pulsePublic facade (Pulse, Options, Request, Response, Ask, …)
github.com/frankbardon/pulse/typesRequest/response structs, component-type constants (AGG_*, …)
github.com/frankbardon/pulse/ioTabular adapter interfaces (Reader, Writer, ImportJob, ExportJob, ConvertJob)
github.com/frankbardon/pulse/io/<fmt>Per-format readers/writers (csv, tsv, ndjson, jsonarray, parquet, arrow, excel)
github.com/frankbardon/pulse/fsafero-backed filesystem config (fs.New, fs.Default, fs.NewMemMap)
github.com/frankbardon/pulse/errorsTyped CodedError system and code constants
github.com/frankbardon/pulse/descriptorManifest, predict, inspect (no-execute operations)
github.com/frankbardon/pulse/synthSynthetic data generator and profile types
github.com/frankbardon/pulse/skillsEmbedded skill pack — skills.List(), skills.Get(name)

The internal/ subtree (internal/cli, internal/mcp, internal/query) is exactly that — internal. Don’t import it.

The facade

Construct a Pulse once per process (or per filesystem boundary) and re-use it:

p, err := pulse.New(pulse.Options{
    DataDir: "/var/data/pulse",
})
if err != nil {
    return err
}

The full Options shape (custom afero.Fs, smart-default toggling) is documented at pulse.New & Options.

Public methods

From pulse.go:

MethodPurpose
Open(ctx, path) (*Cohort, error)Read header + schema, return a typed Cohort handle
Process(ctx, req) (*Response, error)Execute one request
ProcessStream(ctx, req) (RowIter, error)Same, pull-based iterator over result rows
Compose(ctx, req) ([]*Response, error)Execute a batch sequentially
ComposeParallel(ctx, req, opts) ([]*Response, error)Execute a batch in parallel with a worker pool
Ask(ctx, askReq) (*AskResponse, error)Unified entry: predict + (optionally) process, with natural-language query support
Import(ctx, job) (*ImportReport, error)Tabular → .pulse
Export(ctx, job) (*ExportReport, error).pulse → tabular
Convert(ctx, job) (*ConvertReport, error)Tabular → tabular, with .pulse as the transparent middle
Inspect(ctx, path) (*InspectResult, error)Read header + schema only (no record data)
Predict(ctx, req) (*PredictResult, error)Validate a request without executing
Sample(ctx, path, n) ([]Record, error)Up to n rows
Facet(ctx, path, field) ([]string, error)Distinct values of a field
Synth(ctx, spec, out, opts) (*SynthResult, error)Generate a synthetic cohort
Profile(ctx, path, opts) (*Profile, error)Statistical summary suitable for from-profile synthesis
Manifest(ctx) *ManifestDeterministic root self-description
Fs() afero.FsThe underlying filesystem (used by pulse mcp and other embedders)

Re-exported type aliases let you write pulse.Request instead of types.Request:

type (
    Request         = types.Request
    Response        = types.Response
    ComposedRequest = types.ComposedRequest
    SynthSpec       = synth.Spec
    Profile         = synth.Profile
    // … and so on
)

Minimum viable embed

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/frankbardon/pulse"
    "github.com/frankbardon/pulse/types"
)

func main() {
    ctx := context.Background()

    p, err := pulse.New(pulse.Options{DataDir: "/var/data/pulse"})
    if err != nil {
        log.Fatal(err)
    }

    resp, err := p.Process(ctx, &pulse.Request{
        Cohort: &types.Cohort{Filename: "sales.pulse"},
        Aggregations: []*types.Aggregation{
            {Type: types.AGG_AVERAGE, Field: "revenue", Label: "avg_revenue"},
        },
    })
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(resp.Data)
}

Where to go from here

pulse.New & Options

Audience: Go embedders constructing a Pulse instance.

pulse.New(pulse.Options{...}) is the single entry point. There is no config file, no init function, no global state. Every option is declared in code (or comes from PULSE_DATA_DIR when the field is left empty).

LLM agents using MCP: the MCP server constructs its own Pulse instance from CLI flags. Agents don’t see this surface.

The Options struct

From pulse.go:

type Options struct {
    // DataDir is the base directory for cohort files.
    // Defaults to PULSE_DATA_DIR if empty and FS is not set.
    DataDir string

    // FS is an optional custom filesystem.
    // When set, DataDir is ignored for filesystem construction.
    FS afero.Fs

    // DisableDefaults turns off the smart-defaults pass that infers
    // operator Type from the named field's schema type when the caller
    // omits it. Defaults to false (defaults enabled). Predict still
    // computes and reports DefaultsApplied independently — this flag
    // governs only what the runtime mutates on the live request.
    DisableDefaults bool
}

Field reference

DataDir string

The base directory for .pulse files. Relative cohort paths ({"filename": "data.pulse"}) resolve against this directory.

SourceResult
Non-empty Options.DataDirUsed directly
Empty + FS non-nilDataDir is ignored — the FS is the trust boundary
Empty + FS nilPulse falls back to fs.Default(), which reads PULSE_DATA_DIR

Example:

p, err := pulse.New(pulse.Options{DataDir: "/var/data/pulse"})

FS afero.Fs

A custom afero.Fs implementation. When set, it fully overrides the filesystem layer — DataDir is unused, and PULSE_DATA_DIR is not consulted. Use this for tests (afero.NewMemMapFs()) or non-local backends (S3-backed afero.Fs, encrypted overlays, …).

Example:

import "github.com/spf13/afero"

p, err := pulse.New(pulse.Options{
    FS: afero.NewMemMapFs(),
})

See Custom Filesystems for in-depth usage and the hermetic-test pattern.

DisableDefaults bool

The runtime smart-defaults pass infers an operator’s Type from the named field’s schema type when the caller omits it (e.g. AGG_SUM on a numeric field defaults appropriately; categorical fields default toward AGG_COUNT). Set DisableDefaults = true to require an explicit Type on every aggregation and grouper — useful when you want the request to be source-of-truth and never be silently re-typed.

This option only governs the runtime mutation. predict independently computes and reports DefaultsApplied in its result envelope, so callers can see what would have been inferred even when defaults are disabled.

CLI parity: pulse api process --no-defaults, pulse api compose --no-defaults, pulse api ask --no-defaults.

Defaults at a glance

Field omitted from OptionsEffective behaviour
DataDir and FS both emptyPulse calls fs.Default() → reads PULSE_DATA_DIR env var. Errors if unset and the operation needs filesystem access.
DataDir onlyUses an afero.NewOsFs() rooted at DataDir.
FS onlyUses the provided FS verbatim.
BothFS wins; DataDir is ignored.
DisableDefaults omittedDefaults enabled.

Re-using a Pulse instance

Pulse is safe for concurrent use across goroutines once constructed. The internal registries are read-only after New; each Process call constructs fresh stateful operators per request, so multiple goroutines can call Process/ProcessStream/Compose in parallel against the same Pulse.

For batch parallelism, prefer ComposeParallel — it shares the read-only registries and bounds concurrency for you.

Tearing down

There is no explicit Close() method on Pulse. The filesystem is a borrowed handle; if you supply a custom FS, the embedder is responsible for any cleanup that FS requires. Streaming consumers should still call RowIter.Close() so that the underlying readers release their buffers.

pulse.Ask — Unified Entry Point

Audience: Go embedders who want a single call that validates a request and then optionally executes it.

Ask is the one-shot facade. It collapses predict, process, and the natural-language query parser into a single typed call. The MCP server uses this same method internally for the pulse_ask tool.

LLM agents using MCP: the corresponding LLM-facing surface is the pulse_ask MCP tool, documented in skills/mcp-integration.md and skills/request-recipes.md.

When to use Ask vs Process

GoalReach for
Validate a request without running itPredict (or Ask{Predict: true})
Validate then execute in one callAsk
Translate a natural-language string into a request and executeAsk with Query set
Execute a request you’ve already validated separatelyProcess (lower overhead)

If you’re already inside a tight loop that validates once and runs many similar requests, prefer ProcessAsk does the predict pass on every call.

Request shape

From pulse.go:

type AskRequest struct {
    File      string         `json:"file,omitempty"`
    Request   *types.Request `json:"request,omitempty"`
    Query     string         `json:"query,omitempty"`
    OnInvalid string         `json:"on_invalid,omitempty"`
    Predict   bool           `json:"predict,omitempty"`
}
FieldMeaning
FileCohort path. When set and Request.Cohort is nil, Ask synthesises a Cohort from the path.
RequestStructured types.Request. Optional when Query is set — the parser fills empty slots.
QueryNatural-language query string (“average revenue by region”). Parsed against the cohort’s schema.
OnInvalid"abort" (default) returns a SERVICE_VALIDATION error on predict-invalid; "suggest" returns the response with Suggestions populated.
PredictWhen true, skip execution after a successful predict. The “what would happen if I ran this” probe.

Response shape

type AskResponse struct {
    FormatVersion   string                      `json:"format_version"`
    Predict         *descriptor.PredictResult   `json:"predict"`
    Process         *Response                   `json:"process,omitempty"`
    Suggestions     []errors.Fixup              `json:"suggestions,omitempty"`
    QueryResolution *QueryResolution            `json:"query_resolution,omitempty"`
    Errors          []*descriptor.EnvelopeEntry `json:"errors"`
    Warnings        []*descriptor.EnvelopeEntry `json:"warnings"`
}
  • Predict is always populated.
  • Process is set only when execution ran.
  • Suggestions is populated only when predict reported invalid and OnInvalid == "suggest".
  • QueryResolution is set only when Query was non-empty; it echoes the parser’s matched fields and aggregate confidence in [0, 1].

Examples

Structured request, predict-only

resp, err := p.Ask(ctx, &pulse.AskRequest{
    Request: &pulse.Request{
        Cohort: &types.Cohort{Filename: "sales.pulse"},
        Aggregations: []*types.Aggregation{
            {Type: types.AGG_SUM, Field: "revenue", Label: "total"},
        },
    },
    Predict: true,
})

Natural-language query

resp, err := p.Ask(ctx, &pulse.AskRequest{
    File:  "sales.pulse",
    Query: "average revenue by region",
})
fmt.Printf("matched: %v (conf %.2f)\n",
    resp.QueryResolution.MatchedFields,
    resp.QueryResolution.Confidence)

The parser fills the structured request from the query and runs Process. Explicit fields in Request always win on collision — the parser only fills empty slots.

Query plus a partial structured request

resp, err := p.Ask(ctx, &pulse.AskRequest{
    File: "sales.pulse",
    Request: &pulse.Request{
        Filterers: []*types.Filterer{
            {Type: types.FILTER_RANGE, Field: "revenue", Values: []string{"100", "1000"}},
        },
    },
    Query: "average revenue by region",
})

The structured Filterers win; the parser supplies Aggregations and Groups from the query.

Suggest fixups instead of erroring

resp, err := p.Ask(ctx, &pulse.AskRequest{
    Request:   req,
    OnInvalid: "suggest",
})
for _, fix := range resp.Suggestions {
    fmt.Println(fix.Code, fix.Message, fix.Hint)
}

Fixup templates live in errors/fixup_metadata.go and are documented per code in skills/error-code-reference.md.

Errors and warnings

AskResponse.Errors and AskResponse.Warnings flatten the descriptor envelope’s entries plus any issues the query parser raised (PULSE_QUERY_UNRESOLVED, PULSE_QUERY_AMBIGUOUS). The arrays are always present (never nil) so JSON consumers can index without null-checks — same shape as the descriptor envelope.

FormatVersion mirrors the descriptor envelope version ("1.0") so callers can gate on a single value across endpoints.

Custom Filesystems

Audience: Go embedders running Pulse in tests (hermetic, no disk), in cloud-storage-backed environments (S3, GCS, Azure Blob via afero), or behind a custom storage layer.

Pulse routes all file I/O through afero.Fs. Pass any afero.Fs-conformant filesystem to pulse.New(pulse.Options{FS: ...}) and Pulse never touches the OS filesystem directly.

LLM agents using MCP: the MCP server’s filesystem is fixed at startup via PULSE_DATA_DIR or --data-dir. Agents don’t swap filesystems mid-session.

In-memory testing pattern

The single most common reason to override the filesystem is hermetic tests. Use fs.NewMemMap() (which wraps afero.NewMemMapFs() with the right config) or pass the afero filesystem directly:

import (
    "github.com/frankbardon/pulse"
    "github.com/spf13/afero"
)

func TestSomething(t *testing.T) {
    p, err := pulse.New(pulse.Options{FS: afero.NewMemMapFs()})
    if err != nil {
        t.Fatal(err)
    }

    // Write a .pulse file into the in-memory FS, then process it.
    // ...
}

The in-memory FS persists for the life of the FS reference. Create a fresh one per test for isolation.

Custom storage backends

Anything that implements afero.Fs works. Common patterns:

  • S3 / GCS / Azure Blob — via community afero adapters (afero/gcsfs, afero/s3).
  • Encrypted overlays — wrap a base FS with envelope encryption per file.
  • Read-only mountsafero.NewReadOnlyFs(base) for production cohort serving where mutation is by accident, not policy.

Example with a hypothetical S3 wrapper:

import (
    "github.com/frankbardon/pulse"
    "example.com/myorg/aferos3"
)

func main() {
    s3fs := aferos3.New(aferos3.Config{
        Bucket: "my-pulse-cohorts",
        Region: "us-east-1",
    })
    p, _ := pulse.New(pulse.Options{FS: s3fs})
    // p reads and writes cohort files from S3 transparently.
}

The fs package

The lower-level constructors live in fs/:

FunctionPurpose
fs.New(opts ...Option) (*fs.Config, error)Build a config with fs.WithFs(...) / fs.WithDataDir(...)
fs.Default() (*fs.Config, error)Read PULSE_DATA_DIR from the environment
fs.NewMemMap() *fs.ConfigIn-memory test config

You can also bypass pulse.Options entirely and construct a service from a *fs.Config, but the public facade is the intended entry point. pulse.New(pulse.Options{FS: yourFs}) covers every embedding case.

Path resolution

Pulse resolves a Cohort to a path with this rule (see resolveCohortPath in pulse.go):

if cohort.DataDir != "" → "<DataDir>/<Filename>"
else                    → "<Filename>"

The custom FS is then asked to open that path. For an afero.MemMapFs, an absolute-looking path like /var/data/sales.pulse is just a key in the in-memory map — no need to mirror the OS layout.

What custom filesystems do NOT do

  • Pulse never falls back to os.Open if the custom FS fails. The custom FS is the only filesystem; if it errors, that error propagates verbatim.
  • The MCP server (pulse mcp) currently uses afero.NewOsFs() only. Custom filesystems are a library-side capability today.
  • The Go race detector and go test -race work normally with in-memory filesystems; tests can run highly concurrent without fighting over a real directory.

Streaming & ProcessStream

Audience: Go embedders feeding rows into an HTTP response, an NDJSON pipeline, or any consumer that wants result rows one at a time instead of buffering the full set.

pulse.ProcessStream returns a pull-based iterator. The API is stable regardless of whether the underlying request shape streams inside the engine — non-streamable requests return the same iterator, they just buffer once internally before yielding.

LLM agents using MCP: see skills/request-recipes.md for the MCP-side streaming surface (pulse_process with the streaming option). The Streamable predicate is the same on both surfaces.

The iterator API

type RowIter = service.RowIter

// In service:
type RowIter interface {
    Next(ctx context.Context) (Row, bool, error)
    Close() error
    Metadata() *ResponseMetadata
}

type Row = service.Row // map[string]any

Usage:

iter, err := p.ProcessStream(ctx, req)
if err != nil {
    return err
}
defer iter.Close()

for {
    row, ok, err := iter.Next(ctx)
    if err != nil {
        return err
    }
    if !ok {
        break
    }
    // … emit row …
}

meta := iter.Metadata() // available after drain

Metadata() returns the full ResponseMetadata (total rows, filtered rows, cohort file) once the iterator has been drained.

What actually streams

ProcessStream always returns an iterator, but the engine only avoids the buffered intermediate row set for a subset of request shapes. Run pulse api predict (or Predict from the library) and check the Streamable flag in the result:

pred, err := p.Predict(ctx, req)
if !pred.Streamable {
    for _, reason := range pred.StreamableReasons {
        log.Printf("buffered because: %s", reason)
    }
}

The streaming-eligible request shapes are listed in Performance Notes → Streaming path.

The complement — the request shapes that force the buffered path — is at Performance Notes → Buffered path.

Streamable=false doesn’t mean the iterator is broken; it just means rows materialise inside the engine before Next yields them. The output API is identical either way.

CLI parity

pulse api process --stream writes NDJSON to stdout, one row per line. pulse api compose --stream does the same with an index field per row identifying which sub-request produced it.

Cancellation

Every Next call accepts a context. Cancellation propagates to the underlying reader; rows that are already in flight may still be returned before Next returns (_, false, ctx.Err()). Close() releases any reader resources and is safe to call multiple times.

Backpressure

The iterator is pull-based: the engine produces rows only as fast as the consumer calls Next. For HTTP responders that flush periodically, this means you can stream a multi-GB result set through a constant-memory buffer.

For pipelines that want to fan rows out across goroutines, copy each row into your own struct before processing — Row is map[string]any and the engine may re-use the backing data after Next returns. Treat it as borrowed.

Inside the engine

Under the hood, ProcessStream calls one of four orchestrator modes depending on the request shape: single-pass streaming, grouped streaming, two-pass streaming, or the buffered fallback. The choice is made via processing.CanStreamRequest(req, schema), which is the same predicate Predict.Streamable reports — this parity is enforced by TestPredict_Streamable_MatchesRuntime.

If you find a request that predict says is streamable but Next materialises something large, that’s a parity drift and a bug — please report it with the request JSON.

Parallel Compose

Audience: Go embedders running multiple requests concurrently against the same cohort or set of cohorts.

pulse.ComposeParallel fans a ComposedRequest across a bounded worker pool. Workers share the engine’s read-only registries; each Process call constructs fresh stateful operators per request, so concurrent execution is safe.

LLM agents using MCP: the MCP server today exposes pulse_compose as a sequential operation. Parallelism is a library-side capability.

When to use

GoalReach for
Single request, single resultProcess
Single request, pulled as rowsProcessStream
Batch of independent requests, in order, sequentialCompose
Batch of independent requests, in parallel, with bounded workersComposeParallel

Order of results is preserved regardless of completion order — a worker that finishes early is held until its slot’s index is the next to emit. So callers can index responses[i] against req.Requests[i] directly.

ComposeOptions

From service/compose_parallel.go, re-exported as pulse.ComposeOptions:

type ComposeOptions struct {
    // MaxWorkers caps concurrent in-flight Process calls. Zero means
    // runtime.GOMAXPROCS; negatives clamp to 1.
    MaxWorkers int

    // PerRequestTimeout, if positive, derives a context.WithTimeout for
    // each request.
    PerRequestTimeout time.Duration

    // FailFast cancels in-flight siblings on the first request error.
    // Defaults to true. Set false to aggregate all errors instead.
    FailFast bool
}
FieldDefaultNotes
MaxWorkersruntime.GOMAXPROCS(0)0 resolves to GOMAXPROCS; <1 clamps to 1
PerRequestTimeoutunlimitedWhen positive, each worker derives context.WithTimeout
FailFasttrueFirst error cancels siblings and returns immediately

Example

ctx := context.Background()

composed := &pulse.ComposedRequest{
    Requests: []*pulse.Request{req1, req2, req3, req4},
}

resps, err := p.ComposeParallel(ctx, composed, pulse.ComposeOptions{
    MaxWorkers:        4,
    PerRequestTimeout: 30 * time.Second,
    FailFast:          true,
})
if err != nil {
    return err
}

for i, resp := range resps {
    fmt.Printf("request %d: %d rows\n", i, len(resp.Data))
}

FailFast semantics

With FailFast = true (the default):

  • The first request to return an error cancels the shared context.
  • In-flight siblings observe cancellation via ctx.Err() and return early.
  • ComposeParallel returns (nil, theFirstError).

With FailFast = false:

  • Every request runs to completion (or its own per-request timeout).
  • Errors are aggregated into a single SERVICE_INTERNAL error whose details map carries failed_indices (a list of slot indices that errored).
  • Successful slots populate the returned response array; failed slots are nil at their index.

CLI parity

pulse api compose --request batch.json --parallel 4
pulse api compose --request batch.json --parallel 4 --no-fail-fast

--parallel N:

  • 1 (default) → sequential Compose.
  • 0runtime.GOMAXPROCS.
  • > 1 → exactly that many workers.

--no-fail-fast mirrors FailFast = false.

Performance considerations

  • Each worker performs its own filesystem reads. If your cohort lives on slow remote storage, parallelism amortises latency well; on local SSD the gain is smaller and CPU-bound.
  • Streaming aggregations are CPU-friendly — ComposeParallel over a pool of streaming requests scales near-linearly to the worker count.
  • Buffered request shapes (window operators, median, …) hold memory per request. Watch MaxWorkers × per_request_peak_memory.
  • The internal registries are read-only and shared across workers with no locking; only the per-request operator instances are fresh allocations.

Safety

  • Pulse is safe for concurrent use after New.
  • Per-request operator state (running sums, dictionaries, sorted buffers) is allocated fresh inside each Process call.
  • The afero.Fs you supply must itself be safe for concurrent reads — every shipped backend (OsFs, MemMapFs) is.

Header Layout

Audience: anyone reading or writing .pulse files by hand (forensics, custom readers, debugging a truncated file). The Go library handles all of this for you; this page documents the wire format.

The header is fixed-size: 9 bytes, consisting of an 8-byte magic identifier and a 1-byte format version.

LLM agents using MCP: see the cohort-schema-design skill via pulse_skills_get. It speaks in field-type semantics rather than byte layout; this page covers the bytes.

Constants

These live in encoding/header.go:

NameValuePurpose
MagicBytes[]byte{'P','U','L','S','E', 0x00, 0x00, 0x00}8-byte identifier; rejects non-Pulse files
FormatVersion0x01 (today)Current .pulse wire format
HeaderSize9Total header byte count

Byte layout

Offset  Length  Field
------  ------  -----
0       8       Magic: "PULSE\0\0\0"
8       1       Format version (currently 0x01)
9       —       Schema block begins here

That’s the entire fixed header. The schema block immediately follows; see Schema Block.

Version semantics

The format version is single-byte. The reader at encoding.ReadHeader rejects unknown versions with the ENCODING_INVALID error code:

ENCODING_INVALID: unsupported pulse format version
{"version": <byte>}

This is the fail-loud guard against silently mis-decoding a file written by a future binary that introduced a new field type or layout change. A forward-incompatible change bumps the version; the older reader stops at header parse instead of producing wrong rows.

The current value is 0x01. The envelope format_version ("1.0") that all CLI --json output carries is unrelated — it tracks the JSON output schema, not the binary file format.

Hexdump sanity check

A freshly-written .pulse file starts with:

00000000  50 55 4c 53 45 00 00 00  01  ..  ..  ..  ..  ..
          |P  U  L  S  E  \0 \0 \0|ver| schema starts here

If file path/to/data.pulse reports “data” (rather than something plausible) and the first nine bytes don’t match the above, the file is either truncated or corrupted — see Troubleshooting.

What comes next

The schema block follows the header. Read it as documented in Schema Block; it carries per-field descriptors, inline categorical dictionaries, and decimal/H3 metadata. After the schema, fixed-width records start — see Record Layout.

Field Types

Audience: anyone designing a cohort schema, decoding a .pulse file by hand, or trying to understand which type to pick for a column.

Pulse supports 17 field types, each with a fixed type byte, a fixed (or bit-packed) byte size, and well-defined semantics. The full list, mirrored from CLAUDE.md → All 17 field types:

LLM agents using MCP: see the cohort-schema-design skill via pulse_skills_get — it covers nullability, bit-packing trade-offs, and “which type to pick” with MCP-side examples.

The catalog

TypeByte valueByteSizeNotes
u801Unsigned 8-bit integer
u1612Unsigned 16-bit integer
u3224Unsigned 32-bit integer
u6438Unsigned 64-bit integer
f324432-bit IEEE 754 float
f645864-bit IEEE 754 float
nullable_bool60Bit-packed tri-state (null/true/false)
nullable_u470Bit-packed, 4-bit nullable unsigned
nullable_u881Nullable 8-bit unsigned
nullable_u1692Nullable 16-bit unsigned
date104Date as 32-bit value
packed_bool110Bit-packed boolean
categorical_u8121Categorical with up to 256 dictionary entries
categorical_u16132Categorical with up to 65,536 entries
categorical_u32144Categorical with up to 4,294,967,295 entries
decimal1281516Fixed-point exact decimal; per-field (precision, scale) ≤ (38, 38)
nullable_decimal1281616decimal128 plus an INT128_MIN null sentinel

The Go source-of-truth for this table is encoding/field_type.go; the FieldType enum’s iota order is the byte-value order above.

Type families

Plain integers and floats

u8, u16, u32, u64, f32, f64. Standard little-endian encoding, full range, no null sentinel. Use these when you know the column never carries a missing value.

Nullable integers

nullable_u8, nullable_u16, nullable_u4, nullable_bool. Each reserves one in-band value (or one in-band bit pattern) to mean “null”. For the byte-sized variants the encoding is straightforward; for the sub-byte variants (nullable_u4, nullable_bool, packed_bool) Pulse packs multiple fields into shared bytes — see Record Layout → Bit-packing.

ByteSize() returns 0 for the bit-packed types because they don’t allocate whole bytes of their own; the schema reader uses BitPosition to locate them within shared bytes.

Date

date is a 32-bit count of days since the Unix epoch. The range is ~5.8 million years on either side of 1970 — effectively unbounded for real data.

Categoricals

categorical_u8, categorical_u16, categorical_u32. Each stores its string-to-ID mapping inline as a dictionary block immediately after the field’s schema entry. Pick the smallest variant that fits your cardinality (Pulse’s import path auto-selects during inference).

Dictionary mechanics are documented in Dictionary Blocks.

Decimal128

decimal128 and nullable_decimal128 are 16-byte fixed-point decimal numbers. Each field carries a per-field (precision, scale) pair written into the schema after the description; precision and scale both top out at 38 (PULSE_DECIMAL_OVERFLOW, PULSE_DECIMAL_PRECISION_LOSS).

Use these for currency and any other column where IEEE-754 rounding is not acceptable. See the financial-cohorts skill for full semantics including banker’s rounding and divide-by-zero policy.

Unknown type bytes

The schema reader rejects unknown FieldType bytes at parse time with ENCODING_INVALID. This is the same fail-loud strategy as the header version check: a file written by a future binary that introduced a new type fails immediately at schema parse, not later during row decode where the corruption could go unnoticed.

What you can do with each type

ConcernSource
Which aggregators are meaningful on which typesskills/aggregation-guide.md (LLM) / api process (CLI)
Decimal arithmetic semanticsskills/financial-cohorts.md (LLM)
Categorical dictionary limitsDictionary Blocks

Schema Block

Audience: anyone decoding a .pulse file by hand or writing a non-Go reader. The schema block follows the 9-byte header and carries one descriptor per column.

From CLAUDE.md, byte-layout invariants for .pulse files, plus the on-disk format documented in encoding/schema.go.

Top-level shape

u16 field_count
field_record × field_count

Each field_record is variable-width (it includes UTF-8 name and description strings, and may include a categorical dictionary or decimal/H3 metadata). The reader walks them sequentially.

Per-field record

In write order — see WriteSchema / ReadSchema in encoding/schema.go:

#FieldSizeEncoding
1type1 byteFieldType byte (see Field Types)
2name_length2 bytesu16 little-endian
3namename_length bytesUTF-8
4byte_offset4 bytesu32 LE — offset within a record
5bit_position1 byteu8 — bit position within byte_offset (bit-packed types only)
6csv_column_idx2 bytesu16 LE — source column index at import time
7description2 bytes length + UTF-8Capped at 1000 bytes (PULSE_IMPORT_DESCRIPTION_TOO_LONG)
8(decimal only) precision1 bytedecimal128 and nullable_decimal128 only
9(decimal only) scale1 bytesame
10(categorical only) dictionaryvariableSee Dictionary Blocks

Order matters: every reader walks these in the listed order, so a malformed record stops the parse with ENCODING_INVALID.

Byte offsets and bit positions

byte_offset is the offset of this field’s first byte within a record. For bit-packed types (packed_bool, nullable_bool, nullable_u4), byte_offset plus bit_position together locate the field’s bits within a byte that may be shared with adjacent fields.

For non-packed types, bit_position is always 0.

Record layout mechanics — including the bit-packing rule, record-size computation, and how the encoder packs adjacent sub-byte fields — are in Record Layout.

Conditional trailers

Two trailers attach only to specific field types:

  • decimal128 / nullable_decimal128 get a (precision, scale) pair (u8, u8). Both ≤ 38.
  • Categorical types (categorical_u8, categorical_u16, categorical_u32) get a full dictionary block in line — see Dictionary Blocks.

A field with none of the above writes nothing after the description.

Field descriptions

The description string is UTF-8 with a 2-byte length prefix. The import path rejects descriptions longer than 1000 bytes (PULSE_IMPORT_DESCRIPTION_TOO_LONG) and warns on low-quality descriptions (empty, under 10 characters, or generic words like "n/a", "tbd", "unknown", "field", "data", "value", "column") — that warning is PULSE_FIELD_DESCRIPTION_LOW_QUALITY, upgraded to an error under --strict.

When the description is empty, pulse cohort inspect synthesises a fallback string (“Categorical field: ” or “Numeric field: ”) with description_source = "synthesized". The original bytes on disk remain empty.

Reader behaviour

encoding.ReadSchema is intentionally strict:

  • Field count limit comes from the u16 prefix (max 65,535 fields).
  • Unknown type bytes fail loud (ENCODING_INVALID).
  • Truncated records fail loud at the first short read.
  • The reader produces a *encoding.Schema with one encoding.Field per record; Schema.Field(name) looks fields up by name.

After the schema block, record data starts at the file’s first byte past the schema. The record layout is documented in Record Layout.

Dictionary Blocks

Audience: anyone decoding categorical fields, sizing a categorical type during import, or chasing a dictionary-overflow error.

Categorical fields (categorical_u8, categorical_u16, categorical_u32) store their string-to-ID mapping inline, immediately after the field’s schema entry. The dictionary is part of the schema block, not the record data.

LLM agents using MCP: the cohort-schema-design skill covers when to pick which categorical width; the import-best-practices skill covers fail-closed semantics on overflow.

On-disk layout

From encoding/dictionary.go:

u32 count
(u16 strlen + utf8 bytes) × count

Sizes are little-endian. Each entry’s ID is its insertion index (0..count-1); ID lookups during decode use the ID found in the record byte(s) and resolve to the string at that index.

Sizing the type

TypeMax entriesBytes per record value
categorical_u82561
categorical_u1665,5362
categorical_u324,294,967,2954

The import path samples the source (--sample-rows, default 500) to estimate cardinality and picks the smallest width that fits. You can also force a width by editing the schema template (pulse import schema-template SOURCE).

Overflow and unbounded errors

AddWithLimit enforces the per-type cap and returns PULSE_IMPORT_CATEGORICAL_OVERFLOW when the source has more distinct values than the dictionary can hold:

{
  "code": "PULSE_IMPORT_CATEGORICAL_OVERFLOW",
  "message": "categorical dictionary overflow: max 256 entries",
  "details": {"max_entries": 256, "value": "the_257th_distinct_string"}
}

The companion code PULSE_IMPORT_CATEGORICAL_UNBOUNDED fires when the import path detects an effectively unbounded categorical column (the schema declared categorical_u32 and the column still grew past the caller-provided guardrails). Both errors halt the import — fail-closed, no partial output.

Recovery options, in order of preference:

  1. Re-import with a wider categorical type (categorical_u8categorical_u16categorical_u32).
  2. Drop the categorical encoding (treat the column as a plain string field — but Pulse has no native variable-string type; you’d add a pre-import transform to bucket values).
  3. Pre-filter the source to a smaller distinct set and re-import.

Inspect behaviour

pulse cohort inspect --json reports each categorical field’s dictionary entry count and sample values. By default the inline list is capped at 100 entries (DefaultDictionaryLimit); pass --full-dict to print the full dictionary:

pulse cohort inspect data.pulse --full-dict --json

Both forms include a truncated: true|false flag and a total_entries count for programmatic consumers.

Performance notes

Dictionary reads are amortised: the reader allocates one shared byte buffer for all string payloads, then does one string(...) copy per entry. This avoids the “one allocation per entry” overhead that naively reading length-prefixed strings would produce. The dictionary itself is held in memory for the life of the cohort’s schema parse.

For very large dictionaries, the categorical_u32 path is still O(N) to deserialise; if you find yourself near the 32-bit cap, you almost certainly want a different model (a separate lookup table, or a plain integer column with the strings stored externally).

Record Layout

Audience: anyone hand-decoding row data or implementing a non-Go reader. The schema block ends; record data starts immediately after.

Records are fixed-width. Every row in a cohort occupies the same number of bytes, computed from the schema’s field types. Variable-width data (strings) lives in the schema (as categorical dictionaries) or is not directly supported.

LLM agents using MCP: the record byte layout is an implementation detail the MCP surface hides — there is no LLM-facing skill for it. The MCP tools operate on the inspect / process / sample abstractions.

Computing record size

Record size is the sum of FieldType.ByteSize() over all schema fields, plus padding bytes that share bits between sub-byte fields. For non-packed types, ByteSize() returns the obvious value (u32 = 4, f64 = 8, decimal128 = 16); for packed types (packed_bool, nullable_bool, nullable_u4), ByteSize() returns 0 and the field shares a byte with adjacent packed fields.

The writer (encoding/record.go) lays out fields in the order they appear in the schema; the reader walks the same order with the per-field ByteOffset and BitPosition recorded in the schema.

Encoding per type

From WriteFieldValue / ReadFieldValue in encoding/record.go:

Type familyEncoding
u8 / nullable_u8 / categorical_u81 byte, unsigned
u16 / nullable_u16 / categorical_u162 bytes, little-endian unsigned
u32 / date / categorical_u324 bytes, little-endian unsigned
u648 bytes, little-endian unsigned
f324 bytes, little-endian IEEE 754
f648 bytes, little-endian IEEE 754
decimal128 / nullable_decimal12816 bytes, little-endian two’s-complement integer (scaled by 10^scale); null sentinel is INT128_MIN for the nullable variant
packed_bool / nullable_bool / nullable_u4Bit-packed — see below

Bit-packing

Sub-byte types share whole bytes with their packed neighbours. The schema records both ByteOffset (the shared byte’s offset) and BitPosition (which bit slot within that byte).

  • packed_bool — 1 bit (true/false).
  • nullable_bool — 2 bits (one null bit, one value bit) for the tri-state encoding.
  • nullable_u4 — 5 bits (one null bit, four value bits) for the nullable 4-bit unsigned encoding.

The writer aligns these into shared bytes from low bit to high bit; adjacent packed fields stack into the same byte until the byte is full, after which a new byte begins. ByteSize() == 0 is the schema reader’s signal that a field type shares bytes — non-zero ByteSize fields never share.

Null sentinels

TypeNull encoding
nullable_u80xFF
nullable_u160xFFFF
nullable_u4Dedicated bit pattern within the packed byte
nullable_boolDedicated bit within the packed byte
nullable_decimal128INT128_MIN (0x8000…0000)

u32, u64, f32, f64, date, decimal128 (non-nullable), and all categoricals are non-nullable — the import path either coerces or rejects rows with missing values (PULSE_IMPORT_ROW_ERROR). Pick the nullable_* variant when you need to preserve the difference between “zero” and “missing”.

Reading a record

The Go decoder lives at encoding.Reader / encoding.ReadRecord(*Schema, []byte). A non-Go reader can follow the same recipe:

  1. Compute record size from the schema.
  2. Read record_size bytes.
  3. For each schema field in declaration order:
    • If ByteSize() > 0, decode the value at the field’s ByteOffset.
    • If ByteSize() == 0, decode the bit slot at (ByteOffset, BitPosition) using the type’s bit-pattern rules.

Forward compatibility

Records carry no type tag — they’re a packed binary blob whose interpretation comes entirely from the schema block. That’s why the file’s format version (in the header) and unknown field-type bytes (in the schema block) both fail loud at parse time: the records themselves cannot self-correct, so the format gates everything before record data is observed.

MCP Integration

Audience: operators wiring Pulse into an MCP-aware AI client (Claude Desktop, Claude Code, Cursor, Zed, custom hosts), and embedders who want to expose Pulse to an LLM agent.

This page is the human-facing guide: what the server does, how to wire it up, what the LLM sees, and how to debug a misbehaving session. Agent-facing guidance ships inside the binary as the mcp-integration skill — fetch it via pulse_skills_get (or pulse skills show mcp-integration).

What pulse mcp is

pulse mcp runs the Pulse library as a Model Context Protocol (MCP) server. The host (Claude Desktop, Claude Code, etc.) launches it as a subprocess, speaks JSON-RPC over its stdio streams, and shuts it down on session close. The LLM sees Pulse as a set of tools (callable functions), resources (browseable URIs), and prompts (canned slash commands).

┌─────────────┐  stdio JSON-RPC  ┌────────────┐  Go calls  ┌─────────────┐
│  AI client  │ ───────────────→ │ pulse mcp  │ ─────────→ │ pulse.Pulse │
│   (host)    │ ←─────────────── │ (this bin) │ ←───────── │  (library)  │
└─────────────┘                  └────────────┘            └─────────────┘
                                       │
                                       └── stderr ─→ host log pane

The server is a thin translator. Every tool wraps a public method on pulse.Pulse; the same code path powers the CLI.

Quickstart

# 1. Build and place on PATH
make build && cp ./bin/pulse /usr/local/bin/

# 2. Pick a data directory
mkdir -p /var/data/pulse

# 3. Wire into your host (see below) and restart it

# 4. From the LLM session, call:
#    pulse_manifest      → cache once
#    pulse_ask           → run analyses

Wiring into a host

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args": ["mcp"],
      "env": {
        "PULSE_DATA_DIR": "/var/data/pulse"
      }
    }
  }
}

Restart Claude Desktop. Pulse tools appear in the tool picker.

Claude Code

claude mcp add pulse --env PULSE_DATA_DIR=/var/data/pulse -- pulse mcp

Or by hand in ~/.claude.json (or per-project .claude.json):

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args":    ["mcp"],
      "env":     { "PULSE_DATA_DIR": "/var/data/pulse" }
    }
  }
}

Cursor / Zed / generic stdio hosts

Any host that speaks the MCP stdio transport can launch pulse mcp the same way — provide the binary path, the mcp argument, and the PULSE_DATA_DIR env var.

What the LLM sees

Tool surface

Sixteen tools, registered at server start. Names and order match internal/mcp/mcptools/meta.go.

ToolPurpose
pulse_manifestCall first. Self-description: commands, operators (with accepted types + streamability), tier-1/tier-2 tests, regressions, synth distributions, error code list, MCP tool list, cohort field types with operator cross-references. Cache once per session.
pulse_askPreferred entry point. One-shot: optional auto-import → inspect → predict → execute. Accepts source (raw file path) + query (natural language, beta) or a structured request.
pulse_inspectRead .pulse header + schema (no record bytes). Side effect: registers session-scoped schema-bound tool variants (see below).
pulse_predictValidate a request against the schema without executing. Returns errors, warnings, applied defaults, streamability reasons.
pulse_processExecute one pre-built request.
pulse_composeExecute a batch of requests against the same cohort in one round trip.
pulse_sampleReturn up to N rows for preview / diagnostics.
pulse_facetDistinct values for a single field.
pulse_importConvert a tabular source (csv, tsv, ndjson, jsonarray, parquet, arrow, excel) into a managed .pulse handle under imports/, with TTL-tracked sidecar. Pulse-format inputs pass through.
pulse_dropDelete a managed-import handle and its sidecar.
pulse_imports_listEnumerate managed handles with sidecar metadata (source, format, imported_at, expires_at, ttl, expired flag, pinned flag).
pulse_examples_searchSearch the embedded request-example library by query, taxonomy tags (ANDed), or category.
pulse_examples_getFetch one runnable example body by name.
pulse_errors_lookupPer-code Message + Fixup detail (kept out of the manifest for context economy).
pulse_skills_listEmbedded skill metadata.
pulse_skills_getFetch one skill body by name.

Natural-language query is beta. Heuristic parsing only — silent misinterpretation is possible. The LLM should always check the query_resolution and resolved request in the response before trusting results. For production, author a structured request against the cached manifest and skip the query field.

Resources

URI schemeYields
pulse://<path>One resource per .pulse file under the data directory. Read returns descriptor.InspectResult JSON (header + schema only — no record bytes).
pulse-skill://<name>One per embedded skill. Read returns the markdown body.

Resources are registered once at server start. Files added afterwards do not appear until the server restarts. Listing is cheap because the server only reads header bytes.

Prompts

NameArgsReturns
pulse-bootstrapnoneA short instructions block telling the assistant what to call (and in what order) before authoring any request, and where the authoritative references live. Inject at session start.
pulse-author-requestquestionA guided tool-call sequence for translating a natural-language analytical question into a Pulse request: manifest → examples search → ask.

Hosts that surface prompts as slash commands let users trigger these directly.

The two-call default for nearly every user request:

  1. pulse_manifest once at session start. No arguments. Cache the payload — it is deterministic for a binary version and carries every fact needed to author a valid request.

  2. pulse_ask for everything else. It collapses import + inspect + predict + execute into one round trip. When the user hands the LLM a raw file:

    {
      "request": "{\"source\":\"data.csv\",\"query\":\"average revenue by month\"}"
    }
    

    When the cohort already exists as a managed handle or .pulse file:

    {
      "request": "{\"cohort\":{\"filename\":\"sales.pulse\"},\"query\":\"top 5 regions by revenue\"}"
    }
    

    On predict-invalid with on_invalid="suggest", the response carries structured Fixup entries derived from each error code’s metadata so the LLM can repair the request without another round trip.

Reach for the multi-step path (pulse_inspectpulse_predictpulse_process) only when:

  • diagnosing a failed predict and you want the full envelope,
  • previewing rows (pulse_sample) or value distributions (pulse_facet),
  • pre-staging a managed handle with a specific name / TTL / pinning (pulse_import),
  • batching multiple requests in one call (pulse_compose).

Managed imports + TTL

pulse_import lets the LLM hand the server any tabular file and address it from then on as if it were a .pulse.

  • Convertible formats (csv, tsv, ndjson, jsonarray, parquet, arrow, excel) are imported into $PULSE_DATA_DIR/imports/<handle>.pulse with a sidecar <handle>.pulse.meta.json carrying imported_at, expires_at, ttl_seconds, source path, source format, and row count. result.managed=true.
  • Pulse passthroughs (.pulse extension) under PULSE_DATA_DIR are not copied — the server returns the relative path verbatim with managed=false. A .pulse outside PULSE_DATA_DIR is copied into the managed pool.

Source path resolution. Relative source paths resolve against PULSE_DATA_DIR. Absolute paths read from the host filesystem through a separate “source fs.”

Import jail. Absolute source paths are confined to a single directory tree (the jail root). Default: the working directory the MCP server was launched from. Paths that escape the jail (including ..) return PULSE_IMPORT_SOURCE_FORBIDDEN. Override via pulse.Options.ImportSourceJailRoot when embedding.

Sliding TTL. Default lifetime is 7d (overridable via PULSE_IMPORT_TTL, or per-import via the ttl field — accepts Go duration like "24h", day form like "7d", or "pin" for never-expire). Every subsequent inspect/predict/process/sample/facet/ask against the handle slides expires_at forward. The pool self-sweeps on every pulse_import call — no daemon required. Inspect with pulse_imports_list; evict manually with pulse_drop.

Schema-bound enums

After a successful pulse_inspect (or after pulse_ask opens a cohort), the server registers session-scoped variants of the action tools (pulse_process, pulse_predict, pulse_compose, pulse_sample, pulse_facet) whose JSON Schemas embed enum constraints on field-name parameters. The LLM picks field names from a typed list rather than free-texting and discovering on predict that the name was wrong.

What gets constrained on bound pulse_process / pulse_predict / pulse_compose schemas:

PathEnum
aggregations[].fieldAll cohort field names
aggregations[].typeFull aggregator catalogue (AGG_*)
attributes[].fieldNumeric fields only (includes decimal)
attributes[].typeFull attribute catalogue (ATTR_*)
filterers[].fieldAll cohort field names
filterers[].typeFull filterer catalogue (FILTER_*)
groups[].fieldAll cohort field names
groups[].typeFull grouper catalogue (GROUP_*)
windows[].field, windows[].partition_by[]All cohort field names
windows[].order_by[].fieldNumeric and date fields
windows[].typeFull window catalogue (WIN_*)
tests[].field, tests[].field2Numeric fields only
tests[].split_by / rows / cols / subject_fieldAll cohort field names
tests[].typeFull test catalogue (TEST_*)
pulse_facet field argAll cohort field names

Trigger and lifecycle. Binding fires on a successful pulse_inspect. mcp-go auto-fires notifications/tools/list_changed on AddSessionTools; the host refreshes its tool list and picks up the bound schemas on the next list. Bound tools share names with the global tools — session-scoped variants override globals for that session.

Limitations.

  • Multi-file sessions: the latest inspect wins. Track multiple cohorts client-side.
  • No per-element type ↔ field correlation: JSON Schema can’t easily express “if aggregations[i].type == AGG_SUM then aggregations[i].field must be numeric.” Operator–type compatibility lives in the type property description; strict validation remains pulse_predict’s job.
  • Transport support: binding requires a session that implements SessionWithTools. SSE / Streamable HTTP transports work; on stdio, binding is a no-op fallback and the global (unbound) schemas remain in effect. The manifest’s accepts_types table is still authoritative, so authoring is not blocked — just less ergonomic.
  • Empty enums omitted: when the cohort has zero fields in a category (e.g. no geo fields), the enum is omitted entirely rather than emitted as [].

Disable binding entirely with --bind-on-open=false.

Configuration

Env varPurposeDefault
PULSE_DATA_DIRCohort base directory. Required.(none — server fails to start without it)
PULSE_IMPORTS_DIRSubdirectory for managed-import handles.imports
PULSE_IMPORT_TTLDefault TTL for managed handles. Accepts Go duration (24h, 30m), day form (7d, 30d), or pin.7d

Embedders can override per-instance via pulse.Options{DataDir, ImportsDir, ImportTTL, ImportSourceJailRoot, FS, ImportSourceFS, BindOnOpen} — see pulse.go.

Transport caveats

  • Stdio. The default and only transport pulse mcp ships today. Schema binding is a no-op (see Limitations). Stdout is the JSON-RPC channel; stderr is the log channel — never write structured output to stdout outside the protocol.
  • SSE / Streamable HTTP. Not exposed by the mcp CLI leaf yet. The underlying mcp-go server supports them; embedders can call mcp.NewWithOptions(p, ...) and serve via mcp-go’s SSE / streamable HTTP entry points directly.

Troubleshooting

SymptomCauseFix
data directory required: set PULSE_DATA_DIR or pass --data-dirNeither env var nor flag setPass PULSE_DATA_DIR in the host’s env block, or --data-dir in args
Tools don’t appear in the host UI after editing configHost caches tool listRestart the host fully (not just the conversation)
pulse_import returns PULSE_IMPORT_SOURCE_FORBIDDEN for an absolute pathPath escapes the import jail (default = server’s working dir)Either move the file under the jail, launch the server from a higher-level directory, or set pulse.Options.ImportSourceJailRoot when embedding
pulse_inspect succeeds but bound enums never fireStdio session — binding is a no-op thereUse pulse_predict for validation; the manifest’s accepts_types lists give the LLM the same information
Tool calls hangHost wrote non-protocol bytes to the server’s stdin, or server wrote non-protocol bytes to stdoutCheck server stderr; restart the session. pulse mcp itself only writes a one-line startup notice to stderr at boot
pulse_ask with query returns nonsense or wrong fieldsNatural-language parsing is heuristic and betaInspect query_resolution in the response. For production, author a structured request against the cached manifest

To see what the server registers without launching the host:

pulse --json | jq '.data.mcp_tools[]'
pulse manifest --json | jq '.data.skills[]'

Skill cross-reference for LLM agents

If you are writing a system prompt for an LLM agent that uses Pulse, point it at these skills rather than at this site:

LLM taskSkill
MCP wiring, tool surface, schema bindingmcp-integration
Author a Process requestrequest-recipes
Compose multiple sub-requests in one callcompose-requests
Iterate on a request with pulse_predictdebugging-with-predict
Look up an error code or warningerror-code-reference
Pick an aggregator / filtereraggregation-guide
Pick an attribute (z-score, percentile, formula, …)attribute-composition
Design a groupergrouper-design
Use a window operator (WIN_*)window-operations
Use a feature engineer (FEAT_*)feature-engineering
Run a statistical test (tier-1 or tier-2)statistical-testing
Fit a regression (OLS, GLM, Bayesian)regression-modeling
Generate synthetic datasynthetic-data
Understand a cohort’s schema layoutcohort-schema-design
Import a tabular source into .pulseimport-best-practices
Pick an export formatexport-format-selection
Work with decimal128 (currency, precise arithmetic)financial-cohorts
Route a natural-language query to a Pulse requestquery-router-prompt
Get started end-to-end (LLM walkthrough)getting-started

The agent should call pulse_skills_list once at session start to enumerate the catalog, then pulse_skills_get on demand. The returned text is authoritative; this site does not duplicate it and may lag.

Request Example Library

Pulse ships a searchable, embedded catalogue of runnable request JSON files spanning every operator category. They are checked into the repo under examples/, mounted into the binary at compile time via //go:embed, and surfaced through three peer access paths:

Access pathBest for
pulse_examples_search / pulse_examples_get (MCP tools)LLM agents authoring requests against a running Pulse server
pulse examples search / pulse examples show (CLI)Developers exploring at a shell
pulse.ExamplesSearch / pulse.ExampleGet (Go API)Embedders building higher-level UIs

What the library contains

Every example is a complete types.Request JSON body — the same shape you hand to pulse_process. Each file is annotated with a structured _meta block describing the example. Pulse’s JSON unmarshaller ignores unknown fields by default, so the _meta block is invisible at execution time; the file remains runnable verbatim.

{
  "_meta": {
    "name": "t_test_one_sample",
    "category": "tests",
    "tags": ["hypothesis-test", "t-test", "tier-1-test", "parametric", "one-sample", "streaming-friendly"],
    "operators": ["AGG_AVERAGE", "AGG_COUNT", "TEST_T"],
    "description": "One-sample t-test comparing revenue mean against the hypothesized mu=100."
  },
  "cohort": {...},
  ...
}

Fetching via pulse_examples_get returns the request body with the _meta block already stripped, so you can pass it straight to pulse_process / pulse_predict.

Searching the library

Three filter dimensions, all optional and combined with AND:

FilterBehaviour
queryCase-insensitive substring across the example’s name, description, and operator list
tagsAn example must carry every requested tag
categoryExact match against the example’s directory (aggregations, attributes, features, filterers, groupers, regression, tests, windows)

CLI

pulse examples search --query welch                       # find Welch-related examples
pulse examples search --tag time-series --tag tier-2-test # AND tag filter
pulse examples search --category tests --json             # JSON envelope
pulse examples show t_test_one_sample                     # print runnable JSON
pulse examples show t_test_one_sample --json              # full record (with _meta)

MCP

// arguments to pulse_examples_search
{"query": "welch"}
{"tags": ["time-series", "tier-2-test"]}
{"category": "features"}

Go API

p, _ := pulse.New(pulse.Options{DataDir: "/data"})

// Search:
hits := p.ExamplesSearch("welch", []string{"experiment-analysis"}, "")
for _, h := range hits {
    fmt.Println(h.Name, "—", h.Description)
}

// Fetch and run:
ex, ok := p.ExampleGet("t_test_one_sample")
if ok {
    var req pulse.Request
    _ = json.Unmarshal(ex.Body, &req)
    resp, _ := p.Process(ctx, &req)
    _ = resp
}

Tag taxonomy

Tags are curated and validated by a CI gate (TestExamples_TagsFromTaxonomy). The taxonomy spans four dimensions:

DimensionTags
Domain / use casetime-series, cohort-analysis, experiment-analysis, correlation-analysis, comparison, before-after, top-n, distribution-shape, cross-tabulation, proportion-analysis, trend-detection, outlier-detection, cardinality-analysis, data-quality, geo-analysis, financial, feature-engineering
Statistical methodhypothesis-test, t-test, parametric, nonparametric, paired, one-sample, two-sample, k-sample, repeated-measures, post-hoc, normality-test, homogeneity-test, exact-test
Regression / modelingregression, ecological, ols, glm, logistic, bayesian, regularization, ridge, lasso, elasticnet, polynomial, resampling, jackknife, selection, stepwise
Pipeline machinerytier-1-test, tier-2-test, composed, pre-filter, feature-pipeline, window-operator, streaming-friendly, buffered-pipeline
Risk / edgeleakage-safe, leakage-risk, small-sample

The category (directory name) is not repeated in the tags — _meta.category carries that.

Adding a new example

  1. Write the request JSON under examples/<category>/. Use existing files as shape templates. Keep cohort.data_dir = ".data" and reference one of the fixture cohorts.
  2. Add a _meta block at the top of the file:
    • name — kebab-case-with-underscores, unique across the whole library.
    • category — must match the parent directory.
    • tags — pick 3-6 from the taxonomy above.
    • operators — the list of AGG_* / ATTR_* / FILTER_* / GROUP_* / WIN_* / FEAT_* / TEST_* types appearing in the body, alphabetized and deduped.
    • description — one-sentence, present-tense summary.
  3. Re-run go test ./examples/... ./descriptor/... to confirm the new file passes:
    • TestExamples_AllParseAsRequest
    • TestExamples_UniqueNames
    • TestExamples_TagsFromTaxonomy
    • TestExamples_OperatorsMatchBody
    • TestExamples_CategoryMatchesDirectory
    • TestManifestExamplesPopulated
  4. The annotation tool at cmd/annotate-examples/ is idempotent and may be re-used; updating its in-source annotations slice and re-running will rewrite the file’s _meta block in canonical form.

Regression Modeling

Pulse exposes regression through a compact, composable surface. Three operators, two orthogonal modifiers, and one upstream feature transform together cover every textbook regression variant. This chapter is the human-facing counterpart to skills/regression-modeling.md; agents should fetch the skill via pulse_skills_get rather than read this page.

Overview

OperatorEngineStreaming
REG_OLSOrdinary least squares + optional regularizationStreams sufficient statistics (Phase 1 + 2)
REG_GLMGeneralized linear model via IRLSAlways buffered (Newton-Raphson refit)
REG_BAYES_LINEARBayesian linear regression (conjugate NIG)Streams sufficient statistics (Phase 4)

Two spec-level modifiers compose with any of the three:

  • Resample ∈ {jackknife, bootstrap} — replaces analytical SE / p-values with resample-based estimates. Forces buffered.
  • Selection ∈ {forward, backward, stepwise} — drives AIC- or BIC-based greedy subset search. Requires Criterion. Forces buffered.

One upstream feature operator (FEAT_POLY) extends the linear core to polynomial regression. Per-row attributes (ATTR_REG_FITTED, ATTR_REG_RESIDUAL, ATTR_REG_LEVERAGE) attach per-record diagnostics in the output row stream.

The 13 textbook names → Pulse specs

The Indeed regression taxonomy double-counts (Simple ≡ Linear univariate, Multiple ≡ Multiple Linear) and treats orthogonal wrappers (Jackknife, Stepwise) as families. Pulse does not. The table below maps each textbook name onto the corresponding Pulse spec and links to a runnable example file under examples/regression/.

#Indeed namePulse expressionExample
1SimpleREG_OLS with one predictorexamples/regression/02_simple_linear.json
2MultipleREG_OLS with multiple predictorsexamples/regression/03_multiple_linear.json
3Linear= #1examples/regression/02_simple_linear.json
4Multiple Linear= #2examples/regression/03_multiple_linear.json
5LogisticREG_GLM{Family:"binomial", Link:"logit"}examples/regression/04_logistic.json
6RidgeREG_OLS{Penalty:"l2", Alpha:λ}examples/regression/05_ridge.json
7LassoREG_OLS{Penalty:"l1", Alpha:λ}examples/regression/06_lasso.json
8PolynomialFEAT_POLY{Field:x, Degree:n} upstream → REG_OLSexamples/regression/07_polynomial.json
9Bayesian LinearREG_BAYES_LINEAR{Prior:"nig"}examples/regression/08_bayesian_linear.json
10Jackknifeany regression with Resample:"jackknife"examples/regression/09_jackknife.json
11Elastic NetREG_OLS{Penalty:"elasticnet", Alpha, L1Ratio}examples/regression/10_elasticnet.json
12EcologicalGROUP_* upstream → REG_OLS over group means (composed request)examples/regression/01_ecological_fallacy.json
13Stepwiseany regression with Selection:"stepwise", Criterion:"aic"|"bic"examples/regression/11_stepwise.json

Streamability matrix

SpecStreamableMemoryNotes
REG_OLS no penaltyyesO(p²)sufficient stats: n, Σx, Σy, XᵀX, Xᵀy, Σy²
REG_OLS + l1 / l2 / elasticnetyesO(p²)streaming Gram; regularized solve at finalize
REG_BAYES_LINEAR (conjugate NIG)yesO(p²)streaming sufficient stats + closed-form posterior update
REG_GLM (binomial / poisson / gamma)noO(n·p)IRLS / Newton requires multiple passes
Any regression with Resample != ""noO(n·p)LOO / bootstrap refit
Any regression with Selection != ""noO(n·p)refit per candidate subset

pulse_predict reports per-request streamability on PredictResult.Streamable, mirroring the runtime gate.

Operator reference

REG_OLS

Ordinary least squares with optional regularization.

ParamRequiredNotes
targetyesNumeric response field.
predictorsyesOne or more numeric predictor fields.
penaltyno"" (default), "l1", "l2", or "elasticnet".
alphaconditionalRequired and > 0 when penalty != "".
l1_ratioconditionalRequired and in [0, 1] when penalty == "elasticnet".
max_itersnoCoordinate-descent cap (default 1000).
tolnoConvergence tolerance (default 1e-6).
resampleno"jackknife" or "bootstrap". Downgrades streaming.
selectionno"forward", "backward", or "stepwise". Requires criterion. Downgrades streaming.

Modifier compatibility: Resample and Selection may be combined; Selection runs first, Resample re-fits the selected subset.

Error codes: PROCESSING_REGRESSION_RANK_DEFICIENT, PROCESSING_REGRESSION_SINGULAR_GRAM, PROCESSING_REGRESSION_NO_CONVERGE, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_REGRESSION_APPROXIMATE_SE (warning, l1/elasticnet without resample), PROCESSING_REGRESSION_REGULARIZED_SELECTION (warning, penalty + selection), PROCESSING_CONFIG.

REG_GLM

Generalized linear model via iteratively-reweighted least squares.

ParamRequiredNotes
targetyesNumeric response.
predictorsyesOne or more numeric predictor fields.
familyyes"binomial", "poisson", or "gamma".
linknoFamily-specific default when empty (binomiallogit, poissonlog, gammainverse).
max_itersnoIRLS iteration cap (default 50).
tolnoConvergence tolerance (default 1e-8).
resampleno"jackknife" or "bootstrap".
selectionnoSubset-selection wrapper; requires criterion.

Always buffered. Setting penalty / alpha / l1_ratio on a REG_GLM spec is rejected with PROCESSING_CONFIG; regularized GLM is reserved for a later phase.

Error codes: PROCESSING_REGRESSION_INVALID_FAMILY, PROCESSING_REGRESSION_INVALID_LINK, PROCESSING_REGRESSION_NO_CONVERGE, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_CONFIG.

REG_BAYES_LINEAR

Bayesian linear regression with a conjugate Normal-Inverse-Gamma prior.

ParamRequiredNotes
targetyesNumeric response.
predictorsyesOne or more numeric predictor fields.
priornoOnly "nig" accepted in v1. Default "nig".
prior_munoLength p+1 mean vector (intercept first); defaults to zero.
prior_precisionnoScalar ε ≥ 0 on the precision matrix ε·I. Default 1e-3.
prior_shapenoInverse-gamma shape a₀. Default 1e-3.
prior_ratenoInverse-gamma rate b₀. Default 1e-3.
credible_levelnoPosterior interval mass. Default 0.95.

Modifier compatibility: Resample and Selection are rejected for REG_BAYES_LINEAR at spec validation — the posterior already conveys uncertainty via credible intervals, and stepwise feature selection on a Bayesian model is a posterior-based question the conjugate-NIG engine doesn’t support.

Setting penalty / alpha / l1_ratio / family / link on a Bayes spec is rejected with PROCESSING_CONFIG.

Error codes: PROCESSING_REGRESSION_RANK_DEFICIENT, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_CONFIG.

Modifiers

Resample

Layered on top of any base operator (except REG_BAYES_LINEAR).

ValueBehavior
""No resampling. Closed-form / asymptotic standard errors.
"jackknife"Leave-one-out resampling. SE = sqrt((n−1)/n · Σᵢ (β⁽⁻ⁱ⁾ − β̄)²).
"bootstrap"Non-parametric bootstrap. bootstrap_iters (default 1000), rng_seed (0 → time-seeded; non-zero → reproducible).

For l1 / elasticnet OLS, setting Resample is the rigorous answer for standard errors: it suppresses the PROCESSING_REGRESSION_APPROXIMATE_SE warning (the SEs are now resample-based, not plug-in over the active set).

Selection

Layered on top of any base operator (except REG_BAYES_LINEAR).

ValueBehavior
""No subset selection.
"forward"Start from intercept-only; add the predictor that lowers the criterion most.
"backward"Start from full model; remove the predictor whose absence lowers the criterion most.
"stepwise"Bidirectional sweep; try every add and every remove per cycle.

Requires Criterion ∈ {"aic", "bic"}.

  • AIC = -2·logL + 2·k. Lighter penalty; may retain weak predictors at moderate n.
  • BIC = -2·logL + log(n)·k. Heavier per-parameter penalty; rejects noise predictors more reliably at moderate n.

SelectedFeatures lists the chosen subset; Coefficients drops non-selected predictors entirely (absence ≠ zero — selection’s contract is stronger). Selection may be combined with Resample: Selection picks the active subset, then Resample replaces SE / p-values on the selected model.

Compositional patterns

Polynomial regression — FEAT_POLY + REG_OLS

Polynomial regression is linear in the coefficients; the non-linearity lives in the feature space. Use FEAT_POLY upstream to materialize x_2, x_3, …, x_<degree> derived columns, then list them alongside the original x in predictors:

{
  "features": [
    {"type": "FEAT_POLY", "field": "x", "label": "x", "params": {"degree": 3}}
  ],
  "regressions": [
    {"type": "REG_OLS", "name": "polyfit", "target": "y",
     "predictors": ["x", "x_2", "x_3"]}
  ]
}

Degree is gated at [2, 10]. Numerical stability is the caller’s responsibility: x^10 overflows f64 once |x| clears a few hundred, and the Gram matrix conditions poorly long before that. Centre or standardize predictors before requesting FEAT_POLY.

Ecological regression — group → regress

“Ecological regression” is a regression fit on aggregated group-level statistics — per-precinct means, per-county sums, per-region rates — rather than individual-level rows. Use pulse_compose with two slots: slot 1 produces per-group means via GROUP_* + AGG_AVERAGE, slot 2 fits REG_OLS over the aggregate output (or, in practice, over a pre-aggregated .pulse file).

The two slots are intentionally independent; Pulse does not pipe slot-1 results into slot-2 as cohort input. Either (a) materialize slot 1’s aggregate as its own .pulse cohort upstream, or (b) treat slot 1 as the audit trail (per-group means visible in the composed response) and run slot 2 over a pre-aggregated fixture.

Caution — the ecological fallacy. A significant group-level slope does not imply an individual-level association. Robinson (1950) showed that ecological correlations and individual correlations can take opposite signs in the same data: a per-state regression of literacy on race might suggest a strong relationship that vanishes (or reverses) at the per-person level. Aggregation collapses within-group variation, leaving only between-group structure that frequently encodes confounders.

When ecological regression is the right tool: aggregate-only data (census output, public-health summary tables); genuinely group-level research questions (“do counties with higher median income have higher turnout?”). When it is the wrong tool: individual-level claims; causal claims. Annotate consumer-facing prose with this caveat; Pulse cannot enforce it.

Robinson, W.S. (1950). “Ecological Correlations and the Behavior of Individuals.” American Sociological Review 15(3): 351–357.

Per-row regression attributes

Three attribute operators emit per-record diagnostics from a fitted regression onto the row stream.

AttributeEmits per row
ATTR_REG_FITTEDŷ_i = Xᵢ β — the model’s prediction at each row.
ATTR_REG_RESIDUALy_i − ŷ_i — the per-row residual.
ATTR_REG_LEVERAGEh_ii = Xᵢ (XᵀX)⁻¹ Xᵢᵀ — the i-th diagonal of the hat matrix.

Each attribute references a sibling regression spec by regression_name. See skills/attribute-composition.md for the parameter table.

Error codes

Look up full prose via pulse_errors_lookup or pulse errors lookup CODE.

CodeMeaning (one-liner)
PROCESSING_REGRESSION_NOT_IMPLEMENTEDReserved as of Phase 8; no engine returns this today.
PROCESSING_REGRESSION_RANK_DEFICIENTXᵀX is singular; add regularization or drop a predictor.
PROCESSING_REGRESSION_NO_CONVERGEIRLS or coordinate descent failed within MaxIters.
PROCESSING_REGRESSION_SINGULAR_GRAMXᵀX non-invertible even after regularization; increase alpha.
PROCESSING_REGRESSION_INVALID_FAMILYREG_GLM Family outside {binomial, poisson, gamma}.
PROCESSING_REGRESSION_INVALID_LINKLink incompatible with the chosen Family.
PROCESSING_REGRESSION_INSUFFICIENT_DATAFiltered set has fewer rows than predictors + 1, or below resample minimum.
PROCESSING_REGRESSION_APPROXIMATE_SEWarning: l1 / elasticnet SE is a plug-in approximation; set resample for rigor.
PROCESSING_REGRESSION_REGULARIZED_SELECTIONWarning: penalty != "" plus selection != "" is unusual.
PROCESSING_CONFIGInvalid spec combination (e.g. Bayes + Resample, GLM + Penalty).

Worked examples

Every Indeed name has a runnable JSON file under examples/regression/. Fetch via pulse_examples_get or read directly:

Architecture Overview

Source of truth: the canonical architectural contract is CLAUDE.md at the repository root. This chapter restates its design principles for human readers; if the two ever disagree, CLAUDE.md is authoritative.

Pulse is a high-performance, self-describing tabular data processing engine. It ships as a Go library (github.com/frankbardon/pulse) and as a CLI binary (cmd/pulse/). The library is the primary deliverable; the CLI is a thin adapter over it.

Design principles

  • Library-first. The pulse.go facade (pulse.New, pulse.Options, pulse.Process, pulse.Compose, pulse.Import, pulse.Export, pulse.Convert, pulse.Inspect, pulse.Predict, pulse.Sample, pulse.Facet) is the public API. The CLI calls the library; it never contains business logic.
  • Self-describing. Every .pulse file carries its schema in the header. The descriptor/ package provides manifest, predict, and inspect operations that expose the system’s capabilities and validate requests without executing them.
  • Skill-augmented. The skills/ package embeds 19 markdown skill files into the binary via //go:embed. LLM agents (and Nexus, the orchestration layer that consumes Pulse) can call skills.List() and skills.Get(name) at boot time to inject domain-specific guidance into their context.
  • Nexus relationship. Pulse is a standalone processing engine. Nexus is the upstream orchestration agent that calls Pulse’s library API or CLI. Pulse has no dependency on Nexus. Nexus discovers Pulse’s capabilities via pulse manifest --json and loads skills from the embedded skill pack.

The next chapter, Package Layout, shows where each of these concerns lives in the source tree.

Package Layout

Source of truth: this tree is mirrored from the “Package layout” section of CLAUDE.md. If the project structure changes, that file is updated first; this page follows.

pulse/
├── cmd/
│   └── pulse/              # CLI binary (the only binary)
├── pulse.go                # Public facade — pulse.New, pulse.Options
├── service/                # Orchestration layer; wires processing to encoding
├── processing/             # Aggregators, attributes, filterers, groupers, windows, features
│   ├── window/             # WIN_* operators (LAG, LEAD, RANK, RUNNING_*, EWMA, ...)
│   └── feature/            # FEAT_* pre-filter feature engineers (LOG, SQRT, BUCKETIZE, ...)
├── encoding/               # Dynamic schema + record codec (.pulse binary format)
├── io/                     # Bidirectional tabular <-> .pulse adapters
│   ├── csv/                # CSV reader + writer
│   ├── tsv/                # TSV reader + writer
│   ├── ndjson/             # NDJSON reader + writer
│   ├── jsonarray/          # JSON-array reader + writer (single top-level array of flat objects)
│   ├── jsonshared/         # Value coercion helpers shared by ndjson and jsonarray
│   ├── arrow/              # Arrow IPC / Feather V2 reader + writer; shared Arrow<->Pulse type maps
│   ├── parquet/            # Parquet reader + writer (delegates type maps to io/arrow)
│   └── excel/              # Excel reader + writer (Excelize)
├── fs/                     # afero-based filesystem abstraction + extension hook
├── errors/                 # Typed error codes (CodedError system)
├── types/                  # Request/response structs (JSON-serializable)
├── descriptor/             # Self-description: manifest, predict, inspect, envelope
├── skills/                 # Embedded markdown skill pack (//go:embed)
│   ├── index.json          # Manifest of all bundled skills
│   └── *.md                # Individual skill files with YAML frontmatter
├── synth/                  # Synthetic data generator (from-schema, from-profile)
├── docs/                   # mdBook source for this site (published to GitHub Pages)
└── internal/
    ├── cli/                # CLI internals (descriptor walker, json action)
    └── mcp/                # MCP server: tool + resource handlers wrapping pulse.Pulse
        └── mcptools/       # Leaf metadata package (tool names + descriptions) consumed by descriptor

Adding an Aggregator

Audience: Pulse internals contributors adding a new AGG_* operator.

This page is a step-by-step recipe. The same content lives in CLAUDE.md → Common Claude Code Workflows → Adding a new aggregator; this is the human-readable mirror.

From CLAUDE.md, Common Claude Code Workflows.

1. Declare the type constant

Add the new constant to types/types.go and the slice returned by types.AllAggregationTypes(). Example, for a hypothetical AGG_GINI:

const (
    // ... existing constants ...
    AGG_GINI AggregationType = "AGG_GINI"
)

func AllAggregationTypes() []AggregationType {
    return []AggregationType{
        // ... existing entries, alphabetised ...
        AGG_GINI,
    }
}

The exhaustiveness tests (TestStreamability_AggregationsKnown and friends) will fail until you add the streamability case in step 4.

2. Implement the aggregator and register it

The operator implementation lives in processing/. Write the factory function (newGini(...) returning the aggregator interface) and register it in aggregatorRegistry in processing/registry.go.

If the aggregator can update one row at a time, also implement the OnlineAggregator interface so it joins the streaming Process path. Sort-based or sum-of-deviation aggregators (like AGG_MEDIAN, AGG_ZSCORE) skip this interface and run in the buffered path.

3. Tests

Tests come first: write them in processing/aggregator_test.go before the implementation, run the suite, confirm they fail informatively, then port the implementation until green. See Testing Conventions.

4. Declare streamability

Add a case for the new type in types/streamability.go:

func (t AggregationType) Streamable() bool {
    switch t {
    // ...
    case AGG_GINI:
        return false // sort-based
    }
}

Add the same row to the table in types/streamability_test.go.

If the aggregator is online, also expect TestRegistryStreamabilityMatchesTypes to compare your OnlineAggregator implementation against the AggregationType.Streamable() return value — they must agree.

5. Update the skill pack

Add a section for the new aggregator in skills/aggregation-guide.md. Cover when to use it, what its inputs and outputs look like, and any caveats (sort cost, memory, supported field types).

The CI gate TestSkillsCoverAllComponents parses the skill body for the operator name; the section can live anywhere in the file as long as the name appears.

6. Declare the capability metadata

Add a row to descriptor/capabilities_aggregations.go describing the operator’s params, accepted field types, emitted type, and the streamable hint. TestManifestOperatorsComplete enforces that every registered aggregator has a capability row.

7. CLAUDE.md and registered-component lists

Update CLAUDE.md’s “Current registered components” section with the new aggregator name in the right alphabetised slot. If the operator interacts with categorical fields in a special way, also update descriptor/predict.go’s numericAggregations map.

8. Run the gates

go test ./skills/ -run TestSkillsCoverAllComponents
go test ./descriptor/ -run 'TestManifest|TestPredict'
go test ./processing/ -run TestRegistryStreamability
go test ./...

The full Update Demand row for aggregators says: skill update + capability declaration + CLAUDE.md update + the existing test coverage. All four ride in the same PR. See The Update Demand.

Adding an I/O Format

Audience: internals contributors adding a new bidirectional tabular format (a peer to the existing csv/, tsv/, ndjson/, jsonarray/, arrow/, parquet/, excel/ sub-packages).

From CLAUDE.md, Common Claude Code Workflows.

1. Create the sub-package

Each format is a sub-package under io/. Create io/<format>/<format>.go with both a reader and a writer.

The two interfaces to implement live in io/:

// Reader
type Reader interface {
    ReadHeader() ([]string, error)
    ReadRows(ctx context.Context, fn func(row []string) error) error
    Close() error
}

// Writer
type Writer interface {
    WriteHeader(columns []string) error
    WriteRow(values []string) error
    Close() error
}

If the reader needs schema inference (header sample, then full import), also implement io.ResetReader.Reset() so the import job can rewind after sampling.

2. Tests

Add io/<format>/<format>_test.go with the standard round-trip checks: write rows, read them back, verify equality. Hermetic tests should use afero.NewMemMapFs() — see Testing Conventions.

3. Wire it into the CLI

The CLI registers per-format leaves in internal/cli/import.go and internal/cli/export.go. Add the format string to:

  • The switch in makeImportReader(format, ...) in import.go.
  • The corresponding newWriterForFormat(format, ...) switch in export.go.
  • The Commands: slice on ImportCommand() and ExportCommand() in the same files (one importFormatCmd("yourformat") / exportFormatCmd("yourformat") line).

The pulse convert leaf auto-detects format from extension via formatFromExt; add the extension mapping if the new format has a canonical file extension.

4. Schema mapping

If the new format has a native type system (Arrow / Parquet do, CSV does not), share the type map with neighbouring formats via the io/arrow package the way Parquet already does. CSV / TSV / NDJSON / JSON-array share io/jsonshared for value coercion.

5. Skill update

Add or update a skill that points users at the new format. If the new format is primarily an export concern, update skills/export-format-selection.md. If it has import-side considerations (schema inference, null markers, type ambiguity), update skills/import-best-practices.md.

If the format adds a CLI flag (e.g. --sheet for Excel), update skills/getting-started.md so TestSkillsCoverAllCliLeaves keeps passing.

6. Convert and orchestration plumbing

Make sure both directions flow through pio.ImportJob and pio.ExportJob. The orchestration layer is format-agnostic; you should not need to touch service/ unless the new format requires special metadata (e.g., Parquet’s per-column statistics).

7. Run the gates

go test ./io/<format>/...
go test ./skills/ -run TestSkillsCoverAll
go test ./...

For format-specific perf, add benchmarks (Benchmark<Format>...) in the sub-package. There’s no required perf gate today, but neighbouring formats have benchmarks you can mirror as a baseline.

Adding a Statistical Test

Audience: internals contributors adding a new TEST_* operator — tier-1 (row-stream) or tier-2 (post-test on the materialised result set).

The recipe mirrors the aggregator and feature recipes; the test-specific moving parts are streamability, the test catalog, and the registered-test capability table.

From CLAUDE.md, “Update Demand” rows for statistical tests and tier-2 post-test variants.

1. Decide tier

  • Tier 1. Runs against the raw row stream, alongside aggregators. Online-moments tests (TEST_T, TEST_WELCH, TEST_CHISQ, TEST_ANOVA_F) stay in the streaming Process path. Sort-required tests (TEST_KS) force the buffered path.
  • Tier 2. Runs after the result set is materialised, in req.PostTests. Always buffered.

2. Declare the type constant

Add to types/types.go:

const (
    // ... existing constants ...
    TEST_GINI_TREND TestType = "TEST_GINI_TREND"
)

Add it to types.AllTestTypes().

3. Implement and register

Tests live in processing/test_*.go. Existing examples to mirror:

  • processing/test_t.go — online tier-1 test.
  • processing/test_anova.go — tier-1 ANOVA with grouper support.
  • processing/test_post.go and processing/test_post_more.go — tier-2 post-tests.
  • processing/test_studentized.go — numerical integration utilities (used by TEST_TUKEY_HSD).

Register the test in processing/test.go (the registry construction calls). For tier-2 variants, declare both the base type and the variant identifier the post-test surface uses.

4. Streamability

Add a case in types/streamability.go for the new TestType:

func (t TestType) Streamable() bool {
    switch t {
    // ...
    case TEST_GINI_TREND:
        return false // sort-based
    }
}

Add the matching row in types/streamability_test.go so TestStreamability_TestsKnown passes.

5. Capability declaration

Add a row to descriptor/capabilities_tests.go:

  • For a tier-1 test, declare it in the tier-1 catalog (testCapabilities).
  • For a tier-2 post-test, declare it in postTestCapabilities.

TestManifestTestsComplete and TestManifestPostTestsComplete enforce that the manifest enumerates every registered test.

6. Skill update

Add an entry to skills/statistical-testing.md under “Operator catalog”. Describe the test’s family, inputs, outputs (statistic, p, df, effect size, …), and any preconditions (PULSE_TEST_* error codes it can raise). For tier-2 variants, also document the variant field shape since the post-test API exposes it.

7. Tests

Use the same TDD pattern as for aggregators. The processing package has rich existing test files to model new cases against: processor_test_pipeline_test.go, test_parametric_test.go, test_nonparametric_test.go, test_post_more_test.go. Add hermetic fixtures that exercise the streaming and buffered paths.

8. Error codes

If your test introduces a new failure mode, add a code to errors/codes.go (mirror the existing PULSE_TEST_* family), register its description row in descriptor/capabilities_errors.go, and document recovery in skills/error-code-reference.md. See the Adding an Aggregator recipe for the same pattern at the aggregator layer.

9. CLAUDE.md

Update CLAUDE.md’s “Current registered components → statistical tests” line with the new operator. If the test introduces a new preconditions class (e.g. paired sample, repeated measures), also add a sentence describing it in the parent paragraph.

10. Run the gates

go test ./processing/ -run TestType_Streamable
go test ./types/    -run TestStreamability_TestsKnown
go test ./descriptor/ -run TestManifest
go test ./skills/    -run TestSkillsCoverAll
go test ./...

See The Update Demand for the full row that governs statistical-test changes.

The Update Demand

Source of truth: this chapter is mirrored from the “Update Demand” section of CLAUDE.md. Both files are kept in lock-step; CLAUDE.md is authoritative if they ever diverge (a TestUpdateDemandTableCovers CI gate enforces table coverage against the registries).

Any change to Pulse code, configuration, file format, or public surface MUST update the corresponding skill file(s) and CLAUDE.md in the same PR. This is not a courtesy. It is a non-skippable CI failure if any of the trigger conditions below is met without the corresponding doc update.

Trigger → required update

If you change…You MUST also update…Enforced by
A registered aggregatorskills/aggregation-guide.md (add or update the section for that aggregator)TestSkillsCoverAllComponents
A registered attributeskills/attribute-composition.mdTestSkillsCoverAllComponents
A registered filtererskills/aggregation-guide.md (filtering section)TestSkillsCoverAllComponents
A registered grouperskills/grouper-design.mdTestSkillsCoverAllComponents
A registered window operatorskills/window-operations.mdTestSkillsCoverAllWindowTypes
An error code (added/removed/renamed)skills/error-code-reference.mdTestSkillsCoverAllErrorCodes
A CLI leaf (added/removed/flag added)CLAUDE.md “Common Claude Code Workflows” + skills/getting-started.md if user-facingTestSkillsCoverAllCliLeaves
A --json envelope or format_versionCLAUDE.md “Output Format Contract”TestClaudeMdMentionsFormatVersion
A .pulse file format change (header layout, new field type)CLAUDE.md “Code Conventions” + skills/cohort-schema-design.mdTestClaudeMdMentionsFormatVersion, TestSkillsCoverAllFieldTypes
A new non-skippable CI gateCLAUDE.md (gate listed by name in the relevant section)TestClaudeMdMentionsAllNonSkippableGates
A new architectural decisionCLAUDE.md (relevant section) + PRD if applicablereviewer enforcement
An environment variableCLAUDE.md “Build / Dev / Test Workflow” + skills/getting-started.mdTestClaudeMdMentionsAllEnvVars
A registered MCP tool (added/removed)skills/mcp-integration.md (Tool surface table) + internal/mcp/mcptools/meta.go (name + description)TestSkillsCoverAllMCPTools, TestManifestMCPToolsComplete
A new MCP action tool with field-name parametersinternal/mcp/schema_bind.go (add a per-tool JSON Schema builder + entry in Bind) + skills/mcp-integration.md (Schema-bound enums section)TestMCPSchemaBinding_RemovesInvalidFields, TestMCPSchemaBinding_AllFieldsInFiltererEnum, TestMCPSchemaBinding_SampleAndFacetFieldEnum, TestMCPSchemaBinding_InspectSucceedsRegistersBindings, TestMCPSchemaBinding_BindOnOpenFalse
A registered feature operatorskills/feature-engineering.md (operator catalog) + capability declaration in descriptor/capabilities_features.goTestSkillsCoverAllComponents, TestManifestOperatorsComplete
A registered synth distribution kindskills/synthetic-data.md (Supported distributions) + capability declaration in descriptor/capabilities_distributions.goTestSkillsCoverAllSynthDistributions, TestManifestDistributionsComplete
A registered statistical test (TEST_*)skills/statistical-testing.md (Operator catalog) + types/streamability.go + types/streamability_test.go + capability declaration in descriptor/capabilities_tests.goTestStreamability_TestsKnown, TestManifestTestsComplete
A registered tier-2 post-test variantCapability declaration in descriptor/capabilities_tests.go (postTestCapabilities)TestManifestPostTestsComplete
A registered aggregator/attribute/filterer/grouper/window capability metadataCapability declaration in descriptor/capabilities_<category>.go (params, accepts_types, emits_type, streamable_hint)TestManifestOperatorsComplete
A new error codeDescription row in descriptor/capabilities_errors.go (errorMetaTable)TestManifestErrorCodesComplete
An error code’s fixup templateEntry in errors/fixup_metadata.go (codeMetadata) + **Fixup**: line in skills/error-code-reference.md under that codeTestCodesHaveFixups, TestSkillsErrorCodeFixupsDocumented
A new operator’s streaming capabilitytypes/streamability.go (case for the new type) + table in types/streamability_test.goTestRegistryStreamabilityMatchesTypes, TestStreamability_*Known, TestManifestStreamableMatchesTypes
The default operator tableCLAUDE.md “Code Conventions → Smart defaults” + skills/getting-started.md (“Defaults” section)TestDefaults_Applied + reviewer enforcement
A natural-query parsing route (new grammar shape)internal/query/query.go grammar + internal/query/query_test.go fixtures + skills/query-router-prompt.md (router prompt grammar) + skills/request-recipes.md (target shapes)TestNaturalQuery_HeuristicGrammar

The Update Demand applies recursively to itself: when a new trigger row is added (e.g., a new component category, a new contract), this table MUST be updated in the same PR. TestUpdateDemandTableCovers (non-skippable) parses this table and asserts every registered component category and contract type has a row.

If you find yourself wanting to defer the doc/skill update to “a follow-up PR,” stop. The follow-up PR will not happen, and the next Claude Code session will read a stale CLAUDE.md and produce wrong code. Update in the same PR or do not merge.

Deployment

Audience: operators standing up Pulse as a CLI server, an MCP process under an AI client, or an embedded Go library inside a larger binary.

Pulse is a single static Go binary. There is no install command, no config file, and no daemon — every deployment story is some shape of “put the binary somewhere, set PULSE_DATA_DIR, run it”.

LLM agents using MCP: see the mcp-integration skill via pulse_skills_get for the MCP-side wiring details. This page covers the operator side.

Mode 1: Standalone CLI

go install github.com/frankbardon/pulse/cmd/pulse@latest
export PULSE_DATA_DIR=/var/data/pulse
pulse --version

That’s the full install. The CLI tree is mapped in the CLI Tour.

Mode 2: MCP stdio server (Claude Desktop, Claude Code, generic MCP clients)

pulse mcp runs the Model Context Protocol over stdio. AI clients launch the process, speak MCP over its standard streams, and shut it down on session close.

The full wiring guide is in the mcp-integration skill. Quick reference for Claude Desktop:

// ~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args": ["mcp"],
      "env": {
        "PULSE_DATA_DIR": "/var/data/pulse"
      }
    }
  }
}

For Claude Code (~/.claude.json) and other clients the shape is the same — see the mcp-integration skill (pulse skills show mcp-integration) for the canonical recipes.

Flags worth knowing:

FlagDefaultPurpose
--data-dirfrom PULSE_DATA_DIROverride the cohort base directory
--bind-on-opentrueRegister session-scoped JSON-schema-bound tool variants on successful pulse_inspect. Disable for clients that bind tool schemas themselves.

See pulse mcp for the full command page.

Mode 3: Embedded Go library

import "github.com/frankbardon/pulse"

p, err := pulse.New(pulse.Options{
    DataDir: "/var/data/pulse",
})

When embedding, you can bypass PULSE_DATA_DIR entirely by passing DataDir (as above) or a custom afero.Fs. See Library Embedding for the full surface.

Production hardening

  • Filesystem permissions. pulse mcp reads everything under PULSE_DATA_DIR. Treat the directory as the trust boundary — run the process as a user that can only read what it should serve.
  • Stdio plumbing. MCP transports stderr too. Pulse writes a one-line startup notice (pulse mcp: serving over stdio...) on stderr and never logs request/response payloads, so MCP clients can surface stderr without leaking data.
  • Resource limits. Streaming aggregations stay memory-bounded; buffered request shapes (window operators, median/percentile, decimal/geo paths) can materialise large intermediate row sets. Use pulse api predict to check Streamable before running an unfamiliar request — see Performance Notes.
  • No mutating background state. Pulse never writes to a cohort during process/compose. The only write paths are import, export, synth, profile, and cohort filter — explicit by flag.

Upgrades

Drop in a new binary and restart the MCP process (or the calling client). The .pulse file format carries a one-byte version field (currently 0x01); files written by a future binary that introduces a new version will be rejected loud at parse time, not silent at row decode. See Header Layout.

Performance Notes

Audience: operators sizing a Pulse deployment, and library users debugging memory or latency surprises.

Pulse is built to keep “the streaming path” the default for most analytical requests. When the engine has to leave that path it says so — via the Streamable flag in pulse api predict — and falls back to a buffered execution. This page tells you what stays streaming, what buffers, and how to read predict’s diagnostics.

LLM agents using MCP: there is no direct skill counterpart for this page — debugging-with-predict covers how to drive predict; this page tells operators what predict’s answers imply.

Streaming path: what stays out of memory

The streaming Process path covers four orchestrator modes (from CLAUDE.md → What streams today):

  • Single-pass streaming. No-group requests with online aggregators (COUNT, SUM, AVG, STDDEV, VARIANCE, RANGE, FREQUENCY, MODE, SKEWNESS, KURTOSIS, DISTINCT_COUNT) on numeric (non-decimal) fields. Row-local attributes (FORMULA, DATE_PART) apply inline.
  • Grouped streaming. Groupers implementing the streaming key path (GROUP_CATEGORY, GROUP_RANGE, GROUP_ROUNDED) drive per-key online aggregator buckets. Memory is O(distinct_groups × per-aggregator-state).
  • Two-pass streaming. Two-pass attributes (ATTR_ZSCORE, ATTR_TSCORE, ATTR_NORMALIZED) compute population stats via Welford-Pébaÿ pass 1, then emit per-row values in pass 2.
  • Streaming features. Every registered FEAT_* operator implements the streaming computer interface and composes with the three modes above.

These paths benefit from three optimisations landed during the streaming refactor (commit cdd72d5): record reuse (the same record buffer flows through the pipeline), zero-allocation decoding into reused buffers, and an mmap reader for .pulse files large enough to benefit from demand paging.

Buffered path: when Pulse has to materialise

pulse api predict reports Streamable=false and lists every buffering reason. The current set, from CLAUDE.md:

  • AGG_MEDIAN, AGG_PERCENTILE, and AGG_ZSCORE — require sorts or summed deviations.
  • ATTR_PERCENTILE — sorted view of every value; no streaming algorithm preserves exact rank.
  • GROUP_QUANTILE, GROUP_DATE — finalize-time work over the full set.
  • Window operators (WIN_*) — operate on a sorted post-aggregate row set.
  • Decimal-typed field aggregations — precision-preserving path.
  • Two-pass attributes combined with features or groups — orchestration matrix not yet extended.
  • Tier-1 statistical tests combined with groupers, features, or two-pass attributes — same orchestration limit.
  • Tier-2 post-tests (req.PostTests) — always run after the result set is materialised, regardless of TestType.

Reading predict output

pulse api predict --request request.json --json | jq '.data | {streamable, streamable_reasons}'
{
  "streamable": false,
  "streamable_reasons": [
    "AGG_MEDIAN on field price"
  ]
}

If streamable_reasons is empty and streamable=true, the request executes without buffering. Each reason is a one-line gate that pushed the request to the buffered path; you can drop or substitute the offending operator (e.g., AGG_AVG instead of AGG_MEDIAN) and re-run predict.

Memory rules of thumb

PathMemory profile
Single-pass streamingConstant — O(aggregator state)
Grouped streamingO(distinct_groups × per-aggregator state)
Two-pass streamingConstant; cost is 2× iter scan (typically OS-page-cached)
BufferedO(filtered_rows × output_width) for the working set, plus per-operator state

Concurrency

pulse.ComposeParallel (CLI: pulse api compose --parallel N) fans ComposedRequest slots over a bounded worker pool. Workers share the engine’s read-only registries; each Process call constructs fresh stateful operators per request, so concurrent execution is safe. Defaults: MaxWorkers = GOMAXPROCS, FailFast = true. See Parallel Compose.

When to embed vs shell out

For high-throughput pipelines, embed Pulse directly via the Go library — you avoid one process boundary per request and can stream rows through your own writer with ProcessStream. For ad-hoc analysis, JSON-in/JSON-out via pulse api process --json is faster to write and easier to debug.

Troubleshooting

Audience: operators chasing a specific failure mode in production (file not found, permission errors, MCP transport issues, common error codes).

This page is organised by symptom. For per-code recovery detail (Message + Fixup templates), fetch metadata via the pulse_errors_lookup MCP tool ({"code": "PULSE_XXX"}) or pulse errors lookup CODE on the command line. The error-code-reference skill explains the envelope shape, the DOMAIN_CATEGORY naming convention, and the repair workflow that chains predict-side suggestions into structured fixups.

LLM agents using MCP: call pulse_errors_lookup for per-code detail — code=PULSE_XXX for one code, domain=PULSE to enumerate, query="..." for keyword search. The skill is the orientation; the tool is the catalog. This page focuses on operational symptoms that don’t reduce to a single error code.

“data directory required: set PULSE_DATA_DIR or pass –data-dir”

pulse mcp refuses to start. The MCP leaf is the one place the binary insists on a base directory because it enumerates cohorts at session start.

Fix: export PULSE_DATA_DIR in the client’s MCP config, or pass --data-dir /path/to/data on the command line. The pulse mcp page has the full example.

“file not found” / “no such file or directory”

The cohort path was resolved against the wrong base. Pulse prefers absolute paths; with PULSE_DATA_DIR set, relative paths resolve against it.

Fix: call pulse cohort inspect /absolute/path/data.pulse to verify the file is where you think it is. If you’re running inside pulse mcp, check the data-dir line on stderr at startup.

“permission denied”

Pulse runs as your user; it does not escalate. When deployed as an MCP process under a different user (e.g. via launchd / systemd), the cohort directory and files must be readable by that user.

Fix: check id inside the MCP startup banner on stderr; check the file mode with ls -l; widen the group as needed.

“invalid pulse magic bytes” / “unsupported pulse format version”

The file isn’t a .pulse file — or it’s from a future binary that introduced a new format version. The reader rejects unknown versions at parse time (see Header Layout) so a future binary doesn’t silently mis-decode an older file.

Fix: verify the file with file path/to/data.pulse and the first nine bytes (hexdump -C). The expected magic is 50 55 4c 53 45 00 00 00 followed by a version byte (0x01 today).

“truncated pulse header”

The file is shorter than nine bytes or was cut off mid-write.

Fix: re-import. If you suspect a partial write, also check whether the writer was killed mid-flush — Pulse writes the header first, then the schema, then the records, so a truncated file usually fails here before any data is observed.

SERVICE_VALIDATION errors

A field name in the request doesn’t exist in the cohort, or an operator targets a field of the wrong type.

Fix: run pulse api predict on the same request — predict diagnoses validation failures without executing. Common cases: typo in field name; numeric aggregation on a categorical field (warning code PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL); two-pass attribute combined with a feature (currently buffered, not invalid — predict will flag this in streamable_reasons).

PULSE_IMPORT_* errors

Import-time failures. The two most common:

  • PULSE_IMPORT_CATEGORICAL_OVERFLOW — too many distinct values for the chosen categorical width. Either bump the width (categorical_u16 / categorical_u32), drop the categorical encoding, or filter the source before re-importing. See Dictionary Blocks.
  • PULSE_IMPORT_DESCRIPTION_TOO_LONG — schema field description exceeds 1000 bytes. Trim it.

PULSE_FIELD_DESCRIPTION_LOW_QUALITY

A warning by default, an error under --strict. The description is empty, under ten characters, or a generic placeholder ("n/a", "tbd", "unknown", "field", "data", "value", "column").

Fix: edit the description in the schema JSON, re-import with --schema.

MCP “tool not found” / “no tools registered”

An MCP client connects but sees no Pulse tools.

Fix: check the client’s MCP log (Claude Desktop surfaces this in ~/Library/Logs/Claude/). Common causes: pulse binary is not on PATH, the wrong working directory, or PULSE_DATA_DIR is not set in the MCP env block. Re-read pulse mcp.

mmap / file-mapping failures

On very large .pulse files the streaming reader uses memory mapping where available. If your environment forbids mmap (some sandboxed containers, very locked-down macOS configurations), the reader falls back to a buffered read.

Fix: typically transparent. If you suspect a regression, run with verbose Go runtime tracing or compare against a non-mmap file by copying it to /tmp and re-running.

When in doubt: predict, then process

Almost every “why doesn’t this work” question is answerable by

pulse api predict --request request.json --json

Predict reads only the header and schema — it never touches record data — and returns the full envelope of errors, warnings, and the streamable flag. If predict says valid:true and process still fails, the bug is in the processing layer, not the request.

Development Setup

Audience: new contributors getting their first PR ready.

This page is the short version. The fuller treatment of the repo’s conventions, CI gates, and Update Demand lives in the Internals section and in CLAUDE.md at the repository root.

Clone

git clone https://github.com/frankbardon/pulse.git
cd pulse

Tooling

Pulse needs only the Go toolchain — there is no Node, Python, or container build. Install Go 1.24+ (see go.mod for the canonical version).

The repo also uses staticcheck for make lint; it is auto-installed on first run via go run.

Common targets

CommandWhat it does
make buildBuilds the CLI binary to bin/pulse (default goal)
make testRuns go test ./...
make fmtRuns go fmt ./...
make vetRuns go vet ./...
make lintRuns go vet then staticcheck ./...
make coverRuns tests with coverage; outputs coverage.out
make cleanRemoves bin/ and coverage.out

A .env file at the repo root is auto-loaded and exported, so PULSE_DATA_DIR and any other PULSE_* env vars can live there for local development.

Run the binary you just built

make build
./bin/pulse --version
./bin/pulse --json | head -20

The CLI tree itself is mapped in the CLI Tour.

Where things live

The package layout is documented at Internals → Package Layout. Two pointers worth knowing on day one:

  • Public facade: pulse.go — every Go embedder API lives here.
  • CLI internals: internal/cli/ — one file per command group; never put processing logic here.

Read this before writing code

Style Guide

Audience: anyone writing code or docs in the Pulse repository.

This page summarises the conventions enforced by review and by CI. The authoritative source is the “Code Conventions” section of CLAUDE.md; copy that file’s rules when in doubt.

Go style

  • Standard gofmt / go vet cleanliness — make lint is the gate.
  • Module path is github.com/frankbardon/pulse. The standard-library io collision is handled by aliasing the project’s package as pio "github.com/frankbardon/pulse/io".
  • Library-first: business logic lives in library packages, never in cmd/pulse/. The CLI parses flags, calls the library, formats output.
  • All file I/O routes through the injected afero.Fs — never os.Open/os.ReadFile directly in library code, because that defeats fs.NewMemMap() for tests and the extension hook for custom storage backends.

Naming

  • Component types use SCREAMING_SNAKE_CASE: AGG_COUNT, ATTR_ZSCORE, FILTER_INCLUDE, GROUP_CATEGORY, WIN_LAG, FEAT_LOG, TEST_T.
  • Error codes use DOMAIN_CATEGORY format, organised by the six domains listed in CLAUDE.md (ENCODING, PROCESSING, SERVICE, DATA, CLI, PULSE).
  • Field types use lowercase snake (u8, nullable_bool, categorical_u16, decimal128).

Structural bans

These are enforced by non-skippable CI gates:

BanEnforced by
descriptor/ MUST NOT import service/ or processing/TestPredictNoExecutionImports
descriptor/ MUST NOT use fmt.Sprintf for JSON constructionTestDescriptorNoFmtSprintf
Golden files in descriptor/testdata/ MUST NOT be hand-editedTestGoldensNotHandEdited
No predecessor-project string prefixes (legacy “Orbit” naming) in error codes or constantsTestNoOrbitReferences, TestNoOrbitPrefix
CLAUDE.md MUST mention every PULSE_* env var, every non-skippable gate, the current format_versionTestClaudeMd* family

See the Pull Request Process for how these surface during review.

Comments and prose

  • Public Go symbols carry a godoc-shaped comment opening with the symbol name.
  • Skill files use YAML frontmatter (name, description, type, applies_to) and are LLM-facing — keep them in MCP voice (tool calls, JSON payloads). The human-facing equivalent is this site; cross-link from each side.
  • mdBook chapters open with a one-sentence summary and an Audience line. See any of the already-authored chapters in this site for the tone.

The Update Demand

The single most important convention: if your code change ships without the corresponding CLAUDE.md and skill updates, CI will fail. The Update Demand chapter is the authoritative table of triggers and the gates that enforce them. Read it before opening a PR that touches a registered surface (new aggregator, new error code, new CLI flag, new field type, …).

Testing Conventions

Audience: contributors writing tests, regenerating goldens, or trying to figure out which CI gate to run locally before pushing.

From CLAUDE.md, CI gates and Common Claude Code Workflows.

Style

  • Table-driven tests are the default. Put cases in a []struct{...} with a name field, run with t.Run(tc.name, func(t *testing.T)).
  • Hermetic by construction: anything that touches the filesystem uses fs.NewMemMap() so tests don’t depend on disk state.
  • New code lands with tests in the same PR — TDD first, then implementation. A test that passes without the implementation is suspicious; the test is probably wrong.

Running tests

# Full suite
go test ./...

# Single package
go test ./processing/...

# Verbose, specific test
go test ./service/... -v -run TestProcess

# Coverage report
make cover

# Fuzz the .pulse header
go test ./encoding/... -fuzz FuzzPulseFileHeader -fuzztime 30s

Non-skippable CI gates

These tests guard structural invariants. If one of them fails, the underlying conventions (not the test) are what need re-thinking. Their full names appear in CLAUDE.md so the TestClaudeMdMentionsAllNonSkippableGates self-check can find them.

GateGuards
TestPredictNoExecutionImportsdescriptor/predict.go does not import service/ or processing/
TestDescriptorNoFmtSprintfdescriptor/ never builds JSON via fmt.Sprintf
TestGoldensNotHandEditeddescriptor/testdata/* hashes match the generator
TestClaudeMdMentionsFormatVersionCLAUDE.md references the current envelope format_version
TestClaudeMdMentionsAllEnvVarsEvery PULSE_* env var has a CLAUDE.md row
TestClaudeMdMentionsAllNonSkippableGatesThis very table is the source — CLAUDE.md must list every gate by name
TestUpdateDemandTableCoversThe Update Demand table covers every registered component category
TestPerPackageCoverageFloorsPackage directories exist and meet documented coverage floors
TestNoOrbitReferences, TestNoOrbitPrefix, TestNoOrbitPrefixesNo predecessor-project string prefixes leak in
TestSkillsCoverAll*Skill files mention every registered component, error code, distribution, CLI leaf, field type, MCP tool
TestSkillsManifestConsistentskills/index.json matches the .md files and frontmatter
TestSkillsFrontmatter_RequiredFieldsEvery skill has name, description, type, applies_to
TestRegistryStreamabilityMatchesTypesAggregator OnlineAggregator capability matches AggregationType.Streamable()
TestPredict_Streamable_MatchesRuntimePredictResult.Streamable mirrors processing.CanStreamRequest
TestStreamability_*KnownEvery All*Types() entry has a streamability table row
TestCanStreamRequest_RegressionMatrixRegression matrix on the exported CanStreamRequest helper
TestManifest*CompleteManifest enumerates every registered operator, test, distribution, MCP tool, error code
TestManifestStreamableMatchesTypesManifest Streamable flags mirror the type-level methods
TestCodesHaveFixups, TestSkillsErrorCodeFixupsDocumentedEach error code has a fixup template and the skill row to match
TestDefaults_AppliedSmart-default operator-type inference behaves as documented
TestNaturalQuery_HeuristicGrammarThe internal/query parser fixtures cover its documented shapes

(See CLAUDE.md “CI gates” for the full prose; this table is the quick-reference.)

Running a subset of gates locally

# All descriptor contract gates
go test ./descriptor/ -run 'TestPredictNoExecution|TestDescriptorNoFmtSprintf|TestGoldensNotHandEdited'

# Skill coverage gates
go test ./skills/ -run 'TestSkillsCoverAll|TestSkillsManifestConsistent|TestSkillsFrontmatter'

# CLAUDE.md gates
go test . -run 'TestClaudeMd|TestUpdateDemandTable'

# Predecessor-reference scrub
go test . -run TestNoOrbitReferences

Regenerating golden files

Golden files live in descriptor/testdata/. Each ends with a // golden-hash: <sha256> line; TestGoldensNotHandEdited verifies the hash. After a legitimate change to the generator:

go test ./descriptor/ -run 'Test.*Golden' -update
go test ./descriptor/ -run TestGoldensNotHandEdited   # confirms the new hash sticks

Never hand-edit a golden file — the gate will catch you.

Adding a new gate

If your change introduces a structural invariant, add a test for it under the same naming convention (TestX), and add it to the table in CLAUDE.md so TestClaudeMdMentionsAllNonSkippableGates recognises it. The Update Demand lists this as a trigger row.

Pull Request Process

Audience: contributors preparing to open or land a PR.

This page is a checklist. The longer prose lives in CONTRIBUTING.md and the Update Demand chapter.

1. Branch and commit shape

  • One feature or fix per PR. Keep the diff focused.
  • Conventional Commits in the subject line: feat(...), fix(...), chore(...), docs(...), perf(...), refactor(...), test(...).
  • The PR title is usually the lead commit’s subject.

2. Tests first

A PR that adds a new aggregator, error code, field type, I/O format, statistical test, or skill must include tests in the same PR. The testing-first preference is documented in Testing Conventions. Implementation that lands without tests will be sent back; tests that pass without the implementation are suspicious and probably wrong.

3. The Update Demand

The single biggest source of “your PR was bounced” feedback. The full table lives in The Update Demand; the cliff-notes are:

Change categoryDoc/skill update required in the same PR
Registered aggregator / attribute / filterer / grouperThe matching skill file + the operator capability table
Registered window / feature / synth distribution / statistical testSame — skill + capability file
Error code (added / removed / renamed)errors/codes.go, skills/error-code-reference.md, descriptor/capabilities_errors.go
CLI leaf (added or flag added)CLAUDE.md “Common Claude Code Workflows” + skills/getting-started.md if user-facing
--json envelope changeCLAUDE.md “Output Format Contract”
.pulse file format changeCLAUDE.md “Code Conventions” + skills/cohort-schema-design.md
New environment variableCLAUDE.md “Build / Dev / Test Workflow” + skills/getting-started.md
New non-skippable CI gateList it by name in CLAUDE.md

If you find yourself wanting to defer the doc update to a follow-up PR, stop. The follow-up PR will not happen, and the next contributor will read stale guidance. Update in the same PR or do not merge.

4. Pre-flight checks

make fmt
make lint
make test

For change-category-specific gates, see Testing → Running a subset of gates locally.

5. Open the PR

  • Use the bug-report or feature-request template as a starting point if applicable.
  • Fill in the PR template’s “Summary” and “Test plan” sections.
  • Link related issues with Closes #N.
  • Do not push --force to main. Force-pushing your own feature branch is fine before review starts.

6. Review and CI

CI runs the full go test ./... plus the non-skippable gates listed in Testing → Non-skippable CI gates. A failing gate means a structural invariant is broken, not a flaky test; fix the root cause rather than retrying.

When a pre-commit hook or PR check fails, create a new commit with the fix. Do not git commit --amend after a hook failure; the prior commit may not exist or may have already been pushed.

7. Merge

  • Squash-merge is the default; the squash message follows Conventional Commits.
  • Once merged, the deploy workflow rebuilds and publishes this docs site to https://frankbardon.github.io/pulse/.

For changes that introduce a new architectural decision, also update the relevant section of CLAUDE.md and reference the PRD (if one exists) in the PR description.