Pulse

Pulse is a self-describing, high-performance tabular data processing engine. It ships as a Go library (github.com/frankbardon/pulse) and as a single CLI binary (bin/pulse). Every .pulse file carries its own schema in the header, so consumers (programs, agents, and humans) can discover what a file contains without an external catalog.

The library is the primary deliverable. The CLI is a thin adapter that exposes the same operations on the command line, and an embedded MCP server (pulse mcp) exposes them to LLM agents.

Where to go from here

If you are…	Start with
New to Pulse	Installation → Your First Cohort → CLI Tour
Driving Pulse from the shell	Command Line Reference
Embedding Pulse in a Go program	Library Embedding
Curious about the binary format	.pulse File Format
Hacking on Pulse itself	Internals and Contributing
Wiring Pulse into an LLM agent	MCP Integration (Pointer), then the in-binary skill pack

LLM-facing surface

LLM agents do not read this site. Pulse exposes a Model Context Protocol server (pulse mcp) and ships 19 embedded skills under skills/ that LLMs load on demand via the pulse_skills_list and pulse_skills_get tools. The skill voice is MCP-only (tool calls, JSON payloads). This site is the human-facing counterpart — same engine, different idiom.

See How LLMs Use Pulse for a short pointer table.

Source of truth

The authoritative architectural contract for Pulse lives in the repository’s CLAUDE.md. When this site and CLAUDE.md disagree, CLAUDE.md wins; please open an issue.

Repository: https://github.com/frankbardon/pulse
Hosted docs: https://frankbardon.github.io/pulse/

Installation

Audience: new users who want a working pulse binary on their PATH.

This page walks through installing Pulse, the prerequisites it needs, and how to verify the install. Pulse is distributed as a single static Go binary; there is no installer, no daemon, and no config file.

LLM agents using MCP: see the getting-started skill via pulse_skills_get — it covers session bootstrap rather than local install.

Prerequisites

Requirement	Minimum
Go toolchain	1.24 (see `go.mod`)
OS	Linux, macOS, or Windows (anywhere Go cross-compiles)
Disk	A few MB for the binary; cohort files live wherever you point `PULSE_DATA_DIR`

go.mod is the source of truth for the supported Go version; if it drifts from this page the go.mod value wins.

Install with `go install`

The fastest path on a developer machine:

go install github.com/frankbardon/pulse/cmd/pulse@latest

This drops a pulse binary at $(go env GOBIN) (typically ~/go/bin). Make sure that directory is on your PATH.

Pin a specific release by replacing @latest with a tag:

go install github.com/frankbardon/pulse/cmd/pulse@v0.2.0

Build from source

The same binary, built reproducibly from a checkout:

git clone https://github.com/frankbardon/pulse.git
cd pulse
make build
# Binary at ./bin/pulse

The Makefile is documented in CLAUDE.md → Build / Dev / Test Workflow; the relevant targets are make build, make test, make lint, and make cover.

Configure the data directory

Pulse reads and writes .pulse files under a base directory called PULSE_DATA_DIR. Most commands accept absolute paths and will work without it, but pulse mcp requires the variable so the MCP server can enumerate cohorts:

export PULSE_DATA_DIR=/var/data/pulse

The repo Makefile auto-loads a .env file from the repo root, so you can also drop PULSE_DATA_DIR=... there for local development.

PULSE_DATA_DIR is the only required environment variable. See Flag Reference for the full list of CLI flags and environment knobs.

Verify

pulse --version
pulse --json | head -20

pulse --json prints the root manifest — the full self-description of commands, components, field types, and embedded skills. If you see a top-level format_version: "1.0" envelope, the install is working.

Where to go next

New to the file format and vocabulary? Your First Cohort
Want a quick map of every command? CLI Tour
Embedding Pulse in a Go program? Go API Overview
Wiring Pulse into an MCP-aware client? pulse mcp

Your First Cohort

Audience: new CLI users. This is a five-minute tour: import a CSV, inspect the resulting .pulse file, run an aggregation, and export the result back.

LLM agents using MCP: the equivalent tour for an agent is the getting-started skill, fetched via pulse_skills_get. That skill speaks in tool calls and JSON payloads; this page speaks in shell commands.

1. Pick a CSV

For this walkthrough we’ll assume a file called sales.csv with columns like:

order_id,region,product,units,revenue,sold_on
1,west,widget,3,29.97,2024-01-04
2,east,gadget,1,19.99,2024-01-04
3,west,widget,7,69.93,2024-01-05
...

Any CSV with a header row works. Pulse also imports TSV, NDJSON, JSON-array, Parquet, Arrow IPC, and Excel — see Flag Reference for per-format flags.

2. Import to a `.pulse` file

pulse import csv --input sales.csv --output sales.pulse

Pulse samples up to 500 rows by default to infer a schema (you can change that with --sample-rows). Each column gets a typed binary representation and, if it looks like a low-cardinality string, a categorical dictionary.

Want to control the schema explicitly? Generate a template, edit it, and re-import:

# Editable schema template
pulse import schema-template sales.csv > sales.schema.json

# Edit sales.schema.json — set types, add descriptions
# Then import with the schema
pulse import csv --input sales.csv --schema sales.schema.json --output sales.pulse

See Field Types for the type catalog and Dictionary Blocks for how categoricals are encoded.

3. Inspect

The .pulse file is fully self-describing. Read it back:

pulse cohort inspect sales.pulse

Output is a table of fields, their types, and the description string stored in the header. Add --json for the structured envelope, or --full-dict to print every categorical entry instead of truncating after 100.

pulse cohort inspect sales.pulse --json

The envelope is documented in pulse cohort inspect.

4. Validate a request before running it

Pulse separates validation from execution. Write a tiny request file:

{
  "cohort": {"filename": "sales.pulse"},
  "groups": [{"type": "GROUP_CATEGORY", "field": "region"}],
  "aggregations": [
    {"type": "AGG_COUNT", "field": "order_id", "label": "orders"},
    {"type": "AGG_SUM", "field": "revenue", "label": "total_revenue"}
  ]
}

Save it as request.json, then check whether it makes sense against the cohort’s schema:

pulse api predict --request request.json

You’ll see Valid: true, the schema’s field count, and any warnings (e.g., aggregating something numeric on a categorical field). Predict never reads record data, so it’s safe to iterate on a request without touching a multi-GB cohort.

See pulse api predict and the debugging-with-predict skill for the full predict loop.

5. Execute

pulse api process --request request.json --json

The response is wrapped in the standard envelope (format_version, data, errors, warnings). data carries the result rows and a metadata block with total_rows and filtered_rows.

If your result is large, swap --json for --stream to receive rows as NDJSON, one line at a time — useful for pipelines that don’t want to buffer the whole result. See Streaming & ProcessStream for which request shapes actually stream end-to-end inside the engine vs which buffer.

6. Export

You’re done with the .pulse file? Export to whatever your downstream tool understands:

pulse export csv     --input sales.pulse --output sales.out.csv
pulse export parquet --input sales.pulse --output sales.out.parquet
pulse export excel   --input sales.pulse --output sales.out.xlsx

To skip the intermediate .pulse entirely and convert in one shot, use pulse convert source.csv target.parquet — see the top-level README for the full convert recipe.

What you didn’t see

Compose: batch multiple requests in one call — pulse api compose.
Ask: natural-language one-shot — pulse api ask.
Sample / Facet: cheap read-only probes — api sample, api facet.
Window / Feature / Test operators: pull from the skill pack (window-operations, feature-engineering, statistical-testing) via pulse skills show <name>.

For a full map of the CLI, see the CLI Tour.

CLI Tour

Audience: anyone who wants a map of every pulse subcommand before diving into per-command details.

This page is a one-liner index of the CLI tree. Each row links to its detailed chapter where applicable; commands that are minor variants of each other (per-format import/export leaves) are listed compactly.

LLM agents using MCP: there is no equivalent skill — agents drive Pulse through MCP tools, not the CLI. Start at the getting-started skill instead.

Top-level groups

pulse [--json] [--slim]
├── import      Tabular → .pulse (csv, tsv, ndjson, jsonarray, parquet, arrow, excel)
├── export      .pulse  → tabular (same format set)
├── convert     Tabular → tabular, with .pulse as the transparent middle
├── cohort      Inspect or filter an existing .pulse file
├── api         Processing operations (process, compose, ask, predict, sample, facet)
├── synth       Generate synthetic cohorts (from-schema, from-profile)
├── profile     Capture a statistical profile of a cohort
├── skills      Read the embedded LLM skill pack
└── mcp         Run the Model Context Protocol server over stdio

Bare pulse --json prints the self-describing root manifest — commands, components, field types, and skill metadata in one envelope. Pass --slim to drop prose descriptions for size-sensitive clients.

API operations

The “processing facade” — these are the operations exposed via the Go library API and the MCP tool set.

Command	Purpose	Chapter
`pulse api process`	Execute one request against a cohort	api process
`pulse api compose`	Execute multiple requests in batch / parallel	api compose
`pulse api ask`	Parse a natural-language query and execute	api ask
`pulse api predict`	Validate a request without executing	api predict
`pulse api sample`	Return up to N rows	api sample
`pulse api facet`	Return distinct values of a field	api facet

Cohort lifecycle

Command	Purpose	Chapter
`pulse cohort inspect PATH`	Read header + schema (no record data)	cohort inspect
`pulse cohort filter`	Write a filtered subset to a new `.pulse`	See Internals → Architecture

Import / export / convert

pulse import <format> and pulse export <format> share the same flag shape per format (--input, --output, --schema for import). Supported formats today:

csv · tsv · ndjson · jsonarray · parquet · arrow · excel

Each format has a per-leaf command (e.g. pulse import csv). Run pulse import --help or pulse export --help for the full list.

pulse convert SOURCE TARGET chains import + export with no intermediate file unless --keep-pulse PATH is passed. Format is auto-detected from extensions.

Synthetic data

Command	Purpose	Chapter
`pulse synth from-schema`	Generate from a JSON spec	synth from-schema
`pulse synth from-profile`	Generate from a captured profile	synth from-profile
`pulse profile create`	Capture a profile from an existing cohort	profile create

Self-description & LLM surface

Command	Purpose	Chapter
`pulse --json`	Root manifest (commands, components, field types, skills)	manifest
`pulse skills list`	List embedded skills with metadata	How LLMs Use Pulse
`pulse skills show NAME`	Print a skill’s full markdown body	same
`pulse mcp`	Serve MCP over stdio	mcp

Cross-cutting flags

Most leaves accept --json (envelope output), --no-defaults (turn off smart operator-type inference), and the operation-specific flags documented per page. Full list: Flag Reference.

The single environment variable to know is PULSE_DATA_DIR — see Installation.

pulse api process

Audience: CLI users running a single processing request against a cohort.

pulse api process executes one types.Request against a .pulse file and prints the result. It’s the most-used leaf in the binary.

LLM agents using MCP: the equivalent surface is the pulse_process MCP tool — see skills/request-recipes.md for request skeletons.

Synopsis

pulse api process --request FILE [--json] [--stream] [--no-defaults]

Flags

Flag	Alias	Type	Default	Purpose
`--request`	`-r`	string	(required)	Path to the request JSON file
`--json`		bool	false	Emit the result wrapped in the JSON envelope
`--stream`		bool	false	Stream rows as NDJSON (one per line) instead of buffering
`--no-defaults`		bool	false	Disable smart operator-type inference; require explicit `Type` on every aggregation and grouper

--stream and --json are mutually exclusive in spirit — --stream emits one JSON object per line; --json emits the full envelope.

Request file shape

The request file is a types.Request serialised to JSON. Minimal example:

{
  "cohort": {"filename": "sales.pulse"},
  "aggregations": [
    {"type": "AGG_SUM", "field": "revenue", "label": "total_revenue"}
  ]
}

The full request grammar — filterers, groupers, attributes, window operators, features, sort, tests, post-tests — is documented in types.Request; the LLM-facing companion is skills/request-recipes.md.

Output

Text mode (default)

Pretty-printed JSON of the Response struct: a data array of result rows plus a metadata block with total_rows, filtered_rows, and cohort_file.

`--json`

The standard envelope:

{
  "format_version": "1.0",
  "data": {
    "data": [ /* result rows */ ],
    "metadata": { "total_rows": 1000, "filtered_rows": 800, "cohort_file": "sales.pulse" }
  },
  "errors": [],
  "warnings": []
}

`--stream`

NDJSON of result rows, one per line. No envelope, no metadata footer. Pair with pulse api predict ahead of time to confirm Streamable=true; predict-buffered shapes still emit via this path, but they materialise inside the engine first.

Exit codes

Code	Meaning
0	Success
1	Any error — wrapped in the envelope’s `errors` array under `--json`, or printed to stderr otherwise

Examples

Quick aggregation

cat > req.json <<'EOF'
{
  "cohort": {"filename": "sales.pulse"},
  "aggregations": [{"type": "AGG_COUNT", "field": "id", "label": "n"}]
}
EOF

pulse api process --request req.json

cat > req.json <<'EOF'
{
  "cohort": {"filename": "sales.pulse"},
  "filterers": [{"type": "FILTER_RANGE", "field": "revenue", "values": ["100", "10000"]}],
  "groups":    [{"type": "GROUP_CATEGORY", "field": "region"}],
  "aggregations": [
    {"type": "AGG_COUNT",   "field": "id",      "label": "orders"},
    {"type": "AGG_AVERAGE", "field": "revenue", "label": "avg_rev"}
  ]
}
EOF

pulse api process --request req.json --json

Stream rows into a downstream pipeline

pulse api process --request req.json --stream | \
    jq -c 'select(.avg_rev > 500)'

pulse api compose — batch of requests in one call
pulse api ask — natural-language one-shot
pulse api predict — validate without executing
pulse api sample — quick row preview
Library: pulse.New & Options — the Go-side equivalent of --no-defaults
Library: Streaming & ProcessStream — what streams vs what buffers

pulse api compose

Audience: CLI users executing a batch of related requests in one call.

pulse api compose runs multiple types.Request entries against one or more cohorts. The whole batch is one ComposedRequest; the engine can run the entries sequentially or in parallel against a bounded worker pool.

LLM agents using MCP: see the pulse_compose MCP tool and the compose-requests skill.

Synopsis

pulse api compose --request FILE [--json] [--stream]
                                  [--parallel N] [--no-fail-fast]
                                  [--no-defaults]

Flags

Flag	Alias	Type	Default	Purpose
`--request`	`-r`	string	(required)	Composed-request JSON path
`--json`		bool	false	Wrap output in the standard envelope
`--stream`		bool	false	Stream rows as NDJSON; each line is `{"index": N, "row": {...}}`
`--parallel`		int	1	Worker count; 0 = `GOMAXPROCS`, 1 = sequential
`--no-fail-fast`		bool	false	Aggregate errors across slots instead of cancelling on first failure (parallel mode only)
`--no-defaults`		bool	false	Disable smart operator-type inference

Request file shape

{
  "requests": [
    { "cohort": {"filename": "sales.pulse"}, "aggregations": [...] },
    { "cohort": {"filename": "sales.pulse"}, "groups":       [...] },
    { "cohort": {"filename": "ops.pulse"},   "filterers":    [...] }
  ]
}

Each requests[i] is a full types.Request. Slots are independent — they may target different cohorts, use different operators, etc.

Output ordering

Responses come back in input order, regardless of --parallel. A worker that finishes early waits its turn before emitting. So responses[i] always corresponds to request.requests[i].

Parallel mode

--parallel N:

1 (default) — sequential Compose, equivalent to running each request through pulse api process in a loop.
0 — runtime.GOMAXPROCS workers.
>1 — exactly N workers.

Workers share Pulse’s read-only registries; per-request stateful operators are constructed fresh. See Parallel Compose for full mechanics.

FailFast semantics

With --no-fail-fast unset (the default, fail-fast on):

The first failing request cancels in-flight siblings.
The command exits non-zero with the first error.

With --no-fail-fast:

Every request runs to its own completion (or per-request timeout).
Errors aggregate into a single SERVICE_INTERNAL error whose details.failed_indices lists the slot indices that failed.
Successful slots populate the response array; failed slots are null.

Output

`--json`

{
  "format_version": "1.0",
  "data": [ /* response per slot, in input order */ ],
  "errors": [],
  "warnings": []
}

`--stream`

{"index": 0, "row": { ... }}
{"index": 0, "row": { ... }}
{"index": 1, "row": { ... }}

The index field identifies which slot’s request produced each row.

Exit codes

Code	Meaning
0	All requests succeeded
1	One or more requests failed (fail-fast: first error; aggregated: any failure)

Examples

Sequential batch

pulse api compose --request batch.json --json

Parallel with 4 workers, aggregated errors

pulse api compose --request batch.json --parallel 4 --no-fail-fast --json

Stream a parallel batch into a downstream consumer

pulse api compose --request batch.json --parallel 4 --stream | \
    jq -c 'select(.index == 2)'

pulse api process — single-request leaf
Library: Parallel Compose — Go-side equivalents
skills/compose-requests.md (LLM) — request composition patterns

pulse api ask

Audience: CLI users running a one-shot natural-language query against a cohort, or any caller who wants “predict + process” in one call.

pulse api ask is the unified entry point. It validates a request (predict), optionally translates a natural-language query into a request via the built-in parser, and — on success — executes the request. The MCP server uses the same library facade internally for the pulse_ask tool.

LLM agents using MCP: the LLM-side counterpart is the pulse_ask MCP tool. The query-router-prompt skill gives a system-prompt template for routing natural language into Pulse requests.

Synopsis

pulse api ask  [--file FILE] [--query "..."] [--request FILE]
               [--on-invalid abort|suggest] [--predict]
               [--json] [--no-defaults]

You must pass at least one of --query or --request.

Flags

Flag	Alias	Type	Default	Purpose
`--file`	`-f`	string	(none)	Cohort `.pulse` file path
`--query`	`-q`	string	(none)	Natural-language query string
`--request`	`-r`	string	(none)	Optional structured request JSON path
`--on-invalid`		string	`"abort"`	Predict-invalid behaviour: `"abort"` returns an error; `"suggest"` returns the response with `suggestions` populated
`--predict`		bool	false	Validate without executing
`--json`		bool	false	Emit the standard envelope
`--no-defaults`		bool	false	Disable smart operator-type inference

How the parser fills the request

When --query is set, the parser reads the cohort’s schema and synthesises a types.Request slot-by-slot. If --request is also provided, explicit fields in that request always win on collision — the parser only fills empty slots.

The parser populates these slots from the query today: Aggregations, Groups, Filterers, Windows, Sort, Tests. Other slots in the parsed request are ignored.

Output

Text mode

A human-readable summary:

Query: average revenue by region
Matched fields: [revenue region]
Confidence: 0.92

Resolved request:
{ ...the synthesised types.Request... }

{ ...result rows, if executed... }

`--json`

Full AskResponse envelope:

{
  "format_version": "1.0",
  "predict": { /* PredictResult */ },
  "process": { /* Response, if executed */ },
  "suggestions": [],
  "query_resolution": {
    "query": "average revenue by region",
    "matched_fields": ["revenue", "region"],
    "confidence": 0.92
  },
  "errors": [],
  "warnings": []
}

process is omitted when --predict is set or when predict reported invalid and execution was skipped.

Confidence and unresolved queries

query_resolution.confidence is in [0, 1]. A confidence of 0 means PULSE_QUERY_UNRESOLVED (the parser found no usable structure) and lands in errors. Lower-than-1 confidences with at least one matched field land their reasons in warnings (PULSE_QUERY_AMBIGUOUS). The query-router-prompt skill describes the parser’s grammar.

OnInvalid behaviours

Value	Behaviour
`"abort"` (default)	Return a `SERVICE_VALIDATION` error if predict reports invalid
`"suggest"`	Return the response with `suggestions` populated from `errors/fixup_metadata.go`

Use "suggest" when you want fixup hints (e.g., “did you mean field revenue?”) rather than a hard fail.

Exit codes

Code	Meaning
0	Success
1	Validation failed (`abort`), parser failed, or process errored

Examples

Pure natural-language query

pulse api ask --file sales.pulse --query "average revenue by region" --json

Query plus partial structured request

cat > partial.json <<'EOF'
{
  "filterers": [{"type": "FILTER_RANGE", "field": "revenue", "values": ["100", "1000"]}]
}
EOF
pulse api ask --file sales.pulse --request partial.json --query "by region" --json

Predict-only probe

pulse api ask --request req.json --predict --json

Suggest fixups instead of erroring

pulse api ask --request typo.json --on-invalid suggest --json

pulse api predict — standalone validation
pulse api process — execute a pre-validated request
Library: pulse.Ask — Go-side counterpart
skills/query-router-prompt.md — LLM prompt template for routing
skills/request-recipes.md — canonical request skeletons

pulse cohort inspect

Audience: CLI users reading a .pulse file’s schema without running a query — the human-side counterpart of the inspect library method and the pulse_inspect MCP tool. Defined in internal/cli/cohort.go.

pulse cohort inspect reads only the file’s header and schema — it never reads record data. The operation is constant-time regardless of cohort size.

LLM agents using MCP: see the cohort-schema-design skill and the pulse_inspect tool.

Synopsis

pulse cohort inspect PATH [--json] [--full-dict]

Flags

Flag	Type	Default	Purpose
`--json`	bool	false	Emit the standard envelope
`--full-dict`	bool	false	Print every categorical dictionary entry (default truncates at 100)

Output (text mode)

Fields: 7
  order_id              u64                  Stable order identifier
  region                categorical_u8       Sales region label
    dictionary: 4 entries
  product               categorical_u16      Product SKU
    dictionary: 240 entries (truncated)
  units                 u32                  Units sold per line
  revenue               decimal128           Line revenue (precision 18, scale 2)
  sold_on               date                 Date the order shipped
  ...

Dictionaries with > 100 entries are flagged (truncated) — pass --full-dict to print every entry.

Output (`--json`)

{
  "format_version": "1.0",
  "data": {
    "field_count": 7,
    "fields": [
      {
        "name": "order_id",
        "type": "u64",
        "byte_offset": 0,
        "bit_position": 0,
        "description": "Stable order identifier",
        "description_source": "schema"
      },
      {
        "name": "region",
        "type": "categorical_u8",
        "byte_offset": 8,
        "bit_position": 0,
        "description": "Sales region label",
        "description_source": "schema",
        "dictionary": {
          "total_entries": 4,
          "truncated": false,
          "entries": ["east", "west", "north", "south"]
        }
      }
    ]
  },
  "errors": [],
  "warnings": []
}

Fields with empty descriptions on disk get a synthesised fallback ("Categorical field: <name>" / "Numeric field: <name>"); their description_source is "synthesized" rather than "schema".

Exit codes

Code	Meaning
0	Success
1	File not found, truncated, magic-byte mismatch, or unsupported format version

Examples

# Human-readable inspect
pulse cohort inspect data.pulse

# Full envelope for programmatic consumers
pulse cohort inspect data.pulse --json

# Show all categorical entries
pulse cohort inspect data.pulse --full-dict --json | jq '.data.fields[] | select(.dictionary)'

Format → Header Layout
Format → Schema Block
Format → Dictionary Blocks
Library: pulse.Inspect — Go counterpart
skills/cohort-schema-design.md — LLM-facing schema-design skill

pulse api predict

Audience: CLI users validating a request before running it.

pulse api predict validates a types.Request against a .pulse file’s schema without executing it. It reads only the header and schema — never record data — so it’s a cheap, safe iteration loop against arbitrarily large cohorts.

LLM agents using MCP: see the pulse_predict MCP tool and the debugging-with-predict skill. Predict is the LLM’s primary “would this work?” probe.

Synopsis

pulse api predict --request FILE [--json] [--strict]

Flags

Flag	Alias	Type	Default	Purpose
`--request`	`-r`	string	(required)	Request JSON path
`--json`		bool	false	Emit the standard envelope
`--strict`		bool	false	Treat warnings as errors

Structural ban

descriptor/predict.go cannot import service/ or processing/. This is enforced by TestPredictNoExecutionImports. Predict is guaranteed to never touch the executor.

Output (text mode)

Valid: true
Schema: 7 fields
Warning [PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL]: AGG_AVG on field region (categorical_u8)

Without --strict, that warning would still let the command exit 0. With --strict, the warning becomes an error and the command exits non-zero.

Output (`--json`)

{
  "format_version": "1.0",
  "data": {
    "valid": true,
    "schema_info": {"field_count": 7},
    "streamable": false,
    "streamable_reasons": [
      "AGG_MEDIAN on field price"
    ],
    "request": { /* the request as predict resolved it, with defaults applied */ }
  },
  "errors":  [],
  "warnings": [
    {"code": "PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL", "message": "..."}
  ]
}

streamable reports whether the request will execute on the streaming Process path; streamable_reasons lists every gate that forced the buffered path. See Performance Notes for the full streaming/buffered table.

request echoes the request after defaults have been applied so you can see what would actually run. To suppress defaults, run with --no-defaults on the executing leaf (api process, api compose); predict reports defaults_applied regardless.

Exit codes

Code	Meaning
0	Valid (or valid with warnings, in non-strict mode)
1	Invalid, or `--strict` with at least one warning

Examples

Quick validity check

pulse api predict --request req.json

Programmatic check with envelope

pulse api predict --request req.json --json | \
    jq -e '.data.valid == true' >/dev/null && echo "OK"

Strict mode for CI

pulse api predict --request req.json --strict --json

Detect that a request will buffer

pulse api predict --request req.json --json | \
    jq '.data | {streamable, streamable_reasons}'

Common warning codes

Code	What to do
`PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL`	Use `AGG_COUNT` / `AGG_FREQUENCY` instead of `AGG_SUM` / `AGG_AVG` on categoricals
`PULSE_AGG_NOT_MEANINGFUL_FOR_DECIMAL`	Decimal-typed field; switch to a decimal-aware aggregator
`PULSE_FIELD_DESCRIPTION_LOW_QUALITY`	Edit the schema description; re-import
`PULSE_FEAT_TARGET_LEAKAGE_RISK`	The feature operator references the target column; reorganise the pipeline

The full code-by-code recovery playbook lives in skills/error-code-reference.md and at Troubleshooting.

pulse api process — executes a validated request
pulse api ask — combined predict + execute
Library: pulse.Predict / Ask — Go counterparts
skills/debugging-with-predict.md — LLM-side iteration recipe

pulse api sample

Audience: CLI users grabbing a quick peek at a few rows from a cohort — for debugging, sanity-checking an import, or seeding a template request.

pulse api sample returns the first N rows from a .pulse file decoded back to a map of field → value. There is no filter, no aggregation, no transformation — just a typed view of raw rows.

LLM agents using MCP: see the pulse_sample MCP tool. It returns the same shape over the MCP transport.

Synopsis

pulse api sample --input PATH [--count N] [--json]

Flags

Flag	Alias	Type	Default	Purpose
`--input`	`-i`	string	(required)	Cohort `.pulse` file path
`--count`	`-n`	int	10	Rows to sample
`--json`		bool	false	Emit the standard envelope

Output (text mode)

Pretty-printed JSON of the row array:

[
  {
    "order_id": 1,
    "region": "west",
    "product": "widget",
    "units": 3,
    "revenue": "29.97",
    "sold_on": "2024-01-04"
  },
  ...
]

Decimal128 values are serialised as strings to preserve precision.

Output (`--json`)

{
  "format_version": "1.0",
  "data": [ /* row array */ ],
  "errors": [],
  "warnings": []
}

Exit codes

Code	Meaning
0	Success
1	File not found, truncated, or unsupported version

Examples

# 10 rows
pulse api sample --input sales.pulse

# 100 rows, envelope-wrapped
pulse api sample --input sales.pulse --count 100 --json

# Pipe into jq
pulse api sample --input sales.pulse --count 100 | jq '.[] | .revenue'

When `sample` is the wrong tool

For filtered subsets, use pulse api process with a FILTER_* and no aggregation — the result will be one row per matching record.
For distinct values of a single field, use pulse api facet.
For schema-only views (types, descriptions, dictionaries), use pulse cohort inspect.

pulse api facet — distinct values for a single field
Library: pulse.Sample

Audience: CLI users enumerating distinct values for a single field — a cheap probe for “what are the regions in this cohort?” without building a full filter.

pulse api facet returns the distinct values of one field in a .pulse file. For categorical fields it reads the dictionary directly (no record scan). For non-categorical fields it scans records.

LLM agents using MCP: see the pulse_facet MCP tool.

Synopsis

pulse api facet --input PATH --field NAME [--json]

Flags

Flag	Alias	Type	Default	Purpose
`--input`	`-i`	string	(required)	Cohort `.pulse` file path
`--field`	`-f`	string	(required)	Field name to facet on
`--json`		bool	false	Emit the standard envelope

Output (text mode)

One value per line:

east
north
south
west

Output (`--json`)

{
  "format_version": "1.0",
  "data": ["east", "north", "south", "west"],
  "errors": [],
  "warnings": []
}

Performance notes

Field type	Behaviour
`categorical_u8` / `_u16` / `_u32`	Read directly from the schema’s inline dictionary; O(distinct values), no record scan
Non-categorical	Full scan; values collected into a set, then returned sorted

For columns with very high cardinality on the non-categorical path, expect memory proportional to distinct value count.

Exit codes

Code	Meaning
0	Success
1	File not found, field name not found, or unsupported version

Examples

# Read categorical dictionary
pulse api facet --input sales.pulse --field region

# JSON envelope
pulse api facet --input sales.pulse --field region --json

# Pipe into another command
for r in $(pulse api facet --input sales.pulse --field region); do
    echo "Region: $r"
done

pulse api sample — raw rows preview
Format: Dictionary Blocks — how categorical dictionaries are encoded
Library: pulse.Facet

pulse manifest

Audience: CLI users (and orchestration agents) discovering Pulse’s self-description — what commands exist, which aggregators are registered, which field types are supported, and what skills the binary ships with.

The manifest is the bare-pulse invocation with --json. It is deterministic and process-wide: it never depends on cohort data or the filesystem.

LLM agents using MCP: the manifest is also available via the pulse_manifest MCP tool. Agents typically call this once per session and cache the result.

Synopsis

pulse --json [--slim]

(There is no pulse manifest subcommand — the manifest is the root command’s --json output.)

Flags

Flag	Type	Default	Purpose
`--json`	bool	false	Emit the manifest as a JSON envelope
`--slim`	bool	false	Drop prose descriptions from the manifest payload (smaller for size-sensitive clients)

Manifest shape

From descriptor/manifest.go:

{
  "format_version": "1.0",
  "data": {
    "commands":   [ /* every CLI leaf with a usage line */ ],
    "operators":  [ /* every aggregator / attribute / filterer / grouper / window / feature */ ],
    "tests":      [ /* every tier-1 statistical test */ ],
    "post_tests": [ /* every tier-2 post-test variant */ ],
    "distributions": [ /* every synth distribution kind */ ],
    "errors":     [ /* every registered error code with a description */ ],
    "mcp_tools":  [ /* every MCP tool name + description */ ],
    "field_types":[ /* every .pulse field type */ ],
    "skills":     [ /* every embedded skill with metadata */ ]
  },
  "errors":   [],
  "warnings": []
}

Every list is sorted deterministically (alphabetical or category + alphabetical). The same Pulse binary always emits the same manifest bytes (modulo --slim).

Determinism gates

Several CI tests enforce manifest completeness — see Testing Conventions. Notably:

TestManifestOperatorsComplete — every registered operator appears in the manifest.
TestManifestTestsComplete / TestManifestPostTestsComplete — every registered statistical test appears.
TestManifestDistributionsComplete, TestManifestErrorCodesComplete, TestManifestMCPToolsComplete — same for distributions, error codes, and MCP tools.
TestManifestStreamableMatchesTypes — every operator’s streamable flag mirrors the per-type method.

When to use the manifest

Use case	Reach for
Discover what’s available	`pulse --json`
Confirm a specific operator’s params and emit type	`jq ’.data.operators[]
List embedded skills with their `applies_to`	`jq '.data.skills[]'`
Generate documentation or client stubs	Parse the full manifest once at boot
Quick “is this name a real operator?”	`pulse –json –slim

Exit codes

Code	Meaning
0	Always (the manifest is in-memory, deterministic, never errors)

Examples

Print the manifest

pulse --json | jq '.data | keys'

Slim variant for embedding in an agent’s system prompt

pulse --json --slim > manifest.slim.json

List every aggregator with its emitted type

pulse --json | jq '.data.operators[] | select(.category == "aggregation") | {name, emits_type}'

Confirm a feature operator’s parameters

pulse --json | jq '.data.operators[] | select(.name == "FEAT_BUCKETIZE")'

How LLMs Use Pulse — the manifest is one of the agent discovery primitives
Library: pulse.Manifest — Go counterpart
Internals: Architecture — why the manifest cannot import service/ or processing/

pulse synth from-schema

Audience: CLI users generating a synthetic .pulse cohort from a declarative spec — for testing, demos, and bootstrapping fixtures.

pulse synth from-schema reads a JSON synth spec (field-by-field distributions, row count, optional pairwise correlations) and writes a deterministic .pulse file. Same (spec, seed) pair produces a byte-identical output.

LLM agents using MCP: see the pulse_synth MCP tool and the synthetic-data skill — it covers spec authoring, the 12 supported distributions, and constraint patterns.

Synopsis

pulse synth from-schema --spec FILE --output FILE
                        [--rows N] [--seed N] [--json]

Flags

Flag	Alias	Type	Default	Purpose
`--spec`	`-s`	string	(required)	Synth spec JSON path
`--output`	`-o`	string	(required)	Output `.pulse` file path
`--rows`		int	from spec	Override `row_count` in the spec
`--seed`		int	0	Deterministic RNG seed
`--json`		bool	false	Emit the standard envelope

Spec shape (sketch)

{
  "row_count": 10000,
  "fields": [
    {"name": "id",      "type": "u64",            "distribution": "monotonic_from", "from": 1},
    {"name": "region",  "type": "categorical_u8", "distribution": "weighted_categorical",
                         "weights": {"east": 0.4, "west": 0.4, "north": 0.1, "south": 0.1}},
    {"name": "revenue", "type": "f64",            "distribution": "lognormal", "mu": 4.0, "sigma": 0.8},
    {"name": "sold_on", "type": "date",           "distribution": "uniform_date",
                         "from": "2024-01-01", "to": "2024-12-31"}
  ]
}

Full spec grammar (constraints, correlations, regex, …) lives in skills/synthetic-data.md and synth/.

Supported distributions

bernoulli, constant, exponential, lognormal, monotonic_from, normal, pareto, poisson, regex, uniform, uniform_date, weighted_categorical.

The full catalog (with parameters) is in skills/synthetic-data.md and pulse --json | jq '.data.distributions'.

Determinism

Same (spec, seed) → byte-identical output. The seed is a int64; default 0. Use a fixed seed for fixtures and a random seed for load-testing variation.

Output

Text mode

Generated 10000 rows -> sales.pulse (rejected 0)

rejected counts rows that failed user-defined constraints (PULSE_SYNTH_CONSTRAINT_INFEASIBLE when the rejection rate is too high to make progress).

`--json`

{
  "format_version": "1.0",
  "data": {
    "output_path": "sales.pulse",
    "rows_generated": 10000,
    "rows_rejected": 0,
    "seed": 0
  },
  "errors": [],
  "warnings": []
}

Exit codes

Code	Meaning
0	Success
1	Spec parse error, unknown distribution, infeasible constraints, or output write failure

Common error codes

Code	Cause
`PULSE_SYNTH_DISTRIBUTION_UNKNOWN`	Spec references a distribution name not in the catalog
`PULSE_SYNTH_CONSTRAINT_INFEASIBLE`	Constraints reject too high a fraction of generated rows

Examples

# Build sales.pulse from a spec
pulse synth from-schema --spec sales.spec.json --output sales.pulse --seed 42

# Override row count without editing the spec
pulse synth from-schema --spec sales.spec.json --output sales.pulse --rows 1000

# Programmatic envelope
pulse synth from-schema --spec sales.spec.json --output sales.pulse --json

pulse synth from-profile — generate from a captured profile of an existing cohort
pulse profile create — capture the profile
skills/synthetic-data.md — full spec grammar and distribution table
Library: pulse.Synth

pulse synth from-profile

Audience: CLI users generating a synthetic .pulse cohort whose distributions match a real cohort — typically to share a sanitised replica without exposing the underlying rows.

pulse synth from-profile reads a profile JSON captured by pulse profile create and writes a synthetic .pulse file whose per-field distributions and (optional) pairwise correlations follow the profile. The profile retains no individual rows from the source; only summary statistics.

LLM agents using MCP: see the pulse_synth_from_profile MCP tool and the synthetic-data skill.

Synopsis

pulse synth from-profile --profile FILE --output FILE --rows N
                         [--seed N] [--json]

Flags

Flag	Alias	Type	Default	Purpose
`--profile`	`-p`	string	(required)	Profile JSON path
`--output`	`-o`	string	(required)	Output `.pulse` file path
`--rows`		int	(required)	Rows to generate
`--seed`		int	0	Deterministic RNG seed
`--json`		bool	false	Emit the standard envelope

--rows is required (unlike from-schema, which can pull it from the spec) because the profile does not carry a generation count of its own.

Determinism

Same (profile, seed, rows) triple → byte-identical output. Seeds are int64; default 0.

Profile shape

The profile is a synth.Profile JSON object produced by pulse profile create. It carries per-field type, descriptive statistics, top-K categorical entries (default K = 32), optional pairwise correlations (when --include-correlations was passed at profile-creation time), and a row count.

See pulse profile create for how to capture one, and synth/ for the underlying Go types.

Output

Text mode

Generated 1000 rows -> sales.synth.pulse (rejected 0)

`--json`

Same envelope shape as synth from-schema.

Exit codes

Code	Meaning
0	Success
1	Profile parse error, infeasible constraints, or output write failure

Examples

# Capture once
pulse profile create --input sales.pulse --output sales.profile.json

# Re-generate any number of times with different seeds
pulse synth from-profile --profile sales.profile.json --output sales.s42.pulse --rows 10000 --seed 42
pulse synth from-profile --profile sales.profile.json --output sales.s43.pulse --rows 10000 --seed 43

Limitations

Categorical tails: anything past the captured top-K is replaced with a sentinel “other” bucket sized to its observed weight.
Correlations: pairwise only, and only between numeric fields. The profile capture flag --include-correlations opts in; without it, fields are generated independently.
Decimal and geo fields: regenerated within the same type family but with synthetic value distributions; downstream uses that depend on exact field values (e.g. joinable identifiers) need the schema-driven path instead.

pulse profile create
pulse synth from-schema
skills/synthetic-data.md — the spec / profile grammar

pulse profile create

Audience: CLI users capturing a statistical profile of an existing cohort — typically to feed into pulse synth from-profile.

pulse profile create reads a .pulse file and writes a JSON profile: per-field type, descriptive statistics, top-K categorical entries, optional pairwise correlations. The profile retains no individual rows from the source.

LLM agents using MCP: see the pulse_profile MCP tool.

Synopsis

pulse profile create --input PATH --output PATH
                     [--top-k N] [--include-stats]
                     [--include-correlations] [--correlation-top-k N]
                     [--sample-limit N] [--json]

Flags

Flag	Alias	Type	Default	Purpose
`--input`	`-i`	string	(required)	Source `.pulse` cohort
`--output`	`-o`	string	(required)	Output profile JSON path
`--top-k`		int	32	Top-K categorical entries to retain per field
`--include-stats`		bool	true	Include percentile / std stats
`--include-correlations`		bool	false	Capture pairwise numeric correlations
`--correlation-top-k`		int	16	Cap on retained correlation pairs
`--sample-limit`		int	0 (unlimited)	Cap rows ingested for the profile (0 disables)
`--json`		bool	false	Also print the envelope to stdout

What the profile captures

Field type	What is recorded
Numeric (`u`, `f`, `decimal128`)	Count, min, max, mean, stddev; percentiles if `--include-stats`
Categorical	Top-K most-frequent values + their frequencies; “other” tail weight
`date`	Min, max, count
`nullable_*`	Null count alongside the above

What the profile does NOT capture

Individual rows.
The full categorical dictionary beyond --top-k.
Correlations unless --include-correlations is set.

This is by design — profiles are intended to be safe to share with parties who shouldn’t see the underlying data.

Output

The profile JSON is always written to --output. With --json, the envelope is also written to stdout (typically piped or jq-d).

Profile schema lives in synth/profile.go and is documented in skills/synthetic-data.md.

Text mode summary

Profiled 50000 rows from sales.pulse -> sales.profile.json

Exit codes

Code	Meaning
0	Success
1	Read error, unsupported field type (`PULSE_PROFILE_FIELD_UNSUPPORTED`), or write failure

Examples

Minimal profile

pulse profile create --input sales.pulse --output sales.profile.json

Rich profile with correlations

pulse profile create --input sales.pulse --output sales.profile.json \
    --include-stats --include-correlations --top-k 64 --correlation-top-k 32

Sample-limited profile for a huge cohort

pulse profile create --input ops.pulse --output ops.profile.json --sample-limit 1000000

Round-trip with synth

pulse profile create --input sales.pulse --output sales.profile.json
pulse synth from-profile --profile sales.profile.json --output sales.synth.pulse --rows 10000 --seed 1
pulse cohort inspect sales.synth.pulse

pulse synth from-profile — the consumer of profile JSON
pulse synth from-schema — the alternative spec-driven path
skills/synthetic-data.md — full profile and spec grammar
Library: pulse.Profile

pulse mcp

Audience: operators wiring Pulse into an MCP-aware AI client (Claude Desktop, Claude Code, generic MCP clients).

pulse mcp runs the Model Context Protocol server over stdio. The AI client launches pulse mcp as a subprocess, speaks MCP over its stdio streams, and shuts it down on session close.

LLM agents using MCP: the agent-side guide is the mcp-integration skill — fetch it via pulse_skills_get for the tool catalog and request shapes. This page is for the human setting the server up.

Synopsis

pulse mcp [--data-dir PATH] [--bind-on-open]

The command reads stdin, writes MCP responses on stdout, and writes a one-line startup notice (and any subsequent diagnostics) on stderr.

Flags

Flag	Type	Default	Purpose
`--data-dir`	string	from `PULSE_DATA_DIR` env var	Cohort base directory
`--bind-on-open`	bool	true	Register session-scoped JSON-schema-bound tool variants on successful `pulse_inspect`

--data-dir is required in one of its two forms (env var or flag). The MCP server fails to start otherwise:

data directory required: set PULSE_DATA_DIR or pass --data-dir

–bind-on-open

When a session calls pulse_inspect successfully, the server can register session-scoped tool variants whose JSON Schemas constrain field-name parameters to the cohort’s actual fields. This narrows the LLM’s choices and prevents typos at parameter-binding time.

Default: true. Pass --bind-on-open=false if your client binds tool schemas itself.

The binding logic lives in internal/mcp/schema_bind.go; see skills/mcp-integration.md for the LLM-facing implications.

Wiring it into Claude Desktop

~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args": ["mcp"],
      "env": {
        "PULSE_DATA_DIR": "/var/data/pulse"
      }
    }
  }
}

Restart the client. The Pulse tools (pulse_manifest, pulse_ask, pulse_inspect, pulse_predict, pulse_process, pulse_compose, pulse_sample, pulse_facet, pulse_import, pulse_drop, pulse_imports_list, pulse_examples_search, pulse_examples_get, pulse_errors_lookup, pulse_skills_list, pulse_skills_get) and resources (pulse://*.pulse, pulse-skill://*) appear in the tool/resource list.

Wiring it into Claude Code

~/.claude.json (or per-project .claude.json):

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args":    ["mcp"],
      "env":     { "PULSE_DATA_DIR": "/var/data/pulse" }
    }
  }
}

The full LLM-side recipe (including resource URIs and the schema binding details) is in skills/mcp-integration.md.

Exit codes

pulse mcp is a long-running process. It exits non-zero only on fatal startup failure (missing data dir, transport error). Once serving, an MCP client controls the lifecycle.

Examples

Foreground run for debugging

PULSE_DATA_DIR=/tmp/pulse-data ./bin/pulse mcp
# Stderr: pulse mcp: serving over stdio (data dir: /tmp/pulse-data, bind-on-open: true)

Disable schema binding

PULSE_DATA_DIR=/tmp/pulse-data ./bin/pulse mcp --bind-on-open=false

Inspect what the server registers

# Manifest exposes the MCP tool list
pulse --json | jq '.data.mcp_tools[]'

How LLMs Use Pulse — the pointer table from this site into the skill pack
skills/mcp-integration.md — LLM-side wiring, tool catalog, resource schemes, schema binding
Deployment — production hardening notes
Troubleshooting — common MCP failure modes

Flag Reference

Audience: CLI users who want one page that lists every flag and every environment variable in scope across the binary.

The per-command pages list each command’s full flag set; this page is the cross-cutting reference for flags that appear on multiple commands and for the environment variables Pulse reads.

LLM agents using MCP: there is no LLM-facing skill for the CLI surface. Agents go via MCP tools (pulse_process, pulse_inspect, …) — see skills/mcp-integration.md.

Global flags

Available on the bare pulse invocation:

Flag	Effect
`--json`	Print the root manifest as JSON (envelope-wrapped)
`--slim`	With `--json`, drop prose descriptions for size-sensitive clients

Both default to off. pulse --json is the discovery entry point — it emits the manifest documented at pulse manifest.

Environment variables

Variable	Used by	Required	Purpose
`PULSE_DATA_DIR`	All commands when no path override is given; required by `pulse mcp`	conditionally	Base directory for cohort files. Relative cohort paths resolve against it.

PULSE_DATA_DIR is the only PULSE_* environment variable today. The Makefile auto-loads a repo-root .env file so you can keep it (and any future env vars) there for development.

When embedding the library, you can bypass the env var entirely by passing pulse.Options{DataDir: "/path"} or pulse.Options{FS: myFs}.

`--json` envelope

Almost every leaf command accepts --json, which switches output from human prose to a structured envelope. The envelope shape is fixed and documented in CLAUDE.md → Output Format Contract:

{
  "format_version": "1.0",
  "data":     { /* operation-specific result */ },
  "errors":   [ /* {"code": "...", "message": "...", "details": {...}} */ ],
  "warnings": [ /* same shape */ ]
}

format_version is currently "1.0". errors and warnings are always arrays (never null) so JSON consumers can index without nullable-check overhead.

Shared per-command flags

Several flags appear on multiple commands with identical semantics.

`--no-defaults`

Available on: api process, api compose, api ask.

Disable the runtime smart-defaults pass that infers operator Type from the named field’s schema type when the caller omits it. Forces the request to be source-of-truth. See pulse.New & Options for the underlying library option.

`--stream`

Available on: api process, api compose.

Stream result rows as NDJSON (one row per line) instead of buffering the full result. For compose, each line carries an {"index": N, "row": {...}} shape so consumers know which sub-request produced each row. See Streaming & ProcessStream.

`--strict`

Available on: api predict.

Treat warnings (e.g. low-quality field description) as errors. Useful in CI gates that want the strictest possible validation.

`--full-dict`

Available on: cohort inspect.

Print full categorical dictionaries instead of truncating after 100 entries. Pair with --json for programmatic consumption.

`--strict` / `--seed` / `--rows`

synth from-schema and synth from-profile use --seed (for deterministic RNG) and --rows (override the spec’s row count). See the per-command pages.

Help

Every command supports --help:

pulse --help
pulse api --help
pulse api process --help
pulse mcp --help

--help output is the urfave/cli v3 default — a usage block, description, flag list, and an examples block where applicable.

Cross-references

If you need…	Go to
Per-command synopsis & examples	CLI Tour and each `cli/` page
Library-side equivalents	Library Embedding
MCP-side equivalents	How LLMs Use Pulse
Envelope and error code semantics	Troubleshooting and `skills/error-code-reference.md`

Go API Overview

Audience: Go developers embedding Pulse in a binary or a service.

Pulse is library-first. The CLI in cmd/pulse/ is a thin adapter around the package documented here. If you’re reaching for os/exec to shell out to the binary from Go, stop and use the library directly — you’ll skip a process boundary and gain typed responses.

LLM agents using MCP: there is no LLM-facing skill that covers Go embedding directly. Agents speak MCP; this page is for the programs that host them.

Module path

import "github.com/frankbardon/pulse"

Sub-packages you’ll commonly touch:

Package	Purpose
`github.com/frankbardon/pulse`	Public facade (`Pulse`, `Options`, `Request`, `Response`, `Ask`, …)
`github.com/frankbardon/pulse/types`	Request/response structs, component-type constants (`AGG_*`, …)
`github.com/frankbardon/pulse/io`	Tabular adapter interfaces (`Reader`, `Writer`, `ImportJob`, `ExportJob`, `ConvertJob`)
`github.com/frankbardon/pulse/io/<fmt>`	Per-format readers/writers (`csv`, `tsv`, `ndjson`, `jsonarray`, `parquet`, `arrow`, `excel`)
`github.com/frankbardon/pulse/fs`	`afero`-backed filesystem config (`fs.New`, `fs.Default`, `fs.NewMemMap`)
`github.com/frankbardon/pulse/errors`	Typed `CodedError` system and code constants
`github.com/frankbardon/pulse/descriptor`	Manifest, predict, inspect (no-execute operations)
`github.com/frankbardon/pulse/synth`	Synthetic data generator and profile types
`github.com/frankbardon/pulse/skills`	Embedded skill pack — `skills.List()`, `skills.Get(name)`

The internal/ subtree (internal/cli, internal/mcp, internal/query) is exactly that — internal. Don’t import it.

The facade

Construct a Pulse once per process (or per filesystem boundary) and re-use it:

p, err := pulse.New(pulse.Options{
    DataDir: "/var/data/pulse",
})
if err != nil {
    return err
}

The full Options shape (custom afero.Fs, smart-default toggling) is documented at pulse.New & Options.

Public methods

From pulse.go:

Method	Purpose
`Open(ctx, path) (*Cohort, error)`	Read header + schema, return a typed Cohort handle
`Process(ctx, req) (*Response, error)`	Execute one request
`ProcessStream(ctx, req) (RowIter, error)`	Same, pull-based iterator over result rows
`Compose(ctx, req) ([]*Response, error)`	Execute a batch sequentially
`ComposeParallel(ctx, req, opts) ([]*Response, error)`	Execute a batch in parallel with a worker pool
`Ask(ctx, askReq) (*AskResponse, error)`	Unified entry: predict + (optionally) process, with natural-language query support
`Import(ctx, job) (*ImportReport, error)`	Tabular → `.pulse`
`Export(ctx, job) (*ExportReport, error)`	`.pulse` → tabular
`Convert(ctx, job) (*ConvertReport, error)`	Tabular → tabular, with `.pulse` as the transparent middle
`Inspect(ctx, path) (*InspectResult, error)`	Read header + schema only (no record data)
`Predict(ctx, req) (*PredictResult, error)`	Validate a request without executing
`Sample(ctx, path, n) ([]Record, error)`	Up to n rows
`Facet(ctx, path, field) ([]string, error)`	Distinct values of a field
`Synth(ctx, spec, out, opts) (*SynthResult, error)`	Generate a synthetic cohort
`Profile(ctx, path, opts) (*Profile, error)`	Statistical summary suitable for `from-profile` synthesis
`Manifest(ctx) *Manifest`	Deterministic root self-description
`Fs() afero.Fs`	The underlying filesystem (used by `pulse mcp` and other embedders)

Re-exported type aliases let you write pulse.Request instead of types.Request:

type (
    Request         = types.Request
    Response        = types.Response
    ComposedRequest = types.ComposedRequest
    SynthSpec       = synth.Spec
    Profile         = synth.Profile
    // … and so on
)

Minimum viable embed

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/frankbardon/pulse"
    "github.com/frankbardon/pulse/types"
)

func main() {
    ctx := context.Background()

    p, err := pulse.New(pulse.Options{DataDir: "/var/data/pulse"})
    if err != nil {
        log.Fatal(err)
    }

    resp, err := p.Process(ctx, &pulse.Request{
        Cohort: &types.Cohort{Filename: "sales.pulse"},
        Aggregations: []*types.Aggregation{
            {Type: types.AGG_AVERAGE, Field: "revenue", Label: "avg_revenue"},
        },
    })
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(resp.Data)
}

Where to go from here

pulse.New & Options — full Options reference.
pulse.Ask Unified Entry Point — the one-shot facade the MCP server uses internally.
Custom Filesystems — in-memory testing pattern, custom storage backends.
Streaming & ProcessStream — pull-based iteration, what streams vs what buffers.
Parallel Compose — worker pool, fail-fast, per-request timeout.

pulse.New & Options

Audience: Go embedders constructing a Pulse instance.

pulse.New(pulse.Options{...}) is the single entry point. There is no config file, no init function, no global state. Every option is declared in code (or comes from PULSE_DATA_DIR when the field is left empty).

LLM agents using MCP: the MCP server constructs its own Pulse instance from CLI flags. Agents don’t see this surface.

The Options struct

From pulse.go:

type Options struct {
    // DataDir is the base directory for cohort files.
    // Defaults to PULSE_DATA_DIR if empty and FS is not set.
    DataDir string

    // FS is an optional custom filesystem.
    // When set, DataDir is ignored for filesystem construction.
    FS afero.Fs

    // DisableDefaults turns off the smart-defaults pass that infers
    // operator Type from the named field's schema type when the caller
    // omits it. Defaults to false (defaults enabled). Predict still
    // computes and reports DefaultsApplied independently — this flag
    // governs only what the runtime mutates on the live request.
    DisableDefaults bool
}

Field reference

`DataDir string`

The base directory for .pulse files. Relative cohort paths ({"filename": "data.pulse"}) resolve against this directory.

Source	Result
Non-empty `Options.DataDir`	Used directly
Empty + `FS` non-nil	`DataDir` is ignored — the FS is the trust boundary
Empty + `FS` nil	Pulse falls back to `fs.Default()`, which reads `PULSE_DATA_DIR`

Example:

p, err := pulse.New(pulse.Options{DataDir: "/var/data/pulse"})

`FS afero.Fs`

A custom afero.Fs implementation. When set, it fully overrides the filesystem layer — DataDir is unused, and PULSE_DATA_DIR is not consulted. Use this for tests (afero.NewMemMapFs()) or non-local backends (S3-backed afero.Fs, encrypted overlays, …).

Example:

import "github.com/spf13/afero"

p, err := pulse.New(pulse.Options{
    FS: afero.NewMemMapFs(),
})

See Custom Filesystems for in-depth usage and the hermetic-test pattern.

`DisableDefaults bool`

The runtime smart-defaults pass infers an operator’s Type from the named field’s schema type when the caller omits it (e.g. AGG_SUM on a numeric field defaults appropriately; categorical fields default toward AGG_COUNT). Set DisableDefaults = true to require an explicit Type on every aggregation and grouper — useful when you want the request to be source-of-truth and never be silently re-typed.

This option only governs the runtime mutation. predict independently computes and reports DefaultsApplied in its result envelope, so callers can see what would have been inferred even when defaults are disabled.

CLI parity: pulse api process --no-defaults, pulse api compose --no-defaults, pulse api ask --no-defaults.

Defaults at a glance

Field omitted from `Options`	Effective behaviour
`DataDir` and `FS` both empty	Pulse calls `fs.Default()` → reads `PULSE_DATA_DIR` env var. Errors if unset and the operation needs filesystem access.
`DataDir` only	Uses an `afero.NewOsFs()` rooted at `DataDir`.
`FS` only	Uses the provided FS verbatim.
Both	`FS` wins; `DataDir` is ignored.
`DisableDefaults` omitted	Defaults enabled.

Re-using a Pulse instance

Pulse is safe for concurrent use across goroutines once constructed. The internal registries are read-only after New; each Process call constructs fresh stateful operators per request, so multiple goroutines can call Process/ProcessStream/Compose in parallel against the same Pulse.

For batch parallelism, prefer ComposeParallel — it shares the read-only registries and bounds concurrency for you.

Tearing down

There is no explicit Close() method on Pulse. The filesystem is a borrowed handle; if you supply a custom FS, the embedder is responsible for any cleanup that FS requires. Streaming consumers should still call RowIter.Close() so that the underlying readers release their buffers.

pulse.Ask — Unified Entry Point

Audience: Go embedders who want a single call that validates a request and then optionally executes it.

Ask is the one-shot facade. It collapses predict, process, and the natural-language query parser into a single typed call. The MCP server uses this same method internally for the pulse_ask tool.

LLM agents using MCP: the corresponding LLM-facing surface is the pulse_ask MCP tool, documented in skills/mcp-integration.md and skills/request-recipes.md.

When to use Ask vs Process

Goal	Reach for
Validate a request without running it	`Predict` (or `Ask{Predict: true}`)
Validate then execute in one call	`Ask`
Translate a natural-language string into a request and execute	`Ask` with `Query` set
Execute a request you’ve already validated separately	`Process` (lower overhead)

If you’re already inside a tight loop that validates once and runs many similar requests, prefer Process — Ask does the predict pass on every call.

Request shape

From pulse.go:

type AskRequest struct {
    File      string         `json:"file,omitempty"`
    Request   *types.Request `json:"request,omitempty"`
    Query     string         `json:"query,omitempty"`
    OnInvalid string         `json:"on_invalid,omitempty"`
    Predict   bool           `json:"predict,omitempty"`
}

Field	Meaning
`File`	Cohort path. When set and `Request.Cohort` is nil, Ask synthesises a `Cohort` from the path.
`Request`	Structured `types.Request`. Optional when `Query` is set — the parser fills empty slots.
`Query`	Natural-language query string (“average revenue by region”). Parsed against the cohort’s schema.
`OnInvalid`	`"abort"` (default) returns a `SERVICE_VALIDATION` error on predict-invalid; `"suggest"` returns the response with `Suggestions` populated.
`Predict`	When `true`, skip execution after a successful predict. The “what would happen if I ran this” probe.

Response shape

type AskResponse struct {
    FormatVersion   string                      `json:"format_version"`
    Predict         *descriptor.PredictResult   `json:"predict"`
    Process         *Response                   `json:"process,omitempty"`
    Suggestions     []errors.Fixup              `json:"suggestions,omitempty"`
    QueryResolution *QueryResolution            `json:"query_resolution,omitempty"`
    Errors          []*descriptor.EnvelopeEntry `json:"errors"`
    Warnings        []*descriptor.EnvelopeEntry `json:"warnings"`
}

Predict is always populated.
Process is set only when execution ran.
Suggestions is populated only when predict reported invalid and OnInvalid == "suggest".
QueryResolution is set only when Query was non-empty; it echoes the parser’s matched fields and aggregate confidence in [0, 1].

Examples

Structured request, predict-only

resp, err := p.Ask(ctx, &pulse.AskRequest{
    Request: &pulse.Request{
        Cohort: &types.Cohort{Filename: "sales.pulse"},
        Aggregations: []*types.Aggregation{
            {Type: types.AGG_SUM, Field: "revenue", Label: "total"},
        },
    },
    Predict: true,
})

Natural-language query

resp, err := p.Ask(ctx, &pulse.AskRequest{
    File:  "sales.pulse",
    Query: "average revenue by region",
})
fmt.Printf("matched: %v (conf %.2f)\n",
    resp.QueryResolution.MatchedFields,
    resp.QueryResolution.Confidence)

The parser fills the structured request from the query and runs Process. Explicit fields in Request always win on collision — the parser only fills empty slots.

Query plus a partial structured request

resp, err := p.Ask(ctx, &pulse.AskRequest{
    File: "sales.pulse",
    Request: &pulse.Request{
        Filterers: []*types.Filterer{
            {Type: types.FILTER_RANGE, Field: "revenue", Values: []string{"100", "1000"}},
        },
    },
    Query: "average revenue by region",
})

The structured Filterers win; the parser supplies Aggregations and Groups from the query.

Suggest fixups instead of erroring

resp, err := p.Ask(ctx, &pulse.AskRequest{
    Request:   req,
    OnInvalid: "suggest",
})
for _, fix := range resp.Suggestions {
    fmt.Println(fix.Code, fix.Message, fix.Hint)
}

Fixup templates live in errors/fixup_metadata.go and are documented per code in skills/error-code-reference.md.

Errors and warnings

AskResponse.Errors and AskResponse.Warnings flatten the descriptor envelope’s entries plus any issues the query parser raised (PULSE_QUERY_UNRESOLVED, PULSE_QUERY_AMBIGUOUS). The arrays are always present (never nil) so JSON consumers can index without null-checks — same shape as the descriptor envelope.

FormatVersion mirrors the descriptor envelope version ("1.0") so callers can gate on a single value across endpoints.

Custom Filesystems

Audience: Go embedders running Pulse in tests (hermetic, no disk), in cloud-storage-backed environments (S3, GCS, Azure Blob via afero), or behind a custom storage layer.

Pulse routes all file I/O through afero.Fs. Pass any afero.Fs-conformant filesystem to pulse.New(pulse.Options{FS: ...}) and Pulse never touches the OS filesystem directly.

LLM agents using MCP: the MCP server’s filesystem is fixed at startup via PULSE_DATA_DIR or --data-dir. Agents don’t swap filesystems mid-session.

In-memory testing pattern

The single most common reason to override the filesystem is hermetic tests. Use fs.NewMemMap() (which wraps afero.NewMemMapFs() with the right config) or pass the afero filesystem directly:

import (
    "github.com/frankbardon/pulse"
    "github.com/spf13/afero"
)

func TestSomething(t *testing.T) {
    p, err := pulse.New(pulse.Options{FS: afero.NewMemMapFs()})
    if err != nil {
        t.Fatal(err)
    }

    // Write a .pulse file into the in-memory FS, then process it.
    // ...
}

The in-memory FS persists for the life of the FS reference. Create a fresh one per test for isolation.

Custom storage backends

Anything that implements afero.Fs works. Common patterns:

S3 / GCS / Azure Blob — via community afero adapters (afero/gcsfs, afero/s3).
Encrypted overlays — wrap a base FS with envelope encryption per file.
Read-only mounts — afero.NewReadOnlyFs(base) for production cohort serving where mutation is by accident, not policy.

Example with a hypothetical S3 wrapper:

import (
    "github.com/frankbardon/pulse"
    "example.com/myorg/aferos3"
)

func main() {
    s3fs := aferos3.New(aferos3.Config{
        Bucket: "my-pulse-cohorts",
        Region: "us-east-1",
    })
    p, _ := pulse.New(pulse.Options{FS: s3fs})
    // p reads and writes cohort files from S3 transparently.
}

The fs package

The lower-level constructors live in fs/:

Function	Purpose
`fs.New(opts ...Option) (*fs.Config, error)`	Build a config with `fs.WithFs(...)` / `fs.WithDataDir(...)`
`fs.Default() (*fs.Config, error)`	Read `PULSE_DATA_DIR` from the environment
`fs.NewMemMap() *fs.Config`	In-memory test config

You can also bypass pulse.Options entirely and construct a service from a *fs.Config, but the public facade is the intended entry point. pulse.New(pulse.Options{FS: yourFs}) covers every embedding case.

Path resolution

Pulse resolves a Cohort to a path with this rule (see resolveCohortPath in pulse.go):

if cohort.DataDir != "" → "<DataDir>/<Filename>"
else                    → "<Filename>"

The custom FS is then asked to open that path. For an afero.MemMapFs, an absolute-looking path like /var/data/sales.pulse is just a key in the in-memory map — no need to mirror the OS layout.

What custom filesystems do NOT do

Pulse never falls back to os.Open if the custom FS fails. The custom FS is the only filesystem; if it errors, that error propagates verbatim.
The MCP server (pulse mcp) currently uses afero.NewOsFs() only. Custom filesystems are a library-side capability today.
The Go race detector and go test -race work normally with in-memory filesystems; tests can run highly concurrent without fighting over a real directory.

Streaming & ProcessStream

Audience: Go embedders feeding rows into an HTTP response, an NDJSON pipeline, or any consumer that wants result rows one at a time instead of buffering the full set.

pulse.ProcessStream returns a pull-based iterator. The API is stable regardless of whether the underlying request shape streams inside the engine — non-streamable requests return the same iterator, they just buffer once internally before yielding.

LLM agents using MCP: see skills/request-recipes.md for the MCP-side streaming surface (pulse_process with the streaming option). The Streamable predicate is the same on both surfaces.

The iterator API

type RowIter = service.RowIter

// In service:
type RowIter interface {
    Next(ctx context.Context) (Row, bool, error)
    Close() error
    Metadata() *ResponseMetadata
}

type Row = service.Row // map[string]any

Usage:

iter, err := p.ProcessStream(ctx, req)
if err != nil {
    return err
}
defer iter.Close()

for {
    row, ok, err := iter.Next(ctx)
    if err != nil {
        return err
    }
    if !ok {
        break
    }
    // … emit row …
}

meta := iter.Metadata() // available after drain

Metadata() returns the full ResponseMetadata (total rows, filtered rows, cohort file) once the iterator has been drained.

What actually streams

ProcessStream always returns an iterator, but the engine only avoids the buffered intermediate row set for a subset of request shapes. Run pulse api predict (or Predict from the library) and check the Streamable flag in the result:

pred, err := p.Predict(ctx, req)
if !pred.Streamable {
    for _, reason := range pred.StreamableReasons {
        log.Printf("buffered because: %s", reason)
    }
}

The streaming-eligible request shapes are listed in Performance Notes → Streaming path.

The complement — the request shapes that force the buffered path — is at Performance Notes → Buffered path.

Streamable=false doesn’t mean the iterator is broken; it just means rows materialise inside the engine before Next yields them. The output API is identical either way.

CLI parity

pulse api process --stream writes NDJSON to stdout, one row per line. pulse api compose --stream does the same with an index field per row identifying which sub-request produced it.

Cancellation

Every Next call accepts a context. Cancellation propagates to the underlying reader; rows that are already in flight may still be returned before Next returns (_, false, ctx.Err()). Close() releases any reader resources and is safe to call multiple times.

Backpressure

The iterator is pull-based: the engine produces rows only as fast as the consumer calls Next. For HTTP responders that flush periodically, this means you can stream a multi-GB result set through a constant-memory buffer.

For pipelines that want to fan rows out across goroutines, copy each row into your own struct before processing — Row is map[string]any and the engine may re-use the backing data after Next returns. Treat it as borrowed.

Inside the engine

Under the hood, ProcessStream calls one of four orchestrator modes depending on the request shape: single-pass streaming, grouped streaming, two-pass streaming, or the buffered fallback. The choice is made via processing.CanStreamRequest(req, schema), which is the same predicate Predict.Streamable reports — this parity is enforced by TestPredict_Streamable_MatchesRuntime.

If you find a request that predict says is streamable but Next materialises something large, that’s a parity drift and a bug — please report it with the request JSON.

Parallel Compose

Audience: Go embedders running multiple requests concurrently against the same cohort or set of cohorts.

pulse.ComposeParallel fans a ComposedRequest across a bounded worker pool. Workers share the engine’s read-only registries; each Process call constructs fresh stateful operators per request, so concurrent execution is safe.

LLM agents using MCP: the MCP server today exposes pulse_compose as a sequential operation. Parallelism is a library-side capability.

When to use

Goal	Reach for
Single request, single result	`Process`
Single request, pulled as rows	`ProcessStream`
Batch of independent requests, in order, sequential	`Compose`
Batch of independent requests, in parallel, with bounded workers	`ComposeParallel`

Order of results is preserved regardless of completion order — a worker that finishes early is held until its slot’s index is the next to emit. So callers can index responses[i] against req.Requests[i] directly.

ComposeOptions

From service/compose_parallel.go, re-exported as pulse.ComposeOptions:

type ComposeOptions struct {
    // MaxWorkers caps concurrent in-flight Process calls. Zero means
    // runtime.GOMAXPROCS; negatives clamp to 1.
    MaxWorkers int

    // PerRequestTimeout, if positive, derives a context.WithTimeout for
    // each request.
    PerRequestTimeout time.Duration

    // FailFast cancels in-flight siblings on the first request error.
    // Defaults to true. Set false to aggregate all errors instead.
    FailFast bool
}

Field	Default	Notes
`MaxWorkers`	`runtime.GOMAXPROCS(0)`	`0` resolves to GOMAXPROCS; `<1` clamps to 1
`PerRequestTimeout`	unlimited	When positive, each worker derives `context.WithTimeout`
`FailFast`	`true`	First error cancels siblings and returns immediately

Example

ctx := context.Background()

composed := &pulse.ComposedRequest{
    Requests: []*pulse.Request{req1, req2, req3, req4},
}

resps, err := p.ComposeParallel(ctx, composed, pulse.ComposeOptions{
    MaxWorkers:        4,
    PerRequestTimeout: 30 * time.Second,
    FailFast:          true,
})
if err != nil {
    return err
}

for i, resp := range resps {
    fmt.Printf("request %d: %d rows\n", i, len(resp.Data))
}

FailFast semantics

With FailFast = true (the default):

The first request to return an error cancels the shared context.
In-flight siblings observe cancellation via ctx.Err() and return early.
ComposeParallel returns (nil, theFirstError).

With FailFast = false:

Every request runs to completion (or its own per-request timeout).
Errors are aggregated into a single SERVICE_INTERNAL error whose details map carries failed_indices (a list of slot indices that errored).
Successful slots populate the returned response array; failed slots are nil at their index.

CLI parity

pulse api compose --request batch.json --parallel 4
pulse api compose --request batch.json --parallel 4 --no-fail-fast

--parallel N:

1 (default) → sequential Compose.
0 → runtime.GOMAXPROCS.
> 1 → exactly that many workers.

--no-fail-fast mirrors FailFast = false.

Performance considerations

Each worker performs its own filesystem reads. If your cohort lives on slow remote storage, parallelism amortises latency well; on local SSD the gain is smaller and CPU-bound.
Streaming aggregations are CPU-friendly — ComposeParallel over a pool of streaming requests scales near-linearly to the worker count.
Buffered request shapes (window operators, median, …) hold memory per request. Watch MaxWorkers × per_request_peak_memory.
The internal registries are read-only and shared across workers with no locking; only the per-request operator instances are fresh allocations.

Safety

Pulse is safe for concurrent use after New.
Per-request operator state (running sums, dictionaries, sorted buffers) is allocated fresh inside each Process call.
The afero.Fs you supply must itself be safe for concurrent reads — every shipped backend (OsFs, MemMapFs) is.

Audience: anyone reading or writing .pulse files by hand (forensics, custom readers, debugging a truncated file). The Go library handles all of this for you; this page documents the wire format.

The header is fixed-size: 9 bytes, consisting of an 8-byte magic identifier and a 1-byte format version.

LLM agents using MCP: see the cohort-schema-design skill via pulse_skills_get. It speaks in field-type semantics rather than byte layout; this page covers the bytes.

Constants

These live in encoding/header.go:

Name	Value	Purpose
`MagicBytes`	`[]byte{'P','U','L','S','E', 0x00, 0x00, 0x00}`	8-byte identifier; rejects non-Pulse files
`FormatVersion`	`0x01` (today)	Current `.pulse` wire format
`HeaderSize`	`9`	Total header byte count

Byte layout

Offset  Length  Field
------  ------  -----
0       8       Magic: "PULSE\0\0\0"
8       1       Format version (currently 0x01)
9       —       Schema block begins here

That’s the entire fixed header. The schema block immediately follows; see Schema Block.

Version semantics

The format version is single-byte. The reader at encoding.ReadHeader rejects unknown versions with the ENCODING_INVALID error code:

ENCODING_INVALID: unsupported pulse format version
{"version": <byte>}

This is the fail-loud guard against silently mis-decoding a file written by a future binary that introduced a new field type or layout change. A forward-incompatible change bumps the version; the older reader stops at header parse instead of producing wrong rows.

The current value is 0x01. The envelope format_version ("1.0") that all CLI --json output carries is unrelated — it tracks the JSON output schema, not the binary file format.

Hexdump sanity check

A freshly-written .pulse file starts with:

00000000  50 55 4c 53 45 00 00 00  01  ..  ..  ..  ..  ..
          |P  U  L  S  E  \0 \0 \0|ver| schema starts here

If file path/to/data.pulse reports “data” (rather than something plausible) and the first nine bytes don’t match the above, the file is either truncated or corrupted — see Troubleshooting.

What comes next

The schema block follows the header. Read it as documented in Schema Block; it carries per-field descriptors, inline categorical dictionaries, and decimal/H3 metadata. After the schema, fixed-width records start — see Record Layout.

Field Types

Audience: anyone designing a cohort schema, decoding a .pulse file by hand, or trying to understand which type to pick for a column.

Pulse supports 17 field types, each with a fixed type byte, a fixed (or bit-packed) byte size, and well-defined semantics. The full list, mirrored from CLAUDE.md → All 17 field types:

LLM agents using MCP: see the cohort-schema-design skill via pulse_skills_get — it covers nullability, bit-packing trade-offs, and “which type to pick” with MCP-side examples.

The catalog

Type	Byte value	ByteSize	Notes
`u8`	0	1	Unsigned 8-bit integer
`u16`	1	2	Unsigned 16-bit integer
`u32`	2	4	Unsigned 32-bit integer
`u64`	3	8	Unsigned 64-bit integer
`f32`	4	4	32-bit IEEE 754 float
`f64`	5	8	64-bit IEEE 754 float
`nullable_bool`	6	0	Bit-packed tri-state (null/true/false)
`nullable_u4`	7	0	Bit-packed, 4-bit nullable unsigned
`nullable_u8`	8	1	Nullable 8-bit unsigned
`nullable_u16`	9	2	Nullable 16-bit unsigned
`date`	10	4	Date as 32-bit value
`packed_bool`	11	0	Bit-packed boolean
`categorical_u8`	12	1	Categorical with up to 256 dictionary entries
`categorical_u16`	13	2	Categorical with up to 65,536 entries
`categorical_u32`	14	4	Categorical with up to 4,294,967,295 entries
`decimal128`	15	16	Fixed-point exact decimal; per-field `(precision, scale)` ≤ (38, 38)
`nullable_decimal128`	16	16	`decimal128` plus an `INT128_MIN` null sentinel

The Go source-of-truth for this table is encoding/field_type.go; the FieldType enum’s iota order is the byte-value order above.

Type families

Plain integers and floats

u8, u16, u32, u64, f32, f64. Standard little-endian encoding, full range, no null sentinel. Use these when you know the column never carries a missing value.

Nullable integers

nullable_u8, nullable_u16, nullable_u4, nullable_bool. Each reserves one in-band value (or one in-band bit pattern) to mean “null”. For the byte-sized variants the encoding is straightforward; for the sub-byte variants (nullable_u4, nullable_bool, packed_bool) Pulse packs multiple fields into shared bytes — see Record Layout → Bit-packing.

ByteSize() returns 0 for the bit-packed types because they don’t allocate whole bytes of their own; the schema reader uses BitPosition to locate them within shared bytes.

Date

date is a 32-bit count of days since the Unix epoch. The range is ~5.8 million years on either side of 1970 — effectively unbounded for real data.

Categoricals

categorical_u8, categorical_u16, categorical_u32. Each stores its string-to-ID mapping inline as a dictionary block immediately after the field’s schema entry. Pick the smallest variant that fits your cardinality (Pulse’s import path auto-selects during inference).

Dictionary mechanics are documented in Dictionary Blocks.

Decimal128

decimal128 and nullable_decimal128 are 16-byte fixed-point decimal numbers. Each field carries a per-field (precision, scale) pair written into the schema after the description; precision and scale both top out at 38 (PULSE_DECIMAL_OVERFLOW, PULSE_DECIMAL_PRECISION_LOSS).

Use these for currency and any other column where IEEE-754 rounding is not acceptable. See the financial-cohorts skill for full semantics including banker’s rounding and divide-by-zero policy.

Unknown type bytes

The schema reader rejects unknown FieldType bytes at parse time with ENCODING_INVALID. This is the same fail-loud strategy as the header version check: a file written by a future binary that introduced a new type fails immediately at schema parse, not later during row decode where the corruption could go unnoticed.

What you can do with each type

Concern	Source
Which aggregators are meaningful on which types	`skills/aggregation-guide.md` (LLM) / api process (CLI)
Decimal arithmetic semantics	`skills/financial-cohorts.md` (LLM)
Categorical dictionary limits	Dictionary Blocks

Schema Block

Audience: anyone decoding a .pulse file by hand or writing a non-Go reader. The schema block follows the 9-byte header and carries one descriptor per column.

From CLAUDE.md, byte-layout invariants for .pulse files, plus the on-disk format documented in encoding/schema.go.

Top-level shape

u16 field_count
field_record × field_count

Each field_record is variable-width (it includes UTF-8 name and description strings, and may include a categorical dictionary or decimal/H3 metadata). The reader walks them sequentially.

Per-field record

In write order — see WriteSchema / ReadSchema in encoding/schema.go:

#	Field	Size	Encoding
1	type	1 byte	`FieldType` byte (see Field Types)
2	name_length	2 bytes	u16 little-endian
3	name	name_length bytes	UTF-8
4	byte_offset	4 bytes	u32 LE — offset within a record
5	bit_position	1 byte	u8 — bit position within `byte_offset` (bit-packed types only)
6	csv_column_idx	2 bytes	u16 LE — source column index at import time
7	description	2 bytes length + UTF-8	Capped at 1000 bytes (`PULSE_IMPORT_DESCRIPTION_TOO_LONG`)
8	(decimal only) precision	1 byte	`decimal128` and `nullable_decimal128` only
9	(decimal only) scale	1 byte	same
10	(categorical only) dictionary	variable	See Dictionary Blocks

Order matters: every reader walks these in the listed order, so a malformed record stops the parse with ENCODING_INVALID.

Byte offsets and bit positions

byte_offset is the offset of this field’s first byte within a record. For bit-packed types (packed_bool, nullable_bool, nullable_u4), byte_offset plus bit_position together locate the field’s bits within a byte that may be shared with adjacent fields.

For non-packed types, bit_position is always 0.

Record layout mechanics — including the bit-packing rule, record-size computation, and how the encoder packs adjacent sub-byte fields — are in Record Layout.

Conditional trailers

Two trailers attach only to specific field types:

decimal128 / nullable_decimal128 get a (precision, scale) pair (u8, u8). Both ≤ 38.
Categorical types (categorical_u8, categorical_u16, categorical_u32) get a full dictionary block in line — see Dictionary Blocks.

A field with none of the above writes nothing after the description.

Field descriptions

The description string is UTF-8 with a 2-byte length prefix. The import path rejects descriptions longer than 1000 bytes (PULSE_IMPORT_DESCRIPTION_TOO_LONG) and warns on low-quality descriptions (empty, under 10 characters, or generic words like "n/a", "tbd", "unknown", "field", "data", "value", "column") — that warning is PULSE_FIELD_DESCRIPTION_LOW_QUALITY, upgraded to an error under --strict.

When the description is empty, pulse cohort inspect synthesises a fallback string (“Categorical field: ” or “Numeric field: ”) with description_source = "synthesized". The original bytes on disk remain empty.

Reader behaviour

encoding.ReadSchema is intentionally strict:

Field count limit comes from the u16 prefix (max 65,535 fields).
Unknown type bytes fail loud (ENCODING_INVALID).
Truncated records fail loud at the first short read.
The reader produces a *encoding.Schema with one encoding.Field per record; Schema.Field(name) looks fields up by name.

After the schema block, record data starts at the file’s first byte past the schema. The record layout is documented in Record Layout.

Dictionary Blocks

Audience: anyone decoding categorical fields, sizing a categorical type during import, or chasing a dictionary-overflow error.

Categorical fields (categorical_u8, categorical_u16, categorical_u32) store their string-to-ID mapping inline, immediately after the field’s schema entry. The dictionary is part of the schema block, not the record data.

LLM agents using MCP: the cohort-schema-design skill covers when to pick which categorical width; the import-best-practices skill covers fail-closed semantics on overflow.

On-disk layout

From encoding/dictionary.go:

u32 count
(u16 strlen + utf8 bytes) × count

Sizes are little-endian. Each entry’s ID is its insertion index (0..count-1); ID lookups during decode use the ID found in the record byte(s) and resolve to the string at that index.

Sizing the type

Type	Max entries	Bytes per record value
`categorical_u8`	256	1
`categorical_u16`	65,536	2
`categorical_u32`	4,294,967,295	4

The import path samples the source (--sample-rows, default 500) to estimate cardinality and picks the smallest width that fits. You can also force a width by editing the schema template (pulse import schema-template SOURCE).

Overflow and unbounded errors

AddWithLimit enforces the per-type cap and returns PULSE_IMPORT_CATEGORICAL_OVERFLOW when the source has more distinct values than the dictionary can hold:

{
  "code": "PULSE_IMPORT_CATEGORICAL_OVERFLOW",
  "message": "categorical dictionary overflow: max 256 entries",
  "details": {"max_entries": 256, "value": "the_257th_distinct_string"}
}

The companion code PULSE_IMPORT_CATEGORICAL_UNBOUNDED fires when the import path detects an effectively unbounded categorical column (the schema declared categorical_u32 and the column still grew past the caller-provided guardrails). Both errors halt the import — fail-closed, no partial output.

Recovery options, in order of preference:

Re-import with a wider categorical type (categorical_u8 → categorical_u16 → categorical_u32).
Drop the categorical encoding (treat the column as a plain string field — but Pulse has no native variable-string type; you’d add a pre-import transform to bucket values).
Pre-filter the source to a smaller distinct set and re-import.

Inspect behaviour

pulse cohort inspect --json reports each categorical field’s dictionary entry count and sample values. By default the inline list is capped at 100 entries (DefaultDictionaryLimit); pass --full-dict to print the full dictionary:

pulse cohort inspect data.pulse --full-dict --json

Both forms include a truncated: true|false flag and a total_entries count for programmatic consumers.

Performance notes

Dictionary reads are amortised: the reader allocates one shared byte buffer for all string payloads, then does one string(...) copy per entry. This avoids the “one allocation per entry” overhead that naively reading length-prefixed strings would produce. The dictionary itself is held in memory for the life of the cohort’s schema parse.

For very large dictionaries, the categorical_u32 path is still O(N) to deserialise; if you find yourself near the 32-bit cap, you almost certainly want a different model (a separate lookup table, or a plain integer column with the strings stored externally).

Record Layout

Audience: anyone hand-decoding row data or implementing a non-Go reader. The schema block ends; record data starts immediately after.

Records are fixed-width. Every row in a cohort occupies the same number of bytes, computed from the schema’s field types. Variable-width data (strings) lives in the schema (as categorical dictionaries) or is not directly supported.

LLM agents using MCP: the record byte layout is an implementation detail the MCP surface hides — there is no LLM-facing skill for it. The MCP tools operate on the inspect / process / sample abstractions.

Computing record size

Record size is the sum of FieldType.ByteSize() over all schema fields, plus padding bytes that share bits between sub-byte fields. For non-packed types, ByteSize() returns the obvious value (u32 = 4, f64 = 8, decimal128 = 16); for packed types (packed_bool, nullable_bool, nullable_u4), ByteSize() returns 0 and the field shares a byte with adjacent packed fields.

The writer (encoding/record.go) lays out fields in the order they appear in the schema; the reader walks the same order with the per-field ByteOffset and BitPosition recorded in the schema.

Encoding per type

From WriteFieldValue / ReadFieldValue in encoding/record.go:

Type family	Encoding
`u8` / `nullable_u8` / `categorical_u8`	1 byte, unsigned
`u16` / `nullable_u16` / `categorical_u16`	2 bytes, little-endian unsigned
`u32` / `date` / `categorical_u32`	4 bytes, little-endian unsigned
`u64`	8 bytes, little-endian unsigned
`f32`	4 bytes, little-endian IEEE 754
`f64`	8 bytes, little-endian IEEE 754
`decimal128` / `nullable_decimal128`	16 bytes, little-endian two’s-complement integer (scaled by `10^scale`); null sentinel is `INT128_MIN` for the nullable variant
`packed_bool` / `nullable_bool` / `nullable_u4`	Bit-packed — see below

Bit-packing

Sub-byte types share whole bytes with their packed neighbours. The schema records both ByteOffset (the shared byte’s offset) and BitPosition (which bit slot within that byte).

packed_bool — 1 bit (true/false).
nullable_bool — 2 bits (one null bit, one value bit) for the tri-state encoding.
nullable_u4 — 5 bits (one null bit, four value bits) for the nullable 4-bit unsigned encoding.

The writer aligns these into shared bytes from low bit to high bit; adjacent packed fields stack into the same byte until the byte is full, after which a new byte begins. ByteSize() == 0 is the schema reader’s signal that a field type shares bytes — non-zero ByteSize fields never share.

Null sentinels

Type	Null encoding
`nullable_u8`	`0xFF`
`nullable_u16`	`0xFFFF`
`nullable_u4`	Dedicated bit pattern within the packed byte
`nullable_bool`	Dedicated bit within the packed byte
`nullable_decimal128`	`INT128_MIN` (`0x8000…0000`)

u32, u64, f32, f64, date, decimal128 (non-nullable), and all categoricals are non-nullable — the import path either coerces or rejects rows with missing values (PULSE_IMPORT_ROW_ERROR). Pick the nullable_* variant when you need to preserve the difference between “zero” and “missing”.

Reading a record

The Go decoder lives at encoding.Reader / encoding.ReadRecord(*Schema, []byte). A non-Go reader can follow the same recipe:

Compute record size from the schema.
Read record_size bytes.
For each schema field in declaration order:
- If ByteSize() > 0, decode the value at the field’s ByteOffset.
- If ByteSize() == 0, decode the bit slot at (ByteOffset, BitPosition) using the type’s bit-pattern rules.

Forward compatibility

Records carry no type tag — they’re a packed binary blob whose interpretation comes entirely from the schema block. That’s why the file’s format version (in the header) and unknown field-type bytes (in the schema block) both fail loud at parse time: the records themselves cannot self-correct, so the format gates everything before record data is observed.

MCP Integration

Audience: operators wiring Pulse into an MCP-aware AI client (Claude Desktop, Claude Code, Cursor, Zed, custom hosts), and embedders who want to expose Pulse to an LLM agent.

This page is the human-facing guide: what the server does, how to wire it up, what the LLM sees, and how to debug a misbehaving session. Agent-facing guidance ships inside the binary as the mcp-integration skill — fetch it via pulse_skills_get (or pulse skills show mcp-integration).

What `pulse mcp` is

pulse mcp runs the Pulse library as a Model Context Protocol (MCP) server. The host (Claude Desktop, Claude Code, etc.) launches it as a subprocess, speaks JSON-RPC over its stdio streams, and shuts it down on session close. The LLM sees Pulse as a set of tools (callable functions), resources (browseable URIs), and prompts (canned slash commands).

┌─────────────┐  stdio JSON-RPC  ┌────────────┐  Go calls  ┌─────────────┐
│  AI client  │ ───────────────→ │ pulse mcp  │ ─────────→ │ pulse.Pulse │
│   (host)    │ ←─────────────── │ (this bin) │ ←───────── │  (library)  │
└─────────────┘                  └────────────┘            └─────────────┘
                                       │
                                       └── stderr ─→ host log pane

The server is a thin translator. Every tool wraps a public method on pulse.Pulse; the same code path powers the CLI.

Quickstart

# 1. Build and place on PATH
make build && cp ./bin/pulse /usr/local/bin/

# 2. Pick a data directory
mkdir -p /var/data/pulse

# 3. Wire into your host (see below) and restart it

# 4. From the LLM session, call:
#    pulse_manifest      → cache once
#    pulse_ask           → run analyses

Wiring into a host

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args": ["mcp"],
      "env": {
        "PULSE_DATA_DIR": "/var/data/pulse"
      }
    }
  }
}

Restart Claude Desktop. Pulse tools appear in the tool picker.

Claude Code

claude mcp add pulse --env PULSE_DATA_DIR=/var/data/pulse -- pulse mcp

Or by hand in ~/.claude.json (or per-project .claude.json):

{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args":    ["mcp"],
      "env":     { "PULSE_DATA_DIR": "/var/data/pulse" }
    }
  }
}

Cursor / Zed / generic stdio hosts

Any host that speaks the MCP stdio transport can launch pulse mcp the same way — provide the binary path, the mcp argument, and the PULSE_DATA_DIR env var.

What the LLM sees

Tool surface

Sixteen tools, registered at server start. Names and order match internal/mcp/mcptools/meta.go.

Tool	Purpose
`pulse_manifest`	Call first. Self-description: commands, operators (with accepted types + streamability), tier-1/tier-2 tests, regressions, synth distributions, error code list, MCP tool list, cohort field types with operator cross-references. Cache once per session.
`pulse_ask`	Preferred entry point. One-shot: optional auto-import → inspect → predict → execute. Accepts `source` (raw file path) + `query` (natural language, beta) or a structured `request`.
`pulse_inspect`	Read `.pulse` header + schema (no record bytes). Side effect: registers session-scoped schema-bound tool variants (see below).
`pulse_predict`	Validate a request against the schema without executing. Returns errors, warnings, applied defaults, streamability reasons.
`pulse_process`	Execute one pre-built request.
`pulse_compose`	Execute a batch of requests against the same cohort in one round trip.
`pulse_sample`	Return up to N rows for preview / diagnostics.
`pulse_facet`	Distinct values for a single field.
`pulse_import`	Convert a tabular source (csv, tsv, ndjson, jsonarray, parquet, arrow, excel) into a managed `.pulse` handle under `imports/`, with TTL-tracked sidecar. Pulse-format inputs pass through.
`pulse_drop`	Delete a managed-import handle and its sidecar.
`pulse_imports_list`	Enumerate managed handles with sidecar metadata (source, format, imported_at, expires_at, ttl, expired flag, pinned flag).
`pulse_examples_search`	Search the embedded request-example library by query, taxonomy tags (ANDed), or category.
`pulse_examples_get`	Fetch one runnable example body by name.
`pulse_errors_lookup`	Per-code Message + Fixup detail (kept out of the manifest for context economy).
`pulse_skills_list`	Embedded skill metadata.
`pulse_skills_get`	Fetch one skill body by name.

Natural-language query is beta. Heuristic parsing only — silent misinterpretation is possible. The LLM should always check the query_resolution and resolved request in the response before trusting results. For production, author a structured request against the cached manifest and skip the query field.

Resources

URI scheme	Yields
`pulse://<path>`	One resource per `.pulse` file under the data directory. Read returns `descriptor.InspectResult` JSON (header + schema only — no record bytes).
`pulse-skill://<name>`	One per embedded skill. Read returns the markdown body.

Resources are registered once at server start. Files added afterwards do not appear until the server restarts. Listing is cheap because the server only reads header bytes.

Prompts

Name	Args	Returns
`pulse-bootstrap`	none	A short instructions block telling the assistant what to call (and in what order) before authoring any request, and where the authoritative references live. Inject at session start.
`pulse-author-request`	`question`	A guided tool-call sequence for translating a natural-language analytical question into a Pulse request: manifest → examples search → ask.

Hosts that surface prompts as slash commands let users trigger these directly.

Recommended session flow

The two-call default for nearly every user request:

pulse_manifest once at session start. No arguments. Cache the payload — it is deterministic for a binary version and carries every fact needed to author a valid request.
pulse_ask for everything else. It collapses import + inspect + predict + execute into one round trip. When the user hands the LLM a raw file:
```
{
  "request": "{\"source\":\"data.csv\",\"query\":\"average revenue by month\"}"
}
```
When the cohort already exists as a managed handle or .pulse file:
```
{
  "request": "{\"cohort\":{\"filename\":\"sales.pulse\"},\"query\":\"top 5 regions by revenue\"}"
}
```
On predict-invalid with on_invalid="suggest", the response carries structured Fixup entries derived from each error code’s metadata so the LLM can repair the request without another round trip.

Reach for the multi-step path (pulse_inspect → pulse_predict → pulse_process) only when:

diagnosing a failed predict and you want the full envelope,
previewing rows (pulse_sample) or value distributions (pulse_facet),
pre-staging a managed handle with a specific name / TTL / pinning (pulse_import),
batching multiple requests in one call (pulse_compose).

Managed imports + TTL

pulse_import lets the LLM hand the server any tabular file and address it from then on as if it were a .pulse.

Convertible formats (csv, tsv, ndjson, jsonarray, parquet, arrow, excel) are imported into $PULSE_DATA_DIR/imports/<handle>.pulse with a sidecar <handle>.pulse.meta.json carrying imported_at, expires_at, ttl_seconds, source path, source format, and row count. result.managed=true.
Pulse passthroughs (.pulse extension) under PULSE_DATA_DIR are not copied — the server returns the relative path verbatim with managed=false. A .pulse outside PULSE_DATA_DIR is copied into the managed pool.

Source path resolution. Relative source paths resolve against PULSE_DATA_DIR. Absolute paths read from the host filesystem through a separate “source fs.”

Import jail. Absolute source paths are confined to a single directory tree (the jail root). Default: the working directory the MCP server was launched from. Paths that escape the jail (including ..) return PULSE_IMPORT_SOURCE_FORBIDDEN. Override via pulse.Options.ImportSourceJailRoot when embedding.

Sliding TTL. Default lifetime is 7d (overridable via PULSE_IMPORT_TTL, or per-import via the ttl field — accepts Go duration like "24h", day form like "7d", or "pin" for never-expire). Every subsequent inspect/predict/process/sample/facet/ask against the handle slides expires_at forward. The pool self-sweeps on every pulse_import call — no daemon required. Inspect with pulse_imports_list; evict manually with pulse_drop.

Schema-bound enums

After a successful pulse_inspect (or after pulse_ask opens a cohort), the server registers session-scoped variants of the action tools (pulse_process, pulse_predict, pulse_compose, pulse_sample, pulse_facet) whose JSON Schemas embed enum constraints on field-name parameters. The LLM picks field names from a typed list rather than free-texting and discovering on predict that the name was wrong.

What gets constrained on bound pulse_process / pulse_predict / pulse_compose schemas:

Path	Enum
`aggregations[].field`	All cohort field names
`aggregations[].type`	Full aggregator catalogue (`AGG_*`)
`attributes[].field`	Numeric fields only (includes decimal)
`attributes[].type`	Full attribute catalogue (`ATTR_*`)
`filterers[].field`	All cohort field names
`filterers[].type`	Full filterer catalogue (`FILTER_*`)
`groups[].field`	All cohort field names
`groups[].type`	Full grouper catalogue (`GROUP_*`)
`windows[].field`, `windows[].partition_by[]`	All cohort field names
`windows[].order_by[].field`	Numeric and date fields
`windows[].type`	Full window catalogue (`WIN_*`)
`tests[].field`, `tests[].field2`	Numeric fields only
`tests[].split_by` / `rows` / `cols` / `subject_field`	All cohort field names
`tests[].type`	Full test catalogue (`TEST_*`)
`pulse_facet` `field` arg	All cohort field names

Trigger and lifecycle. Binding fires on a successful pulse_inspect. mcp-go auto-fires notifications/tools/list_changed on AddSessionTools; the host refreshes its tool list and picks up the bound schemas on the next list. Bound tools share names with the global tools — session-scoped variants override globals for that session.

Limitations.

Multi-file sessions: the latest inspect wins. Track multiple cohorts client-side.
No per-element type ↔ field correlation: JSON Schema can’t easily express “if aggregations[i].type == AGG_SUM then aggregations[i].field must be numeric.” Operator–type compatibility lives in the type property description; strict validation remains pulse_predict’s job.
Transport support: binding requires a session that implements SessionWithTools. SSE / Streamable HTTP transports work; on stdio, binding is a no-op fallback and the global (unbound) schemas remain in effect. The manifest’s accepts_types table is still authoritative, so authoring is not blocked — just less ergonomic.
Empty enums omitted: when the cohort has zero fields in a category (e.g. no geo fields), the enum is omitted entirely rather than emitted as [].

Disable binding entirely with --bind-on-open=false.

Configuration

Env var	Purpose	Default
`PULSE_DATA_DIR`	Cohort base directory. Required.	(none — server fails to start without it)
`PULSE_IMPORTS_DIR`	Subdirectory for managed-import handles.	`imports`
`PULSE_IMPORT_TTL`	Default TTL for managed handles. Accepts Go duration (`24h`, `30m`), day form (`7d`, `30d`), or `pin`.	`7d`

Embedders can override per-instance via pulse.Options{DataDir, ImportsDir, ImportTTL, ImportSourceJailRoot, FS, ImportSourceFS, BindOnOpen} — see pulse.go.

Transport caveats

Stdio. The default and only transport pulse mcp ships today. Schema binding is a no-op (see Limitations). Stdout is the JSON-RPC channel; stderr is the log channel — never write structured output to stdout outside the protocol.
SSE / Streamable HTTP. Not exposed by the mcp CLI leaf yet. The underlying mcp-go server supports them; embedders can call mcp.NewWithOptions(p, ...) and serve via mcp-go’s SSE / streamable HTTP entry points directly.

Troubleshooting

Symptom	Cause	Fix
`data directory required: set PULSE_DATA_DIR or pass --data-dir`	Neither env var nor flag set	Pass `PULSE_DATA_DIR` in the host’s `env` block, or `--data-dir` in `args`
Tools don’t appear in the host UI after editing config	Host caches tool list	Restart the host fully (not just the conversation)
`pulse_import` returns `PULSE_IMPORT_SOURCE_FORBIDDEN` for an absolute path	Path escapes the import jail (default = server’s working dir)	Either move the file under the jail, launch the server from a higher-level directory, or set `pulse.Options.ImportSourceJailRoot` when embedding
`pulse_inspect` succeeds but bound enums never fire	Stdio session — binding is a no-op there	Use `pulse_predict` for validation; the manifest’s `accepts_types` lists give the LLM the same information
Tool calls hang	Host wrote non-protocol bytes to the server’s stdin, or server wrote non-protocol bytes to stdout	Check server stderr; restart the session. `pulse mcp` itself only writes a one-line startup notice to stderr at boot
`pulse_ask` with `query` returns nonsense or wrong fields	Natural-language parsing is heuristic and beta	Inspect `query_resolution` in the response. For production, author a structured `request` against the cached manifest

To see what the server registers without launching the host:

pulse --json | jq '.data.mcp_tools[]'
pulse manifest --json | jq '.data.skills[]'

Skill cross-reference for LLM agents

If you are writing a system prompt for an LLM agent that uses Pulse, point it at these skills rather than at this site:

LLM task	Skill
MCP wiring, tool surface, schema binding	`mcp-integration`
Author a `Process` request	`request-recipes`
Compose multiple sub-requests in one call	`compose-requests`
Iterate on a request with `pulse_predict`	`debugging-with-predict`
Look up an error code or warning	`error-code-reference`
Pick an aggregator / filterer	`aggregation-guide`
Pick an attribute (z-score, percentile, formula, …)	`attribute-composition`
Design a grouper	`grouper-design`
Use a window operator (`WIN_*`)	`window-operations`
Use a feature engineer (`FEAT_*`)	`feature-engineering`
Run a statistical test (tier-1 or tier-2)	`statistical-testing`
Fit a regression (OLS, GLM, Bayesian)	`regression-modeling`
Generate synthetic data	`synthetic-data`
Understand a cohort’s schema layout	`cohort-schema-design`
Import a tabular source into `.pulse`	`import-best-practices`
Pick an export format	`export-format-selection`
Work with `decimal128` (currency, precise arithmetic)	`financial-cohorts`
Route a natural-language query to a Pulse request	`query-router-prompt`
Get started end-to-end (LLM walkthrough)	`getting-started`

The agent should call pulse_skills_list once at session start to enumerate the catalog, then pulse_skills_get on demand. The returned text is authoritative; this site does not duplicate it and may lag.

mcp (CLI leaf) — flag reference and exit codes for the server binary
Deployment — production hardening notes
Troubleshooting — non-MCP failure modes

Request Example Library

Pulse ships a searchable, embedded catalogue of runnable request JSON files spanning every operator category. They are checked into the repo under examples/, mounted into the binary at compile time via //go:embed, and surfaced through three peer access paths:

Access path	Best for
`pulse_examples_search` / `pulse_examples_get` (MCP tools)	LLM agents authoring requests against a running Pulse server
`pulse examples search` / `pulse examples show` (CLI)	Developers exploring at a shell
`pulse.ExamplesSearch` / `pulse.ExampleGet` (Go API)	Embedders building higher-level UIs

What the library contains

Every example is a complete types.Request JSON body — the same shape you hand to pulse_process. Each file is annotated with a structured _meta block describing the example. Pulse’s JSON unmarshaller ignores unknown fields by default, so the _meta block is invisible at execution time; the file remains runnable verbatim.

{
  "_meta": {
    "name": "t_test_one_sample",
    "category": "tests",
    "tags": ["hypothesis-test", "t-test", "tier-1-test", "parametric", "one-sample", "streaming-friendly"],
    "operators": ["AGG_AVERAGE", "AGG_COUNT", "TEST_T"],
    "description": "One-sample t-test comparing revenue mean against the hypothesized mu=100."
  },
  "cohort": {...},
  ...
}

Fetching via pulse_examples_get returns the request body with the _meta block already stripped, so you can pass it straight to pulse_process / pulse_predict.

Searching the library

Three filter dimensions, all optional and combined with AND:

Filter	Behaviour
`query`	Case-insensitive substring across the example’s name, description, and operator list
`tags`	An example must carry every requested tag
`category`	Exact match against the example’s directory (`aggregations`, `attributes`, `features`, `filterers`, `groupers`, `regression`, `tests`, `windows`)

CLI

pulse examples search --query welch                       # find Welch-related examples
pulse examples search --tag time-series --tag tier-2-test # AND tag filter
pulse examples search --category tests --json             # JSON envelope
pulse examples show t_test_one_sample                     # print runnable JSON
pulse examples show t_test_one_sample --json              # full record (with _meta)

MCP

// arguments to pulse_examples_search
{"query": "welch"}
{"tags": ["time-series", "tier-2-test"]}
{"category": "features"}

Go API

p, _ := pulse.New(pulse.Options{DataDir: "/data"})

// Search:
hits := p.ExamplesSearch("welch", []string{"experiment-analysis"}, "")
for _, h := range hits {
    fmt.Println(h.Name, "—", h.Description)
}

// Fetch and run:
ex, ok := p.ExampleGet("t_test_one_sample")
if ok {
    var req pulse.Request
    _ = json.Unmarshal(ex.Body, &req)
    resp, _ := p.Process(ctx, &req)
    _ = resp
}

Tag taxonomy

Tags are curated and validated by a CI gate (TestExamples_TagsFromTaxonomy). The taxonomy spans four dimensions:

Dimension	Tags
Domain / use case	`time-series`, `cohort-analysis`, `experiment-analysis`, `correlation-analysis`, `comparison`, `before-after`, `top-n`, `distribution-shape`, `cross-tabulation`, `proportion-analysis`, `trend-detection`, `outlier-detection`, `cardinality-analysis`, `data-quality`, `geo-analysis`, `financial`, `feature-engineering`
Statistical method	`hypothesis-test`, `t-test`, `parametric`, `nonparametric`, `paired`, `one-sample`, `two-sample`, `k-sample`, `repeated-measures`, `post-hoc`, `normality-test`, `homogeneity-test`, `exact-test`
Regression / modeling	`regression`, `ecological`, `ols`, `glm`, `logistic`, `bayesian`, `regularization`, `ridge`, `lasso`, `elasticnet`, `polynomial`, `resampling`, `jackknife`, `selection`, `stepwise`
Pipeline machinery	`tier-1-test`, `tier-2-test`, `composed`, `pre-filter`, `feature-pipeline`, `window-operator`, `streaming-friendly`, `buffered-pipeline`
Risk / edge	`leakage-safe`, `leakage-risk`, `small-sample`

The category (directory name) is not repeated in the tags — _meta.category carries that.

Adding a new example

Write the request JSON under examples/<category>/. Use existing files as shape templates. Keep cohort.data_dir = ".data" and reference one of the fixture cohorts.
Add a _meta block at the top of the file:
- name — kebab-case-with-underscores, unique across the whole library.
- category — must match the parent directory.
- tags — pick 3-6 from the taxonomy above.
- operators — the list of AGG_* / ATTR_* / FILTER_* / GROUP_* / WIN_* / FEAT_* / TEST_* types appearing in the body, alphabetized and deduped.
- description — one-sentence, present-tense summary.
Re-run go test ./examples/... ./descriptor/... to confirm the new file passes:
- TestExamples_AllParseAsRequest
- TestExamples_UniqueNames
- TestExamples_TagsFromTaxonomy
- TestExamples_OperatorsMatchBody
- TestExamples_CategoryMatchesDirectory
- TestManifestExamplesPopulated
The annotation tool at cmd/annotate-examples/ is idempotent and may be re-used; updating its in-source annotations slice and re-running will rewrite the file’s _meta block in canonical form.

Regression Modeling

Pulse exposes regression through a compact, composable surface. Three operators, two orthogonal modifiers, and one upstream feature transform together cover every textbook regression variant. This chapter is the human-facing counterpart to skills/regression-modeling.md; agents should fetch the skill via pulse_skills_get rather than read this page.

Overview

Operator	Engine	Streaming
`REG_OLS`	Ordinary least squares + optional regularization	Streams sufficient statistics (Phase 1 + 2)
`REG_GLM`	Generalized linear model via IRLS	Always buffered (Newton-Raphson refit)
`REG_BAYES_LINEAR`	Bayesian linear regression (conjugate NIG)	Streams sufficient statistics (Phase 4)

Two spec-level modifiers compose with any of the three:

Resample ∈ {jackknife, bootstrap} — replaces analytical SE / p-values with resample-based estimates. Forces buffered.
Selection ∈ {forward, backward, stepwise} — drives AIC- or BIC-based greedy subset search. Requires Criterion. Forces buffered.

One upstream feature operator (FEAT_POLY) extends the linear core to polynomial regression. Per-row attributes (ATTR_REG_FITTED, ATTR_REG_RESIDUAL, ATTR_REG_LEVERAGE) attach per-record diagnostics in the output row stream.

The 13 textbook names → Pulse specs

The Indeed regression taxonomy double-counts (Simple ≡ Linear univariate, Multiple ≡ Multiple Linear) and treats orthogonal wrappers (Jackknife, Stepwise) as families. Pulse does not. The table below maps each textbook name onto the corresponding Pulse spec and links to a runnable example file under examples/regression/.

#	Indeed name	Pulse expression	Example
1	Simple	`REG_OLS` with one predictor	`examples/regression/02_simple_linear.json`
2	Multiple	`REG_OLS` with multiple predictors	`examples/regression/03_multiple_linear.json`
3	Linear	= #1	`examples/regression/02_simple_linear.json`
4	Multiple Linear	= #2	`examples/regression/03_multiple_linear.json`
5	Logistic	`REG_GLM{Family:"binomial", Link:"logit"}`	`examples/regression/04_logistic.json`
6	Ridge	`REG_OLS{Penalty:"l2", Alpha:λ}`	`examples/regression/05_ridge.json`
7	Lasso	`REG_OLS{Penalty:"l1", Alpha:λ}`	`examples/regression/06_lasso.json`
8	Polynomial	`FEAT_POLY{Field:x, Degree:n}` upstream → `REG_OLS`	`examples/regression/07_polynomial.json`
9	Bayesian Linear	`REG_BAYES_LINEAR{Prior:"nig"}`	`examples/regression/08_bayesian_linear.json`
10	Jackknife	any regression with `Resample:"jackknife"`	`examples/regression/09_jackknife.json`
11	Elastic Net	`REG_OLS{Penalty:"elasticnet", Alpha, L1Ratio}`	`examples/regression/10_elasticnet.json`
12	Ecological	`GROUP_*` upstream → `REG_OLS` over group means (composed request)	`examples/regression/01_ecological_fallacy.json`
13	Stepwise	any regression with `Selection:"stepwise", Criterion:"aic"\|"bic"`	`examples/regression/11_stepwise.json`

Streamability matrix

Spec	Streamable	Memory	Notes
`REG_OLS` no penalty	yes	O(p²)	sufficient stats: n, Σx, Σy, XᵀX, Xᵀy, Σy²
`REG_OLS` + l1 / l2 / elasticnet	yes	O(p²)	streaming Gram; regularized solve at finalize
`REG_BAYES_LINEAR` (conjugate NIG)	yes	O(p²)	streaming sufficient stats + closed-form posterior update
`REG_GLM` (binomial / poisson / gamma)	no	O(n·p)	IRLS / Newton requires multiple passes
Any regression with `Resample != ""`	no	O(n·p)	LOO / bootstrap refit
Any regression with `Selection != ""`	no	O(n·p)	refit per candidate subset

pulse_predict reports per-request streamability on PredictResult.Streamable, mirroring the runtime gate.

Operator reference

`REG_OLS`

Ordinary least squares with optional regularization.

Param	Required	Notes
`target`	yes	Numeric response field.
`predictors`	yes	One or more numeric predictor fields.
`penalty`	no	`""` (default), `"l1"`, `"l2"`, or `"elasticnet"`.
`alpha`	conditional	Required and `> 0` when `penalty != ""`.
`l1_ratio`	conditional	Required and in `[0, 1]` when `penalty == "elasticnet"`.
`max_iters`	no	Coordinate-descent cap (default 1000).
`tol`	no	Convergence tolerance (default `1e-6`).
`resample`	no	`"jackknife"` or `"bootstrap"`. Downgrades streaming.
`selection`	no	`"forward"`, `"backward"`, or `"stepwise"`. Requires `criterion`. Downgrades streaming.

Modifier compatibility: Resample and Selection may be combined; Selection runs first, Resample re-fits the selected subset.

Error codes: PROCESSING_REGRESSION_RANK_DEFICIENT, PROCESSING_REGRESSION_SINGULAR_GRAM, PROCESSING_REGRESSION_NO_CONVERGE, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_REGRESSION_APPROXIMATE_SE (warning, l1/elasticnet without resample), PROCESSING_REGRESSION_REGULARIZED_SELECTION (warning, penalty + selection), PROCESSING_CONFIG.

`REG_GLM`

Generalized linear model via iteratively-reweighted least squares.

Param	Required	Notes
`target`	yes	Numeric response.
`predictors`	yes	One or more numeric predictor fields.
`family`	yes	`"binomial"`, `"poisson"`, or `"gamma"`.
`link`	no	Family-specific default when empty (`binomial`→`logit`, `poisson`→`log`, `gamma`→`inverse`).
`max_iters`	no	IRLS iteration cap (default 50).
`tol`	no	Convergence tolerance (default `1e-8`).
`resample`	no	`"jackknife"` or `"bootstrap"`.
`selection`	no	Subset-selection wrapper; requires `criterion`.

Always buffered. Setting penalty / alpha / l1_ratio on a REG_GLM spec is rejected with PROCESSING_CONFIG; regularized GLM is reserved for a later phase.

Error codes: PROCESSING_REGRESSION_INVALID_FAMILY, PROCESSING_REGRESSION_INVALID_LINK, PROCESSING_REGRESSION_NO_CONVERGE, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_CONFIG.

`REG_BAYES_LINEAR`

Bayesian linear regression with a conjugate Normal-Inverse-Gamma prior.

Param	Required	Notes
`target`	yes	Numeric response.
`predictors`	yes	One or more numeric predictor fields.
`prior`	no	Only `"nig"` accepted in v1. Default `"nig"`.
`prior_mu`	no	Length `p+1` mean vector (intercept first); defaults to zero.
`prior_precision`	no	Scalar `ε ≥ 0` on the precision matrix `ε·I`. Default `1e-3`.
`prior_shape`	no	Inverse-gamma shape `a₀`. Default `1e-3`.
`prior_rate`	no	Inverse-gamma rate `b₀`. Default `1e-3`.
`credible_level`	no	Posterior interval mass. Default 0.95.

Modifier compatibility: Resample and Selection are rejected for REG_BAYES_LINEAR at spec validation — the posterior already conveys uncertainty via credible intervals, and stepwise feature selection on a Bayesian model is a posterior-based question the conjugate-NIG engine doesn’t support.

Setting penalty / alpha / l1_ratio / family / link on a Bayes spec is rejected with PROCESSING_CONFIG.

Error codes: PROCESSING_REGRESSION_RANK_DEFICIENT, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_CONFIG.

Modifiers

`Resample`

Layered on top of any base operator (except REG_BAYES_LINEAR).

Value	Behavior
`""`	No resampling. Closed-form / asymptotic standard errors.
`"jackknife"`	Leave-one-out resampling. `SE = sqrt((n−1)/n · Σᵢ (β⁽⁻ⁱ⁾ − β̄)²)`.
`"bootstrap"`	Non-parametric bootstrap. `bootstrap_iters` (default 1000), `rng_seed` (`0` → time-seeded; non-zero → reproducible).

For l1 / elasticnet OLS, setting Resample is the rigorous answer for standard errors: it suppresses the PROCESSING_REGRESSION_APPROXIMATE_SE warning (the SEs are now resample-based, not plug-in over the active set).

`Selection`

Layered on top of any base operator (except REG_BAYES_LINEAR).

Value	Behavior
`""`	No subset selection.
`"forward"`	Start from intercept-only; add the predictor that lowers the criterion most.
`"backward"`	Start from full model; remove the predictor whose absence lowers the criterion most.
`"stepwise"`	Bidirectional sweep; try every add and every remove per cycle.

Requires Criterion ∈ {"aic", "bic"}.

AIC = -2·logL + 2·k. Lighter penalty; may retain weak predictors at moderate n.
BIC = -2·logL + log(n)·k. Heavier per-parameter penalty; rejects noise predictors more reliably at moderate n.

SelectedFeatures lists the chosen subset; Coefficients drops non-selected predictors entirely (absence ≠ zero — selection’s contract is stronger). Selection may be combined with Resample: Selection picks the active subset, then Resample replaces SE / p-values on the selected model.

Compositional patterns

Polynomial regression — `FEAT_POLY` + `REG_OLS`

Polynomial regression is linear in the coefficients; the non-linearity lives in the feature space. Use FEAT_POLY upstream to materialize x_2, x_3, …, x_<degree> derived columns, then list them alongside the original x in predictors:

{
  "features": [
    {"type": "FEAT_POLY", "field": "x", "label": "x", "params": {"degree": 3}}
  ],
  "regressions": [
    {"type": "REG_OLS", "name": "polyfit", "target": "y",
     "predictors": ["x", "x_2", "x_3"]}
  ]
}

Degree is gated at [2, 10]. Numerical stability is the caller’s responsibility: x^10 overflows f64 once |x| clears a few hundred, and the Gram matrix conditions poorly long before that. Centre or standardize predictors before requesting FEAT_POLY.

Ecological regression — group → regress

“Ecological regression” is a regression fit on aggregated group-level statistics — per-precinct means, per-county sums, per-region rates — rather than individual-level rows. Use pulse_compose with two slots: slot 1 produces per-group means via GROUP_* + AGG_AVERAGE, slot 2 fits REG_OLS over the aggregate output (or, in practice, over a pre-aggregated .pulse file).

The two slots are intentionally independent; Pulse does not pipe slot-1 results into slot-2 as cohort input. Either (a) materialize slot 1’s aggregate as its own .pulse cohort upstream, or (b) treat slot 1 as the audit trail (per-group means visible in the composed response) and run slot 2 over a pre-aggregated fixture.

Caution — the ecological fallacy. A significant group-level slope does not imply an individual-level association. Robinson (1950) showed that ecological correlations and individual correlations can take opposite signs in the same data: a per-state regression of literacy on race might suggest a strong relationship that vanishes (or reverses) at the per-person level. Aggregation collapses within-group variation, leaving only between-group structure that frequently encodes confounders.

When ecological regression is the right tool: aggregate-only data (census output, public-health summary tables); genuinely group-level research questions (“do counties with higher median income have higher turnout?”). When it is the wrong tool: individual-level claims; causal claims. Annotate consumer-facing prose with this caveat; Pulse cannot enforce it.

Robinson, W.S. (1950). “Ecological Correlations and the Behavior of Individuals.” American Sociological Review 15(3): 351–357.

Per-row regression attributes

Three attribute operators emit per-record diagnostics from a fitted regression onto the row stream.

Attribute	Emits per row
`ATTR_REG_FITTED`	`ŷ_i = Xᵢ β` — the model’s prediction at each row.
`ATTR_REG_RESIDUAL`	`y_i − ŷ_i` — the per-row residual.
`ATTR_REG_LEVERAGE`	`h_ii = Xᵢ (XᵀX)⁻¹ Xᵢᵀ` — the i-th diagonal of the hat matrix.

Each attribute references a sibling regression spec by regression_name. See skills/attribute-composition.md for the parameter table.

Error codes

Look up full prose via pulse_errors_lookup or pulse errors lookup CODE.

Code	Meaning (one-liner)
`PROCESSING_REGRESSION_NOT_IMPLEMENTED`	Reserved as of Phase 8; no engine returns this today.
`PROCESSING_REGRESSION_RANK_DEFICIENT`	XᵀX is singular; add regularization or drop a predictor.
`PROCESSING_REGRESSION_NO_CONVERGE`	IRLS or coordinate descent failed within `MaxIters`.
`PROCESSING_REGRESSION_SINGULAR_GRAM`	XᵀX non-invertible even after regularization; increase `alpha`.
`PROCESSING_REGRESSION_INVALID_FAMILY`	`REG_GLM` Family outside `{binomial, poisson, gamma}`.
`PROCESSING_REGRESSION_INVALID_LINK`	Link incompatible with the chosen Family.
`PROCESSING_REGRESSION_INSUFFICIENT_DATA`	Filtered set has fewer rows than predictors + 1, or below resample minimum.
`PROCESSING_REGRESSION_APPROXIMATE_SE`	Warning: l1 / elasticnet SE is a plug-in approximation; set `resample` for rigor.
`PROCESSING_REGRESSION_REGULARIZED_SELECTION`	Warning: `penalty != ""` plus `selection != ""` is unusual.
`PROCESSING_CONFIG`	Invalid spec combination (e.g. Bayes + Resample, GLM + Penalty).

Worked examples

Every Indeed name has a runnable JSON file under examples/regression/. Fetch via pulse_examples_get or read directly:

01_ecological_fallacy.json — per-region aggregation + ecological caveat (#12).
02_simple_linear.json — univariate OLS (#1, #3).
03_multiple_linear.json — multivariate OLS (#2, #4).
04_logistic.json — binary classification (#5).
05_ridge.json — l2 penalty (#6).
06_lasso.json — l1 penalty (#7).
07_polynomial.json — FEAT_POLY + OLS (#8).
08_bayesian_linear.json — conjugate NIG (#9).
09_jackknife.json — leave-one-out resampling (#10).
10_elasticnet.json — combined l1 / l2 penalty (#11).
11_stepwise.json — BIC-driven stepwise selection (#13).

Architecture Overview

Source of truth: the canonical architectural contract is CLAUDE.md at the repository root. This chapter restates its design principles for human readers; if the two ever disagree, CLAUDE.md is authoritative.

Pulse is a high-performance, self-describing tabular data processing engine. It ships as a Go library (github.com/frankbardon/pulse) and as a CLI binary (cmd/pulse/). The library is the primary deliverable; the CLI is a thin adapter over it.

Design principles

Library-first. The pulse.go facade (pulse.New, pulse.Options, pulse.Process, pulse.Compose, pulse.Import, pulse.Export, pulse.Convert, pulse.Inspect, pulse.Predict, pulse.Sample, pulse.Facet) is the public API. The CLI calls the library; it never contains business logic.
Self-describing. Every .pulse file carries its schema in the header. The descriptor/ package provides manifest, predict, and inspect operations that expose the system’s capabilities and validate requests without executing them.
Skill-augmented. The skills/ package embeds 19 markdown skill files into the binary via //go:embed. LLM agents (and Nexus, the orchestration layer that consumes Pulse) can call skills.List() and skills.Get(name) at boot time to inject domain-specific guidance into their context.
Nexus relationship. Pulse is a standalone processing engine. Nexus is the upstream orchestration agent that calls Pulse’s library API or CLI. Pulse has no dependency on Nexus. Nexus discovers Pulse’s capabilities via pulse manifest --json and loads skills from the embedded skill pack.

The next chapter, Package Layout, shows where each of these concerns lives in the source tree.

Package Layout

Source of truth: this tree is mirrored from the “Package layout” section of CLAUDE.md. If the project structure changes, that file is updated first; this page follows.

pulse/
├── cmd/
│   └── pulse/              # CLI binary (the only binary)
├── pulse.go                # Public facade — pulse.New, pulse.Options
├── service/                # Orchestration layer; wires processing to encoding
├── processing/             # Aggregators, attributes, filterers, groupers, windows, features
│   ├── window/             # WIN_* operators (LAG, LEAD, RANK, RUNNING_*, EWMA, ...)
│   └── feature/            # FEAT_* pre-filter feature engineers (LOG, SQRT, BUCKETIZE, ...)
├── encoding/               # Dynamic schema + record codec (.pulse binary format)
├── io/                     # Bidirectional tabular <-> .pulse adapters
│   ├── csv/                # CSV reader + writer
│   ├── tsv/                # TSV reader + writer
│   ├── ndjson/             # NDJSON reader + writer
│   ├── jsonarray/          # JSON-array reader + writer (single top-level array of flat objects)
│   ├── jsonshared/         # Value coercion helpers shared by ndjson and jsonarray
│   ├── arrow/              # Arrow IPC / Feather V2 reader + writer; shared Arrow<->Pulse type maps
│   ├── parquet/            # Parquet reader + writer (delegates type maps to io/arrow)
│   └── excel/              # Excel reader + writer (Excelize)
├── fs/                     # afero-based filesystem abstraction + extension hook
├── errors/                 # Typed error codes (CodedError system)
├── types/                  # Request/response structs (JSON-serializable)
├── descriptor/             # Self-description: manifest, predict, inspect, envelope
├── skills/                 # Embedded markdown skill pack (//go:embed)
│   ├── index.json          # Manifest of all bundled skills
│   └── *.md                # Individual skill files with YAML frontmatter
├── synth/                  # Synthetic data generator (from-schema, from-profile)
├── docs/                   # mdBook source for this site (published to GitHub Pages)
└── internal/
    ├── cli/                # CLI internals (descriptor walker, json action)
    └── mcp/                # MCP server: tool + resource handlers wrapping pulse.Pulse
        └── mcptools/       # Leaf metadata package (tool names + descriptions) consumed by descriptor

Adding an Aggregator

Audience: Pulse internals contributors adding a new AGG_* operator.

This page is a step-by-step recipe. The same content lives in CLAUDE.md → Common Claude Code Workflows → Adding a new aggregator; this is the human-readable mirror.

From CLAUDE.md, Common Claude Code Workflows.

1. Declare the type constant

Add the new constant to types/types.go and the slice returned by types.AllAggregationTypes(). Example, for a hypothetical AGG_GINI:

const (
    // ... existing constants ...
    AGG_GINI AggregationType = "AGG_GINI"
)

func AllAggregationTypes() []AggregationType {
    return []AggregationType{
        // ... existing entries, alphabetised ...
        AGG_GINI,
    }
}

The exhaustiveness tests (TestStreamability_AggregationsKnown and friends) will fail until you add the streamability case in step 4.

2. Implement the aggregator and register it

The operator implementation lives in processing/. Write the factory function (newGini(...) returning the aggregator interface) and register it in aggregatorRegistry in processing/registry.go.

If the aggregator can update one row at a time, also implement the OnlineAggregator interface so it joins the streaming Process path. Sort-based or sum-of-deviation aggregators (like AGG_MEDIAN, AGG_ZSCORE) skip this interface and run in the buffered path.

3. Tests

Tests come first: write them in processing/aggregator_test.go before the implementation, run the suite, confirm they fail informatively, then port the implementation until green. See Testing Conventions.

4. Declare streamability

Add a case for the new type in types/streamability.go:

func (t AggregationType) Streamable() bool {
    switch t {
    // ...
    case AGG_GINI:
        return false // sort-based
    }
}

Add the same row to the table in types/streamability_test.go.

If the aggregator is online, also expect TestRegistryStreamabilityMatchesTypes to compare your OnlineAggregator implementation against the AggregationType.Streamable() return value — they must agree.

5. Update the skill pack

Add a section for the new aggregator in skills/aggregation-guide.md. Cover when to use it, what its inputs and outputs look like, and any caveats (sort cost, memory, supported field types).

The CI gate TestSkillsCoverAllComponents parses the skill body for the operator name; the section can live anywhere in the file as long as the name appears.

6. Declare the capability metadata

Add a row to descriptor/capabilities_aggregations.go describing the operator’s params, accepted field types, emitted type, and the streamable hint. TestManifestOperatorsComplete enforces that every registered aggregator has a capability row.

7. CLAUDE.md and registered-component lists

Update CLAUDE.md’s “Current registered components” section with the new aggregator name in the right alphabetised slot. If the operator interacts with categorical fields in a special way, also update descriptor/predict.go’s numericAggregations map.

8. Run the gates

go test ./skills/ -run TestSkillsCoverAllComponents
go test ./descriptor/ -run 'TestManifest|TestPredict'
go test ./processing/ -run TestRegistryStreamability
go test ./...

The full Update Demand row for aggregators says: skill update + capability declaration + CLAUDE.md update + the existing test coverage. All four ride in the same PR. See The Update Demand.

Adding an I/O Format

Audience: internals contributors adding a new bidirectional tabular format (a peer to the existing csv/, tsv/, ndjson/, jsonarray/, arrow/, parquet/, excel/ sub-packages).

From CLAUDE.md, Common Claude Code Workflows.

1. Create the sub-package

Each format is a sub-package under io/. Create io/<format>/<format>.go with both a reader and a writer.

The two interfaces to implement live in io/:

// Reader
type Reader interface {
    ReadHeader() ([]string, error)
    ReadRows(ctx context.Context, fn func(row []string) error) error
    Close() error
}

// Writer
type Writer interface {
    WriteHeader(columns []string) error
    WriteRow(values []string) error
    Close() error
}

If the reader needs schema inference (header sample, then full import), also implement io.ResetReader.Reset() so the import job can rewind after sampling.

2. Tests

Add io/<format>/<format>_test.go with the standard round-trip checks: write rows, read them back, verify equality. Hermetic tests should use afero.NewMemMapFs() — see Testing Conventions.

3. Wire it into the CLI

The CLI registers per-format leaves in internal/cli/import.go and internal/cli/export.go. Add the format string to:

The switch in makeImportReader(format, ...) in import.go.
The corresponding newWriterForFormat(format, ...) switch in export.go.
The Commands: slice on ImportCommand() and ExportCommand() in the same files (one importFormatCmd("yourformat") / exportFormatCmd("yourformat") line).

The pulse convert leaf auto-detects format from extension via formatFromExt; add the extension mapping if the new format has a canonical file extension.

4. Schema mapping

If the new format has a native type system (Arrow / Parquet do, CSV does not), share the type map with neighbouring formats via the io/arrow package the way Parquet already does. CSV / TSV / NDJSON / JSON-array share io/jsonshared for value coercion.

5. Skill update

Add or update a skill that points users at the new format. If the new format is primarily an export concern, update skills/export-format-selection.md. If it has import-side considerations (schema inference, null markers, type ambiguity), update skills/import-best-practices.md.

If the format adds a CLI flag (e.g. --sheet for Excel), update skills/getting-started.md so TestSkillsCoverAllCliLeaves keeps passing.

6. Convert and orchestration plumbing

Make sure both directions flow through pio.ImportJob and pio.ExportJob. The orchestration layer is format-agnostic; you should not need to touch service/ unless the new format requires special metadata (e.g., Parquet’s per-column statistics).

7. Run the gates

go test ./io/<format>/...
go test ./skills/ -run TestSkillsCoverAll
go test ./...

For format-specific perf, add benchmarks (Benchmark<Format>...) in the sub-package. There’s no required perf gate today, but neighbouring formats have benchmarks you can mirror as a baseline.

Adding a Statistical Test

Audience: internals contributors adding a new TEST_* operator — tier-1 (row-stream) or tier-2 (post-test on the materialised result set).

The recipe mirrors the aggregator and feature recipes; the test-specific moving parts are streamability, the test catalog, and the registered-test capability table.

From CLAUDE.md, “Update Demand” rows for statistical tests and tier-2 post-test variants.

1. Decide tier

Tier 1. Runs against the raw row stream, alongside aggregators. Online-moments tests (TEST_T, TEST_WELCH, TEST_CHISQ, TEST_ANOVA_F) stay in the streaming Process path. Sort-required tests (TEST_KS) force the buffered path.
Tier 2. Runs after the result set is materialised, in req.PostTests. Always buffered.

2. Declare the type constant

Add to types/types.go:

const (
    // ... existing constants ...
    TEST_GINI_TREND TestType = "TEST_GINI_TREND"
)

Add it to types.AllTestTypes().

3. Implement and register

Tests live in processing/test_*.go. Existing examples to mirror:

processing/test_t.go — online tier-1 test.
processing/test_anova.go — tier-1 ANOVA with grouper support.
processing/test_post.go and processing/test_post_more.go — tier-2 post-tests.
processing/test_studentized.go — numerical integration utilities (used by TEST_TUKEY_HSD).

Register the test in processing/test.go (the registry construction calls). For tier-2 variants, declare both the base type and the variant identifier the post-test surface uses.

4. Streamability

Add a case in types/streamability.go for the new TestType:

func (t TestType) Streamable() bool {
    switch t {
    // ...
    case TEST_GINI_TREND:
        return false // sort-based
    }
}

Add the matching row in types/streamability_test.go so TestStreamability_TestsKnown passes.

5. Capability declaration

Add a row to descriptor/capabilities_tests.go:

For a tier-1 test, declare it in the tier-1 catalog (testCapabilities).
For a tier-2 post-test, declare it in postTestCapabilities.

TestManifestTestsComplete and TestManifestPostTestsComplete enforce that the manifest enumerates every registered test.

6. Skill update

Add an entry to skills/statistical-testing.md under “Operator catalog”. Describe the test’s family, inputs, outputs (statistic, p, df, effect size, …), and any preconditions (PULSE_TEST_* error codes it can raise). For tier-2 variants, also document the variant field shape since the post-test API exposes it.

7. Tests

Use the same TDD pattern as for aggregators. The processing package has rich existing test files to model new cases against: processor_test_pipeline_test.go, test_parametric_test.go, test_nonparametric_test.go, test_post_more_test.go. Add hermetic fixtures that exercise the streaming and buffered paths.

8. Error codes

If your test introduces a new failure mode, add a code to errors/codes.go (mirror the existing PULSE_TEST_* family), register its description row in descriptor/capabilities_errors.go, and document recovery in skills/error-code-reference.md. See the Adding an Aggregator recipe for the same pattern at the aggregator layer.

9. CLAUDE.md

Update CLAUDE.md’s “Current registered components → statistical tests” line with the new operator. If the test introduces a new preconditions class (e.g. paired sample, repeated measures), also add a sentence describing it in the parent paragraph.

10. Run the gates

go test ./processing/ -run TestType_Streamable
go test ./types/    -run TestStreamability_TestsKnown
go test ./descriptor/ -run TestManifest
go test ./skills/    -run TestSkillsCoverAll
go test ./...

See The Update Demand for the full row that governs statistical-test changes.

The Update Demand

Source of truth: this chapter is mirrored from the “Update Demand” section of CLAUDE.md. Both files are kept in lock-step; CLAUDE.md is authoritative if they ever diverge (a TestUpdateDemandTableCovers CI gate enforces table coverage against the registries).

Any change to Pulse code, configuration, file format, or public surface MUST update the corresponding skill file(s) and CLAUDE.md in the same PR. This is not a courtesy. It is a non-skippable CI failure if any of the trigger conditions below is met without the corresponding doc update.

Trigger → required update

If you change…	You MUST also update…	Enforced by
A registered aggregator	`skills/aggregation-guide.md` (add or update the section for that aggregator)	`TestSkillsCoverAllComponents`
A registered attribute	`skills/attribute-composition.md`	`TestSkillsCoverAllComponents`
A registered filterer	`skills/aggregation-guide.md` (filtering section)	`TestSkillsCoverAllComponents`
A registered grouper	`skills/grouper-design.md`	`TestSkillsCoverAllComponents`
A registered window operator	`skills/window-operations.md`	`TestSkillsCoverAllWindowTypes`
An error code (added/removed/renamed)	`skills/error-code-reference.md`	`TestSkillsCoverAllErrorCodes`
A CLI leaf (added/removed/flag added)	`CLAUDE.md` “Common Claude Code Workflows” + `skills/getting-started.md` if user-facing	`TestSkillsCoverAllCliLeaves`
A `--json` envelope or `format_version`	`CLAUDE.md` “Output Format Contract”	`TestClaudeMdMentionsFormatVersion`
A `.pulse` file format change (header layout, new field type)	`CLAUDE.md` “Code Conventions” + `skills/cohort-schema-design.md`	`TestClaudeMdMentionsFormatVersion`, `TestSkillsCoverAllFieldTypes`
A new non-skippable CI gate	`CLAUDE.md` (gate listed by name in the relevant section)	`TestClaudeMdMentionsAllNonSkippableGates`
A new architectural decision	`CLAUDE.md` (relevant section) + PRD if applicable	reviewer enforcement
An environment variable	`CLAUDE.md` “Build / Dev / Test Workflow” + `skills/getting-started.md`	`TestClaudeMdMentionsAllEnvVars`
A registered MCP tool (added/removed)	`skills/mcp-integration.md` (Tool surface table) + `internal/mcp/mcptools/meta.go` (name + description)	`TestSkillsCoverAllMCPTools`, `TestManifestMCPToolsComplete`
A new MCP action tool with field-name parameters	`internal/mcp/schema_bind.go` (add a per-tool JSON Schema builder + entry in `Bind`) + `skills/mcp-integration.md` (Schema-bound enums section)	`TestMCPSchemaBinding_RemovesInvalidFields`, `TestMCPSchemaBinding_AllFieldsInFiltererEnum`, `TestMCPSchemaBinding_SampleAndFacetFieldEnum`, `TestMCPSchemaBinding_InspectSucceedsRegistersBindings`, `TestMCPSchemaBinding_BindOnOpenFalse`
A registered feature operator	`skills/feature-engineering.md` (operator catalog) + capability declaration in `descriptor/capabilities_features.go`	`TestSkillsCoverAllComponents`, `TestManifestOperatorsComplete`
A registered synth distribution kind	`skills/synthetic-data.md` (Supported distributions) + capability declaration in `descriptor/capabilities_distributions.go`	`TestSkillsCoverAllSynthDistributions`, `TestManifestDistributionsComplete`
A registered statistical test (`TEST_*`)	`skills/statistical-testing.md` (Operator catalog) + `types/streamability.go` + `types/streamability_test.go` + capability declaration in `descriptor/capabilities_tests.go`	`TestStreamability_TestsKnown`, `TestManifestTestsComplete`
A registered tier-2 post-test variant	Capability declaration in `descriptor/capabilities_tests.go` (`postTestCapabilities`)	`TestManifestPostTestsComplete`
A registered aggregator/attribute/filterer/grouper/window capability metadata	Capability declaration in `descriptor/capabilities_<category>.go` (params, accepts_types, emits_type, streamable_hint)	`TestManifestOperatorsComplete`
A new error code	Description row in `descriptor/capabilities_errors.go` (`errorMetaTable`)	`TestManifestErrorCodesComplete`
An error code’s fixup template	Entry in `errors/fixup_metadata.go` (`codeMetadata`) + `Fixup:` line in `skills/error-code-reference.md` under that code	`TestCodesHaveFixups`, `TestSkillsErrorCodeFixupsDocumented`
A new operator’s streaming capability	`types/streamability.go` (case for the new type) + table in `types/streamability_test.go`	`TestRegistryStreamabilityMatchesTypes`, `TestStreamability_*Known`, `TestManifestStreamableMatchesTypes`
The default operator table	`CLAUDE.md` “Code Conventions → Smart defaults” + `skills/getting-started.md` (“Defaults” section)	`TestDefaults_Applied` + reviewer enforcement
A natural-query parsing route (new grammar shape)	`internal/query/query.go` grammar + `internal/query/query_test.go` fixtures + `skills/query-router-prompt.md` (router prompt grammar) + `skills/request-recipes.md` (target shapes)	`TestNaturalQuery_HeuristicGrammar`

The Update Demand applies recursively to itself: when a new trigger row is added (e.g., a new component category, a new contract), this table MUST be updated in the same PR. TestUpdateDemandTableCovers (non-skippable) parses this table and asserts every registered component category and contract type has a row.

If you find yourself wanting to defer the doc/skill update to “a follow-up PR,” stop. The follow-up PR will not happen, and the next Claude Code session will read a stale CLAUDE.md and produce wrong code. Update in the same PR or do not merge.

Deployment

Audience: operators standing up Pulse as a CLI server, an MCP process under an AI client, or an embedded Go library inside a larger binary.

Pulse is a single static Go binary. There is no install command, no config file, and no daemon — every deployment story is some shape of “put the binary somewhere, set PULSE_DATA_DIR, run it”.

LLM agents using MCP: see the mcp-integration skill via pulse_skills_get for the MCP-side wiring details. This page covers the operator side.

Mode 1: Standalone CLI

go install github.com/frankbardon/pulse/cmd/pulse@latest
export PULSE_DATA_DIR=/var/data/pulse
pulse --version

That’s the full install. The CLI tree is mapped in the CLI Tour.

Mode 2: MCP stdio server (Claude Desktop, Claude Code, generic MCP clients)

pulse mcp runs the Model Context Protocol over stdio. AI clients launch the process, speak MCP over its standard streams, and shut it down on session close.

The full wiring guide is in the mcp-integration skill. Quick reference for Claude Desktop:

// ~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "pulse": {
      "command": "/usr/local/bin/pulse",
      "args": ["mcp"],
      "env": {
        "PULSE_DATA_DIR": "/var/data/pulse"
      }
    }
  }
}

For Claude Code (~/.claude.json) and other clients the shape is the same — see the mcp-integration skill (pulse skills show mcp-integration) for the canonical recipes.

Flags worth knowing:

Flag	Default	Purpose
`--data-dir`	from `PULSE_DATA_DIR`	Override the cohort base directory
`--bind-on-open`	`true`	Register session-scoped JSON-schema-bound tool variants on successful `pulse_inspect`. Disable for clients that bind tool schemas themselves.

See pulse mcp for the full command page.

Mode 3: Embedded Go library

import "github.com/frankbardon/pulse"

p, err := pulse.New(pulse.Options{
    DataDir: "/var/data/pulse",
})

When embedding, you can bypass PULSE_DATA_DIR entirely by passing DataDir (as above) or a custom afero.Fs. See Library Embedding for the full surface.

Production hardening

Filesystem permissions. pulse mcp reads everything under PULSE_DATA_DIR. Treat the directory as the trust boundary — run the process as a user that can only read what it should serve.
Stdio plumbing. MCP transports stderr too. Pulse writes a one-line startup notice (pulse mcp: serving over stdio...) on stderr and never logs request/response payloads, so MCP clients can surface stderr without leaking data.
Resource limits. Streaming aggregations stay memory-bounded; buffered request shapes (window operators, median/percentile, decimal/geo paths) can materialise large intermediate row sets. Use pulse api predict to check Streamable before running an unfamiliar request — see Performance Notes.
No mutating background state. Pulse never writes to a cohort during process/compose. The only write paths are import, export, synth, profile, and cohort filter — explicit by flag.

Upgrades

Drop in a new binary and restart the MCP process (or the calling client). The .pulse file format carries a one-byte version field (currently 0x01); files written by a future binary that introduces a new version will be rejected loud at parse time, not silent at row decode. See Header Layout.

Performance Notes

Audience: operators sizing a Pulse deployment, and library users debugging memory or latency surprises.

Pulse is built to keep “the streaming path” the default for most analytical requests. When the engine has to leave that path it says so — via the Streamable flag in pulse api predict — and falls back to a buffered execution. This page tells you what stays streaming, what buffers, and how to read predict’s diagnostics.

LLM agents using MCP: there is no direct skill counterpart for this page — debugging-with-predict covers how to drive predict; this page tells operators what predict’s answers imply.

Streaming path: what stays out of memory

The streaming Process path covers four orchestrator modes (from CLAUDE.md → What streams today):

Single-pass streaming. No-group requests with online aggregators (COUNT, SUM, AVG, STDDEV, VARIANCE, RANGE, FREQUENCY, MODE, SKEWNESS, KURTOSIS, DISTINCT_COUNT) on numeric (non-decimal) fields. Row-local attributes (FORMULA, DATE_PART) apply inline.
Grouped streaming. Groupers implementing the streaming key path (GROUP_CATEGORY, GROUP_RANGE, GROUP_ROUNDED) drive per-key online aggregator buckets. Memory is O(distinct_groups × per-aggregator-state).
Two-pass streaming. Two-pass attributes (ATTR_ZSCORE, ATTR_TSCORE, ATTR_NORMALIZED) compute population stats via Welford-Pébaÿ pass 1, then emit per-row values in pass 2.
Streaming features. Every registered FEAT_* operator implements the streaming computer interface and composes with the three modes above.

These paths benefit from three optimisations landed during the streaming refactor (commit cdd72d5): record reuse (the same record buffer flows through the pipeline), zero-allocation decoding into reused buffers, and an mmap reader for .pulse files large enough to benefit from demand paging.

Buffered path: when Pulse has to materialise

pulse api predict reports Streamable=false and lists every buffering reason. The current set, from CLAUDE.md:

AGG_MEDIAN, AGG_PERCENTILE, and AGG_ZSCORE — require sorts or summed deviations.
ATTR_PERCENTILE — sorted view of every value; no streaming algorithm preserves exact rank.
GROUP_QUANTILE, GROUP_DATE — finalize-time work over the full set.
Window operators (WIN_*) — operate on a sorted post-aggregate row set.
Decimal-typed field aggregations — precision-preserving path.
Two-pass attributes combined with features or groups — orchestration matrix not yet extended.
Tier-1 statistical tests combined with groupers, features, or two-pass attributes — same orchestration limit.
Tier-2 post-tests (req.PostTests) — always run after the result set is materialised, regardless of TestType.

Reading predict output

pulse api predict --request request.json --json | jq '.data | {streamable, streamable_reasons}'

{
  "streamable": false,
  "streamable_reasons": [
    "AGG_MEDIAN on field price"
  ]
}

If streamable_reasons is empty and streamable=true, the request executes without buffering. Each reason is a one-line gate that pushed the request to the buffered path; you can drop or substitute the offending operator (e.g., AGG_AVG instead of AGG_MEDIAN) and re-run predict.

Memory rules of thumb

Path	Memory profile
Single-pass streaming	Constant — `O(aggregator state)`
Grouped streaming	`O(distinct_groups × per-aggregator state)`
Two-pass streaming	Constant; cost is 2× iter scan (typically OS-page-cached)
Buffered	`O(filtered_rows × output_width)` for the working set, plus per-operator state

Concurrency

pulse.ComposeParallel (CLI: pulse api compose --parallel N) fans ComposedRequest slots over a bounded worker pool. Workers share the engine’s read-only registries; each Process call constructs fresh stateful operators per request, so concurrent execution is safe. Defaults: MaxWorkers = GOMAXPROCS, FailFast = true. See Parallel Compose.

When to embed vs shell out

For high-throughput pipelines, embed Pulse directly via the Go library — you avoid one process boundary per request and can stream rows through your own writer with ProcessStream. For ad-hoc analysis, JSON-in/JSON-out via pulse api process --json is faster to write and easier to debug.

Troubleshooting

Audience: operators chasing a specific failure mode in production (file not found, permission errors, MCP transport issues, common error codes).

This page is organised by symptom. For per-code recovery detail (Message + Fixup templates), fetch metadata via the pulse_errors_lookup MCP tool ({"code": "PULSE_XXX"}) or pulse errors lookup CODE on the command line. The error-code-reference skill explains the envelope shape, the DOMAIN_CATEGORY naming convention, and the repair workflow that chains predict-side suggestions into structured fixups.

LLM agents using MCP: call pulse_errors_lookup for per-code detail — code=PULSE_XXX for one code, domain=PULSE to enumerate, query="..." for keyword search. The skill is the orientation; the tool is the catalog. This page focuses on operational symptoms that don’t reduce to a single error code.

“data directory required: set PULSE_DATA_DIR or pass –data-dir”

pulse mcp refuses to start. The MCP leaf is the one place the binary insists on a base directory because it enumerates cohorts at session start.

Fix: export PULSE_DATA_DIR in the client’s MCP config, or pass --data-dir /path/to/data on the command line. The pulse mcp page has the full example.

“file not found” / “no such file or directory”

The cohort path was resolved against the wrong base. Pulse prefers absolute paths; with PULSE_DATA_DIR set, relative paths resolve against it.

Fix: call pulse cohort inspect /absolute/path/data.pulse to verify the file is where you think it is. If you’re running inside pulse mcp, check the data-dir line on stderr at startup.

“permission denied”

Pulse runs as your user; it does not escalate. When deployed as an MCP process under a different user (e.g. via launchd / systemd), the cohort directory and files must be readable by that user.

Fix: check id inside the MCP startup banner on stderr; check the file mode with ls -l; widen the group as needed.

“invalid pulse magic bytes” / “unsupported pulse format version”

The file isn’t a .pulse file — or it’s from a future binary that introduced a new format version. The reader rejects unknown versions at parse time (see Header Layout) so a future binary doesn’t silently mis-decode an older file.

Fix: verify the file with file path/to/data.pulse and the first nine bytes (hexdump -C). The expected magic is 50 55 4c 53 45 00 00 00 followed by a version byte (0x01 today).

“truncated pulse header”

The file is shorter than nine bytes or was cut off mid-write.

Fix: re-import. If you suspect a partial write, also check whether the writer was killed mid-flush — Pulse writes the header first, then the schema, then the records, so a truncated file usually fails here before any data is observed.

`SERVICE_VALIDATION` errors

A field name in the request doesn’t exist in the cohort, or an operator targets a field of the wrong type.

Fix: run pulse api predict on the same request — predict diagnoses validation failures without executing. Common cases: typo in field name; numeric aggregation on a categorical field (warning code PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL); two-pass attribute combined with a feature (currently buffered, not invalid — predict will flag this in streamable_reasons).

`PULSE_IMPORT_*` errors

Import-time failures. The two most common:

PULSE_IMPORT_CATEGORICAL_OVERFLOW — too many distinct values for the chosen categorical width. Either bump the width (categorical_u16 / categorical_u32), drop the categorical encoding, or filter the source before re-importing. See Dictionary Blocks.
PULSE_IMPORT_DESCRIPTION_TOO_LONG — schema field description exceeds 1000 bytes. Trim it.

`PULSE_FIELD_DESCRIPTION_LOW_QUALITY`

A warning by default, an error under --strict. The description is empty, under ten characters, or a generic placeholder ("n/a", "tbd", "unknown", "field", "data", "value", "column").

Fix: edit the description in the schema JSON, re-import with --schema.

MCP “tool not found” / “no tools registered”

An MCP client connects but sees no Pulse tools.

Fix: check the client’s MCP log (Claude Desktop surfaces this in ~/Library/Logs/Claude/). Common causes: pulse binary is not on PATH, the wrong working directory, or PULSE_DATA_DIR is not set in the MCP env block. Re-read pulse mcp.

mmap / file-mapping failures

On very large .pulse files the streaming reader uses memory mapping where available. If your environment forbids mmap (some sandboxed containers, very locked-down macOS configurations), the reader falls back to a buffered read.

Fix: typically transparent. If you suspect a regression, run with verbose Go runtime tracing or compare against a non-mmap file by copying it to /tmp and re-running.

When in doubt: predict, then process

Almost every “why doesn’t this work” question is answerable by

pulse api predict --request request.json --json

Predict reads only the header and schema — it never touches record data — and returns the full envelope of errors, warnings, and the streamable flag. If predict says valid:true and process still fails, the bug is in the processing layer, not the request.

Development Setup

Audience: new contributors getting their first PR ready.

This page is the short version. The fuller treatment of the repo’s conventions, CI gates, and Update Demand lives in the Internals section and in CLAUDE.md at the repository root.

Clone

git clone https://github.com/frankbardon/pulse.git
cd pulse

Tooling

Pulse needs only the Go toolchain — there is no Node, Python, or container build. Install Go 1.24+ (see go.mod for the canonical version).

The repo also uses staticcheck for make lint; it is auto-installed on first run via go run.

Common targets

Command	What it does
`make build`	Builds the CLI binary to `bin/pulse` (default goal)
`make test`	Runs `go test ./...`
`make fmt`	Runs `go fmt ./...`
`make vet`	Runs `go vet ./...`
`make lint`	Runs `go vet` then `staticcheck ./...`
`make cover`	Runs tests with coverage; outputs `coverage.out`
`make clean`	Removes `bin/` and `coverage.out`

A .env file at the repo root is auto-loaded and exported, so PULSE_DATA_DIR and any other PULSE_* env vars can live there for local development.

Run the binary you just built

make build
./bin/pulse --version
./bin/pulse --json | head -20

The CLI tree itself is mapped in the CLI Tour.

Where things live

The package layout is documented at Internals → Package Layout. Two pointers worth knowing on day one:

Public facade: pulse.go — every Go embedder API lives here.
CLI internals: internal/cli/ — one file per command group; never put processing logic here.

Read this before writing code

Style Guide
Testing Conventions
Pull Request Process
The Update Demand — what doc/skill updates ride alongside what code changes.

Style Guide

Audience: anyone writing code or docs in the Pulse repository.

This page summarises the conventions enforced by review and by CI. The authoritative source is the “Code Conventions” section of CLAUDE.md; copy that file’s rules when in doubt.

Go style

Standard gofmt / go vet cleanliness — make lint is the gate.
Module path is github.com/frankbardon/pulse. The standard-library io collision is handled by aliasing the project’s package as pio "github.com/frankbardon/pulse/io".
Library-first: business logic lives in library packages, never in cmd/pulse/. The CLI parses flags, calls the library, formats output.
All file I/O routes through the injected afero.Fs — never os.Open/os.ReadFile directly in library code, because that defeats fs.NewMemMap() for tests and the extension hook for custom storage backends.

Naming

Component types use SCREAMING_SNAKE_CASE: AGG_COUNT, ATTR_ZSCORE, FILTER_INCLUDE, GROUP_CATEGORY, WIN_LAG, FEAT_LOG, TEST_T.
Error codes use DOMAIN_CATEGORY format, organised by the six domains listed in CLAUDE.md (ENCODING, PROCESSING, SERVICE, DATA, CLI, PULSE).
Field types use lowercase snake (u8, nullable_bool, categorical_u16, decimal128).

Structural bans

These are enforced by non-skippable CI gates:

Ban	Enforced by
`descriptor/` MUST NOT import `service/` or `processing/`	`TestPredictNoExecutionImports`
`descriptor/` MUST NOT use `fmt.Sprintf` for JSON construction	`TestDescriptorNoFmtSprintf`
Golden files in `descriptor/testdata/` MUST NOT be hand-edited	`TestGoldensNotHandEdited`
No predecessor-project string prefixes (legacy “Orbit” naming) in error codes or constants	`TestNoOrbitReferences`, `TestNoOrbitPrefix`
`CLAUDE.md` MUST mention every `PULSE_*` env var, every non-skippable gate, the current `format_version`	`TestClaudeMd*` family

See the Pull Request Process for how these surface during review.

Comments and prose

Public Go symbols carry a godoc-shaped comment opening with the symbol name.
Skill files use YAML frontmatter (name, description, type, applies_to) and are LLM-facing — keep them in MCP voice (tool calls, JSON payloads). The human-facing equivalent is this site; cross-link from each side.
mdBook chapters open with a one-sentence summary and an Audience line. See any of the already-authored chapters in this site for the tone.

The Update Demand

The single most important convention: if your code change ships without the corresponding CLAUDE.md and skill updates, CI will fail. The Update Demand chapter is the authoritative table of triggers and the gates that enforce them. Read it before opening a PR that touches a registered surface (new aggregator, new error code, new CLI flag, new field type, …).

Testing Conventions

Audience: contributors writing tests, regenerating goldens, or trying to figure out which CI gate to run locally before pushing.

From CLAUDE.md, CI gates and Common Claude Code Workflows.

Style

Table-driven tests are the default. Put cases in a []struct{...} with a name field, run with t.Run(tc.name, func(t *testing.T)).
Hermetic by construction: anything that touches the filesystem uses fs.NewMemMap() so tests don’t depend on disk state.
New code lands with tests in the same PR — TDD first, then implementation. A test that passes without the implementation is suspicious; the test is probably wrong.

Running tests

# Full suite
go test ./...

# Single package
go test ./processing/...

# Verbose, specific test
go test ./service/... -v -run TestProcess

# Coverage report
make cover

# Fuzz the .pulse header
go test ./encoding/... -fuzz FuzzPulseFileHeader -fuzztime 30s

Non-skippable CI gates

These tests guard structural invariants. If one of them fails, the underlying conventions (not the test) are what need re-thinking. Their full names appear in CLAUDE.md so the TestClaudeMdMentionsAllNonSkippableGates self-check can find them.

Gate	Guards
`TestPredictNoExecutionImports`	`descriptor/predict.go` does not import `service/` or `processing/`
`TestDescriptorNoFmtSprintf`	`descriptor/` never builds JSON via `fmt.Sprintf`
`TestGoldensNotHandEdited`	`descriptor/testdata/*` hashes match the generator
`TestClaudeMdMentionsFormatVersion`	CLAUDE.md references the current envelope `format_version`
`TestClaudeMdMentionsAllEnvVars`	Every `PULSE_*` env var has a CLAUDE.md row
`TestClaudeMdMentionsAllNonSkippableGates`	This very table is the source — CLAUDE.md must list every gate by name
`TestUpdateDemandTableCovers`	The Update Demand table covers every registered component category
`TestPerPackageCoverageFloors`	Package directories exist and meet documented coverage floors
`TestNoOrbitReferences`, `TestNoOrbitPrefix`, `TestNoOrbitPrefixes`	No predecessor-project string prefixes leak in
`TestSkillsCoverAll*`	Skill files mention every registered component, error code, distribution, CLI leaf, field type, MCP tool
`TestSkillsManifestConsistent`	`skills/index.json` matches the `.md` files and frontmatter
`TestSkillsFrontmatter_RequiredFields`	Every skill has `name`, `description`, `type`, `applies_to`
`TestRegistryStreamabilityMatchesTypes`	Aggregator `OnlineAggregator` capability matches `AggregationType.Streamable()`
`TestPredict_Streamable_MatchesRuntime`	`PredictResult.Streamable` mirrors `processing.CanStreamRequest`
`TestStreamability_*Known`	Every `All*Types()` entry has a streamability table row
`TestCanStreamRequest_RegressionMatrix`	Regression matrix on the exported `CanStreamRequest` helper
`TestManifest*Complete`	Manifest enumerates every registered operator, test, distribution, MCP tool, error code
`TestManifestStreamableMatchesTypes`	Manifest `Streamable` flags mirror the type-level methods
`TestCodesHaveFixups`, `TestSkillsErrorCodeFixupsDocumented`	Each error code has a fixup template and the skill row to match
`TestDefaults_Applied`	Smart-default operator-type inference behaves as documented
`TestNaturalQuery_HeuristicGrammar`	The `internal/query` parser fixtures cover its documented shapes

(See CLAUDE.md “CI gates” for the full prose; this table is the quick-reference.)

Running a subset of gates locally

# All descriptor contract gates
go test ./descriptor/ -run 'TestPredictNoExecution|TestDescriptorNoFmtSprintf|TestGoldensNotHandEdited'

# Skill coverage gates
go test ./skills/ -run 'TestSkillsCoverAll|TestSkillsManifestConsistent|TestSkillsFrontmatter'

# CLAUDE.md gates
go test . -run 'TestClaudeMd|TestUpdateDemandTable'

# Predecessor-reference scrub
go test . -run TestNoOrbitReferences

Regenerating golden files

Golden files live in descriptor/testdata/. Each ends with a // golden-hash: <sha256> line; TestGoldensNotHandEdited verifies the hash. After a legitimate change to the generator:

go test ./descriptor/ -run 'Test.*Golden' -update
go test ./descriptor/ -run TestGoldensNotHandEdited   # confirms the new hash sticks

Never hand-edit a golden file — the gate will catch you.

Adding a new gate

If your change introduces a structural invariant, add a test for it under the same naming convention (TestX), and add it to the table in CLAUDE.md so TestClaudeMdMentionsAllNonSkippableGates recognises it. The Update Demand lists this as a trigger row.

Pull Request Process

Audience: contributors preparing to open or land a PR.

This page is a checklist. The longer prose lives in CONTRIBUTING.md and the Update Demand chapter.

1. Branch and commit shape

One feature or fix per PR. Keep the diff focused.
Conventional Commits in the subject line: feat(...), fix(...), chore(...), docs(...), perf(...), refactor(...), test(...).
The PR title is usually the lead commit’s subject.

2. Tests first

A PR that adds a new aggregator, error code, field type, I/O format, statistical test, or skill must include tests in the same PR. The testing-first preference is documented in Testing Conventions. Implementation that lands without tests will be sent back; tests that pass without the implementation are suspicious and probably wrong.

3. The Update Demand

The single biggest source of “your PR was bounced” feedback. The full table lives in The Update Demand; the cliff-notes are:

Change category	Doc/skill update required in the same PR
Registered aggregator / attribute / filterer / grouper	The matching skill file + the operator capability table
Registered window / feature / synth distribution / statistical test	Same — skill + capability file
Error code (added / removed / renamed)	`errors/codes.go`, `skills/error-code-reference.md`, `descriptor/capabilities_errors.go`
CLI leaf (added or flag added)	`CLAUDE.md` “Common Claude Code Workflows” + `skills/getting-started.md` if user-facing
`--json` envelope change	`CLAUDE.md` “Output Format Contract”
`.pulse` file format change	`CLAUDE.md` “Code Conventions” + `skills/cohort-schema-design.md`
New environment variable	`CLAUDE.md` “Build / Dev / Test Workflow” + `skills/getting-started.md`
New non-skippable CI gate	List it by name in `CLAUDE.md`

If you find yourself wanting to defer the doc update to a follow-up PR, stop. The follow-up PR will not happen, and the next contributor will read stale guidance. Update in the same PR or do not merge.

4. Pre-flight checks

make fmt
make lint
make test

For change-category-specific gates, see Testing → Running a subset of gates locally.

5. Open the PR

Use the bug-report or feature-request template as a starting point if applicable.
Fill in the PR template’s “Summary” and “Test plan” sections.
Link related issues with Closes #N.
Do not push --force to main. Force-pushing your own feature branch is fine before review starts.

6. Review and CI

CI runs the full go test ./... plus the non-skippable gates listed in Testing → Non-skippable CI gates. A failing gate means a structural invariant is broken, not a flaky test; fix the root cause rather than retrying.

When a pre-commit hook or PR check fails, create a new commit with the fix. Do not git commit --amend after a hook failure; the prior commit may not exist or may have already been pushed.

7. Merge

Squash-merge is the default; the squash message follows Conventional Commits.
Once merged, the deploy workflow rebuilds and publishes this docs site to https://frankbardon.github.io/pulse/.

For changes that introduce a new architectural decision, also update the relevant section of CLAUDE.md and reference the PRD (if one exists) in the PR description.

Keyboard shortcuts

Pulse Documentation