Pulse
Pulse is a self-describing, high-performance tabular data processing engine. It ships as a Go library (github.com/frankbardon/pulse) and as a single CLI binary (bin/pulse). Every .pulse file carries its own schema in the header, so consumers (programs, agents, and humans) can discover what a file contains without an external catalog.
The library is the primary deliverable. The CLI is a thin adapter that exposes the same operations on the command line, and an embedded MCP server (pulse mcp) exposes them to LLM agents.
Where to go from here
| If you are… | Start with |
|---|---|
| New to Pulse | Installation → Your First Cohort → CLI Tour |
| Driving Pulse from the shell | Command Line Reference |
| Embedding Pulse in a Go program | Library Embedding |
| Curious about the binary format | .pulse File Format |
| Hacking on Pulse itself | Internals and Contributing |
| Wiring Pulse into an LLM agent | MCP Integration (Pointer), then the in-binary skill pack |
LLM-facing surface
LLM agents do not read this site. Pulse exposes a Model Context Protocol server (pulse mcp) and ships 19 embedded skills under skills/ that LLMs load on demand via the pulse_skills_list and pulse_skills_get tools. The skill voice is MCP-only (tool calls, JSON payloads). This site is the human-facing counterpart — same engine, different idiom.
See How LLMs Use Pulse for a short pointer table.
Source of truth
The authoritative architectural contract for Pulse lives in the repository’s CLAUDE.md. When this site and CLAUDE.md disagree, CLAUDE.md wins; please open an issue.
- Repository: https://github.com/frankbardon/pulse
- Hosted docs: https://frankbardon.github.io/pulse/
Installation
Audience: new users who want a working pulse binary on their PATH.
This page walks through installing Pulse, the prerequisites it needs, and how to verify the install. Pulse is distributed as a single static Go binary; there is no installer, no daemon, and no config file.
LLM agents using MCP: see the
getting-startedskill viapulse_skills_get— it covers session bootstrap rather than local install.
Prerequisites
| Requirement | Minimum |
|---|---|
| Go toolchain | 1.24 (see go.mod) |
| OS | Linux, macOS, or Windows (anywhere Go cross-compiles) |
| Disk | A few MB for the binary; cohort files live wherever you point PULSE_DATA_DIR |
go.mod is the source of truth for the supported Go version; if it drifts
from this page the go.mod value wins.
Install with go install
The fastest path on a developer machine:
go install github.com/frankbardon/pulse/cmd/pulse@latest
This drops a pulse binary at $(go env GOBIN) (typically ~/go/bin).
Make sure that directory is on your PATH.
Pin a specific release by replacing @latest with a tag:
go install github.com/frankbardon/pulse/cmd/pulse@v0.2.0
Build from source
The same binary, built reproducibly from a checkout:
git clone https://github.com/frankbardon/pulse.git
cd pulse
make build
# Binary at ./bin/pulse
The Makefile is documented in CLAUDE.md → Build / Dev / Test
Workflow;
the relevant targets are make build, make test, make lint, and
make cover.
Configure the data directory
Pulse reads and writes .pulse files under a base directory called
PULSE_DATA_DIR. Most commands accept absolute paths and will work without
it, but pulse mcp requires the variable so the MCP server can enumerate
cohorts:
export PULSE_DATA_DIR=/var/data/pulse
The repo Makefile auto-loads a .env file from the repo root, so you can
also drop PULSE_DATA_DIR=... there for local development.
PULSE_DATA_DIR is the only required environment variable. See
Flag Reference for the full list of CLI flags and
environment knobs.
Verify
pulse --version
pulse --json | head -20
pulse --json prints the root manifest — the full self-description of
commands, components, field types, and embedded skills. If you see a
top-level format_version: "1.0" envelope, the install is working.
Where to go next
- New to the file format and vocabulary? Your First Cohort
- Want a quick map of every command? CLI Tour
- Embedding Pulse in a Go program? Go API Overview
- Wiring Pulse into an MCP-aware client?
pulse mcp
Your First Cohort
Audience: new CLI users. This is a five-minute tour: import a CSV,
inspect the resulting .pulse file, run an aggregation, and export the
result back.
LLM agents using MCP: the equivalent tour for an agent is the
getting-startedskill, fetched viapulse_skills_get. That skill speaks in tool calls and JSON payloads; this page speaks in shell commands.
1. Pick a CSV
For this walkthrough we’ll assume a file called sales.csv with columns
like:
order_id,region,product,units,revenue,sold_on
1,west,widget,3,29.97,2024-01-04
2,east,gadget,1,19.99,2024-01-04
3,west,widget,7,69.93,2024-01-05
...
Any CSV with a header row works. Pulse also imports TSV, NDJSON, JSON-array, Parquet, Arrow IPC, and Excel — see Flag Reference for per-format flags.
2. Import to a .pulse file
pulse import csv --input sales.csv --output sales.pulse
Pulse samples up to 500 rows by default to infer a schema (you can change
that with --sample-rows). Each column gets a typed binary representation
and, if it looks like a low-cardinality string, a categorical dictionary.
Want to control the schema explicitly? Generate a template, edit it, and re-import:
# Editable schema template
pulse import schema-template sales.csv > sales.schema.json
# Edit sales.schema.json — set types, add descriptions
# Then import with the schema
pulse import csv --input sales.csv --schema sales.schema.json --output sales.pulse
See Field Types for the type catalog and Dictionary Blocks for how categoricals are encoded.
3. Inspect
The .pulse file is fully self-describing. Read it back:
pulse cohort inspect sales.pulse
Output is a table of fields, their types, and the description string
stored in the header. Add --json for the structured envelope, or
--full-dict to print every categorical entry instead of truncating
after 100.
pulse cohort inspect sales.pulse --json
The envelope is documented in pulse cohort inspect.
4. Validate a request before running it
Pulse separates validation from execution. Write a tiny request file:
{
"cohort": {"filename": "sales.pulse"},
"groups": [{"type": "GROUP_CATEGORY", "field": "region"}],
"aggregations": [
{"type": "AGG_COUNT", "field": "order_id", "label": "orders"},
{"type": "AGG_SUM", "field": "revenue", "label": "total_revenue"}
]
}
Save it as request.json, then check whether it makes sense against the
cohort’s schema:
pulse api predict --request request.json
You’ll see Valid: true, the schema’s field count, and any warnings
(e.g., aggregating something numeric on a categorical field). Predict
never reads record data, so it’s safe to iterate on a request without
touching a multi-GB cohort.
See pulse api predict and the
debugging-with-predict skill for the full predict loop.
5. Execute
pulse api process --request request.json --json
The response is wrapped in the standard envelope (format_version,
data, errors, warnings). data carries the result rows and a
metadata block with total_rows and filtered_rows.
If your result is large, swap --json for --stream to receive rows as
NDJSON, one line at a time — useful for pipelines that don’t want to
buffer the whole result. See Streaming &
ProcessStream for which request shapes
actually stream end-to-end inside the engine vs which buffer.
6. Export
You’re done with the .pulse file? Export to whatever your downstream
tool understands:
pulse export csv --input sales.pulse --output sales.out.csv
pulse export parquet --input sales.pulse --output sales.out.parquet
pulse export excel --input sales.pulse --output sales.out.xlsx
To skip the intermediate .pulse entirely and convert in one shot, use
pulse convert source.csv target.parquet — see the top-level
README for
the full convert recipe.
What you didn’t see
- Compose: batch multiple requests in one call —
pulse api compose. - Ask: natural-language one-shot —
pulse api ask. - Sample / Facet: cheap read-only probes —
api sample,api facet. - Window / Feature / Test operators: pull from the skill pack
(
window-operations,feature-engineering,statistical-testing) viapulse skills show <name>.
For a full map of the CLI, see the CLI Tour.
CLI Tour
Audience: anyone who wants a map of every pulse subcommand before
diving into per-command details.
This page is a one-liner index of the CLI tree. Each row links to its
detailed chapter where applicable; commands that are minor variants of
each other (per-format import/export leaves) are listed compactly.
LLM agents using MCP: there is no equivalent skill — agents drive Pulse through MCP tools, not the CLI. Start at the
getting-startedskill instead.
Top-level groups
pulse [--json] [--slim]
├── import Tabular → .pulse (csv, tsv, ndjson, jsonarray, parquet, arrow, excel)
├── export .pulse → tabular (same format set)
├── convert Tabular → tabular, with .pulse as the transparent middle
├── cohort Inspect or filter an existing .pulse file
├── api Processing operations (process, compose, ask, predict, sample, facet)
├── synth Generate synthetic cohorts (from-schema, from-profile)
├── profile Capture a statistical profile of a cohort
├── skills Read the embedded LLM skill pack
└── mcp Run the Model Context Protocol server over stdio
Bare pulse --json prints the self-describing root manifest — commands,
components, field types, and skill metadata in one envelope. Pass
--slim to drop prose descriptions for size-sensitive clients.
API operations
The “processing facade” — these are the operations exposed via the Go library API and the MCP tool set.
| Command | Purpose | Chapter |
|---|---|---|
pulse api process | Execute one request against a cohort | api process |
pulse api compose | Execute multiple requests in batch / parallel | api compose |
pulse api ask | Parse a natural-language query and execute | api ask |
pulse api predict | Validate a request without executing | api predict |
pulse api sample | Return up to N rows | api sample |
pulse api facet | Return distinct values of a field | api facet |
Cohort lifecycle
| Command | Purpose | Chapter |
|---|---|---|
pulse cohort inspect PATH | Read header + schema (no record data) | cohort inspect |
pulse cohort filter | Write a filtered subset to a new .pulse | See Internals → Architecture |
Import / export / convert
pulse import <format> and pulse export <format> share the same flag
shape per format (--input, --output, --schema for import).
Supported formats today:
csv · tsv · ndjson · jsonarray · parquet · arrow · excel
Each format has a per-leaf command (e.g. pulse import csv). Run
pulse import --help or pulse export --help for the full list.
pulse convert SOURCE TARGET chains import + export with no
intermediate file unless --keep-pulse PATH is passed. Format is
auto-detected from extensions.
Synthetic data
| Command | Purpose | Chapter |
|---|---|---|
pulse synth from-schema | Generate from a JSON spec | synth from-schema |
pulse synth from-profile | Generate from a captured profile | synth from-profile |
pulse profile create | Capture a profile from an existing cohort | profile create |
Self-description & LLM surface
| Command | Purpose | Chapter |
|---|---|---|
pulse --json | Root manifest (commands, components, field types, skills) | manifest |
pulse skills list | List embedded skills with metadata | How LLMs Use Pulse |
pulse skills show NAME | Print a skill’s full markdown body | same |
pulse mcp | Serve MCP over stdio | mcp |
Cross-cutting flags
Most leaves accept --json (envelope output), --no-defaults (turn off
smart operator-type inference), and the operation-specific flags
documented per page. Full list: Flag Reference.
The single environment variable to know is PULSE_DATA_DIR — see
Installation.
pulse api process
Audience: CLI users running a single processing request against a cohort.
pulse api process executes one types.Request
against a .pulse file and prints the result. It’s the most-used
leaf in the binary.
LLM agents using MCP: the equivalent surface is the
pulse_processMCP tool — seeskills/request-recipes.mdfor request skeletons.
Synopsis
pulse api process --request FILE [--json] [--stream] [--no-defaults]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--request | -r | string | (required) | Path to the request JSON file |
--json | bool | false | Emit the result wrapped in the JSON envelope | |
--stream | bool | false | Stream rows as NDJSON (one per line) instead of buffering | |
--no-defaults | bool | false | Disable smart operator-type inference; require explicit Type on every aggregation and grouper |
--stream and --json are mutually exclusive in spirit — --stream
emits one JSON object per line; --json emits the full envelope.
Request file shape
The request file is a types.Request
serialised to JSON. Minimal example:
{
"cohort": {"filename": "sales.pulse"},
"aggregations": [
{"type": "AGG_SUM", "field": "revenue", "label": "total_revenue"}
]
}
The full request grammar — filterers, groupers, attributes, window
operators, features, sort, tests, post-tests — is documented in
types.Request;
the LLM-facing companion is skills/request-recipes.md.
Output
Text mode (default)
Pretty-printed JSON of the Response struct: a data array of
result rows plus a metadata block with total_rows, filtered_rows,
and cohort_file.
--json
The standard envelope:
{
"format_version": "1.0",
"data": {
"data": [ /* result rows */ ],
"metadata": { "total_rows": 1000, "filtered_rows": 800, "cohort_file": "sales.pulse" }
},
"errors": [],
"warnings": []
}
--stream
NDJSON of result rows, one per line. No envelope, no metadata footer.
Pair with pulse api predict ahead of time to
confirm Streamable=true; predict-buffered shapes still emit via
this path, but they materialise inside the engine first.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Any error — wrapped in the envelope’s errors array under --json, or printed to stderr otherwise |
Examples
Quick aggregation
cat > req.json <<'EOF'
{
"cohort": {"filename": "sales.pulse"},
"aggregations": [{"type": "AGG_COUNT", "field": "id", "label": "n"}]
}
EOF
pulse api process --request req.json
Filter, group, and aggregate
cat > req.json <<'EOF'
{
"cohort": {"filename": "sales.pulse"},
"filterers": [{"type": "FILTER_RANGE", "field": "revenue", "values": ["100", "10000"]}],
"groups": [{"type": "GROUP_CATEGORY", "field": "region"}],
"aggregations": [
{"type": "AGG_COUNT", "field": "id", "label": "orders"},
{"type": "AGG_AVERAGE", "field": "revenue", "label": "avg_rev"}
]
}
EOF
pulse api process --request req.json --json
Stream rows into a downstream pipeline
pulse api process --request req.json --stream | \
jq -c 'select(.avg_rev > 500)'
Related
pulse api compose— batch of requests in one callpulse api ask— natural-language one-shotpulse api predict— validate without executingpulse api sample— quick row preview- Library: pulse.New & Options — the Go-side
equivalent of
--no-defaults - Library: Streaming & ProcessStream — what streams vs what buffers
pulse api compose
Audience: CLI users executing a batch of related requests in one call.
pulse api compose runs multiple types.Request
entries against one or more cohorts. The whole batch is one
ComposedRequest; the engine can run the entries sequentially or in
parallel against a bounded worker pool.
LLM agents using MCP: see the
pulse_composeMCP tool and thecompose-requestsskill.
Synopsis
pulse api compose --request FILE [--json] [--stream]
[--parallel N] [--no-fail-fast]
[--no-defaults]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--request | -r | string | (required) | Composed-request JSON path |
--json | bool | false | Wrap output in the standard envelope | |
--stream | bool | false | Stream rows as NDJSON; each line is {"index": N, "row": {...}} | |
--parallel | int | 1 | Worker count; 0 = GOMAXPROCS, 1 = sequential | |
--no-fail-fast | bool | false | Aggregate errors across slots instead of cancelling on first failure (parallel mode only) | |
--no-defaults | bool | false | Disable smart operator-type inference |
Request file shape
{
"requests": [
{ "cohort": {"filename": "sales.pulse"}, "aggregations": [...] },
{ "cohort": {"filename": "sales.pulse"}, "groups": [...] },
{ "cohort": {"filename": "ops.pulse"}, "filterers": [...] }
]
}
Each requests[i] is a full types.Request. Slots are independent —
they may target different cohorts, use different operators, etc.
Output ordering
Responses come back in input order, regardless of --parallel.
A worker that finishes early waits its turn before emitting. So
responses[i] always corresponds to request.requests[i].
Parallel mode
--parallel N:
1(default) — sequentialCompose, equivalent to running each request throughpulse api processin a loop.0—runtime.GOMAXPROCSworkers.>1— exactly N workers.
Workers share Pulse’s read-only registries; per-request stateful operators are constructed fresh. See Parallel Compose for full mechanics.
FailFast semantics
With --no-fail-fast unset (the default, fail-fast on):
- The first failing request cancels in-flight siblings.
- The command exits non-zero with the first error.
With --no-fail-fast:
- Every request runs to its own completion (or per-request timeout).
- Errors aggregate into a single
SERVICE_INTERNALerror whosedetails.failed_indiceslists the slot indices that failed. - Successful slots populate the response array; failed slots are
null.
Output
--json
{
"format_version": "1.0",
"data": [ /* response per slot, in input order */ ],
"errors": [],
"warnings": []
}
--stream
{"index": 0, "row": { ... }}
{"index": 0, "row": { ... }}
{"index": 1, "row": { ... }}
The index field identifies which slot’s request produced each row.
Exit codes
| Code | Meaning |
|---|---|
| 0 | All requests succeeded |
| 1 | One or more requests failed (fail-fast: first error; aggregated: any failure) |
Examples
Sequential batch
pulse api compose --request batch.json --json
Parallel with 4 workers, aggregated errors
pulse api compose --request batch.json --parallel 4 --no-fail-fast --json
Stream a parallel batch into a downstream consumer
pulse api compose --request batch.json --parallel 4 --stream | \
jq -c 'select(.index == 2)'
Related
pulse api process— single-request leaf- Library: Parallel Compose — Go-side equivalents
skills/compose-requests.md(LLM) — request composition patterns
pulse api ask
Audience: CLI users running a one-shot natural-language query against a cohort, or any caller who wants “predict + process” in one call.
pulse api ask is the unified entry point. It validates a request
(predict), optionally translates a natural-language query into a
request via the built-in parser, and — on success — executes the
request. The MCP server uses the same library facade internally for
the pulse_ask tool.
LLM agents using MCP: the LLM-side counterpart is the
pulse_askMCP tool. Thequery-router-promptskill gives a system-prompt template for routing natural language into Pulse requests.
Synopsis
pulse api ask [--file FILE] [--query "..."] [--request FILE]
[--on-invalid abort|suggest] [--predict]
[--json] [--no-defaults]
You must pass at least one of --query or --request.
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--file | -f | string | (none) | Cohort .pulse file path |
--query | -q | string | (none) | Natural-language query string |
--request | -r | string | (none) | Optional structured request JSON path |
--on-invalid | string | "abort" | Predict-invalid behaviour: "abort" returns an error; "suggest" returns the response with suggestions populated | |
--predict | bool | false | Validate without executing | |
--json | bool | false | Emit the standard envelope | |
--no-defaults | bool | false | Disable smart operator-type inference |
How the parser fills the request
When --query is set, the parser reads the cohort’s schema and
synthesises a types.Request slot-by-slot. If --request is also
provided, explicit fields in that request always win on
collision — the parser only fills empty slots.
The parser populates these slots from the query today: Aggregations,
Groups, Filterers, Windows, Sort, Tests. Other slots in the
parsed request are ignored.
Output
Text mode
A human-readable summary:
Query: average revenue by region
Matched fields: [revenue region]
Confidence: 0.92
Resolved request:
{ ...the synthesised types.Request... }
{ ...result rows, if executed... }
--json
Full AskResponse envelope:
{
"format_version": "1.0",
"predict": { /* PredictResult */ },
"process": { /* Response, if executed */ },
"suggestions": [],
"query_resolution": {
"query": "average revenue by region",
"matched_fields": ["revenue", "region"],
"confidence": 0.92
},
"errors": [],
"warnings": []
}
process is omitted when --predict is set or when predict reported
invalid and execution was skipped.
Confidence and unresolved queries
query_resolution.confidence is in [0, 1]. A confidence of 0 means
PULSE_QUERY_UNRESOLVED (the parser found no usable structure) and
lands in errors. Lower-than-1 confidences with at least one matched
field land their reasons in warnings
(PULSE_QUERY_AMBIGUOUS). The query-router-prompt skill describes
the parser’s grammar.
OnInvalid behaviours
| Value | Behaviour |
|---|---|
"abort" (default) | Return a SERVICE_VALIDATION error if predict reports invalid |
"suggest" | Return the response with suggestions populated from errors/fixup_metadata.go |
Use "suggest" when you want fixup hints (e.g., “did you mean field
revenue?”) rather than a hard fail.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Validation failed (abort), parser failed, or process errored |
Examples
Pure natural-language query
pulse api ask --file sales.pulse --query "average revenue by region" --json
Query plus partial structured request
cat > partial.json <<'EOF'
{
"filterers": [{"type": "FILTER_RANGE", "field": "revenue", "values": ["100", "1000"]}]
}
EOF
pulse api ask --file sales.pulse --request partial.json --query "by region" --json
Predict-only probe
pulse api ask --request req.json --predict --json
Suggest fixups instead of erroring
pulse api ask --request typo.json --on-invalid suggest --json
Related
pulse api predict— standalone validationpulse api process— execute a pre-validated request- Library: pulse.Ask — Go-side counterpart
skills/query-router-prompt.md— LLM prompt template for routingskills/request-recipes.md— canonical request skeletons
pulse cohort inspect
Audience: CLI users reading a .pulse file’s schema without
running a query — the human-side counterpart of the inspect library
method and the pulse_inspect MCP tool. Defined in
internal/cli/cohort.go.
pulse cohort inspect reads only the file’s header and schema — it
never reads record data. The operation is constant-time regardless of
cohort size.
LLM agents using MCP: see the
cohort-schema-designskill and thepulse_inspecttool.
Synopsis
pulse cohort inspect PATH [--json] [--full-dict]
Flags
| Flag | Type | Default | Purpose |
|---|---|---|---|
--json | bool | false | Emit the standard envelope |
--full-dict | bool | false | Print every categorical dictionary entry (default truncates at 100) |
Output (text mode)
Fields: 7
order_id u64 Stable order identifier
region categorical_u8 Sales region label
dictionary: 4 entries
product categorical_u16 Product SKU
dictionary: 240 entries (truncated)
units u32 Units sold per line
revenue decimal128 Line revenue (precision 18, scale 2)
sold_on date Date the order shipped
...
Dictionaries with > 100 entries are flagged (truncated) — pass
--full-dict to print every entry.
Output (--json)
{
"format_version": "1.0",
"data": {
"field_count": 7,
"fields": [
{
"name": "order_id",
"type": "u64",
"byte_offset": 0,
"bit_position": 0,
"description": "Stable order identifier",
"description_source": "schema"
},
{
"name": "region",
"type": "categorical_u8",
"byte_offset": 8,
"bit_position": 0,
"description": "Sales region label",
"description_source": "schema",
"dictionary": {
"total_entries": 4,
"truncated": false,
"entries": ["east", "west", "north", "south"]
}
}
]
},
"errors": [],
"warnings": []
}
Fields with empty descriptions on disk get a synthesised fallback
("Categorical field: <name>" / "Numeric field: <name>"); their
description_source is "synthesized" rather than "schema".
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | File not found, truncated, magic-byte mismatch, or unsupported format version |
Examples
# Human-readable inspect
pulse cohort inspect data.pulse
# Full envelope for programmatic consumers
pulse cohort inspect data.pulse --json
# Show all categorical entries
pulse cohort inspect data.pulse --full-dict --json | jq '.data.fields[] | select(.dictionary)'
Related
- Format → Header Layout
- Format → Schema Block
- Format → Dictionary Blocks
- Library: pulse.Inspect — Go counterpart
skills/cohort-schema-design.md— LLM-facing schema-design skill
pulse api predict
Audience: CLI users validating a request before running it.
pulse api predict validates a types.Request against a .pulse
file’s schema without executing it. It reads only the header and
schema — never record data — so it’s a cheap, safe iteration loop
against arbitrarily large cohorts.
LLM agents using MCP: see the
pulse_predictMCP tool and thedebugging-with-predictskill. Predict is the LLM’s primary “would this work?” probe.
Synopsis
pulse api predict --request FILE [--json] [--strict]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--request | -r | string | (required) | Request JSON path |
--json | bool | false | Emit the standard envelope | |
--strict | bool | false | Treat warnings as errors |
Structural ban
descriptor/predict.go cannot import service/ or processing/.
This is enforced by TestPredictNoExecutionImports. Predict is
guaranteed to never touch the executor.
Output (text mode)
Valid: true
Schema: 7 fields
Warning [PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL]: AGG_AVG on field region (categorical_u8)
Without --strict, that warning would still let the command exit 0.
With --strict, the warning becomes an error and the command exits
non-zero.
Output (--json)
{
"format_version": "1.0",
"data": {
"valid": true,
"schema_info": {"field_count": 7},
"streamable": false,
"streamable_reasons": [
"AGG_MEDIAN on field price"
],
"request": { /* the request as predict resolved it, with defaults applied */ }
},
"errors": [],
"warnings": [
{"code": "PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL", "message": "..."}
]
}
streamable reports whether the request will execute on the
streaming Process path; streamable_reasons lists every gate that
forced the buffered path. See Performance
Notes for the full streaming/buffered table.
request echoes the request after defaults have been applied so
you can see what would actually run. To suppress defaults, run with
--no-defaults on the executing leaf (api process,
api compose); predict reports defaults_applied regardless.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Valid (or valid with warnings, in non-strict mode) |
| 1 | Invalid, or --strict with at least one warning |
Examples
Quick validity check
pulse api predict --request req.json
Programmatic check with envelope
pulse api predict --request req.json --json | \
jq -e '.data.valid == true' >/dev/null && echo "OK"
Strict mode for CI
pulse api predict --request req.json --strict --json
Detect that a request will buffer
pulse api predict --request req.json --json | \
jq '.data | {streamable, streamable_reasons}'
Common warning codes
| Code | What to do |
|---|---|
PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL | Use AGG_COUNT / AGG_FREQUENCY instead of AGG_SUM / AGG_AVG on categoricals |
PULSE_AGG_NOT_MEANINGFUL_FOR_DECIMAL | Decimal-typed field; switch to a decimal-aware aggregator |
PULSE_FIELD_DESCRIPTION_LOW_QUALITY | Edit the schema description; re-import |
PULSE_FEAT_TARGET_LEAKAGE_RISK | The feature operator references the target column; reorganise the pipeline |
The full code-by-code recovery playbook lives in
skills/error-code-reference.md and at
Troubleshooting.
Related
pulse api process— executes a validated requestpulse api ask— combined predict + execute- Library: pulse.Predict / Ask — Go counterparts
skills/debugging-with-predict.md— LLM-side iteration recipe
pulse api sample
Audience: CLI users grabbing a quick peek at a few rows from a cohort — for debugging, sanity-checking an import, or seeding a template request.
pulse api sample returns the first N rows from a .pulse file
decoded back to a map of field → value. There is no filter, no
aggregation, no transformation — just a typed view of raw rows.
LLM agents using MCP: see the
pulse_sampleMCP tool. It returns the same shape over the MCP transport.
Synopsis
pulse api sample --input PATH [--count N] [--json]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--input | -i | string | (required) | Cohort .pulse file path |
--count | -n | int | 10 | Rows to sample |
--json | bool | false | Emit the standard envelope |
Output (text mode)
Pretty-printed JSON of the row array:
[
{
"order_id": 1,
"region": "west",
"product": "widget",
"units": 3,
"revenue": "29.97",
"sold_on": "2024-01-04"
},
...
]
Decimal128 values are serialised as strings to preserve precision.
Output (--json)
{
"format_version": "1.0",
"data": [ /* row array */ ],
"errors": [],
"warnings": []
}
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | File not found, truncated, or unsupported version |
Examples
# 10 rows
pulse api sample --input sales.pulse
# 100 rows, envelope-wrapped
pulse api sample --input sales.pulse --count 100 --json
# Pipe into jq
pulse api sample --input sales.pulse --count 100 | jq '.[] | .revenue'
When sample is the wrong tool
- For filtered subsets, use
pulse api processwith aFILTER_*and no aggregation — the result will be one row per matching record. - For distinct values of a single field, use
pulse api facet. - For schema-only views (types, descriptions, dictionaries), use
pulse cohort inspect.
Related
pulse api facet— distinct values for a single field- Library: pulse.Sample
pulse api facet
Audience: CLI users enumerating distinct values for a single field — a cheap probe for “what are the regions in this cohort?” without building a full filter.
pulse api facet returns the distinct values of one field in a
.pulse file. For categorical fields it reads the dictionary
directly (no record scan). For non-categorical fields it scans
records.
LLM agents using MCP: see the
pulse_facetMCP tool.
Synopsis
pulse api facet --input PATH --field NAME [--json]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--input | -i | string | (required) | Cohort .pulse file path |
--field | -f | string | (required) | Field name to facet on |
--json | bool | false | Emit the standard envelope |
Output (text mode)
One value per line:
east
north
south
west
Output (--json)
{
"format_version": "1.0",
"data": ["east", "north", "south", "west"],
"errors": [],
"warnings": []
}
Performance notes
| Field type | Behaviour |
|---|---|
categorical_u8 / _u16 / _u32 | Read directly from the schema’s inline dictionary; O(distinct values), no record scan |
| Non-categorical | Full scan; values collected into a set, then returned sorted |
For columns with very high cardinality on the non-categorical path, expect memory proportional to distinct value count.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | File not found, field name not found, or unsupported version |
Examples
# Read categorical dictionary
pulse api facet --input sales.pulse --field region
# JSON envelope
pulse api facet --input sales.pulse --field region --json
# Pipe into another command
for r in $(pulse api facet --input sales.pulse --field region); do
echo "Region: $r"
done
Related
pulse api sample— raw rows preview- Format: Dictionary Blocks — how categorical dictionaries are encoded
- Library: pulse.Facet
pulse manifest
Audience: CLI users (and orchestration agents) discovering Pulse’s self-description — what commands exist, which aggregators are registered, which field types are supported, and what skills the binary ships with.
The manifest is the bare-pulse invocation with --json. It is
deterministic and process-wide: it never depends on cohort data or
the filesystem.
LLM agents using MCP: the manifest is also available via the
pulse_manifestMCP tool. Agents typically call this once per session and cache the result.
Synopsis
pulse --json [--slim]
(There is no pulse manifest subcommand — the manifest is the root
command’s --json output.)
Flags
| Flag | Type | Default | Purpose |
|---|---|---|---|
--json | bool | false | Emit the manifest as a JSON envelope |
--slim | bool | false | Drop prose descriptions from the manifest payload (smaller for size-sensitive clients) |
Manifest shape
From descriptor/manifest.go:
{
"format_version": "1.0",
"data": {
"commands": [ /* every CLI leaf with a usage line */ ],
"operators": [ /* every aggregator / attribute / filterer / grouper / window / feature */ ],
"tests": [ /* every tier-1 statistical test */ ],
"post_tests": [ /* every tier-2 post-test variant */ ],
"distributions": [ /* every synth distribution kind */ ],
"errors": [ /* every registered error code with a description */ ],
"mcp_tools": [ /* every MCP tool name + description */ ],
"field_types":[ /* every .pulse field type */ ],
"skills": [ /* every embedded skill with metadata */ ]
},
"errors": [],
"warnings": []
}
Every list is sorted deterministically (alphabetical or category +
alphabetical). The same Pulse binary always emits the same manifest
bytes (modulo --slim).
Determinism gates
Several CI tests enforce manifest completeness — see Testing Conventions. Notably:
TestManifestOperatorsComplete— every registered operator appears in the manifest.TestManifestTestsComplete/TestManifestPostTestsComplete— every registered statistical test appears.TestManifestDistributionsComplete,TestManifestErrorCodesComplete,TestManifestMCPToolsComplete— same for distributions, error codes, and MCP tools.TestManifestStreamableMatchesTypes— every operator’sstreamableflag mirrors the per-type method.
When to use the manifest
| Use case | Reach for |
|---|---|
| Discover what’s available | pulse --json |
| Confirm a specific operator’s params and emit type | `jq ’.data.operators[] |
List embedded skills with their applies_to | jq '.data.skills[]' |
| Generate documentation or client stubs | Parse the full manifest once at boot |
| Quick “is this name a real operator?” | `pulse –json –slim |
Exit codes
| Code | Meaning |
|---|---|
| 0 | Always (the manifest is in-memory, deterministic, never errors) |
Examples
Print the manifest
pulse --json | jq '.data | keys'
Slim variant for embedding in an agent’s system prompt
pulse --json --slim > manifest.slim.json
List every aggregator with its emitted type
pulse --json | jq '.data.operators[] | select(.category == "aggregation") | {name, emits_type}'
Confirm a feature operator’s parameters
pulse --json | jq '.data.operators[] | select(.name == "FEAT_BUCKETIZE")'
Related
- How LLMs Use Pulse — the manifest is one of the agent discovery primitives
- Library: pulse.Manifest — Go counterpart
- Internals: Architecture — why the
manifest cannot import
service/orprocessing/
pulse synth from-schema
Audience: CLI users generating a synthetic .pulse cohort from a
declarative spec — for testing, demos, and bootstrapping fixtures.
pulse synth from-schema reads a JSON synth spec (field-by-field
distributions, row count, optional pairwise correlations) and writes
a deterministic .pulse file. Same (spec, seed) pair produces a
byte-identical output.
LLM agents using MCP: see the
pulse_synthMCP tool and thesynthetic-dataskill — it covers spec authoring, the 12 supported distributions, and constraint patterns.
Synopsis
pulse synth from-schema --spec FILE --output FILE
[--rows N] [--seed N] [--json]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--spec | -s | string | (required) | Synth spec JSON path |
--output | -o | string | (required) | Output .pulse file path |
--rows | int | from spec | Override row_count in the spec | |
--seed | int | 0 | Deterministic RNG seed | |
--json | bool | false | Emit the standard envelope |
Spec shape (sketch)
{
"row_count": 10000,
"fields": [
{"name": "id", "type": "u64", "distribution": "monotonic_from", "from": 1},
{"name": "region", "type": "categorical_u8", "distribution": "weighted_categorical",
"weights": {"east": 0.4, "west": 0.4, "north": 0.1, "south": 0.1}},
{"name": "revenue", "type": "f64", "distribution": "lognormal", "mu": 4.0, "sigma": 0.8},
{"name": "sold_on", "type": "date", "distribution": "uniform_date",
"from": "2024-01-01", "to": "2024-12-31"}
]
}
Full spec grammar (constraints, correlations, regex, …) lives in
skills/synthetic-data.md and synth/.
Supported distributions
bernoulli, constant, exponential, lognormal, monotonic_from,
normal, pareto, poisson, regex, uniform, uniform_date,
weighted_categorical.
The full catalog (with parameters) is in skills/synthetic-data.md
and pulse --json | jq '.data.distributions'.
Determinism
Same (spec, seed) → byte-identical output. The seed is a int64;
default 0. Use a fixed seed for fixtures and a random seed for
load-testing variation.
Output
Text mode
Generated 10000 rows -> sales.pulse (rejected 0)
rejected counts rows that failed user-defined constraints
(PULSE_SYNTH_CONSTRAINT_INFEASIBLE when the rejection rate is too
high to make progress).
--json
{
"format_version": "1.0",
"data": {
"output_path": "sales.pulse",
"rows_generated": 10000,
"rows_rejected": 0,
"seed": 0
},
"errors": [],
"warnings": []
}
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Spec parse error, unknown distribution, infeasible constraints, or output write failure |
Common error codes
| Code | Cause |
|---|---|
PULSE_SYNTH_DISTRIBUTION_UNKNOWN | Spec references a distribution name not in the catalog |
PULSE_SYNTH_CONSTRAINT_INFEASIBLE | Constraints reject too high a fraction of generated rows |
Examples
# Build sales.pulse from a spec
pulse synth from-schema --spec sales.spec.json --output sales.pulse --seed 42
# Override row count without editing the spec
pulse synth from-schema --spec sales.spec.json --output sales.pulse --rows 1000
# Programmatic envelope
pulse synth from-schema --spec sales.spec.json --output sales.pulse --json
Related
pulse synth from-profile— generate from a captured profile of an existing cohortpulse profile create— capture the profileskills/synthetic-data.md— full spec grammar and distribution table- Library: pulse.Synth
pulse synth from-profile
Audience: CLI users generating a synthetic .pulse cohort whose
distributions match a real cohort — typically to share a sanitised
replica without exposing the underlying rows.
pulse synth from-profile reads a profile JSON captured by
pulse profile create and writes a synthetic
.pulse file whose per-field distributions and (optional) pairwise
correlations follow the profile. The profile retains no individual
rows from the source; only summary statistics.
LLM agents using MCP: see the
pulse_synth_from_profileMCP tool and thesynthetic-dataskill.
Synopsis
pulse synth from-profile --profile FILE --output FILE --rows N
[--seed N] [--json]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--profile | -p | string | (required) | Profile JSON path |
--output | -o | string | (required) | Output .pulse file path |
--rows | int | (required) | Rows to generate | |
--seed | int | 0 | Deterministic RNG seed | |
--json | bool | false | Emit the standard envelope |
--rows is required (unlike from-schema, which can pull it from
the spec) because the profile does not carry a generation count of
its own.
Determinism
Same (profile, seed, rows) triple → byte-identical output. Seeds
are int64; default 0.
Profile shape
The profile is a synth.Profile JSON object produced by
pulse profile create. It carries per-field type, descriptive
statistics, top-K categorical entries (default K = 32), optional
pairwise correlations (when --include-correlations was passed at
profile-creation time), and a row count.
See pulse profile create for how to capture
one, and synth/ for the underlying Go types.
Output
Text mode
Generated 1000 rows -> sales.synth.pulse (rejected 0)
--json
Same envelope shape as
synth from-schema.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Profile parse error, infeasible constraints, or output write failure |
Examples
# Capture once
pulse profile create --input sales.pulse --output sales.profile.json
# Re-generate any number of times with different seeds
pulse synth from-profile --profile sales.profile.json --output sales.s42.pulse --rows 10000 --seed 42
pulse synth from-profile --profile sales.profile.json --output sales.s43.pulse --rows 10000 --seed 43
Limitations
- Categorical tails: anything past the captured top-K is replaced with a sentinel “other” bucket sized to its observed weight.
- Correlations: pairwise only, and only between numeric fields. The
profile capture flag
--include-correlationsopts in; without it, fields are generated independently. - Decimal and geo fields: regenerated within the same type family but with synthetic value distributions; downstream uses that depend on exact field values (e.g. joinable identifiers) need the schema-driven path instead.
Related
pulse profile createpulse synth from-schemaskills/synthetic-data.md— the spec / profile grammar
pulse profile create
Audience: CLI users capturing a statistical profile of an
existing cohort — typically to feed into
pulse synth from-profile.
pulse profile create reads a .pulse file and writes a JSON
profile: per-field type, descriptive statistics, top-K categorical
entries, optional pairwise correlations. The profile retains no
individual rows from the source.
LLM agents using MCP: see the
pulse_profileMCP tool.
Synopsis
pulse profile create --input PATH --output PATH
[--top-k N] [--include-stats]
[--include-correlations] [--correlation-top-k N]
[--sample-limit N] [--json]
Flags
| Flag | Alias | Type | Default | Purpose |
|---|---|---|---|---|
--input | -i | string | (required) | Source .pulse cohort |
--output | -o | string | (required) | Output profile JSON path |
--top-k | int | 32 | Top-K categorical entries to retain per field | |
--include-stats | bool | true | Include percentile / std stats | |
--include-correlations | bool | false | Capture pairwise numeric correlations | |
--correlation-top-k | int | 16 | Cap on retained correlation pairs | |
--sample-limit | int | 0 (unlimited) | Cap rows ingested for the profile (0 disables) | |
--json | bool | false | Also print the envelope to stdout |
What the profile captures
| Field type | What is recorded |
|---|---|
Numeric (u*, f*, decimal128) | Count, min, max, mean, stddev; percentiles if --include-stats |
| Categorical | Top-K most-frequent values + their frequencies; “other” tail weight |
date | Min, max, count |
nullable_* | Null count alongside the above |
What the profile does NOT capture
- Individual rows.
- The full categorical dictionary beyond
--top-k. - Correlations unless
--include-correlationsis set.
This is by design — profiles are intended to be safe to share with parties who shouldn’t see the underlying data.
Output
The profile JSON is always written to --output. With --json, the
envelope is also written to stdout (typically piped or jq-d).
Profile schema lives in synth/profile.go and is documented in
skills/synthetic-data.md.
Text mode summary
Profiled 50000 rows from sales.pulse -> sales.profile.json
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Read error, unsupported field type (PULSE_PROFILE_FIELD_UNSUPPORTED), or write failure |
Examples
Minimal profile
pulse profile create --input sales.pulse --output sales.profile.json
Rich profile with correlations
pulse profile create --input sales.pulse --output sales.profile.json \
--include-stats --include-correlations --top-k 64 --correlation-top-k 32
Sample-limited profile for a huge cohort
pulse profile create --input ops.pulse --output ops.profile.json --sample-limit 1000000
Round-trip with synth
pulse profile create --input sales.pulse --output sales.profile.json
pulse synth from-profile --profile sales.profile.json --output sales.synth.pulse --rows 10000 --seed 1
pulse cohort inspect sales.synth.pulse
Related
pulse synth from-profile— the consumer of profile JSONpulse synth from-schema— the alternative spec-driven pathskills/synthetic-data.md— full profile and spec grammar- Library: pulse.Profile
pulse mcp
Audience: operators wiring Pulse into an MCP-aware AI client (Claude Desktop, Claude Code, generic MCP clients).
pulse mcp runs the Model Context Protocol server over stdio. The AI
client launches pulse mcp as a subprocess, speaks MCP over its
stdio streams, and shuts it down on session close.
LLM agents using MCP: the agent-side guide is the
mcp-integrationskill — fetch it viapulse_skills_getfor the tool catalog and request shapes. This page is for the human setting the server up.
Synopsis
pulse mcp [--data-dir PATH] [--bind-on-open]
The command reads stdin, writes MCP responses on stdout, and writes a one-line startup notice (and any subsequent diagnostics) on stderr.
Flags
| Flag | Type | Default | Purpose |
|---|---|---|---|
--data-dir | string | from PULSE_DATA_DIR env var | Cohort base directory |
--bind-on-open | bool | true | Register session-scoped JSON-schema-bound tool variants on successful pulse_inspect |
--data-dir is required in one of its two forms (env var or
flag). The MCP server fails to start otherwise:
data directory required: set PULSE_DATA_DIR or pass --data-dir
–bind-on-open
When a session calls pulse_inspect successfully, the server can
register session-scoped tool variants whose JSON Schemas constrain
field-name parameters to the cohort’s actual fields. This narrows
the LLM’s choices and prevents typos at parameter-binding time.
Default: true. Pass --bind-on-open=false if your client binds
tool schemas itself.
The binding logic lives in
internal/mcp/schema_bind.go;
see skills/mcp-integration.md for the LLM-facing implications.
Wiring it into Claude Desktop
~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"pulse": {
"command": "/usr/local/bin/pulse",
"args": ["mcp"],
"env": {
"PULSE_DATA_DIR": "/var/data/pulse"
}
}
}
}
Restart the client. The Pulse tools (pulse_manifest, pulse_ask,
pulse_inspect, pulse_predict, pulse_process, pulse_compose,
pulse_sample, pulse_facet, pulse_import, pulse_drop,
pulse_imports_list, pulse_examples_search, pulse_examples_get,
pulse_errors_lookup, pulse_skills_list, pulse_skills_get) and
resources (pulse://*.pulse, pulse-skill://*) appear in the
tool/resource list.
Wiring it into Claude Code
~/.claude.json (or per-project .claude.json):
{
"mcpServers": {
"pulse": {
"command": "/usr/local/bin/pulse",
"args": ["mcp"],
"env": { "PULSE_DATA_DIR": "/var/data/pulse" }
}
}
}
The full LLM-side recipe (including resource URIs and the schema
binding details) is in skills/mcp-integration.md.
Exit codes
pulse mcp is a long-running process. It exits non-zero only on
fatal startup failure (missing data dir, transport error). Once
serving, an MCP client controls the lifecycle.
Examples
Foreground run for debugging
PULSE_DATA_DIR=/tmp/pulse-data ./bin/pulse mcp
# Stderr: pulse mcp: serving over stdio (data dir: /tmp/pulse-data, bind-on-open: true)
Disable schema binding
PULSE_DATA_DIR=/tmp/pulse-data ./bin/pulse mcp --bind-on-open=false
Inspect what the server registers
# Manifest exposes the MCP tool list
pulse --json | jq '.data.mcp_tools[]'
Related
- How LLMs Use Pulse — the pointer table from this site into the skill pack
skills/mcp-integration.md— LLM-side wiring, tool catalog, resource schemes, schema binding- Deployment — production hardening notes
- Troubleshooting — common MCP failure modes
Flag Reference
Audience: CLI users who want one page that lists every flag and every environment variable in scope across the binary.
The per-command pages list each command’s full flag set; this page is the cross-cutting reference for flags that appear on multiple commands and for the environment variables Pulse reads.
LLM agents using MCP: there is no LLM-facing skill for the CLI surface. Agents go via MCP tools (
pulse_process,pulse_inspect, …) — seeskills/mcp-integration.md.
Global flags
Available on the bare pulse invocation:
| Flag | Effect |
|---|---|
--json | Print the root manifest as JSON (envelope-wrapped) |
--slim | With --json, drop prose descriptions for size-sensitive clients |
Both default to off. pulse --json is the discovery entry point — it
emits the manifest documented at pulse manifest.
Environment variables
| Variable | Used by | Required | Purpose |
|---|---|---|---|
PULSE_DATA_DIR | All commands when no path override is given; required by pulse mcp | conditionally | Base directory for cohort files. Relative cohort paths resolve against it. |
PULSE_DATA_DIR is the only PULSE_* environment variable today. The
Makefile auto-loads a repo-root .env file so you can keep it (and
any future env vars) there for development.
When embedding the library, you can bypass the env var entirely by
passing pulse.Options{DataDir: "/path"} or pulse.Options{FS: myFs}.
--json envelope
Almost every leaf command accepts --json, which switches output
from human prose to a structured envelope. The envelope shape is
fixed and documented in CLAUDE.md → Output Format Contract:
{
"format_version": "1.0",
"data": { /* operation-specific result */ },
"errors": [ /* {"code": "...", "message": "...", "details": {...}} */ ],
"warnings": [ /* same shape */ ]
}
format_version is currently "1.0". errors and warnings are
always arrays (never null) so JSON consumers can index without
nullable-check overhead.
Shared per-command flags
Several flags appear on multiple commands with identical semantics.
--no-defaults
Available on: api process, api compose, api ask.
Disable the runtime smart-defaults pass that infers operator Type
from the named field’s schema type when the caller omits it. Forces
the request to be source-of-truth. See pulse.New &
Options for the underlying library option.
--stream
Available on: api process, api compose.
Stream result rows as NDJSON (one row per line) instead of buffering
the full result. For compose, each line carries an {"index": N, "row": {...}} shape so consumers know which sub-request produced
each row. See Streaming & ProcessStream.
--strict
Available on: api predict.
Treat warnings (e.g. low-quality field description) as errors. Useful in CI gates that want the strictest possible validation.
--full-dict
Available on: cohort inspect.
Print full categorical dictionaries instead of truncating after 100
entries. Pair with --json for programmatic consumption.
--strict / --seed / --rows
synth from-schema and synth from-profile use --seed (for
deterministic RNG) and --rows (override the spec’s row count). See
the per-command pages.
Help
Every command supports --help:
pulse --help
pulse api --help
pulse api process --help
pulse mcp --help
--help output is the urfave/cli v3 default — a usage block,
description, flag list, and an examples block where applicable.
Cross-references
| If you need… | Go to |
|---|---|
| Per-command synopsis & examples | CLI Tour and each cli/ page |
| Library-side equivalents | Library Embedding |
| MCP-side equivalents | How LLMs Use Pulse |
| Envelope and error code semantics | Troubleshooting and skills/error-code-reference.md |
Go API Overview
Audience: Go developers embedding Pulse in a binary or a service.
Pulse is library-first. The CLI in cmd/pulse/ is a thin adapter
around the package documented here. If you’re reaching for os/exec
to shell out to the binary from Go, stop and use the library directly
— you’ll skip a process boundary and gain typed responses.
LLM agents using MCP: there is no LLM-facing skill that covers Go embedding directly. Agents speak MCP; this page is for the programs that host them.
Module path
import "github.com/frankbardon/pulse"
Sub-packages you’ll commonly touch:
| Package | Purpose |
|---|---|
github.com/frankbardon/pulse | Public facade (Pulse, Options, Request, Response, Ask, …) |
github.com/frankbardon/pulse/types | Request/response structs, component-type constants (AGG_*, …) |
github.com/frankbardon/pulse/io | Tabular adapter interfaces (Reader, Writer, ImportJob, ExportJob, ConvertJob) |
github.com/frankbardon/pulse/io/<fmt> | Per-format readers/writers (csv, tsv, ndjson, jsonarray, parquet, arrow, excel) |
github.com/frankbardon/pulse/fs | afero-backed filesystem config (fs.New, fs.Default, fs.NewMemMap) |
github.com/frankbardon/pulse/errors | Typed CodedError system and code constants |
github.com/frankbardon/pulse/descriptor | Manifest, predict, inspect (no-execute operations) |
github.com/frankbardon/pulse/synth | Synthetic data generator and profile types |
github.com/frankbardon/pulse/skills | Embedded skill pack — skills.List(), skills.Get(name) |
The internal/ subtree (internal/cli, internal/mcp,
internal/query) is exactly that — internal. Don’t import it.
The facade
Construct a Pulse once per process (or per filesystem boundary) and
re-use it:
p, err := pulse.New(pulse.Options{
DataDir: "/var/data/pulse",
})
if err != nil {
return err
}
The full Options shape (custom afero.Fs, smart-default toggling)
is documented at pulse.New & Options.
Public methods
From pulse.go:
| Method | Purpose |
|---|---|
Open(ctx, path) (*Cohort, error) | Read header + schema, return a typed Cohort handle |
Process(ctx, req) (*Response, error) | Execute one request |
ProcessStream(ctx, req) (RowIter, error) | Same, pull-based iterator over result rows |
Compose(ctx, req) ([]*Response, error) | Execute a batch sequentially |
ComposeParallel(ctx, req, opts) ([]*Response, error) | Execute a batch in parallel with a worker pool |
Ask(ctx, askReq) (*AskResponse, error) | Unified entry: predict + (optionally) process, with natural-language query support |
Import(ctx, job) (*ImportReport, error) | Tabular → .pulse |
Export(ctx, job) (*ExportReport, error) | .pulse → tabular |
Convert(ctx, job) (*ConvertReport, error) | Tabular → tabular, with .pulse as the transparent middle |
Inspect(ctx, path) (*InspectResult, error) | Read header + schema only (no record data) |
Predict(ctx, req) (*PredictResult, error) | Validate a request without executing |
Sample(ctx, path, n) ([]Record, error) | Up to n rows |
Facet(ctx, path, field) ([]string, error) | Distinct values of a field |
Synth(ctx, spec, out, opts) (*SynthResult, error) | Generate a synthetic cohort |
Profile(ctx, path, opts) (*Profile, error) | Statistical summary suitable for from-profile synthesis |
Manifest(ctx) *Manifest | Deterministic root self-description |
Fs() afero.Fs | The underlying filesystem (used by pulse mcp and other embedders) |
Re-exported type aliases let you write pulse.Request instead of
types.Request:
type (
Request = types.Request
Response = types.Response
ComposedRequest = types.ComposedRequest
SynthSpec = synth.Spec
Profile = synth.Profile
// … and so on
)
Minimum viable embed
package main
import (
"context"
"fmt"
"log"
"github.com/frankbardon/pulse"
"github.com/frankbardon/pulse/types"
)
func main() {
ctx := context.Background()
p, err := pulse.New(pulse.Options{DataDir: "/var/data/pulse"})
if err != nil {
log.Fatal(err)
}
resp, err := p.Process(ctx, &pulse.Request{
Cohort: &types.Cohort{Filename: "sales.pulse"},
Aggregations: []*types.Aggregation{
{Type: types.AGG_AVERAGE, Field: "revenue", Label: "avg_revenue"},
},
})
if err != nil {
log.Fatal(err)
}
fmt.Println(resp.Data)
}
Where to go from here
- pulse.New & Options — full
Optionsreference. - pulse.Ask Unified Entry Point — the one-shot facade the MCP server uses internally.
- Custom Filesystems — in-memory testing pattern, custom storage backends.
- Streaming & ProcessStream — pull-based iteration, what streams vs what buffers.
- Parallel Compose — worker pool, fail-fast, per-request timeout.
pulse.New & Options
Audience: Go embedders constructing a Pulse instance.
pulse.New(pulse.Options{...}) is the single entry point. There is no
config file, no init function, no global state. Every option is
declared in code (or comes from PULSE_DATA_DIR when the field is
left empty).
LLM agents using MCP: the MCP server constructs its own
Pulseinstance from CLI flags. Agents don’t see this surface.
The Options struct
From pulse.go:
type Options struct {
// DataDir is the base directory for cohort files.
// Defaults to PULSE_DATA_DIR if empty and FS is not set.
DataDir string
// FS is an optional custom filesystem.
// When set, DataDir is ignored for filesystem construction.
FS afero.Fs
// DisableDefaults turns off the smart-defaults pass that infers
// operator Type from the named field's schema type when the caller
// omits it. Defaults to false (defaults enabled). Predict still
// computes and reports DefaultsApplied independently — this flag
// governs only what the runtime mutates on the live request.
DisableDefaults bool
}
Field reference
DataDir string
The base directory for .pulse files. Relative cohort paths
({"filename": "data.pulse"}) resolve against this directory.
| Source | Result |
|---|---|
Non-empty Options.DataDir | Used directly |
Empty + FS non-nil | DataDir is ignored — the FS is the trust boundary |
Empty + FS nil | Pulse falls back to fs.Default(), which reads PULSE_DATA_DIR |
Example:
p, err := pulse.New(pulse.Options{DataDir: "/var/data/pulse"})
FS afero.Fs
A custom afero.Fs implementation. When set, it fully overrides the
filesystem layer — DataDir is unused, and PULSE_DATA_DIR is not
consulted. Use this for tests (afero.NewMemMapFs()) or non-local
backends (S3-backed afero.Fs, encrypted overlays, …).
Example:
import "github.com/spf13/afero"
p, err := pulse.New(pulse.Options{
FS: afero.NewMemMapFs(),
})
See Custom Filesystems for in-depth usage and the hermetic-test pattern.
DisableDefaults bool
The runtime smart-defaults pass infers an operator’s Type from the
named field’s schema type when the caller omits it (e.g. AGG_SUM on
a numeric field defaults appropriately; categorical fields default
toward AGG_COUNT). Set DisableDefaults = true to require an
explicit Type on every aggregation and grouper — useful when you
want the request to be source-of-truth and never be silently
re-typed.
This option only governs the runtime mutation. predict independently
computes and reports DefaultsApplied in its result envelope, so
callers can see what would have been inferred even when defaults are
disabled.
CLI parity: pulse api process --no-defaults, pulse api compose --no-defaults, pulse api ask --no-defaults.
Defaults at a glance
Field omitted from Options | Effective behaviour |
|---|---|
DataDir and FS both empty | Pulse calls fs.Default() → reads PULSE_DATA_DIR env var. Errors if unset and the operation needs filesystem access. |
DataDir only | Uses an afero.NewOsFs() rooted at DataDir. |
FS only | Uses the provided FS verbatim. |
| Both | FS wins; DataDir is ignored. |
DisableDefaults omitted | Defaults enabled. |
Re-using a Pulse instance
Pulse is safe for concurrent use across goroutines once constructed.
The internal registries are read-only after New; each Process
call constructs fresh stateful operators per request, so multiple
goroutines can call Process/ProcessStream/Compose in parallel
against the same Pulse.
For batch parallelism, prefer
ComposeParallel — it shares the read-only
registries and bounds concurrency for you.
Tearing down
There is no explicit Close() method on Pulse. The filesystem is a
borrowed handle; if you supply a custom FS, the embedder is
responsible for any cleanup that FS requires. Streaming consumers
should still call RowIter.Close() so that the underlying readers
release their buffers.
pulse.Ask — Unified Entry Point
Audience: Go embedders who want a single call that validates a request and then optionally executes it.
Ask is the one-shot facade. It collapses predict, process, and
the natural-language query parser into a single typed call. The
MCP server uses this same method internally for the pulse_ask
tool.
LLM agents using MCP: the corresponding LLM-facing surface is the
pulse_askMCP tool, documented inskills/mcp-integration.mdandskills/request-recipes.md.
When to use Ask vs Process
| Goal | Reach for |
|---|---|
| Validate a request without running it | Predict (or Ask{Predict: true}) |
| Validate then execute in one call | Ask |
| Translate a natural-language string into a request and execute | Ask with Query set |
| Execute a request you’ve already validated separately | Process (lower overhead) |
If you’re already inside a tight loop that validates once and runs
many similar requests, prefer Process — Ask does the predict pass
on every call.
Request shape
From pulse.go:
type AskRequest struct {
File string `json:"file,omitempty"`
Request *types.Request `json:"request,omitempty"`
Query string `json:"query,omitempty"`
OnInvalid string `json:"on_invalid,omitempty"`
Predict bool `json:"predict,omitempty"`
}
| Field | Meaning |
|---|---|
File | Cohort path. When set and Request.Cohort is nil, Ask synthesises a Cohort from the path. |
Request | Structured types.Request. Optional when Query is set — the parser fills empty slots. |
Query | Natural-language query string (“average revenue by region”). Parsed against the cohort’s schema. |
OnInvalid | "abort" (default) returns a SERVICE_VALIDATION error on predict-invalid; "suggest" returns the response with Suggestions populated. |
Predict | When true, skip execution after a successful predict. The “what would happen if I ran this” probe. |
Response shape
type AskResponse struct {
FormatVersion string `json:"format_version"`
Predict *descriptor.PredictResult `json:"predict"`
Process *Response `json:"process,omitempty"`
Suggestions []errors.Fixup `json:"suggestions,omitempty"`
QueryResolution *QueryResolution `json:"query_resolution,omitempty"`
Errors []*descriptor.EnvelopeEntry `json:"errors"`
Warnings []*descriptor.EnvelopeEntry `json:"warnings"`
}
Predictis always populated.Processis set only when execution ran.Suggestionsis populated only when predict reported invalid andOnInvalid == "suggest".QueryResolutionis set only whenQuerywas non-empty; it echoes the parser’s matched fields and aggregate confidence in[0, 1].
Examples
Structured request, predict-only
resp, err := p.Ask(ctx, &pulse.AskRequest{
Request: &pulse.Request{
Cohort: &types.Cohort{Filename: "sales.pulse"},
Aggregations: []*types.Aggregation{
{Type: types.AGG_SUM, Field: "revenue", Label: "total"},
},
},
Predict: true,
})
Natural-language query
resp, err := p.Ask(ctx, &pulse.AskRequest{
File: "sales.pulse",
Query: "average revenue by region",
})
fmt.Printf("matched: %v (conf %.2f)\n",
resp.QueryResolution.MatchedFields,
resp.QueryResolution.Confidence)
The parser fills the structured request from the query and runs
Process. Explicit fields in Request always win on collision —
the parser only fills empty slots.
Query plus a partial structured request
resp, err := p.Ask(ctx, &pulse.AskRequest{
File: "sales.pulse",
Request: &pulse.Request{
Filterers: []*types.Filterer{
{Type: types.FILTER_RANGE, Field: "revenue", Values: []string{"100", "1000"}},
},
},
Query: "average revenue by region",
})
The structured Filterers win; the parser supplies Aggregations
and Groups from the query.
Suggest fixups instead of erroring
resp, err := p.Ask(ctx, &pulse.AskRequest{
Request: req,
OnInvalid: "suggest",
})
for _, fix := range resp.Suggestions {
fmt.Println(fix.Code, fix.Message, fix.Hint)
}
Fixup templates live in errors/fixup_metadata.go and are documented
per code in skills/error-code-reference.md.
Errors and warnings
AskResponse.Errors and AskResponse.Warnings flatten the descriptor
envelope’s entries plus any issues the query parser raised
(PULSE_QUERY_UNRESOLVED, PULSE_QUERY_AMBIGUOUS). The arrays are
always present (never nil) so JSON consumers can index without
null-checks — same shape as the descriptor envelope.
FormatVersion mirrors the descriptor envelope version ("1.0")
so callers can gate on a single value across endpoints.
Custom Filesystems
Audience: Go embedders running Pulse in tests (hermetic, no disk), in cloud-storage-backed environments (S3, GCS, Azure Blob via afero), or behind a custom storage layer.
Pulse routes all file I/O through afero.Fs. Pass any
afero.Fs-conformant filesystem to pulse.New(pulse.Options{FS: ...}) and Pulse never touches the OS filesystem directly.
LLM agents using MCP: the MCP server’s filesystem is fixed at startup via
PULSE_DATA_DIRor--data-dir. Agents don’t swap filesystems mid-session.
In-memory testing pattern
The single most common reason to override the filesystem is hermetic
tests. Use fs.NewMemMap() (which wraps afero.NewMemMapFs() with
the right config) or pass the afero filesystem directly:
import (
"github.com/frankbardon/pulse"
"github.com/spf13/afero"
)
func TestSomething(t *testing.T) {
p, err := pulse.New(pulse.Options{FS: afero.NewMemMapFs()})
if err != nil {
t.Fatal(err)
}
// Write a .pulse file into the in-memory FS, then process it.
// ...
}
The in-memory FS persists for the life of the FS reference. Create a fresh one per test for isolation.
Custom storage backends
Anything that implements afero.Fs works. Common patterns:
- S3 / GCS / Azure Blob — via community afero adapters
(
afero/gcsfs,afero/s3). - Encrypted overlays — wrap a base FS with envelope encryption per file.
- Read-only mounts —
afero.NewReadOnlyFs(base)for production cohort serving where mutation is by accident, not policy.
Example with a hypothetical S3 wrapper:
import (
"github.com/frankbardon/pulse"
"example.com/myorg/aferos3"
)
func main() {
s3fs := aferos3.New(aferos3.Config{
Bucket: "my-pulse-cohorts",
Region: "us-east-1",
})
p, _ := pulse.New(pulse.Options{FS: s3fs})
// p reads and writes cohort files from S3 transparently.
}
The fs package
The lower-level constructors live in
fs/:
| Function | Purpose |
|---|---|
fs.New(opts ...Option) (*fs.Config, error) | Build a config with fs.WithFs(...) / fs.WithDataDir(...) |
fs.Default() (*fs.Config, error) | Read PULSE_DATA_DIR from the environment |
fs.NewMemMap() *fs.Config | In-memory test config |
You can also bypass pulse.Options entirely and construct a service
from a *fs.Config, but the public facade is the intended entry
point. pulse.New(pulse.Options{FS: yourFs}) covers every embedding
case.
Path resolution
Pulse resolves a Cohort to a path with this rule (see
resolveCohortPath in pulse.go):
if cohort.DataDir != "" → "<DataDir>/<Filename>"
else → "<Filename>"
The custom FS is then asked to open that path. For an
afero.MemMapFs, an absolute-looking path like
/var/data/sales.pulse is just a key in the in-memory map — no need
to mirror the OS layout.
What custom filesystems do NOT do
- Pulse never falls back to
os.Openif the custom FS fails. The custom FS is the only filesystem; if it errors, that error propagates verbatim. - The MCP server (
pulse mcp) currently usesafero.NewOsFs()only. Custom filesystems are a library-side capability today. - The Go race detector and
go test -racework normally with in-memory filesystems; tests can run highly concurrent without fighting over a real directory.
Streaming & ProcessStream
Audience: Go embedders feeding rows into an HTTP response, an NDJSON pipeline, or any consumer that wants result rows one at a time instead of buffering the full set.
pulse.ProcessStream returns a pull-based iterator. The API is
stable regardless of whether the underlying request shape streams
inside the engine — non-streamable requests return the same iterator,
they just buffer once internally before yielding.
LLM agents using MCP: see
skills/request-recipes.mdfor the MCP-side streaming surface (pulse_processwith the streaming option). The Streamable predicate is the same on both surfaces.
The iterator API
type RowIter = service.RowIter
// In service:
type RowIter interface {
Next(ctx context.Context) (Row, bool, error)
Close() error
Metadata() *ResponseMetadata
}
type Row = service.Row // map[string]any
Usage:
iter, err := p.ProcessStream(ctx, req)
if err != nil {
return err
}
defer iter.Close()
for {
row, ok, err := iter.Next(ctx)
if err != nil {
return err
}
if !ok {
break
}
// … emit row …
}
meta := iter.Metadata() // available after drain
Metadata() returns the full ResponseMetadata (total rows,
filtered rows, cohort file) once the iterator has been drained.
What actually streams
ProcessStream always returns an iterator, but the engine only
avoids the buffered intermediate row set for a subset of request
shapes. Run pulse api predict (or Predict from the library) and
check the Streamable flag in the result:
pred, err := p.Predict(ctx, req)
if !pred.Streamable {
for _, reason := range pred.StreamableReasons {
log.Printf("buffered because: %s", reason)
}
}
The streaming-eligible request shapes are listed in Performance Notes → Streaming path.
The complement — the request shapes that force the buffered path — is at Performance Notes → Buffered path.
Streamable=false doesn’t mean the iterator is broken; it just
means rows materialise inside the engine before Next yields them.
The output API is identical either way.
CLI parity
pulse api process --stream writes NDJSON to stdout, one row per
line. pulse api compose --stream does the same with an index
field per row identifying which sub-request produced it.
Cancellation
Every Next call accepts a context. Cancellation propagates to the
underlying reader; rows that are already in flight may still be
returned before Next returns (_, false, ctx.Err()). Close()
releases any reader resources and is safe to call multiple times.
Backpressure
The iterator is pull-based: the engine produces rows only as fast as
the consumer calls Next. For HTTP responders that flush periodically,
this means you can stream a multi-GB result set through a
constant-memory buffer.
For pipelines that want to fan rows out across goroutines, copy each
row into your own struct before processing — Row is
map[string]any and the engine may re-use the backing data after
Next returns. Treat it as borrowed.
Inside the engine
Under the hood, ProcessStream calls one of four orchestrator modes
depending on the request shape: single-pass streaming, grouped
streaming, two-pass streaming, or the buffered fallback. The choice
is made via processing.CanStreamRequest(req, schema), which is the
same predicate Predict.Streamable reports — this parity is
enforced by TestPredict_Streamable_MatchesRuntime.
If you find a request that predict says is streamable but Next
materialises something large, that’s a parity drift and a bug —
please report it with the request JSON.
Parallel Compose
Audience: Go embedders running multiple requests concurrently against the same cohort or set of cohorts.
pulse.ComposeParallel fans a ComposedRequest across a bounded
worker pool. Workers share the engine’s read-only registries; each
Process call constructs fresh stateful operators per request, so
concurrent execution is safe.
LLM agents using MCP: the MCP server today exposes
pulse_composeas a sequential operation. Parallelism is a library-side capability.
When to use
| Goal | Reach for |
|---|---|
| Single request, single result | Process |
| Single request, pulled as rows | ProcessStream |
| Batch of independent requests, in order, sequential | Compose |
| Batch of independent requests, in parallel, with bounded workers | ComposeParallel |
Order of results is preserved regardless of completion order — a
worker that finishes early is held until its slot’s index is the
next to emit. So callers can index responses[i] against
req.Requests[i] directly.
ComposeOptions
From service/compose_parallel.go,
re-exported as pulse.ComposeOptions:
type ComposeOptions struct {
// MaxWorkers caps concurrent in-flight Process calls. Zero means
// runtime.GOMAXPROCS; negatives clamp to 1.
MaxWorkers int
// PerRequestTimeout, if positive, derives a context.WithTimeout for
// each request.
PerRequestTimeout time.Duration
// FailFast cancels in-flight siblings on the first request error.
// Defaults to true. Set false to aggregate all errors instead.
FailFast bool
}
| Field | Default | Notes |
|---|---|---|
MaxWorkers | runtime.GOMAXPROCS(0) | 0 resolves to GOMAXPROCS; <1 clamps to 1 |
PerRequestTimeout | unlimited | When positive, each worker derives context.WithTimeout |
FailFast | true | First error cancels siblings and returns immediately |
Example
ctx := context.Background()
composed := &pulse.ComposedRequest{
Requests: []*pulse.Request{req1, req2, req3, req4},
}
resps, err := p.ComposeParallel(ctx, composed, pulse.ComposeOptions{
MaxWorkers: 4,
PerRequestTimeout: 30 * time.Second,
FailFast: true,
})
if err != nil {
return err
}
for i, resp := range resps {
fmt.Printf("request %d: %d rows\n", i, len(resp.Data))
}
FailFast semantics
With FailFast = true (the default):
- The first request to return an error cancels the shared context.
- In-flight siblings observe cancellation via
ctx.Err()and return early. ComposeParallelreturns(nil, theFirstError).
With FailFast = false:
- Every request runs to completion (or its own per-request timeout).
- Errors are aggregated into a single
SERVICE_INTERNALerror whosedetailsmap carriesfailed_indices(a list of slot indices that errored). - Successful slots populate the returned response array; failed
slots are
nilat their index.
CLI parity
pulse api compose --request batch.json --parallel 4
pulse api compose --request batch.json --parallel 4 --no-fail-fast
--parallel N:
1(default) → sequentialCompose.0→runtime.GOMAXPROCS.> 1→ exactly that many workers.
--no-fail-fast mirrors FailFast = false.
Performance considerations
- Each worker performs its own filesystem reads. If your cohort lives on slow remote storage, parallelism amortises latency well; on local SSD the gain is smaller and CPU-bound.
- Streaming aggregations are CPU-friendly —
ComposeParallelover a pool of streaming requests scales near-linearly to the worker count. - Buffered request shapes (window operators, median, …) hold
memory per request. Watch
MaxWorkers × per_request_peak_memory. - The internal registries are read-only and shared across workers with no locking; only the per-request operator instances are fresh allocations.
Safety
Pulseis safe for concurrent use afterNew.- Per-request operator state (running sums, dictionaries, sorted
buffers) is allocated fresh inside each
Processcall. - The
afero.Fsyou supply must itself be safe for concurrent reads — every shipped backend (OsFs,MemMapFs) is.
Header Layout
Audience: anyone reading or writing .pulse files by hand (forensics,
custom readers, debugging a truncated file). The Go library handles all of
this for you; this page documents the wire format.
The header is fixed-size: 9 bytes, consisting of an 8-byte magic identifier and a 1-byte format version.
LLM agents using MCP: see the
cohort-schema-designskill viapulse_skills_get. It speaks in field-type semantics rather than byte layout; this page covers the bytes.
Constants
These live in encoding/header.go:
| Name | Value | Purpose |
|---|---|---|
MagicBytes | []byte{'P','U','L','S','E', 0x00, 0x00, 0x00} | 8-byte identifier; rejects non-Pulse files |
FormatVersion | 0x01 (today) | Current .pulse wire format |
HeaderSize | 9 | Total header byte count |
Byte layout
Offset Length Field
------ ------ -----
0 8 Magic: "PULSE\0\0\0"
8 1 Format version (currently 0x01)
9 — Schema block begins here
That’s the entire fixed header. The schema block immediately follows; see Schema Block.
Version semantics
The format version is single-byte. The reader at
encoding.ReadHeader rejects unknown versions with the
ENCODING_INVALID error code:
ENCODING_INVALID: unsupported pulse format version
{"version": <byte>}
This is the fail-loud guard against silently mis-decoding a file written by a future binary that introduced a new field type or layout change. A forward-incompatible change bumps the version; the older reader stops at header parse instead of producing wrong rows.
The current value is 0x01. The envelope format_version ("1.0")
that all CLI --json output carries is unrelated — it tracks the
JSON output schema, not the binary file format.
Hexdump sanity check
A freshly-written .pulse file starts with:
00000000 50 55 4c 53 45 00 00 00 01 .. .. .. .. ..
|P U L S E \0 \0 \0|ver| schema starts here
If file path/to/data.pulse reports “data” (rather than something
plausible) and the first nine bytes don’t match the above, the file is
either truncated or corrupted — see
Troubleshooting.
What comes next
The schema block follows the header. Read it as documented in Schema Block; it carries per-field descriptors, inline categorical dictionaries, and decimal/H3 metadata. After the schema, fixed-width records start — see Record Layout.
Field Types
Audience: anyone designing a cohort schema, decoding a .pulse
file by hand, or trying to understand which type to pick for a column.
Pulse supports 17 field types, each with a fixed type byte, a fixed (or bit-packed) byte size, and well-defined semantics. The full list, mirrored from CLAUDE.md → All 17 field types:
LLM agents using MCP: see the
cohort-schema-designskill viapulse_skills_get— it covers nullability, bit-packing trade-offs, and “which type to pick” with MCP-side examples.
The catalog
| Type | Byte value | ByteSize | Notes |
|---|---|---|---|
u8 | 0 | 1 | Unsigned 8-bit integer |
u16 | 1 | 2 | Unsigned 16-bit integer |
u32 | 2 | 4 | Unsigned 32-bit integer |
u64 | 3 | 8 | Unsigned 64-bit integer |
f32 | 4 | 4 | 32-bit IEEE 754 float |
f64 | 5 | 8 | 64-bit IEEE 754 float |
nullable_bool | 6 | 0 | Bit-packed tri-state (null/true/false) |
nullable_u4 | 7 | 0 | Bit-packed, 4-bit nullable unsigned |
nullable_u8 | 8 | 1 | Nullable 8-bit unsigned |
nullable_u16 | 9 | 2 | Nullable 16-bit unsigned |
date | 10 | 4 | Date as 32-bit value |
packed_bool | 11 | 0 | Bit-packed boolean |
categorical_u8 | 12 | 1 | Categorical with up to 256 dictionary entries |
categorical_u16 | 13 | 2 | Categorical with up to 65,536 entries |
categorical_u32 | 14 | 4 | Categorical with up to 4,294,967,295 entries |
decimal128 | 15 | 16 | Fixed-point exact decimal; per-field (precision, scale) ≤ (38, 38) |
nullable_decimal128 | 16 | 16 | decimal128 plus an INT128_MIN null sentinel |
The Go source-of-truth for this table is
encoding/field_type.go;
the FieldType enum’s iota order is the byte-value order above.
Type families
Plain integers and floats
u8, u16, u32, u64, f32, f64. Standard little-endian
encoding, full range, no null sentinel. Use these when you know the
column never carries a missing value.
Nullable integers
nullable_u8, nullable_u16, nullable_u4, nullable_bool. Each
reserves one in-band value (or one in-band bit pattern) to mean
“null”. For the byte-sized variants the encoding is straightforward;
for the sub-byte variants (nullable_u4, nullable_bool,
packed_bool) Pulse packs multiple fields into shared bytes — see
Record Layout → Bit-packing.
ByteSize() returns 0 for the bit-packed types because they don’t
allocate whole bytes of their own; the schema reader uses BitPosition
to locate them within shared bytes.
Date
date is a 32-bit count of days since the Unix epoch. The range is
~5.8 million years on either side of 1970 — effectively unbounded for
real data.
Categoricals
categorical_u8, categorical_u16, categorical_u32. Each stores
its string-to-ID mapping inline as a dictionary block immediately
after the field’s schema entry. Pick the smallest variant that fits
your cardinality (Pulse’s import path auto-selects during inference).
Dictionary mechanics are documented in Dictionary Blocks.
Decimal128
decimal128 and nullable_decimal128 are 16-byte fixed-point decimal
numbers. Each field carries a per-field (precision, scale) pair
written into the schema after the description; precision and scale
both top out at 38 (PULSE_DECIMAL_OVERFLOW, PULSE_DECIMAL_PRECISION_LOSS).
Use these for currency and any other column where IEEE-754 rounding
is not acceptable. See the financial-cohorts skill for full
semantics including banker’s rounding and divide-by-zero policy.
Unknown type bytes
The schema reader rejects unknown FieldType bytes at parse time
with ENCODING_INVALID. This is the same fail-loud strategy as the
header version check: a file written by a future binary that
introduced a new type fails immediately at schema parse, not later
during row decode where the corruption could go unnoticed.
What you can do with each type
| Concern | Source |
|---|---|
| Which aggregators are meaningful on which types | skills/aggregation-guide.md (LLM) / api process (CLI) |
| Decimal arithmetic semantics | skills/financial-cohorts.md (LLM) |
| Categorical dictionary limits | Dictionary Blocks |
Schema Block
Audience: anyone decoding a .pulse file by hand or writing a
non-Go reader. The schema block follows the 9-byte
header and carries one descriptor per column.
From CLAUDE.md, byte-layout invariants for
.pulsefiles, plus the on-disk format documented inencoding/schema.go.
Top-level shape
u16 field_count
field_record × field_count
Each field_record is variable-width (it includes UTF-8 name and
description strings, and may include a categorical dictionary or
decimal/H3 metadata). The reader walks them sequentially.
Per-field record
In write order — see WriteSchema / ReadSchema in
encoding/schema.go:
| # | Field | Size | Encoding |
|---|---|---|---|
| 1 | type | 1 byte | FieldType byte (see Field Types) |
| 2 | name_length | 2 bytes | u16 little-endian |
| 3 | name | name_length bytes | UTF-8 |
| 4 | byte_offset | 4 bytes | u32 LE — offset within a record |
| 5 | bit_position | 1 byte | u8 — bit position within byte_offset (bit-packed types only) |
| 6 | csv_column_idx | 2 bytes | u16 LE — source column index at import time |
| 7 | description | 2 bytes length + UTF-8 | Capped at 1000 bytes (PULSE_IMPORT_DESCRIPTION_TOO_LONG) |
| 8 | (decimal only) precision | 1 byte | decimal128 and nullable_decimal128 only |
| 9 | (decimal only) scale | 1 byte | same |
| 10 | (categorical only) dictionary | variable | See Dictionary Blocks |
Order matters: every reader walks these in the listed order, so a
malformed record stops the parse with ENCODING_INVALID.
Byte offsets and bit positions
byte_offset is the offset of this field’s first byte within a
record. For bit-packed types (packed_bool, nullable_bool,
nullable_u4), byte_offset plus bit_position together locate the
field’s bits within a byte that may be shared with adjacent fields.
For non-packed types, bit_position is always 0.
Record layout mechanics — including the bit-packing rule, record-size computation, and how the encoder packs adjacent sub-byte fields — are in Record Layout.
Conditional trailers
Two trailers attach only to specific field types:
decimal128/nullable_decimal128get a(precision, scale)pair (u8,u8). Both ≤ 38.- Categorical types (
categorical_u8,categorical_u16,categorical_u32) get a full dictionary block in line — see Dictionary Blocks.
A field with none of the above writes nothing after the description.
Field descriptions
The description string is UTF-8 with a 2-byte length prefix. The
import path rejects descriptions longer than 1000 bytes
(PULSE_IMPORT_DESCRIPTION_TOO_LONG) and warns on low-quality
descriptions (empty, under 10 characters, or generic words like
"n/a", "tbd", "unknown", "field", "data", "value",
"column") — that warning is PULSE_FIELD_DESCRIPTION_LOW_QUALITY,
upgraded to an error under --strict.
When the description is empty, pulse cohort inspect synthesises a
fallback string (“Categorical field: description_source = "synthesized". The original
bytes on disk remain empty.
Reader behaviour
encoding.ReadSchema is intentionally strict:
- Field count limit comes from the u16 prefix (max 65,535 fields).
- Unknown type bytes fail loud (
ENCODING_INVALID). - Truncated records fail loud at the first short read.
- The reader produces a
*encoding.Schemawith oneencoding.Fieldper record;Schema.Field(name)looks fields up by name.
After the schema block, record data starts at the file’s first byte past the schema. The record layout is documented in Record Layout.
Dictionary Blocks
Audience: anyone decoding categorical fields, sizing a categorical type during import, or chasing a dictionary-overflow error.
Categorical fields (categorical_u8, categorical_u16,
categorical_u32) store their string-to-ID mapping inline, immediately
after the field’s schema entry. The dictionary is part of the schema
block, not the record data.
LLM agents using MCP: the
cohort-schema-designskill covers when to pick which categorical width; theimport-best-practicesskill covers fail-closed semantics on overflow.
On-disk layout
From encoding/dictionary.go:
u32 count
(u16 strlen + utf8 bytes) × count
Sizes are little-endian. Each entry’s ID is its insertion index
(0..count-1); ID lookups during decode use the ID found in the
record byte(s) and resolve to the string at that index.
Sizing the type
| Type | Max entries | Bytes per record value |
|---|---|---|
categorical_u8 | 256 | 1 |
categorical_u16 | 65,536 | 2 |
categorical_u32 | 4,294,967,295 | 4 |
The import path samples the source (--sample-rows, default 500) to
estimate cardinality and picks the smallest width that fits. You can
also force a width by editing the schema template (pulse import schema-template SOURCE).
Overflow and unbounded errors
AddWithLimit enforces the per-type cap and returns
PULSE_IMPORT_CATEGORICAL_OVERFLOW when the source has more distinct
values than the dictionary can hold:
{
"code": "PULSE_IMPORT_CATEGORICAL_OVERFLOW",
"message": "categorical dictionary overflow: max 256 entries",
"details": {"max_entries": 256, "value": "the_257th_distinct_string"}
}
The companion code PULSE_IMPORT_CATEGORICAL_UNBOUNDED fires when the
import path detects an effectively unbounded categorical column (the
schema declared categorical_u32 and the column still grew past the
caller-provided guardrails). Both errors halt the import — fail-closed,
no partial output.
Recovery options, in order of preference:
- Re-import with a wider categorical type
(
categorical_u8→categorical_u16→categorical_u32). - Drop the categorical encoding (treat the column as a plain string field — but Pulse has no native variable-string type; you’d add a pre-import transform to bucket values).
- Pre-filter the source to a smaller distinct set and re-import.
Inspect behaviour
pulse cohort inspect --json reports each categorical field’s
dictionary entry count and sample values. By default the inline list
is capped at 100 entries (DefaultDictionaryLimit); pass --full-dict
to print the full dictionary:
pulse cohort inspect data.pulse --full-dict --json
Both forms include a truncated: true|false flag and a total_entries
count for programmatic consumers.
Performance notes
Dictionary reads are amortised: the reader allocates one shared byte
buffer for all string payloads, then does one string(...) copy per
entry. This avoids the “one allocation per entry” overhead that
naively reading length-prefixed strings would produce. The dictionary
itself is held in memory for the life of the cohort’s schema parse.
For very large dictionaries, the categorical_u32 path is still O(N)
to deserialise; if you find yourself near the 32-bit cap, you almost
certainly want a different model (a separate lookup table, or a
plain integer column with the strings stored externally).
Record Layout
Audience: anyone hand-decoding row data or implementing a non-Go reader. The schema block ends; record data starts immediately after.
Records are fixed-width. Every row in a cohort occupies the same number of bytes, computed from the schema’s field types. Variable-width data (strings) lives in the schema (as categorical dictionaries) or is not directly supported.
LLM agents using MCP: the record byte layout is an implementation detail the MCP surface hides — there is no LLM-facing skill for it. The MCP tools operate on the inspect / process / sample abstractions.
Computing record size
Record size is the sum of FieldType.ByteSize() over all schema
fields, plus padding bytes that share bits between sub-byte fields.
For non-packed types, ByteSize() returns the obvious value
(u32 = 4, f64 = 8, decimal128 = 16); for packed types
(packed_bool, nullable_bool, nullable_u4), ByteSize() returns
0 and the field shares a byte with adjacent packed fields.
The writer (encoding/record.go) lays out fields in the order they
appear in the schema; the reader walks the same order with the
per-field ByteOffset and BitPosition recorded in the schema.
Encoding per type
From WriteFieldValue / ReadFieldValue in
encoding/record.go:
| Type family | Encoding |
|---|---|
u8 / nullable_u8 / categorical_u8 | 1 byte, unsigned |
u16 / nullable_u16 / categorical_u16 | 2 bytes, little-endian unsigned |
u32 / date / categorical_u32 | 4 bytes, little-endian unsigned |
u64 | 8 bytes, little-endian unsigned |
f32 | 4 bytes, little-endian IEEE 754 |
f64 | 8 bytes, little-endian IEEE 754 |
decimal128 / nullable_decimal128 | 16 bytes, little-endian two’s-complement integer (scaled by 10^scale); null sentinel is INT128_MIN for the nullable variant |
packed_bool / nullable_bool / nullable_u4 | Bit-packed — see below |
Bit-packing
Sub-byte types share whole bytes with their packed neighbours. The
schema records both ByteOffset (the shared byte’s offset) and
BitPosition (which bit slot within that byte).
packed_bool— 1 bit (true/false).nullable_bool— 2 bits (one null bit, one value bit) for the tri-state encoding.nullable_u4— 5 bits (one null bit, four value bits) for the nullable 4-bit unsigned encoding.
The writer aligns these into shared bytes from low bit to high bit;
adjacent packed fields stack into the same byte until the byte is
full, after which a new byte begins. ByteSize() == 0 is the schema
reader’s signal that a field type shares bytes — non-zero ByteSize
fields never share.
Null sentinels
| Type | Null encoding |
|---|---|
nullable_u8 | 0xFF |
nullable_u16 | 0xFFFF |
nullable_u4 | Dedicated bit pattern within the packed byte |
nullable_bool | Dedicated bit within the packed byte |
nullable_decimal128 | INT128_MIN (0x8000…0000) |
u32, u64, f32, f64, date, decimal128 (non-nullable), and
all categoricals are non-nullable — the import path either coerces
or rejects rows with missing values (PULSE_IMPORT_ROW_ERROR). Pick
the nullable_* variant when you need to preserve the difference
between “zero” and “missing”.
Reading a record
The Go decoder lives at encoding.Reader /
encoding.ReadRecord(*Schema, []byte). A non-Go reader can follow
the same recipe:
- Compute record size from the schema.
- Read
record_sizebytes. - For each schema field in declaration order:
- If
ByteSize() > 0, decode the value at the field’sByteOffset. - If
ByteSize() == 0, decode the bit slot at(ByteOffset, BitPosition)using the type’s bit-pattern rules.
- If
Forward compatibility
Records carry no type tag — they’re a packed binary blob whose interpretation comes entirely from the schema block. That’s why the file’s format version (in the header) and unknown field-type bytes (in the schema block) both fail loud at parse time: the records themselves cannot self-correct, so the format gates everything before record data is observed.
MCP Integration
Audience: operators wiring Pulse into an MCP-aware AI client (Claude Desktop, Claude Code, Cursor, Zed, custom hosts), and embedders who want to expose Pulse to an LLM agent.
This page is the human-facing guide: what the server does, how to wire it up, what the LLM sees, and how to debug a misbehaving session. Agent-facing guidance ships inside the binary as the mcp-integration skill — fetch it via pulse_skills_get (or pulse skills show mcp-integration).
What pulse mcp is
pulse mcp runs the Pulse library as a Model Context Protocol (MCP) server. The host (Claude Desktop, Claude Code, etc.) launches it as a subprocess, speaks JSON-RPC over its stdio streams, and shuts it down on session close. The LLM sees Pulse as a set of tools (callable functions), resources (browseable URIs), and prompts (canned slash commands).
┌─────────────┐ stdio JSON-RPC ┌────────────┐ Go calls ┌─────────────┐
│ AI client │ ───────────────→ │ pulse mcp │ ─────────→ │ pulse.Pulse │
│ (host) │ ←─────────────── │ (this bin) │ ←───────── │ (library) │
└─────────────┘ └────────────┘ └─────────────┘
│
└── stderr ─→ host log pane
The server is a thin translator. Every tool wraps a public method on pulse.Pulse; the same code path powers the CLI.
Quickstart
# 1. Build and place on PATH
make build && cp ./bin/pulse /usr/local/bin/
# 2. Pick a data directory
mkdir -p /var/data/pulse
# 3. Wire into your host (see below) and restart it
# 4. From the LLM session, call:
# pulse_manifest → cache once
# pulse_ask → run analyses
Wiring into a host
Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"pulse": {
"command": "/usr/local/bin/pulse",
"args": ["mcp"],
"env": {
"PULSE_DATA_DIR": "/var/data/pulse"
}
}
}
}
Restart Claude Desktop. Pulse tools appear in the tool picker.
Claude Code
claude mcp add pulse --env PULSE_DATA_DIR=/var/data/pulse -- pulse mcp
Or by hand in ~/.claude.json (or per-project .claude.json):
{
"mcpServers": {
"pulse": {
"command": "/usr/local/bin/pulse",
"args": ["mcp"],
"env": { "PULSE_DATA_DIR": "/var/data/pulse" }
}
}
}
Cursor / Zed / generic stdio hosts
Any host that speaks the MCP stdio transport can launch pulse mcp the same way — provide the binary path, the mcp argument, and the PULSE_DATA_DIR env var.
What the LLM sees
Tool surface
Sixteen tools, registered at server start. Names and order match internal/mcp/mcptools/meta.go.
| Tool | Purpose |
|---|---|
pulse_manifest | Call first. Self-description: commands, operators (with accepted types + streamability), tier-1/tier-2 tests, regressions, synth distributions, error code list, MCP tool list, cohort field types with operator cross-references. Cache once per session. |
pulse_ask | Preferred entry point. One-shot: optional auto-import → inspect → predict → execute. Accepts source (raw file path) + query (natural language, beta) or a structured request. |
pulse_inspect | Read .pulse header + schema (no record bytes). Side effect: registers session-scoped schema-bound tool variants (see below). |
pulse_predict | Validate a request against the schema without executing. Returns errors, warnings, applied defaults, streamability reasons. |
pulse_process | Execute one pre-built request. |
pulse_compose | Execute a batch of requests against the same cohort in one round trip. |
pulse_sample | Return up to N rows for preview / diagnostics. |
pulse_facet | Distinct values for a single field. |
pulse_import | Convert a tabular source (csv, tsv, ndjson, jsonarray, parquet, arrow, excel) into a managed .pulse handle under imports/, with TTL-tracked sidecar. Pulse-format inputs pass through. |
pulse_drop | Delete a managed-import handle and its sidecar. |
pulse_imports_list | Enumerate managed handles with sidecar metadata (source, format, imported_at, expires_at, ttl, expired flag, pinned flag). |
pulse_examples_search | Search the embedded request-example library by query, taxonomy tags (ANDed), or category. |
pulse_examples_get | Fetch one runnable example body by name. |
pulse_errors_lookup | Per-code Message + Fixup detail (kept out of the manifest for context economy). |
pulse_skills_list | Embedded skill metadata. |
pulse_skills_get | Fetch one skill body by name. |
Natural-language
queryis beta. Heuristic parsing only — silent misinterpretation is possible. The LLM should always check thequery_resolutionand resolvedrequestin the response before trusting results. For production, author a structuredrequestagainst the cached manifest and skip thequeryfield.
Resources
| URI scheme | Yields |
|---|---|
pulse://<path> | One resource per .pulse file under the data directory. Read returns descriptor.InspectResult JSON (header + schema only — no record bytes). |
pulse-skill://<name> | One per embedded skill. Read returns the markdown body. |
Resources are registered once at server start. Files added afterwards do not appear until the server restarts. Listing is cheap because the server only reads header bytes.
Prompts
| Name | Args | Returns |
|---|---|---|
pulse-bootstrap | none | A short instructions block telling the assistant what to call (and in what order) before authoring any request, and where the authoritative references live. Inject at session start. |
pulse-author-request | question | A guided tool-call sequence for translating a natural-language analytical question into a Pulse request: manifest → examples search → ask. |
Hosts that surface prompts as slash commands let users trigger these directly.
Recommended session flow
The two-call default for nearly every user request:
-
pulse_manifestonce at session start. No arguments. Cache the payload — it is deterministic for a binary version and carries every fact needed to author a valid request. -
pulse_askfor everything else. It collapses import + inspect + predict + execute into one round trip. When the user hands the LLM a raw file:{ "request": "{\"source\":\"data.csv\",\"query\":\"average revenue by month\"}" }When the cohort already exists as a managed handle or
.pulsefile:{ "request": "{\"cohort\":{\"filename\":\"sales.pulse\"},\"query\":\"top 5 regions by revenue\"}" }On predict-invalid with
on_invalid="suggest", the response carries structuredFixupentries derived from each error code’s metadata so the LLM can repair the request without another round trip.
Reach for the multi-step path (pulse_inspect → pulse_predict → pulse_process) only when:
- diagnosing a failed predict and you want the full envelope,
- previewing rows (
pulse_sample) or value distributions (pulse_facet), - pre-staging a managed handle with a specific name / TTL / pinning (
pulse_import), - batching multiple requests in one call (
pulse_compose).
Managed imports + TTL
pulse_import lets the LLM hand the server any tabular file and address it from then on as if it were a .pulse.
- Convertible formats (csv, tsv, ndjson, jsonarray, parquet, arrow, excel) are imported into
$PULSE_DATA_DIR/imports/<handle>.pulsewith a sidecar<handle>.pulse.meta.jsoncarryingimported_at,expires_at,ttl_seconds, source path, source format, and row count.result.managed=true. - Pulse passthroughs (
.pulseextension) underPULSE_DATA_DIRare not copied — the server returns the relative path verbatim withmanaged=false. A.pulseoutsidePULSE_DATA_DIRis copied into the managed pool.
Source path resolution. Relative source paths resolve against PULSE_DATA_DIR. Absolute paths read from the host filesystem through a separate “source fs.”
Import jail. Absolute source paths are confined to a single directory tree (the jail root). Default: the working directory the MCP server was launched from. Paths that escape the jail (including ..) return PULSE_IMPORT_SOURCE_FORBIDDEN. Override via pulse.Options.ImportSourceJailRoot when embedding.
Sliding TTL. Default lifetime is 7d (overridable via PULSE_IMPORT_TTL, or per-import via the ttl field — accepts Go duration like "24h", day form like "7d", or "pin" for never-expire). Every subsequent inspect/predict/process/sample/facet/ask against the handle slides expires_at forward. The pool self-sweeps on every pulse_import call — no daemon required. Inspect with pulse_imports_list; evict manually with pulse_drop.
Schema-bound enums
After a successful pulse_inspect (or after pulse_ask opens a cohort), the server registers session-scoped variants of the action tools (pulse_process, pulse_predict, pulse_compose, pulse_sample, pulse_facet) whose JSON Schemas embed enum constraints on field-name parameters. The LLM picks field names from a typed list rather than free-texting and discovering on predict that the name was wrong.
What gets constrained on bound pulse_process / pulse_predict / pulse_compose schemas:
| Path | Enum |
|---|---|
aggregations[].field | All cohort field names |
aggregations[].type | Full aggregator catalogue (AGG_*) |
attributes[].field | Numeric fields only (includes decimal) |
attributes[].type | Full attribute catalogue (ATTR_*) |
filterers[].field | All cohort field names |
filterers[].type | Full filterer catalogue (FILTER_*) |
groups[].field | All cohort field names |
groups[].type | Full grouper catalogue (GROUP_*) |
windows[].field, windows[].partition_by[] | All cohort field names |
windows[].order_by[].field | Numeric and date fields |
windows[].type | Full window catalogue (WIN_*) |
tests[].field, tests[].field2 | Numeric fields only |
tests[].split_by / rows / cols / subject_field | All cohort field names |
tests[].type | Full test catalogue (TEST_*) |
pulse_facet field arg | All cohort field names |
Trigger and lifecycle. Binding fires on a successful pulse_inspect. mcp-go auto-fires notifications/tools/list_changed on AddSessionTools; the host refreshes its tool list and picks up the bound schemas on the next list. Bound tools share names with the global tools — session-scoped variants override globals for that session.
Limitations.
- Multi-file sessions: the latest inspect wins. Track multiple cohorts client-side.
- No per-element type ↔ field correlation: JSON Schema can’t easily express “if
aggregations[i].type == AGG_SUMthenaggregations[i].fieldmust be numeric.” Operator–type compatibility lives in thetypeproperty description; strict validation remainspulse_predict’s job. - Transport support: binding requires a session that implements
SessionWithTools. SSE / Streamable HTTP transports work; on stdio, binding is a no-op fallback and the global (unbound) schemas remain in effect. The manifest’saccepts_typestable is still authoritative, so authoring is not blocked — just less ergonomic. - Empty enums omitted: when the cohort has zero fields in a category (e.g. no geo fields), the enum is omitted entirely rather than emitted as
[].
Disable binding entirely with --bind-on-open=false.
Configuration
| Env var | Purpose | Default |
|---|---|---|
PULSE_DATA_DIR | Cohort base directory. Required. | (none — server fails to start without it) |
PULSE_IMPORTS_DIR | Subdirectory for managed-import handles. | imports |
PULSE_IMPORT_TTL | Default TTL for managed handles. Accepts Go duration (24h, 30m), day form (7d, 30d), or pin. | 7d |
Embedders can override per-instance via pulse.Options{DataDir, ImportsDir, ImportTTL, ImportSourceJailRoot, FS, ImportSourceFS, BindOnOpen} — see pulse.go.
Transport caveats
- Stdio. The default and only transport
pulse mcpships today. Schema binding is a no-op (see Limitations). Stdout is the JSON-RPC channel; stderr is the log channel — never write structured output to stdout outside the protocol. - SSE / Streamable HTTP. Not exposed by the
mcpCLI leaf yet. The underlyingmcp-goserver supports them; embedders can callmcp.NewWithOptions(p, ...)and serve viamcp-go’s SSE / streamable HTTP entry points directly.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
data directory required: set PULSE_DATA_DIR or pass --data-dir | Neither env var nor flag set | Pass PULSE_DATA_DIR in the host’s env block, or --data-dir in args |
| Tools don’t appear in the host UI after editing config | Host caches tool list | Restart the host fully (not just the conversation) |
pulse_import returns PULSE_IMPORT_SOURCE_FORBIDDEN for an absolute path | Path escapes the import jail (default = server’s working dir) | Either move the file under the jail, launch the server from a higher-level directory, or set pulse.Options.ImportSourceJailRoot when embedding |
pulse_inspect succeeds but bound enums never fire | Stdio session — binding is a no-op there | Use pulse_predict for validation; the manifest’s accepts_types lists give the LLM the same information |
| Tool calls hang | Host wrote non-protocol bytes to the server’s stdin, or server wrote non-protocol bytes to stdout | Check server stderr; restart the session. pulse mcp itself only writes a one-line startup notice to stderr at boot |
pulse_ask with query returns nonsense or wrong fields | Natural-language parsing is heuristic and beta | Inspect query_resolution in the response. For production, author a structured request against the cached manifest |
To see what the server registers without launching the host:
pulse --json | jq '.data.mcp_tools[]'
pulse manifest --json | jq '.data.skills[]'
Skill cross-reference for LLM agents
If you are writing a system prompt for an LLM agent that uses Pulse, point it at these skills rather than at this site:
| LLM task | Skill |
|---|---|
| MCP wiring, tool surface, schema binding | mcp-integration |
Author a Process request | request-recipes |
| Compose multiple sub-requests in one call | compose-requests |
Iterate on a request with pulse_predict | debugging-with-predict |
| Look up an error code or warning | error-code-reference |
| Pick an aggregator / filterer | aggregation-guide |
| Pick an attribute (z-score, percentile, formula, …) | attribute-composition |
| Design a grouper | grouper-design |
Use a window operator (WIN_*) | window-operations |
Use a feature engineer (FEAT_*) | feature-engineering |
| Run a statistical test (tier-1 or tier-2) | statistical-testing |
| Fit a regression (OLS, GLM, Bayesian) | regression-modeling |
| Generate synthetic data | synthetic-data |
| Understand a cohort’s schema layout | cohort-schema-design |
Import a tabular source into .pulse | import-best-practices |
| Pick an export format | export-format-selection |
Work with decimal128 (currency, precise arithmetic) | financial-cohorts |
| Route a natural-language query to a Pulse request | query-router-prompt |
| Get started end-to-end (LLM walkthrough) | getting-started |
The agent should call pulse_skills_list once at session start to enumerate the catalog, then pulse_skills_get on demand. The returned text is authoritative; this site does not duplicate it and may lag.
Related
mcp(CLI leaf) — flag reference and exit codes for the server binary- Deployment — production hardening notes
- Troubleshooting — non-MCP failure modes
Request Example Library
Pulse ships a searchable, embedded catalogue of runnable request JSON files
spanning every operator category. They are checked into the repo
under examples/, mounted into the binary at compile time via //go:embed,
and surfaced through three peer access paths:
| Access path | Best for |
|---|---|
pulse_examples_search / pulse_examples_get (MCP tools) | LLM agents authoring requests against a running Pulse server |
pulse examples search / pulse examples show (CLI) | Developers exploring at a shell |
pulse.ExamplesSearch / pulse.ExampleGet (Go API) | Embedders building higher-level UIs |
What the library contains
Every example is a complete types.Request JSON body — the same shape you
hand to pulse_process. Each file is annotated with a structured _meta
block describing the example. Pulse’s JSON unmarshaller ignores unknown
fields by default, so the _meta block is invisible at execution time;
the file remains runnable verbatim.
{
"_meta": {
"name": "t_test_one_sample",
"category": "tests",
"tags": ["hypothesis-test", "t-test", "tier-1-test", "parametric", "one-sample", "streaming-friendly"],
"operators": ["AGG_AVERAGE", "AGG_COUNT", "TEST_T"],
"description": "One-sample t-test comparing revenue mean against the hypothesized mu=100."
},
"cohort": {...},
...
}
Fetching via pulse_examples_get returns the request body with the _meta
block already stripped, so you can pass it straight to
pulse_process / pulse_predict.
Searching the library
Three filter dimensions, all optional and combined with AND:
| Filter | Behaviour |
|---|---|
query | Case-insensitive substring across the example’s name, description, and operator list |
tags | An example must carry every requested tag |
category | Exact match against the example’s directory (aggregations, attributes, features, filterers, groupers, regression, tests, windows) |
CLI
pulse examples search --query welch # find Welch-related examples
pulse examples search --tag time-series --tag tier-2-test # AND tag filter
pulse examples search --category tests --json # JSON envelope
pulse examples show t_test_one_sample # print runnable JSON
pulse examples show t_test_one_sample --json # full record (with _meta)
MCP
// arguments to pulse_examples_search
{"query": "welch"}
{"tags": ["time-series", "tier-2-test"]}
{"category": "features"}
Go API
p, _ := pulse.New(pulse.Options{DataDir: "/data"})
// Search:
hits := p.ExamplesSearch("welch", []string{"experiment-analysis"}, "")
for _, h := range hits {
fmt.Println(h.Name, "—", h.Description)
}
// Fetch and run:
ex, ok := p.ExampleGet("t_test_one_sample")
if ok {
var req pulse.Request
_ = json.Unmarshal(ex.Body, &req)
resp, _ := p.Process(ctx, &req)
_ = resp
}
Tag taxonomy
Tags are curated and validated by a CI gate (TestExamples_TagsFromTaxonomy).
The taxonomy spans four dimensions:
| Dimension | Tags |
|---|---|
| Domain / use case | time-series, cohort-analysis, experiment-analysis, correlation-analysis, comparison, before-after, top-n, distribution-shape, cross-tabulation, proportion-analysis, trend-detection, outlier-detection, cardinality-analysis, data-quality, geo-analysis, financial, feature-engineering |
| Statistical method | hypothesis-test, t-test, parametric, nonparametric, paired, one-sample, two-sample, k-sample, repeated-measures, post-hoc, normality-test, homogeneity-test, exact-test |
| Regression / modeling | regression, ecological, ols, glm, logistic, bayesian, regularization, ridge, lasso, elasticnet, polynomial, resampling, jackknife, selection, stepwise |
| Pipeline machinery | tier-1-test, tier-2-test, composed, pre-filter, feature-pipeline, window-operator, streaming-friendly, buffered-pipeline |
| Risk / edge | leakage-safe, leakage-risk, small-sample |
The category (directory name) is not repeated in the tags — _meta.category
carries that.
Adding a new example
- Write the request JSON under
examples/<category>/. Use existing files as shape templates. Keepcohort.data_dir = ".data"and reference one of the fixture cohorts. - Add a
_metablock at the top of the file:name— kebab-case-with-underscores, unique across the whole library.category— must match the parent directory.tags— pick 3-6 from the taxonomy above.operators— the list ofAGG_* / ATTR_* / FILTER_* / GROUP_* / WIN_* / FEAT_* / TEST_*types appearing in the body, alphabetized and deduped.description— one-sentence, present-tense summary.
- Re-run
go test ./examples/... ./descriptor/...to confirm the new file passes:TestExamples_AllParseAsRequestTestExamples_UniqueNamesTestExamples_TagsFromTaxonomyTestExamples_OperatorsMatchBodyTestExamples_CategoryMatchesDirectoryTestManifestExamplesPopulated
- The annotation tool at
cmd/annotate-examples/is idempotent and may be re-used; updating its in-sourceannotationsslice and re-running will rewrite the file’s_metablock in canonical form.
Regression Modeling
Pulse exposes regression through a compact, composable surface. Three operators, two orthogonal modifiers, and one upstream feature transform together cover every textbook regression variant. This chapter is the human-facing counterpart to skills/regression-modeling.md; agents should fetch the skill via pulse_skills_get rather than read this page.
Overview
| Operator | Engine | Streaming |
|---|---|---|
REG_OLS | Ordinary least squares + optional regularization | Streams sufficient statistics (Phase 1 + 2) |
REG_GLM | Generalized linear model via IRLS | Always buffered (Newton-Raphson refit) |
REG_BAYES_LINEAR | Bayesian linear regression (conjugate NIG) | Streams sufficient statistics (Phase 4) |
Two spec-level modifiers compose with any of the three:
Resample ∈ {jackknife, bootstrap}— replaces analytical SE / p-values with resample-based estimates. Forces buffered.Selection ∈ {forward, backward, stepwise}— drives AIC- or BIC-based greedy subset search. RequiresCriterion. Forces buffered.
One upstream feature operator (FEAT_POLY) extends the linear core to polynomial regression. Per-row attributes (ATTR_REG_FITTED, ATTR_REG_RESIDUAL, ATTR_REG_LEVERAGE) attach per-record diagnostics in the output row stream.
The 13 textbook names → Pulse specs
The Indeed regression taxonomy double-counts (Simple ≡ Linear univariate, Multiple ≡ Multiple Linear) and treats orthogonal wrappers (Jackknife, Stepwise) as families. Pulse does not. The table below maps each textbook name onto the corresponding Pulse spec and links to a runnable example file under examples/regression/.
| # | Indeed name | Pulse expression | Example |
|---|---|---|---|
| 1 | Simple | REG_OLS with one predictor | examples/regression/02_simple_linear.json |
| 2 | Multiple | REG_OLS with multiple predictors | examples/regression/03_multiple_linear.json |
| 3 | Linear | = #1 | examples/regression/02_simple_linear.json |
| 4 | Multiple Linear | = #2 | examples/regression/03_multiple_linear.json |
| 5 | Logistic | REG_GLM{Family:"binomial", Link:"logit"} | examples/regression/04_logistic.json |
| 6 | Ridge | REG_OLS{Penalty:"l2", Alpha:λ} | examples/regression/05_ridge.json |
| 7 | Lasso | REG_OLS{Penalty:"l1", Alpha:λ} | examples/regression/06_lasso.json |
| 8 | Polynomial | FEAT_POLY{Field:x, Degree:n} upstream → REG_OLS | examples/regression/07_polynomial.json |
| 9 | Bayesian Linear | REG_BAYES_LINEAR{Prior:"nig"} | examples/regression/08_bayesian_linear.json |
| 10 | Jackknife | any regression with Resample:"jackknife" | examples/regression/09_jackknife.json |
| 11 | Elastic Net | REG_OLS{Penalty:"elasticnet", Alpha, L1Ratio} | examples/regression/10_elasticnet.json |
| 12 | Ecological | GROUP_* upstream → REG_OLS over group means (composed request) | examples/regression/01_ecological_fallacy.json |
| 13 | Stepwise | any regression with Selection:"stepwise", Criterion:"aic"|"bic" | examples/regression/11_stepwise.json |
Streamability matrix
| Spec | Streamable | Memory | Notes |
|---|---|---|---|
REG_OLS no penalty | yes | O(p²) | sufficient stats: n, Σx, Σy, XᵀX, Xᵀy, Σy² |
REG_OLS + l1 / l2 / elasticnet | yes | O(p²) | streaming Gram; regularized solve at finalize |
REG_BAYES_LINEAR (conjugate NIG) | yes | O(p²) | streaming sufficient stats + closed-form posterior update |
REG_GLM (binomial / poisson / gamma) | no | O(n·p) | IRLS / Newton requires multiple passes |
Any regression with Resample != "" | no | O(n·p) | LOO / bootstrap refit |
Any regression with Selection != "" | no | O(n·p) | refit per candidate subset |
pulse_predict reports per-request streamability on PredictResult.Streamable, mirroring the runtime gate.
Operator reference
REG_OLS
Ordinary least squares with optional regularization.
| Param | Required | Notes |
|---|---|---|
target | yes | Numeric response field. |
predictors | yes | One or more numeric predictor fields. |
penalty | no | "" (default), "l1", "l2", or "elasticnet". |
alpha | conditional | Required and > 0 when penalty != "". |
l1_ratio | conditional | Required and in [0, 1] when penalty == "elasticnet". |
max_iters | no | Coordinate-descent cap (default 1000). |
tol | no | Convergence tolerance (default 1e-6). |
resample | no | "jackknife" or "bootstrap". Downgrades streaming. |
selection | no | "forward", "backward", or "stepwise". Requires criterion. Downgrades streaming. |
Modifier compatibility: Resample and Selection may be combined; Selection runs first, Resample re-fits the selected subset.
Error codes: PROCESSING_REGRESSION_RANK_DEFICIENT, PROCESSING_REGRESSION_SINGULAR_GRAM, PROCESSING_REGRESSION_NO_CONVERGE, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_REGRESSION_APPROXIMATE_SE (warning, l1/elasticnet without resample), PROCESSING_REGRESSION_REGULARIZED_SELECTION (warning, penalty + selection), PROCESSING_CONFIG.
REG_GLM
Generalized linear model via iteratively-reweighted least squares.
| Param | Required | Notes |
|---|---|---|
target | yes | Numeric response. |
predictors | yes | One or more numeric predictor fields. |
family | yes | "binomial", "poisson", or "gamma". |
link | no | Family-specific default when empty (binomial→logit, poisson→log, gamma→inverse). |
max_iters | no | IRLS iteration cap (default 50). |
tol | no | Convergence tolerance (default 1e-8). |
resample | no | "jackknife" or "bootstrap". |
selection | no | Subset-selection wrapper; requires criterion. |
Always buffered. Setting penalty / alpha / l1_ratio on a REG_GLM spec is rejected with PROCESSING_CONFIG; regularized GLM is reserved for a later phase.
Error codes: PROCESSING_REGRESSION_INVALID_FAMILY, PROCESSING_REGRESSION_INVALID_LINK, PROCESSING_REGRESSION_NO_CONVERGE, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_CONFIG.
REG_BAYES_LINEAR
Bayesian linear regression with a conjugate Normal-Inverse-Gamma prior.
| Param | Required | Notes |
|---|---|---|
target | yes | Numeric response. |
predictors | yes | One or more numeric predictor fields. |
prior | no | Only "nig" accepted in v1. Default "nig". |
prior_mu | no | Length p+1 mean vector (intercept first); defaults to zero. |
prior_precision | no | Scalar ε ≥ 0 on the precision matrix ε·I. Default 1e-3. |
prior_shape | no | Inverse-gamma shape a₀. Default 1e-3. |
prior_rate | no | Inverse-gamma rate b₀. Default 1e-3. |
credible_level | no | Posterior interval mass. Default 0.95. |
Modifier compatibility: Resample and Selection are rejected for REG_BAYES_LINEAR at spec validation — the posterior already conveys uncertainty via credible intervals, and stepwise feature selection on a Bayesian model is a posterior-based question the conjugate-NIG engine doesn’t support.
Setting penalty / alpha / l1_ratio / family / link on a Bayes spec is rejected with PROCESSING_CONFIG.
Error codes: PROCESSING_REGRESSION_RANK_DEFICIENT, PROCESSING_REGRESSION_INSUFFICIENT_DATA, PROCESSING_CONFIG.
Modifiers
Resample
Layered on top of any base operator (except REG_BAYES_LINEAR).
| Value | Behavior |
|---|---|
"" | No resampling. Closed-form / asymptotic standard errors. |
"jackknife" | Leave-one-out resampling. SE = sqrt((n−1)/n · Σᵢ (β⁽⁻ⁱ⁾ − β̄)²). |
"bootstrap" | Non-parametric bootstrap. bootstrap_iters (default 1000), rng_seed (0 → time-seeded; non-zero → reproducible). |
For l1 / elasticnet OLS, setting Resample is the rigorous answer for standard errors: it suppresses the PROCESSING_REGRESSION_APPROXIMATE_SE warning (the SEs are now resample-based, not plug-in over the active set).
Selection
Layered on top of any base operator (except REG_BAYES_LINEAR).
| Value | Behavior |
|---|---|
"" | No subset selection. |
"forward" | Start from intercept-only; add the predictor that lowers the criterion most. |
"backward" | Start from full model; remove the predictor whose absence lowers the criterion most. |
"stepwise" | Bidirectional sweep; try every add and every remove per cycle. |
Requires Criterion ∈ {"aic", "bic"}.
- AIC =
-2·logL + 2·k. Lighter penalty; may retain weak predictors at moderaten. - BIC =
-2·logL + log(n)·k. Heavier per-parameter penalty; rejects noise predictors more reliably at moderaten.
SelectedFeatures lists the chosen subset; Coefficients drops non-selected predictors entirely (absence ≠ zero — selection’s contract is stronger). Selection may be combined with Resample: Selection picks the active subset, then Resample replaces SE / p-values on the selected model.
Compositional patterns
Polynomial regression — FEAT_POLY + REG_OLS
Polynomial regression is linear in the coefficients; the non-linearity lives in the feature space. Use FEAT_POLY upstream to materialize x_2, x_3, …, x_<degree> derived columns, then list them alongside the original x in predictors:
{
"features": [
{"type": "FEAT_POLY", "field": "x", "label": "x", "params": {"degree": 3}}
],
"regressions": [
{"type": "REG_OLS", "name": "polyfit", "target": "y",
"predictors": ["x", "x_2", "x_3"]}
]
}
Degree is gated at [2, 10]. Numerical stability is the caller’s responsibility: x^10 overflows f64 once |x| clears a few hundred, and the Gram matrix conditions poorly long before that. Centre or standardize predictors before requesting FEAT_POLY.
Ecological regression — group → regress
“Ecological regression” is a regression fit on aggregated group-level statistics — per-precinct means, per-county sums, per-region rates — rather than individual-level rows. Use pulse_compose with two slots: slot 1 produces per-group means via GROUP_* + AGG_AVERAGE, slot 2 fits REG_OLS over the aggregate output (or, in practice, over a pre-aggregated .pulse file).
The two slots are intentionally independent; Pulse does not pipe slot-1 results into slot-2 as cohort input. Either (a) materialize slot 1’s aggregate as its own .pulse cohort upstream, or (b) treat slot 1 as the audit trail (per-group means visible in the composed response) and run slot 2 over a pre-aggregated fixture.
Caution — the ecological fallacy. A significant group-level slope does not imply an individual-level association. Robinson (1950) showed that ecological correlations and individual correlations can take opposite signs in the same data: a per-state regression of literacy on race might suggest a strong relationship that vanishes (or reverses) at the per-person level. Aggregation collapses within-group variation, leaving only between-group structure that frequently encodes confounders.
When ecological regression is the right tool: aggregate-only data (census output, public-health summary tables); genuinely group-level research questions (“do counties with higher median income have higher turnout?”). When it is the wrong tool: individual-level claims; causal claims. Annotate consumer-facing prose with this caveat; Pulse cannot enforce it.
Robinson, W.S. (1950). “Ecological Correlations and the Behavior of Individuals.” American Sociological Review 15(3): 351–357.
Per-row regression attributes
Three attribute operators emit per-record diagnostics from a fitted regression onto the row stream.
| Attribute | Emits per row |
|---|---|
ATTR_REG_FITTED | ŷ_i = Xᵢ β — the model’s prediction at each row. |
ATTR_REG_RESIDUAL | y_i − ŷ_i — the per-row residual. |
ATTR_REG_LEVERAGE | h_ii = Xᵢ (XᵀX)⁻¹ Xᵢᵀ — the i-th diagonal of the hat matrix. |
Each attribute references a sibling regression spec by regression_name. See skills/attribute-composition.md for the parameter table.
Error codes
Look up full prose via pulse_errors_lookup or pulse errors lookup CODE.
| Code | Meaning (one-liner) |
|---|---|
PROCESSING_REGRESSION_NOT_IMPLEMENTED | Reserved as of Phase 8; no engine returns this today. |
PROCESSING_REGRESSION_RANK_DEFICIENT | XᵀX is singular; add regularization or drop a predictor. |
PROCESSING_REGRESSION_NO_CONVERGE | IRLS or coordinate descent failed within MaxIters. |
PROCESSING_REGRESSION_SINGULAR_GRAM | XᵀX non-invertible even after regularization; increase alpha. |
PROCESSING_REGRESSION_INVALID_FAMILY | REG_GLM Family outside {binomial, poisson, gamma}. |
PROCESSING_REGRESSION_INVALID_LINK | Link incompatible with the chosen Family. |
PROCESSING_REGRESSION_INSUFFICIENT_DATA | Filtered set has fewer rows than predictors + 1, or below resample minimum. |
PROCESSING_REGRESSION_APPROXIMATE_SE | Warning: l1 / elasticnet SE is a plug-in approximation; set resample for rigor. |
PROCESSING_REGRESSION_REGULARIZED_SELECTION | Warning: penalty != "" plus selection != "" is unusual. |
PROCESSING_CONFIG | Invalid spec combination (e.g. Bayes + Resample, GLM + Penalty). |
Worked examples
Every Indeed name has a runnable JSON file under examples/regression/. Fetch via pulse_examples_get or read directly:
- 01_ecological_fallacy.json — per-region aggregation + ecological caveat (#12).
- 02_simple_linear.json — univariate OLS (#1, #3).
- 03_multiple_linear.json — multivariate OLS (#2, #4).
- 04_logistic.json — binary classification (#5).
- 05_ridge.json — l2 penalty (#6).
- 06_lasso.json — l1 penalty (#7).
- 07_polynomial.json —
FEAT_POLY+ OLS (#8). - 08_bayesian_linear.json — conjugate NIG (#9).
- 09_jackknife.json — leave-one-out resampling (#10).
- 10_elasticnet.json — combined l1 / l2 penalty (#11).
- 11_stepwise.json — BIC-driven stepwise selection (#13).
Architecture Overview
Source of truth: the canonical architectural contract is
CLAUDE.mdat the repository root. This chapter restates its design principles for human readers; if the two ever disagree, CLAUDE.md is authoritative.
Pulse is a high-performance, self-describing tabular data processing engine. It
ships as a Go library (github.com/frankbardon/pulse) and as a CLI binary
(cmd/pulse/). The library is the primary deliverable; the CLI is a thin
adapter over it.
Design principles
- Library-first. The
pulse.gofacade (pulse.New,pulse.Options,pulse.Process,pulse.Compose,pulse.Import,pulse.Export,pulse.Convert,pulse.Inspect,pulse.Predict,pulse.Sample,pulse.Facet) is the public API. The CLI calls the library; it never contains business logic. - Self-describing. Every
.pulsefile carries its schema in the header. Thedescriptor/package providesmanifest,predict, andinspectoperations that expose the system’s capabilities and validate requests without executing them. - Skill-augmented. The
skills/package embeds 19 markdown skill files into the binary via//go:embed. LLM agents (and Nexus, the orchestration layer that consumes Pulse) can callskills.List()andskills.Get(name)at boot time to inject domain-specific guidance into their context. - Nexus relationship. Pulse is a standalone processing engine. Nexus is
the upstream orchestration agent that calls Pulse’s library API or CLI.
Pulse has no dependency on Nexus. Nexus discovers Pulse’s capabilities via
pulse manifest --jsonand loads skills from the embedded skill pack.
The next chapter, Package Layout, shows where each of these concerns lives in the source tree.
Package Layout
Source of truth: this tree is mirrored from the “Package layout” section of
CLAUDE.md. If the project structure changes, that file is updated first; this page follows.
pulse/
├── cmd/
│ └── pulse/ # CLI binary (the only binary)
├── pulse.go # Public facade — pulse.New, pulse.Options
├── service/ # Orchestration layer; wires processing to encoding
├── processing/ # Aggregators, attributes, filterers, groupers, windows, features
│ ├── window/ # WIN_* operators (LAG, LEAD, RANK, RUNNING_*, EWMA, ...)
│ └── feature/ # FEAT_* pre-filter feature engineers (LOG, SQRT, BUCKETIZE, ...)
├── encoding/ # Dynamic schema + record codec (.pulse binary format)
├── io/ # Bidirectional tabular <-> .pulse adapters
│ ├── csv/ # CSV reader + writer
│ ├── tsv/ # TSV reader + writer
│ ├── ndjson/ # NDJSON reader + writer
│ ├── jsonarray/ # JSON-array reader + writer (single top-level array of flat objects)
│ ├── jsonshared/ # Value coercion helpers shared by ndjson and jsonarray
│ ├── arrow/ # Arrow IPC / Feather V2 reader + writer; shared Arrow<->Pulse type maps
│ ├── parquet/ # Parquet reader + writer (delegates type maps to io/arrow)
│ └── excel/ # Excel reader + writer (Excelize)
├── fs/ # afero-based filesystem abstraction + extension hook
├── errors/ # Typed error codes (CodedError system)
├── types/ # Request/response structs (JSON-serializable)
├── descriptor/ # Self-description: manifest, predict, inspect, envelope
├── skills/ # Embedded markdown skill pack (//go:embed)
│ ├── index.json # Manifest of all bundled skills
│ └── *.md # Individual skill files with YAML frontmatter
├── synth/ # Synthetic data generator (from-schema, from-profile)
├── docs/ # mdBook source for this site (published to GitHub Pages)
└── internal/
├── cli/ # CLI internals (descriptor walker, json action)
└── mcp/ # MCP server: tool + resource handlers wrapping pulse.Pulse
└── mcptools/ # Leaf metadata package (tool names + descriptions) consumed by descriptor
Adding an Aggregator
Audience: Pulse internals contributors adding a new AGG_*
operator.
This page is a step-by-step recipe. The same content lives in
CLAUDE.md → Common Claude Code Workflows → Adding a new
aggregator;
this is the human-readable mirror.
From CLAUDE.md, Common Claude Code Workflows.
1. Declare the type constant
Add the new constant to types/types.go and the slice returned by
types.AllAggregationTypes(). Example, for a hypothetical AGG_GINI:
const (
// ... existing constants ...
AGG_GINI AggregationType = "AGG_GINI"
)
func AllAggregationTypes() []AggregationType {
return []AggregationType{
// ... existing entries, alphabetised ...
AGG_GINI,
}
}
The exhaustiveness tests (TestStreamability_AggregationsKnown and
friends) will fail until you add the streamability case in step 4.
2. Implement the aggregator and register it
The operator implementation lives in processing/. Write the factory
function (newGini(...) returning the aggregator interface) and
register it in aggregatorRegistry in processing/registry.go.
If the aggregator can update one row at a time, also implement the
OnlineAggregator interface so it joins the streaming Process path.
Sort-based or sum-of-deviation aggregators (like AGG_MEDIAN,
AGG_ZSCORE) skip this interface and run in the buffered path.
3. Tests
Tests come first: write them in processing/aggregator_test.go
before the implementation, run the suite, confirm they fail
informatively, then port the implementation until green. See
Testing Conventions.
4. Declare streamability
Add a case for the new type in types/streamability.go:
func (t AggregationType) Streamable() bool {
switch t {
// ...
case AGG_GINI:
return false // sort-based
}
}
Add the same row to the table in types/streamability_test.go.
If the aggregator is online, also expect
TestRegistryStreamabilityMatchesTypes to compare your
OnlineAggregator implementation against the
AggregationType.Streamable() return value — they must agree.
5. Update the skill pack
Add a section for the new aggregator in
skills/aggregation-guide.md. Cover when to use it, what its inputs
and outputs look like, and any caveats (sort cost, memory, supported
field types).
The CI gate TestSkillsCoverAllComponents parses the skill body for
the operator name; the section can live anywhere in the file as long
as the name appears.
6. Declare the capability metadata
Add a row to descriptor/capabilities_aggregations.go describing the
operator’s params, accepted field types, emitted type, and the
streamable hint. TestManifestOperatorsComplete enforces that every
registered aggregator has a capability row.
7. CLAUDE.md and registered-component lists
Update CLAUDE.md’s “Current registered components” section with the
new aggregator name in the right alphabetised slot. If the operator
interacts with categorical fields in a special way, also update
descriptor/predict.go’s numericAggregations map.
8. Run the gates
go test ./skills/ -run TestSkillsCoverAllComponents
go test ./descriptor/ -run 'TestManifest|TestPredict'
go test ./processing/ -run TestRegistryStreamability
go test ./...
The full Update Demand row for aggregators says: skill update + capability declaration + CLAUDE.md update + the existing test coverage. All four ride in the same PR. See The Update Demand.
Adding an I/O Format
Audience: internals contributors adding a new bidirectional
tabular format (a peer to the existing csv/, tsv/, ndjson/,
jsonarray/, arrow/, parquet/, excel/ sub-packages).
From CLAUDE.md, Common Claude Code Workflows.
1. Create the sub-package
Each format is a sub-package under io/. Create
io/<format>/<format>.go with both a reader and a writer.
The two interfaces to implement live in io/:
// Reader
type Reader interface {
ReadHeader() ([]string, error)
ReadRows(ctx context.Context, fn func(row []string) error) error
Close() error
}
// Writer
type Writer interface {
WriteHeader(columns []string) error
WriteRow(values []string) error
Close() error
}
If the reader needs schema inference (header sample, then full
import), also implement io.ResetReader.Reset() so the import job
can rewind after sampling.
2. Tests
Add io/<format>/<format>_test.go with the standard round-trip
checks: write rows, read them back, verify equality. Hermetic tests
should use afero.NewMemMapFs() — see Testing
Conventions.
3. Wire it into the CLI
The CLI registers per-format leaves in internal/cli/import.go and
internal/cli/export.go. Add the format string to:
- The switch in
makeImportReader(format, ...)inimport.go. - The corresponding
newWriterForFormat(format, ...)switch inexport.go. - The
Commands:slice onImportCommand()andExportCommand()in the same files (oneimportFormatCmd("yourformat")/exportFormatCmd("yourformat")line).
The pulse convert leaf auto-detects format from extension via
formatFromExt; add the extension mapping if the new format has a
canonical file extension.
4. Schema mapping
If the new format has a native type system (Arrow / Parquet do, CSV
does not), share the type map with neighbouring formats via the
io/arrow package the way Parquet already does. CSV / TSV / NDJSON
/ JSON-array share io/jsonshared for value coercion.
5. Skill update
Add or update a skill that points users at the new format. If the
new format is primarily an export concern, update
skills/export-format-selection.md. If it has import-side
considerations (schema inference, null markers, type ambiguity),
update skills/import-best-practices.md.
If the format adds a CLI flag (e.g. --sheet for Excel), update
skills/getting-started.md so TestSkillsCoverAllCliLeaves keeps
passing.
6. Convert and orchestration plumbing
Make sure both directions flow through pio.ImportJob and
pio.ExportJob. The orchestration layer is format-agnostic; you
should not need to touch service/ unless the new format requires
special metadata (e.g., Parquet’s per-column statistics).
7. Run the gates
go test ./io/<format>/...
go test ./skills/ -run TestSkillsCoverAll
go test ./...
For format-specific perf, add benchmarks (Benchmark<Format>...) in
the sub-package. There’s no required perf gate today, but neighbouring
formats have benchmarks you can mirror as a baseline.
Adding a Statistical Test
Audience: internals contributors adding a new TEST_* operator —
tier-1 (row-stream) or tier-2 (post-test on the materialised result
set).
The recipe mirrors the aggregator and feature recipes; the test-specific moving parts are streamability, the test catalog, and the registered-test capability table.
From CLAUDE.md, “Update Demand” rows for statistical tests and tier-2 post-test variants.
1. Decide tier
- Tier 1. Runs against the raw row stream, alongside aggregators.
Online-moments tests (
TEST_T,TEST_WELCH,TEST_CHISQ,TEST_ANOVA_F) stay in the streaming Process path. Sort-required tests (TEST_KS) force the buffered path. - Tier 2. Runs after the result set is materialised, in
req.PostTests. Always buffered.
2. Declare the type constant
Add to types/types.go:
const (
// ... existing constants ...
TEST_GINI_TREND TestType = "TEST_GINI_TREND"
)
Add it to types.AllTestTypes().
3. Implement and register
Tests live in processing/test_*.go. Existing examples to mirror:
processing/test_t.go— online tier-1 test.processing/test_anova.go— tier-1 ANOVA with grouper support.processing/test_post.goandprocessing/test_post_more.go— tier-2 post-tests.processing/test_studentized.go— numerical integration utilities (used byTEST_TUKEY_HSD).
Register the test in processing/test.go (the registry construction
calls). For tier-2 variants, declare both the base type and the
variant identifier the post-test surface uses.
4. Streamability
Add a case in types/streamability.go for the new TestType:
func (t TestType) Streamable() bool {
switch t {
// ...
case TEST_GINI_TREND:
return false // sort-based
}
}
Add the matching row in types/streamability_test.go so
TestStreamability_TestsKnown passes.
5. Capability declaration
Add a row to descriptor/capabilities_tests.go:
- For a tier-1 test, declare it in the tier-1 catalog (
testCapabilities). - For a tier-2 post-test, declare it in
postTestCapabilities.
TestManifestTestsComplete and TestManifestPostTestsComplete
enforce that the manifest enumerates every registered test.
6. Skill update
Add an entry to skills/statistical-testing.md under “Operator
catalog”. Describe the test’s family, inputs, outputs (statistic, p,
df, effect size, …), and any preconditions (PULSE_TEST_* error
codes it can raise). For tier-2 variants, also document the variant
field shape since the post-test API exposes it.
7. Tests
Use the same TDD pattern as for aggregators. The processing package
has rich existing test files to model new cases against:
processor_test_pipeline_test.go, test_parametric_test.go,
test_nonparametric_test.go, test_post_more_test.go. Add hermetic
fixtures that exercise the streaming and buffered paths.
8. Error codes
If your test introduces a new failure mode, add a code to
errors/codes.go (mirror the existing PULSE_TEST_* family),
register its description row in descriptor/capabilities_errors.go,
and document recovery in skills/error-code-reference.md. See the
Adding an Aggregator recipe for the same
pattern at the aggregator layer.
9. CLAUDE.md
Update CLAUDE.md’s “Current registered components → statistical tests” line with the new operator. If the test introduces a new preconditions class (e.g. paired sample, repeated measures), also add a sentence describing it in the parent paragraph.
10. Run the gates
go test ./processing/ -run TestType_Streamable
go test ./types/ -run TestStreamability_TestsKnown
go test ./descriptor/ -run TestManifest
go test ./skills/ -run TestSkillsCoverAll
go test ./...
See The Update Demand for the full row that governs statistical-test changes.
The Update Demand
Source of truth: this chapter is mirrored from the “Update Demand” section of
CLAUDE.md. Both files are kept in lock-step;CLAUDE.mdis authoritative if they ever diverge (aTestUpdateDemandTableCoversCI gate enforces table coverage against the registries).
Any change to Pulse code, configuration, file format, or public surface MUST update the corresponding skill file(s) and CLAUDE.md in the same PR. This is not a courtesy. It is a non-skippable CI failure if any of the trigger conditions below is met without the corresponding doc update.
Trigger → required update
| If you change… | You MUST also update… | Enforced by |
|---|---|---|
| A registered aggregator | skills/aggregation-guide.md (add or update the section for that aggregator) | TestSkillsCoverAllComponents |
| A registered attribute | skills/attribute-composition.md | TestSkillsCoverAllComponents |
| A registered filterer | skills/aggregation-guide.md (filtering section) | TestSkillsCoverAllComponents |
| A registered grouper | skills/grouper-design.md | TestSkillsCoverAllComponents |
| A registered window operator | skills/window-operations.md | TestSkillsCoverAllWindowTypes |
| An error code (added/removed/renamed) | skills/error-code-reference.md | TestSkillsCoverAllErrorCodes |
| A CLI leaf (added/removed/flag added) | CLAUDE.md “Common Claude Code Workflows” + skills/getting-started.md if user-facing | TestSkillsCoverAllCliLeaves |
A --json envelope or format_version | CLAUDE.md “Output Format Contract” | TestClaudeMdMentionsFormatVersion |
A .pulse file format change (header layout, new field type) | CLAUDE.md “Code Conventions” + skills/cohort-schema-design.md | TestClaudeMdMentionsFormatVersion, TestSkillsCoverAllFieldTypes |
| A new non-skippable CI gate | CLAUDE.md (gate listed by name in the relevant section) | TestClaudeMdMentionsAllNonSkippableGates |
| A new architectural decision | CLAUDE.md (relevant section) + PRD if applicable | reviewer enforcement |
| An environment variable | CLAUDE.md “Build / Dev / Test Workflow” + skills/getting-started.md | TestClaudeMdMentionsAllEnvVars |
| A registered MCP tool (added/removed) | skills/mcp-integration.md (Tool surface table) + internal/mcp/mcptools/meta.go (name + description) | TestSkillsCoverAllMCPTools, TestManifestMCPToolsComplete |
| A new MCP action tool with field-name parameters | internal/mcp/schema_bind.go (add a per-tool JSON Schema builder + entry in Bind) + skills/mcp-integration.md (Schema-bound enums section) | TestMCPSchemaBinding_RemovesInvalidFields, TestMCPSchemaBinding_AllFieldsInFiltererEnum, TestMCPSchemaBinding_SampleAndFacetFieldEnum, TestMCPSchemaBinding_InspectSucceedsRegistersBindings, TestMCPSchemaBinding_BindOnOpenFalse |
| A registered feature operator | skills/feature-engineering.md (operator catalog) + capability declaration in descriptor/capabilities_features.go | TestSkillsCoverAllComponents, TestManifestOperatorsComplete |
| A registered synth distribution kind | skills/synthetic-data.md (Supported distributions) + capability declaration in descriptor/capabilities_distributions.go | TestSkillsCoverAllSynthDistributions, TestManifestDistributionsComplete |
A registered statistical test (TEST_*) | skills/statistical-testing.md (Operator catalog) + types/streamability.go + types/streamability_test.go + capability declaration in descriptor/capabilities_tests.go | TestStreamability_TestsKnown, TestManifestTestsComplete |
| A registered tier-2 post-test variant | Capability declaration in descriptor/capabilities_tests.go (postTestCapabilities) | TestManifestPostTestsComplete |
| A registered aggregator/attribute/filterer/grouper/window capability metadata | Capability declaration in descriptor/capabilities_<category>.go (params, accepts_types, emits_type, streamable_hint) | TestManifestOperatorsComplete |
| A new error code | Description row in descriptor/capabilities_errors.go (errorMetaTable) | TestManifestErrorCodesComplete |
| An error code’s fixup template | Entry in errors/fixup_metadata.go (codeMetadata) + **Fixup**: line in skills/error-code-reference.md under that code | TestCodesHaveFixups, TestSkillsErrorCodeFixupsDocumented |
| A new operator’s streaming capability | types/streamability.go (case for the new type) + table in types/streamability_test.go | TestRegistryStreamabilityMatchesTypes, TestStreamability_*Known, TestManifestStreamableMatchesTypes |
| The default operator table | CLAUDE.md “Code Conventions → Smart defaults” + skills/getting-started.md (“Defaults” section) | TestDefaults_Applied + reviewer enforcement |
| A natural-query parsing route (new grammar shape) | internal/query/query.go grammar + internal/query/query_test.go fixtures + skills/query-router-prompt.md (router prompt grammar) + skills/request-recipes.md (target shapes) | TestNaturalQuery_HeuristicGrammar |
The Update Demand applies recursively to itself: when a new trigger row is added (e.g., a new component category, a new contract), this table MUST be updated in the same PR. TestUpdateDemandTableCovers (non-skippable) parses this table and asserts every registered component category and contract type has a row.
If you find yourself wanting to defer the doc/skill update to “a follow-up PR,” stop. The follow-up PR will not happen, and the next Claude Code session will read a stale CLAUDE.md and produce wrong code. Update in the same PR or do not merge.
Deployment
Audience: operators standing up Pulse as a CLI server, an MCP process under an AI client, or an embedded Go library inside a larger binary.
Pulse is a single static Go binary. There is no install command, no
config file, and no daemon — every deployment story is some shape of
“put the binary somewhere, set PULSE_DATA_DIR, run it”.
LLM agents using MCP: see the
mcp-integrationskill viapulse_skills_getfor the MCP-side wiring details. This page covers the operator side.
Mode 1: Standalone CLI
go install github.com/frankbardon/pulse/cmd/pulse@latest
export PULSE_DATA_DIR=/var/data/pulse
pulse --version
That’s the full install. The CLI tree is mapped in the CLI Tour.
Mode 2: MCP stdio server (Claude Desktop, Claude Code, generic MCP clients)
pulse mcp runs the Model Context Protocol over stdio. AI clients
launch the process, speak MCP over its standard streams, and shut it
down on session close.
The full wiring guide is in the mcp-integration skill. Quick
reference for Claude Desktop:
// ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"pulse": {
"command": "/usr/local/bin/pulse",
"args": ["mcp"],
"env": {
"PULSE_DATA_DIR": "/var/data/pulse"
}
}
}
}
For Claude Code (~/.claude.json) and other clients the shape is the
same — see the mcp-integration skill (pulse skills show mcp-integration) for the canonical recipes.
Flags worth knowing:
| Flag | Default | Purpose |
|---|---|---|
--data-dir | from PULSE_DATA_DIR | Override the cohort base directory |
--bind-on-open | true | Register session-scoped JSON-schema-bound tool variants on successful pulse_inspect. Disable for clients that bind tool schemas themselves. |
See pulse mcp for the full command page.
Mode 3: Embedded Go library
import "github.com/frankbardon/pulse"
p, err := pulse.New(pulse.Options{
DataDir: "/var/data/pulse",
})
When embedding, you can bypass PULSE_DATA_DIR entirely by passing
DataDir (as above) or a custom afero.Fs. See Library
Embedding for the full surface.
Production hardening
- Filesystem permissions.
pulse mcpreads everything underPULSE_DATA_DIR. Treat the directory as the trust boundary — run the process as a user that can only read what it should serve. - Stdio plumbing. MCP transports stderr too. Pulse writes a
one-line startup notice (
pulse mcp: serving over stdio...) on stderr and never logs request/response payloads, so MCP clients can surface stderr without leaking data. - Resource limits. Streaming aggregations stay memory-bounded;
buffered request shapes (window operators, median/percentile,
decimal/geo paths) can materialise large intermediate row sets.
Use
pulse api predictto checkStreamablebefore running an unfamiliar request — see Performance Notes. - No mutating background state. Pulse never writes to a cohort
during
process/compose. The only write paths areimport,export,synth,profile, andcohort filter— explicit by flag.
Upgrades
Drop in a new binary and restart the MCP process (or the calling
client). The .pulse file format carries a one-byte version field
(currently 0x01); files written by a future binary that introduces
a new version will be rejected loud at parse time, not silent at row
decode. See Header Layout.
Performance Notes
Audience: operators sizing a Pulse deployment, and library users debugging memory or latency surprises.
Pulse is built to keep “the streaming path” the default for most
analytical requests. When the engine has to leave that path it says so
— via the Streamable flag in
pulse api predict — and falls back to a
buffered execution. This page tells you what stays streaming, what
buffers, and how to read predict’s diagnostics.
LLM agents using MCP: there is no direct skill counterpart for this page —
debugging-with-predictcovers how to drive predict; this page tells operators what predict’s answers imply.
Streaming path: what stays out of memory
The streaming Process path covers four orchestrator modes (from
CLAUDE.md → What streams today):
- Single-pass streaming. No-group requests with online aggregators
(
COUNT,SUM,AVG,STDDEV,VARIANCE,RANGE,FREQUENCY,MODE,SKEWNESS,KURTOSIS,DISTINCT_COUNT) on numeric (non-decimal) fields. Row-local attributes (FORMULA,DATE_PART) apply inline. - Grouped streaming. Groupers implementing the streaming key path
(
GROUP_CATEGORY,GROUP_RANGE,GROUP_ROUNDED) drive per-key online aggregator buckets. Memory isO(distinct_groups × per-aggregator-state). - Two-pass streaming. Two-pass attributes (
ATTR_ZSCORE,ATTR_TSCORE,ATTR_NORMALIZED) compute population stats via Welford-Pébaÿ pass 1, then emit per-row values in pass 2. - Streaming features. Every registered
FEAT_*operator implements the streaming computer interface and composes with the three modes above.
These paths benefit from three optimisations landed during the streaming
refactor (commit cdd72d5): record reuse (the same record buffer flows
through the pipeline), zero-allocation decoding into reused buffers,
and an mmap reader for .pulse files large enough to benefit from
demand paging.
Buffered path: when Pulse has to materialise
pulse api predict reports Streamable=false and lists every
buffering reason. The current set, from CLAUDE.md:
AGG_MEDIAN,AGG_PERCENTILE, andAGG_ZSCORE— require sorts or summed deviations.ATTR_PERCENTILE— sorted view of every value; no streaming algorithm preserves exact rank.GROUP_QUANTILE,GROUP_DATE— finalize-time work over the full set.- Window operators (
WIN_*) — operate on a sorted post-aggregate row set. - Decimal-typed field aggregations — precision-preserving path.
- Two-pass attributes combined with features or groups — orchestration matrix not yet extended.
- Tier-1 statistical tests combined with groupers, features, or two-pass attributes — same orchestration limit.
- Tier-2 post-tests (
req.PostTests) — always run after the result set is materialised, regardless ofTestType.
Reading predict output
pulse api predict --request request.json --json | jq '.data | {streamable, streamable_reasons}'
{
"streamable": false,
"streamable_reasons": [
"AGG_MEDIAN on field price"
]
}
If streamable_reasons is empty and streamable=true, the request
executes without buffering. Each reason is a one-line gate that pushed
the request to the buffered path; you can drop or substitute the
offending operator (e.g., AGG_AVG instead of AGG_MEDIAN) and
re-run predict.
Memory rules of thumb
| Path | Memory profile |
|---|---|
| Single-pass streaming | Constant — O(aggregator state) |
| Grouped streaming | O(distinct_groups × per-aggregator state) |
| Two-pass streaming | Constant; cost is 2× iter scan (typically OS-page-cached) |
| Buffered | O(filtered_rows × output_width) for the working set, plus per-operator state |
Concurrency
pulse.ComposeParallel (CLI: pulse api compose --parallel N)
fans ComposedRequest slots over a bounded worker pool. Workers share
the engine’s read-only registries; each Process call constructs
fresh stateful operators per request, so concurrent execution is
safe. Defaults: MaxWorkers = GOMAXPROCS, FailFast = true. See
Parallel Compose.
When to embed vs shell out
For high-throughput pipelines, embed Pulse directly via the Go library
— you avoid one process boundary per request and can stream rows
through your own writer with ProcessStream. For ad-hoc analysis,
JSON-in/JSON-out via pulse api process --json is faster to write
and easier to debug.
Troubleshooting
Audience: operators chasing a specific failure mode in production (file not found, permission errors, MCP transport issues, common error codes).
This page is organised by symptom. For per-code recovery detail
(Message + Fixup templates), fetch metadata via the
pulse_errors_lookup MCP tool ({"code": "PULSE_XXX"}) or
pulse errors lookup CODE on the command line. The
error-code-reference skill explains the envelope shape, the
DOMAIN_CATEGORY naming convention, and the repair workflow that
chains predict-side suggestions into structured fixups.
LLM agents using MCP: call
pulse_errors_lookupfor per-code detail —code=PULSE_XXXfor one code,domain=PULSEto enumerate,query="..."for keyword search. The skill is the orientation; the tool is the catalog. This page focuses on operational symptoms that don’t reduce to a single error code.
“data directory required: set PULSE_DATA_DIR or pass –data-dir”
pulse mcp refuses to start. The MCP leaf is the one place the
binary insists on a base directory because it enumerates cohorts at
session start.
Fix: export PULSE_DATA_DIR in the client’s MCP config, or pass
--data-dir /path/to/data on the command line. The
pulse mcp page has the full example.
“file not found” / “no such file or directory”
The cohort path was resolved against the wrong base. Pulse prefers
absolute paths; with PULSE_DATA_DIR set, relative paths resolve
against it.
Fix: call pulse cohort inspect /absolute/path/data.pulse to
verify the file is where you think it is. If you’re running inside
pulse mcp, check the data-dir line on stderr at startup.
“permission denied”
Pulse runs as your user; it does not escalate. When deployed as an
MCP process under a different user (e.g. via launchd / systemd),
the cohort directory and files must be readable by that user.
Fix: check id inside the MCP startup banner on stderr; check
the file mode with ls -l; widen the group as needed.
“invalid pulse magic bytes” / “unsupported pulse format version”
The file isn’t a .pulse file — or it’s from a future binary that
introduced a new format version. The reader rejects unknown versions
at parse time (see Header Layout) so a future
binary doesn’t silently mis-decode an older file.
Fix: verify the file with file path/to/data.pulse and the first
nine bytes (hexdump -C). The expected magic is 50 55 4c 53 45 00 00 00
followed by a version byte (0x01 today).
“truncated pulse header”
The file is shorter than nine bytes or was cut off mid-write.
Fix: re-import. If you suspect a partial write, also check whether the writer was killed mid-flush — Pulse writes the header first, then the schema, then the records, so a truncated file usually fails here before any data is observed.
SERVICE_VALIDATION errors
A field name in the request doesn’t exist in the cohort, or an operator targets a field of the wrong type.
Fix: run pulse api predict on the same
request — predict diagnoses validation failures without executing.
Common cases: typo in field name; numeric aggregation on a
categorical field (warning code
PULSE_AGG_NOT_MEANINGFUL_FOR_CATEGORICAL); two-pass attribute
combined with a feature (currently buffered, not invalid — predict
will flag this in streamable_reasons).
PULSE_IMPORT_* errors
Import-time failures. The two most common:
PULSE_IMPORT_CATEGORICAL_OVERFLOW— too many distinct values for the chosen categorical width. Either bump the width (categorical_u16/categorical_u32), drop the categorical encoding, or filter the source before re-importing. See Dictionary Blocks.PULSE_IMPORT_DESCRIPTION_TOO_LONG— schema field description exceeds 1000 bytes. Trim it.
PULSE_FIELD_DESCRIPTION_LOW_QUALITY
A warning by default, an error under --strict. The description is
empty, under ten characters, or a generic placeholder ("n/a",
"tbd", "unknown", "field", "data", "value", "column").
Fix: edit the description in the schema JSON, re-import with
--schema.
MCP “tool not found” / “no tools registered”
An MCP client connects but sees no Pulse tools.
Fix: check the client’s MCP log (Claude Desktop surfaces this in
~/Library/Logs/Claude/). Common causes: pulse binary is not on
PATH, the wrong working directory, or PULSE_DATA_DIR is not set in
the MCP env block. Re-read pulse mcp.
mmap / file-mapping failures
On very large .pulse files the streaming reader uses memory mapping
where available. If your environment forbids mmap (some sandboxed
containers, very locked-down macOS configurations), the reader falls
back to a buffered read.
Fix: typically transparent. If you suspect a regression, run with
verbose Go runtime tracing or compare against a non-mmap file by
copying it to /tmp and re-running.
When in doubt: predict, then process
Almost every “why doesn’t this work” question is answerable by
pulse api predict --request request.json --json
Predict reads only the header and schema — it never touches record
data — and returns the full envelope of errors, warnings, and the
streamable flag. If predict says valid:true and process still
fails, the bug is in the processing layer, not the request.
Development Setup
Audience: new contributors getting their first PR ready.
This page is the short version. The fuller treatment of the repo’s
conventions, CI gates, and Update Demand lives in the Internals
section and in
CLAUDE.md
at the repository root.
Clone
git clone https://github.com/frankbardon/pulse.git
cd pulse
Tooling
Pulse needs only the Go toolchain — there is no Node, Python, or
container build. Install Go 1.24+ (see go.mod for the canonical
version).
The repo also uses staticcheck for make lint; it is auto-installed on
first run via go run.
Common targets
| Command | What it does |
|---|---|
make build | Builds the CLI binary to bin/pulse (default goal) |
make test | Runs go test ./... |
make fmt | Runs go fmt ./... |
make vet | Runs go vet ./... |
make lint | Runs go vet then staticcheck ./... |
make cover | Runs tests with coverage; outputs coverage.out |
make clean | Removes bin/ and coverage.out |
A .env file at the repo root is auto-loaded and exported, so
PULSE_DATA_DIR and any other PULSE_* env vars can live there for
local development.
Run the binary you just built
make build
./bin/pulse --version
./bin/pulse --json | head -20
The CLI tree itself is mapped in the CLI Tour.
Where things live
The package layout is documented at Internals → Package Layout. Two pointers worth knowing on day one:
- Public facade:
pulse.go— every Go embedder API lives here. - CLI internals:
internal/cli/— one file per command group; never put processing logic here.
Read this before writing code
- Style Guide
- Testing Conventions
- Pull Request Process
- The Update Demand — what doc/skill updates ride alongside what code changes.
Style Guide
Audience: anyone writing code or docs in the Pulse repository.
This page summarises the conventions enforced by review and by CI. The
authoritative source is the “Code Conventions” section of
CLAUDE.md;
copy that file’s rules when in doubt.
Go style
- Standard
gofmt/go vetcleanliness —make lintis the gate. - Module path is
github.com/frankbardon/pulse. The standard-libraryiocollision is handled by aliasing the project’s package aspio "github.com/frankbardon/pulse/io". - Library-first: business logic lives in library packages, never in
cmd/pulse/. The CLI parses flags, calls the library, formats output. - All file I/O routes through the injected
afero.Fs— neveros.Open/os.ReadFiledirectly in library code, because that defeatsfs.NewMemMap()for tests and the extension hook for custom storage backends.
Naming
- Component types use
SCREAMING_SNAKE_CASE:AGG_COUNT,ATTR_ZSCORE,FILTER_INCLUDE,GROUP_CATEGORY,WIN_LAG,FEAT_LOG,TEST_T. - Error codes use
DOMAIN_CATEGORYformat, organised by the six domains listed in CLAUDE.md (ENCODING,PROCESSING,SERVICE,DATA,CLI,PULSE). - Field types use lowercase snake (
u8,nullable_bool,categorical_u16,decimal128).
Structural bans
These are enforced by non-skippable CI gates:
| Ban | Enforced by |
|---|---|
descriptor/ MUST NOT import service/ or processing/ | TestPredictNoExecutionImports |
descriptor/ MUST NOT use fmt.Sprintf for JSON construction | TestDescriptorNoFmtSprintf |
Golden files in descriptor/testdata/ MUST NOT be hand-edited | TestGoldensNotHandEdited |
| No predecessor-project string prefixes (legacy “Orbit” naming) in error codes or constants | TestNoOrbitReferences, TestNoOrbitPrefix |
CLAUDE.md MUST mention every PULSE_* env var, every non-skippable gate, the current format_version | TestClaudeMd* family |
See the Pull Request Process for how these surface during review.
Comments and prose
- Public Go symbols carry a godoc-shaped comment opening with the symbol name.
- Skill files use YAML frontmatter (
name,description,type,applies_to) and are LLM-facing — keep them in MCP voice (tool calls, JSON payloads). The human-facing equivalent is this site; cross-link from each side. - mdBook chapters open with a one-sentence summary and an Audience line. See any of the already-authored chapters in this site for the tone.
The Update Demand
The single most important convention: if your code change ships without
the corresponding CLAUDE.md and skill updates, CI will fail. The
Update Demand chapter is the
authoritative table of triggers and the gates that enforce them. Read
it before opening a PR that touches a registered surface (new
aggregator, new error code, new CLI flag, new field type, …).
Testing Conventions
Audience: contributors writing tests, regenerating goldens, or trying to figure out which CI gate to run locally before pushing.
From CLAUDE.md, CI gates and Common Claude Code Workflows.
Style
- Table-driven tests are the default. Put cases in a
[]struct{...}with anamefield, run witht.Run(tc.name, func(t *testing.T)). - Hermetic by construction: anything that touches the filesystem uses
fs.NewMemMap()so tests don’t depend on disk state. - New code lands with tests in the same PR — TDD first, then implementation. A test that passes without the implementation is suspicious; the test is probably wrong.
Running tests
# Full suite
go test ./...
# Single package
go test ./processing/...
# Verbose, specific test
go test ./service/... -v -run TestProcess
# Coverage report
make cover
# Fuzz the .pulse header
go test ./encoding/... -fuzz FuzzPulseFileHeader -fuzztime 30s
Non-skippable CI gates
These tests guard structural invariants. If one of them fails, the
underlying conventions (not the test) are what need re-thinking.
Their full names appear in CLAUDE.md so the
TestClaudeMdMentionsAllNonSkippableGates self-check can find them.
| Gate | Guards |
|---|---|
TestPredictNoExecutionImports | descriptor/predict.go does not import service/ or processing/ |
TestDescriptorNoFmtSprintf | descriptor/ never builds JSON via fmt.Sprintf |
TestGoldensNotHandEdited | descriptor/testdata/* hashes match the generator |
TestClaudeMdMentionsFormatVersion | CLAUDE.md references the current envelope format_version |
TestClaudeMdMentionsAllEnvVars | Every PULSE_* env var has a CLAUDE.md row |
TestClaudeMdMentionsAllNonSkippableGates | This very table is the source — CLAUDE.md must list every gate by name |
TestUpdateDemandTableCovers | The Update Demand table covers every registered component category |
TestPerPackageCoverageFloors | Package directories exist and meet documented coverage floors |
TestNoOrbitReferences, TestNoOrbitPrefix, TestNoOrbitPrefixes | No predecessor-project string prefixes leak in |
TestSkillsCoverAll* | Skill files mention every registered component, error code, distribution, CLI leaf, field type, MCP tool |
TestSkillsManifestConsistent | skills/index.json matches the .md files and frontmatter |
TestSkillsFrontmatter_RequiredFields | Every skill has name, description, type, applies_to |
TestRegistryStreamabilityMatchesTypes | Aggregator OnlineAggregator capability matches AggregationType.Streamable() |
TestPredict_Streamable_MatchesRuntime | PredictResult.Streamable mirrors processing.CanStreamRequest |
TestStreamability_*Known | Every All*Types() entry has a streamability table row |
TestCanStreamRequest_RegressionMatrix | Regression matrix on the exported CanStreamRequest helper |
TestManifest*Complete | Manifest enumerates every registered operator, test, distribution, MCP tool, error code |
TestManifestStreamableMatchesTypes | Manifest Streamable flags mirror the type-level methods |
TestCodesHaveFixups, TestSkillsErrorCodeFixupsDocumented | Each error code has a fixup template and the skill row to match |
TestDefaults_Applied | Smart-default operator-type inference behaves as documented |
TestNaturalQuery_HeuristicGrammar | The internal/query parser fixtures cover its documented shapes |
(See CLAUDE.md “CI gates” for the full prose; this table is the
quick-reference.)
Running a subset of gates locally
# All descriptor contract gates
go test ./descriptor/ -run 'TestPredictNoExecution|TestDescriptorNoFmtSprintf|TestGoldensNotHandEdited'
# Skill coverage gates
go test ./skills/ -run 'TestSkillsCoverAll|TestSkillsManifestConsistent|TestSkillsFrontmatter'
# CLAUDE.md gates
go test . -run 'TestClaudeMd|TestUpdateDemandTable'
# Predecessor-reference scrub
go test . -run TestNoOrbitReferences
Regenerating golden files
Golden files live in descriptor/testdata/. Each ends with a
// golden-hash: <sha256> line; TestGoldensNotHandEdited verifies
the hash. After a legitimate change to the generator:
go test ./descriptor/ -run 'Test.*Golden' -update
go test ./descriptor/ -run TestGoldensNotHandEdited # confirms the new hash sticks
Never hand-edit a golden file — the gate will catch you.
Adding a new gate
If your change introduces a structural invariant, add a test for it
under the same naming convention (TestX), and add it to the table in
CLAUDE.md so TestClaudeMdMentionsAllNonSkippableGates recognises
it. The Update Demand lists this as a
trigger row.
Pull Request Process
Audience: contributors preparing to open or land a PR.
This page is a checklist. The longer prose lives in
CONTRIBUTING.md
and the Update Demand chapter.
1. Branch and commit shape
- One feature or fix per PR. Keep the diff focused.
- Conventional Commits in the subject line:
feat(...),fix(...),chore(...),docs(...),perf(...),refactor(...),test(...). - The PR title is usually the lead commit’s subject.
2. Tests first
A PR that adds a new aggregator, error code, field type, I/O format, statistical test, or skill must include tests in the same PR. The testing-first preference is documented in Testing Conventions. Implementation that lands without tests will be sent back; tests that pass without the implementation are suspicious and probably wrong.
3. The Update Demand
The single biggest source of “your PR was bounced” feedback. The full table lives in The Update Demand; the cliff-notes are:
| Change category | Doc/skill update required in the same PR |
|---|---|
| Registered aggregator / attribute / filterer / grouper | The matching skill file + the operator capability table |
| Registered window / feature / synth distribution / statistical test | Same — skill + capability file |
| Error code (added / removed / renamed) | errors/codes.go, skills/error-code-reference.md, descriptor/capabilities_errors.go |
| CLI leaf (added or flag added) | CLAUDE.md “Common Claude Code Workflows” + skills/getting-started.md if user-facing |
--json envelope change | CLAUDE.md “Output Format Contract” |
.pulse file format change | CLAUDE.md “Code Conventions” + skills/cohort-schema-design.md |
| New environment variable | CLAUDE.md “Build / Dev / Test Workflow” + skills/getting-started.md |
| New non-skippable CI gate | List it by name in CLAUDE.md |
If you find yourself wanting to defer the doc update to a follow-up PR, stop. The follow-up PR will not happen, and the next contributor will read stale guidance. Update in the same PR or do not merge.
4. Pre-flight checks
make fmt
make lint
make test
For change-category-specific gates, see Testing → Running a subset of gates locally.
5. Open the PR
- Use the bug-report or feature-request template as a starting point if applicable.
- Fill in the PR template’s “Summary” and “Test plan” sections.
- Link related issues with
Closes #N. - Do not push
--forcetomain. Force-pushing your own feature branch is fine before review starts.
6. Review and CI
CI runs the full go test ./... plus the non-skippable gates listed in
Testing → Non-skippable CI gates.
A failing gate means a structural invariant is broken, not a flaky
test; fix the root cause rather than retrying.
When a pre-commit hook or PR check fails, create a new commit with
the fix. Do not git commit --amend after a hook failure; the prior
commit may not exist or may have already been pushed.
7. Merge
- Squash-merge is the default; the squash message follows Conventional Commits.
- Once merged, the deploy workflow rebuilds and publishes this docs site to https://frankbardon.github.io/pulse/.
For changes that introduce a new architectural decision, also update
the relevant section of CLAUDE.md and reference the PRD (if one
exists) in the PR description.