GitHub Agentic Workflows

Forecast Command Specification

Version: 0.1.0
Status: Experimental Draft
Latest Version: forecast-specification
Editor: GitHub Agentic Workflows Team

! Experimental: This specification describes a feature that is under active development. The command interface, output schema, and algorithmic parameters are subject to change without notice. Do not depend on this interface in production workflows.


This specification defines the gh aw forecast command for the GitHub Agentic Workflows (gh-aw) project. The command performs historical sampling of completed agentic workflow runs and applies a Monte Carlo simulation engine to project future Effective Token (ET) consumption over a configurable time horizon. The specification covers workflow discovery (local and remote modes), data sampling via the GitHub Actions API, the Poisson–bootstrap Monte Carlo projection algorithm, episode-level analysis, and both console-table and machine-readable JSON output formats. Implementations conforming to this specification provide operators with probabilistic token-consumption forecasts suitable for capacity planning, cost estimation, and budget governance.


This section describes the status of this document at the time of publication. This is an Experimental Draft specification and may be updated, replaced, or made obsolete by other documents at any time. The feature it describes is experimental and not yet subject to the stability guarantees that apply to other gh-aw commands.

This document is governed by the GitHub Agentic Workflows project specifications process.

Feedback should be filed as GitHub issues against the github/gh-aw repository.


  1. Introduction
  2. Conformance
  3. Terminology
  4. Command Interface
  5. Workflow Discovery
  6. Data Sampling
  7. Monte Carlo Projection Engine
  8. Episode Analysis
  9. Output Formats
  10. Error Handling
  11. Implementation Requirements
  12. Compliance Testing
  13. Sync Notes
  14. Appendices
  15. References
  16. Change Log

The gh aw forecast command addresses the operational need to predict future Large Language Model (LLM) token expenditure for agentic workflows managed by gh-aw. Token consumption is a primary cost driver for agentic systems; the ability to project future usage from historical observations enables:

  • Capacity Planning: Anticipating token demand before budget thresholds are reached.
  • Cost Governance: Providing P10/P50/P90 confidence intervals for financial planning.
  • Workflow Comparison: Ranking workflows by projected token cost across a shared time period.
  • Experiment Evaluation: Measuring the token impact of A/B experiment variants.

The command combines empirical bootstrapping of historical token observations with a Poisson-distributed run-count model to produce statistically sound projections without requiring parametric distribution assumptions on token usage.

This specification covers:

  • Command-line interface: flags, positional arguments, and invocation modes
  • Workflow discovery in local (.github/workflows/) and remote (--repo) modes
  • Historical run sampling and per-run metric derivation
  • The Monte Carlo simulation algorithm producing P10, P50, P90 percentile estimates
  • Episode grouping and episode-level metric computation
  • Console table output format
  • Machine-readable JSON output schema (--json)
  • Error conditions and graceful-degradation behavior

This specification does NOT cover:

  • The Effective Tokens (ET) computation algorithm (defined in the Effective Tokens Specification)
  • The aw_info.json artifact schema
  • A/B experiment frontmatter schema (defined in the A/B Experiments Specification)
  • Billing, pricing, or financial modeling beyond token projections
  • Streaming or real-time token consumption reporting

A conforming gh aw forecast implementation MUST be designed for:

  • Empirical Accuracy: Projections derived from observed historical data rather than assumed distributions.
  • Probabilistic Reporting: P10/P50/P90 uncertainty bounds communicated to callers.
  • Graceful Degradation: Missing data (no runs, no artifacts, no frontmatter) MUST produce partial results rather than failures.
  • Dual Modes: Both local-repository and remote-repository operation without requiring a checkout.
  • Interoperability: JSON output schema stable enough for machine consumption by downstream tooling.

A conforming forecast implementation is one that satisfies all MUST, REQUIRED, and SHALL requirements in this specification.

A partially conforming forecast implementation is one that satisfies all MUST requirements in Sections 4, 5, 6, and 7 but MAY lack support for optional features such as episode analysis (Section 8), experiment variant reporting, or verbose diagnostics.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Implementations MUST support:

  • Level 1 (Required): Command invocation, workflow discovery, historical data sampling, and Monte Carlo projection with console output.
  • Level 2 (Standard): JSON output (--json), episode analysis, remote-repository mode (--repo), and experiment variant reporting.
  • Level 3 (Complete): All optional features including --verbose diagnostics, concurrency limit reporting, and frontmatter metadata enrichment.

A normalized unit of LLM token consumption defined in the Effective Tokens Specification. ET accounts for token class weights and model multipliers to produce a single comparable scalar across heterogeneous LLM invocations.

A single execution of a GitHub Actions workflow. A run has a unique numeric run ID, an event type, a status (completed, in_progress, queued), a conclusion (success, failure, cancelled, etc.), and a head commit SHA.

The time interval [now − days, now] used to bound the set of completed runs eligible for sampling. Controlled by the --days flag.

The subset of completed workflow runs within the historical window selected for metric derivation. The maximum sample size per workflow is controlled by the --sample flag.

A single independent simulation that draws stochastic values for run count, per-run token usage, and per-run success, combining them to produce one projected Effective Token total for the projection period.

The future time interval for which token consumption is projected. Controlled by the --period flag; either one calendar week (week) or one calendar month (month).

The rate of workflow runs observed in the historical window, extrapolated to the projection period length:

observed_runs_per_period = (sampled_run_count / history_days) × period_days

Where period_days is 7 for week and 30 for month.

A logical grouping of one or more workflow runs that collectively represent a single task attempt. Episodes are identified by grouping runs sharing the same headSha and headBranch, or by workflow_dispatch/workflow_call linkage where available.

The effective throughput rate: the expected number of successful runs per projection period, computed as the product of the observed run frequency and the historical success rate:

yield = observed_runs_per_period × success_rate

Where success_rate = successful_run_count / total_sampled_run_count.

An empirical resampling technique where individual observations are drawn with replacement from the observed sample. Used in Section 7 to model per-run token usage without parametric distribution assumptions.

A .lock.yml file located in .github/workflows/ that declares a compiled agentic workflow and its associated metadata. Lock files are the authoritative source of workflow identifiers in local mode.


gh aw forecast [workflow_id...] [flags]
ArgumentTypeRequiredDescription
workflow_idstring (repeatable)NoZero or more workflow identifiers to forecast. If omitted, all discovered agentic workflows are forecasted.

Workflow identifiers MUST be matched case-insensitively against:

  1. The workflow display name
  2. The workflow file-path basename (without extension)

If a provided workflow_id does not match any discovered workflow, the implementation MUST emit an error message identifying the unmatched identifier and MUST exit with a non-zero status code.

FlagTypeDefaultDescription
--daysint30Length of the historical sampling window in days. Permitted values: 7, 30.
--periodstring"month"Projection period length. Permitted values: "week", "month".
--sampleint100Maximum number of completed runs to sample per workflow. MUST be ≥ 1.
--max-ageint90Maximum age in days for historical runs eligible for sampling. Implementations SHOULD discard runs older than this bound unless the caller overrides it. MUST be ≥ 1.
--repostring(none)Target a repository other than the current working directory, in owner/repo format. Enables remote mode.
--jsonboolfalseEmit machine-readable JSON output instead of console tables.
--verboseboolfalseEmit verbose diagnostic output to stderr during processing.

Implementations MUST validate all flag values before beginning any API calls or file system operations:

  • R-CLI-001: If --days is not one of {7, 30}, the implementation MUST exit with a non-zero status and an error message specifying the permitted values.
  • R-CLI-002: If --period is not one of {"week", "month"}, the implementation MUST exit with a non-zero status and an error message specifying the permitted values.
  • R-CLI-003: If --sample is less than 1, the implementation MUST exit with a non-zero status.
  • R-CLI-004: If --repo is provided, it MUST match the pattern owner/repo (two non-empty components separated by /). An invalid format MUST produce a non-zero exit with a descriptive error.
  • R-CLI-005: If --max-age is provided and is less than 1, the implementation MUST exit with a non-zero status and a descriptive error.
CodeMeaning
0Forecast completed successfully.
1Usage error (invalid flags, unmatched workflow IDs).
2GitHub API authentication failure.
3No workflows discovered.
Terminal window
# Forecast all agentic workflows in the current repository for the next month
gh aw forecast
# Forecast two specific workflows and compare
gh aw forecast ci-doctor daily-planner
# Use a 7-day window and project over the next week
gh aw forecast --period week --days 7
# Emit machine-readable JSON
gh aw forecast --json
# Forecast workflows in a remote repository
gh aw forecast --repo owner/repo
# Forecast a specific workflow in a remote repository
gh aw forecast --repo owner/repo ci-doctor
# Ignore historical runs older than 90 days (default)
gh aw forecast --max-age 90

The forecast command operates in one of two discovery modes, determined by the presence of the --repo flag:

  • Local Mode: --repo is absent; workflows are discovered from the current repository’s .github/workflows/ directory.
  • Remote Mode: --repo is present; workflows are discovered via the GitHub Actions API.

In local mode, the implementation MUST:

  1. R-DISC-001: Enumerate all files matching *.lock.yml within .github/workflows/ of the current working repository.
  2. R-DISC-002: Parse each lock file to extract the workflow identifier and display name.
  3. R-DISC-003: If the .github/workflows/ directory does not exist or contains no lock files, the implementation MUST emit an informational message and exit with code 3.

The implementation MAY additionally read frontmatter metadata from corresponding workflow source files to enrich per-workflow records with:

  • Active trigger types (active_triggers)
  • Concurrency configuration (concurrency_limit)
  • A/B experiment variant declarations (experiment_variants)

Frontmatter enrichment is OPTIONAL; absence of a corresponding source file MUST NOT prevent discovery or projection of the workflow.

In remote mode (when --repo owner/repo is specified), the implementation MUST:

  1. R-DISC-010: Call the GitHub Actions API (GET /repos/{owner}/{repo}/actions/workflows) to enumerate workflows in the target repository. If workflow discovery hits a primary or secondary GitHub API rate limit, the implementation SHOULD back off and retry before failing.
  2. R-DISC-011: Filter the returned workflows to those identified as agentic (e.g., by inspecting file-path conventions, labels, or other implementation-defined heuristics).
  3. R-DISC-012: Match any caller-supplied workflow_id positional arguments against workflow display names and file-path basenames using case-insensitive string comparison.
  4. R-DISC-013: If rate-limit exhaustion occurs after at least one caller-supplied workflow identifier can still be attempted, the implementation MUST continue with that subset as a partial result set and MUST emit a warning identifying the degraded discovery mode.

In remote mode, frontmatter metadata (triggers, concurrency, experiment variants) is UNAVAILABLE because the workflow source files are not accessible. The implementation MUST degrade gracefully: fields that depend on frontmatter MUST be omitted from output or reported as their zero/empty values rather than causing an error.

Workflow ID matching MUST be case-insensitive. A caller-supplied identifier matches a discovered workflow if and only if it equals (ignoring case) either:

  • The workflow’s display name, OR
  • The basename of the workflow’s file path (without file extension)

Matching MUST be performed after discovery is complete; partial prefix matches are NOT sufficient for conformance.


For each discovered workflow (or each workflow in the filtered set), the implementation MUST perform the following sampling procedure:

  1. R-SAMP-001: Query completed workflow runs within the historical window using the equivalent of gh run list --workflow <id> --status completed --limit <sample> --created >=<cutoff>.
  2. R-SAMP-002: Limit the returned run set to at most --sample runs.
  3. R-SAMP-003: Implementations SHOULD discard historical runs older than 90 days by default, even when a broader sampling window is requested, and SHOULD expose this bound through a --max-age flag so operators can opt in to older samples when needed.
  4. R-SAMP-004: For each run in the sample, derive the per-run metrics defined in Section 6.2.
  5. R-SAMP-005: Record the count of runs with a successful conclusion separately from the total sampled count.

If the historical window yields zero completed runs for a workflow, the implementation MUST:

  • R-SAMP-006: Return nil (or a sentinel empty result) for that workflow’s Monte Carlo projection.
  • R-SAMP-007: Include the workflow in output with sampled_runs: 0 and all projection fields set to zero.
  • R-SAMP-008: SHOULD emit a warning indicating that no historical data is available for the workflow.

For each sampled run, the implementation MUST derive:

MetricSourceDescription
effective_tokensaw_info.json artifactTotal ET for this run as defined in the Effective Tokens Specification.
duration_secondsRun start/end timestampsWall-clock duration of the run in seconds.
successRun conclusion fieldtrue if conclusion is "success", false otherwise.

Effective token counts are obtained from locally-cached run summaries when available. The gh aw logs command stores a run_summary.json file for each processed run under {output_dir}/run-{run_id}/. During forecasting the implementation:

  • R-SAMP-010: MUST attempt to load the cached run_summary.json for each sampled run using the default logs output directory (.github/aw/logs).
  • R-SAMP-011: MUST extract the TotalEffectiveTokens field from the cached TokenUsage summary when present.
  • R-SAMP-012: If no cached summary exists or the ET field is zero, the run’s ET contribution MUST be treated as zero and the run MUST still be counted in sampled_runs. The implementation SHOULD log a debug-level warning.

This lightweight approach avoids re-downloading artifacts while still providing accurate ET observations for runs that have already been processed locally by gh aw logs.

Duration MUST be computed as:

duration_seconds = run.updated_at − run.started_at

Both timestamps MUST be sourced from the GitHub Actions API run object. If either timestamp is zero or unavailable, the run’s duration contribution SHOULD be treated as zero.

After sampling, the implementation MUST compute:

observed_runs_per_period = (sampled_run_count / history_days) × period_days

Where:

  • history_days is the value of --days
  • period_days is 7 for "week" and 30 for "month"

The Monte Carlo engine runs 10,000 independent simulation trials per workflow to produce a probability distribution over projected Effective Token consumption in the next projection period. The engine models three independent sources of uncertainty per trial.

Implementations MUST use exactly 10,000 trials. The trial count is a normative requirement to ensure consistency of P10/P50/P90 estimates across implementations.

Each trial draws independently from three stochastic components:

The number of runs in the projection period is modeled as a Poisson random variable with rate parameter:

λ = observed_runs_per_period

The implementation MUST use:

  • Knuth’s exact algorithm when λ ≤ 15:

    L ← e^(−λ)
    k ← 0; p ← 1
    repeat:
    k ← k + 1
    p ← p × Uniform(0, 1)
    until p ≤ L
    return k − 1
  • Normal approximation when λ > 15:

    k ← round(Normal(μ=λ, σ=sqrt(λ)))
    k ← max(0, k)
  • R-MC-001: For λ = 0, the implementation MUST return a projected token total of 0 for that trial without invoking either algorithm.

  • R-FC-060: Implementations MUST use λ = 15 as the crossover threshold: Knuth’s exact algorithm for λ ≤ 15, and Normal approximation only for λ > 15. Implementations MUST NOT raise this threshold above 15 without a specification revision, because the documented error and comparability assumptions are calibrated to this crossover.

7.2.2 Per-Run Token Usage (Bootstrap Resampling)

Section titled “7.2.2 Per-Run Token Usage (Bootstrap Resampling)”

Token usage per run is modeled empirically using bootstrap resampling:

  • R-MC-010: For each run in a trial, the implementation MUST draw one observation uniformly at random with replacement from the set of historical ET observations in the sample.
  • R-MC-011: If the sample contains zero ET observations (all runs had missing artifacts), the per-run token draw MUST return 0.

This non-parametric approach preserves the empirical distribution of token usage, including multi-modal distributions and heavy tails, without imposing a parametric form.

Whether a given run in the trial succeeds is modeled as a Bernoulli draw:

P(success) = success_rate = successful_run_count / total_sampled_run_count
  • R-MC-020: Each run in a trial MUST independently draw from Bernoulli(success_rate).
  • R-MC-021: Only successful runs contribute their token draw to the trial’s projected total. Failed runs contribute zero tokens to the projection.
  • R-MC-022: If total_sampled_run_count = 0, success_rate MUST be treated as 0. The implementation MUST return a zero projection for all trials.

For a given trial with k drawn runs:

trial_tokens = Σ_{i=1}^{k} (success_i × token_draw_i)

Where:

  • success_i is 1 if the Bernoulli draw for run i succeeds, 0 otherwise
  • token_draw_i is the bootstrapped ET observation for run i

After completing all 10,000 trials, the implementation MUST compute and report:

StatisticDefinition
mean_projected_effective_tokensArithmetic mean of all trial totals
std_dev_effective_tokensPopulation or sample standard deviation of all trial totals
p10_projected_effective_tokens10th percentile of trial totals (lower bound of 80% CI)
p50_projected_effective_tokens50th percentile of trial totals (median projection)
p90_projected_effective_tokens90th percentile of trial totals (upper bound of 80% CI)

Percentile computation MUST use the nearest-rank method or an equivalent method that produces results consistent with a 10,000-element sorted array.

The projected_effective_tokens top-level field MUST equal p50_projected_effective_tokens.

If no historical runs are available for a workflow, the implementation MUST return a nil (empty/zero) projection for that workflow. Nil projections MUST be represented in JSON output as zero values for all numeric Monte Carlo fields. The implementation MUST NOT run trials when the sample is empty.


An episode is a logical grouping of one or more workflow runs that collectively represent a single task attempt. Episode analysis computes per-episode metrics to reveal how many runs, on average, are required to complete a task successfully.

The implementation MUST group sampled runs into episodes using the buildEpisodeData and classifyEpisode engine:

  • R-EP-001: Runs sharing the same headSha and headBranch MUST be grouped into the same episode.
  • R-EP-002: Runs linked by workflow_dispatch or workflow_call relationships (reconstructed from cached run summaries) SHOULD be merged into the triggering run’s episode.

During forecasting, full artifact data may not be available for all sampled runs. When cached summary data is unavailable:

  • R-EP-010: workflow_dispatch/workflow_call linkage MUST be omitted from episode construction.
  • R-EP-011: The resulting sampled_episodes count MUST be treated as a lower-bound estimate. Implementations MUST communicate this limitation in output (e.g., via a note in console output or a boolean episode_count_is_lower_bound field in JSON).

For orchestrator workflows that primarily receive workflow_call triggers, the episode count underestimate may be significant. Implementations SHOULD emit a warning when the dominant trigger type is workflow_call or workflow_dispatch.

For each workflow, the implementation MUST compute:

MetricDefinition
sampled_episodesCount of distinct episodes identified in the sample
runs_per_episodesampled_run_count / sampled_episodes
avg_effective_tokens_per_episodeMean ET summed across all runs within each episode
observed_episodes_per_period(sampled_episodes / history_days) × period_days

The implementation MUST display the episode analysis table in console output when any workflow in the result set has runs_per_episode > 1.0. The table SHOULD be omitted when all workflows have runs_per_episode = 1.0 (one run per episode is the baseline and adds no additional information).


When --json is not specified, the implementation MUST render a formatted console table to stderr with the following columns:

ColumnDescription
WorkflowWorkflow display name or identifier
Sampled RunsCount of completed runs included in the sample
Success RateFraction of sampled runs concluding with success, formatted as a percentage; N/A when no runs were sampled
Yield/PeriodEffective throughput rate (success_rate × observed_runs_per_period) formatted to one decimal place
Avg ETavg_effective_tokens formatted as K/M abbreviations (e.g. 12.5K, 1.20M); - when zero
Proj. ET (P50)Median projected effective tokens from Monte Carlo (P50), formatted as K/M abbreviations
80% CI (P10–P90)Confidence interval range p10–p90, both formatted as K/M abbreviations
TriggersComma-separated list of active trigger event names from frontmatter (up to 3, remainder shown as +N)
  • R-OUT-001: Column widths MUST be auto-fitted to the widest value in each column.
  • R-OUT-002: ET values MUST be formatted as K/M abbreviations (e.g. 12.5K, 1.20M); raw integer values of zero MUST be rendered as -.
  • R-OUT-003: Rows MUST be sorted by Monte Carlo P50 projected effective tokens in descending order; when Monte Carlo data is unavailable, sort by projected_effective_tokens.
  • R-OUT-004: A workflow with zero sampled runs MUST appear in the table with - in projection columns and N/A in rate columns.
  • R-OUT-005: When episode analysis is applicable (Section 8.4), a second table with episode metrics MUST be printed below the main table, separated by a blank line.
Workflow Sampled Runs Success Rate Yield/Period Avg ET Proj. ET (P50) 80% CI (P10–P90) Triggers
ci-doctor 42 92% 35.4 12.5K 480.0K 430.0K–535.0K pull_request, workflow_dispatch
daily-planner 18 89% 14.4 8.2K 131.0K 105.0K–158.0K schedule

When --json is specified, the implementation MUST emit a single JSON object to stdout conforming to the following schema. No additional content (banners, progress indicators, or table output) MUST be emitted to stdout. Diagnostic messages MAY be emitted to stderr.

{
"period": "<string>",
"as_of": "<RFC 3339 timestamp>",
"workflows": [ <WorkflowForecast>, ... ]
}
FieldTypeRequiredDescription
periodstringMUSTProjection period: "week" or "month".
as_ofstringMUSTISO 8601 / RFC 3339 UTC timestamp at which the forecast was computed.
workflowsarrayMUSTOrdered array of per-workflow forecast objects. MUST be sorted by projected_effective_tokens (P50) descending.
{
"workflow_id": "<string>",
"period": "<string>",
"sampled_runs": <integer>,
"history_days": <integer>,
"observed_runs_per_period": <number>,
"success_rate": <number>,
"yield": <number>,
"avg_effective_tokens": <number>,
"avg_duration_seconds": <number>,
"projected_effective_tokens": <number>,
"active_triggers": [ "<string>", ... ],
"concurrency_limit": <integer>,
"monte_carlo": { <MonteCarlo> },
"episode_analysis": { <EpisodeAnalysis> },
"experiment_variants": [ <ExperimentVariant>, ... ]
}
FieldTypeRequiredDescription
workflow_idstringMUSTWorkflow identifier as used in discovery.
periodstringMUSTMirrors the root period field.
sampled_runsintegerMUSTNumber of runs included in the sample.
history_daysintegerMUSTValue of --days used for this forecast.
observed_runs_per_periodnumberMUSTExtrapolated run rate for the projection period.
success_ratenumberMUSTFraction of sampled runs that concluded successfully, in [0.0, 1.0].
yieldnumberMUSTEffective throughput rate: success_rate × observed_runs_per_period.
avg_effective_tokensnumberMUSTMean ET per sampled run. 0 when no ET data is available.
avg_duration_secondsnumberMUSTMean wall-clock duration per sampled run in seconds.
projected_effective_tokensnumberMUSTP50 Monte Carlo projection. Equals monte_carlo.p50_projected_effective_tokens.
active_triggersarray of stringsSHOULDTrigger event types from workflow frontmatter. Empty array when frontmatter is unavailable.
concurrency_limitintegerSHOULDConcurrency group limit from frontmatter. 0 indicates unlimited or unavailable.
monte_carloobjectMUSTMonte Carlo simulation results. See Section 9.2.3.
episode_analysisobjectSHOULDEpisode analysis results. See Section 9.2.4.
experiment_variantsarrayMAYA/B experiment variant breakdown. See Section 9.2.5. Empty array when frontmatter is unavailable or no experiments are configured.
{
"iterations": 10000,
"mean_projected_effective_tokens": <number>,
"std_dev_effective_tokens": <number>,
"p10_projected_effective_tokens": <number>,
"p50_projected_effective_tokens": <number>,
"p90_projected_effective_tokens": <number>
}
FieldTypeRequiredDescription
iterationsintegerMUSTAlways 10000.
mean_projected_effective_tokensnumberMUSTArithmetic mean of trial totals.
std_dev_effective_tokensnumberMUSTStandard deviation of trial totals.
p10_projected_effective_tokensnumberMUST10th percentile of trial totals.
p50_projected_effective_tokensnumberMUST50th percentile (median) of trial totals.
p90_projected_effective_tokensnumberMUST90th percentile of trial totals.

When sampled_runs = 0, all numeric fields in this object MUST be 0 and iterations MUST be 0.

{
"sampled_episodes": <integer>,
"runs_per_episode": <number>,
"avg_effective_tokens_per_episode": <number>,
"observed_episodes_per_period": <number>
}
FieldTypeRequiredDescription
sampled_episodesintegerMUSTDistinct episode count. Lower-bound estimate when artifact linkage is unavailable.
runs_per_episodenumberMUSTMean runs per episode.
avg_effective_tokens_per_episodenumberMUSTMean ET per episode.
observed_episodes_per_periodnumberMUSTExtrapolated episode rate for the projection period.
{
"experiment_name": "<string>",
"variant": "<string>",
"run_count": <integer>,
"fraction": <number>
}
FieldTypeRequiredDescription
experiment_namestringMUSTName of the A/B experiment from frontmatter.
variantstringMUSTVariant identifier (e.g., "control", "treatment").
run_countintegerMUSTNumber of sampled runs assigned to this variant.
fractionnumberMUSTrun_count / sampled_runs for this workflow; fraction in [0.0, 1.0].
{
"period": "month",
"as_of": "2026-05-10T22:00:00Z",
"workflows": [
{
"workflow_id": "ci-doctor",
"period": "month",
"sampled_runs": 42,
"history_days": 30,
"observed_runs_per_period": 38.5,
"success_rate": 0.92,
"yield": 0.92,
"avg_effective_tokens": 12500,
"avg_duration_seconds": 145.3,
"projected_effective_tokens": 480000,
"active_triggers": ["pull_request", "workflow_dispatch"],
"concurrency_limit": 0,
"monte_carlo": {
"iterations": 10000,
"mean_projected_effective_tokens": 481250,
"std_dev_effective_tokens": 32000.5,
"p10_projected_effective_tokens": 430000,
"p50_projected_effective_tokens": 480000,
"p90_projected_effective_tokens": 535000
},
"episode_analysis": {
"sampled_episodes": 40,
"runs_per_episode": 1.05,
"avg_effective_tokens_per_episode": 13100,
"observed_episodes_per_period": 36.7
},
"experiment_variants": [
{
"experiment_name": "model-selection",
"variant": "control",
"run_count": 21,
"fraction": 0.5
},
{
"experiment_name": "model-selection",
"variant": "treatment",
"run_count": 21,
"fraction": 0.5
}
]
}
]
}
  • R-OUT-010: In both console and JSON output, workflows MUST be ordered by projected_effective_tokens (P50 value) in descending order.
  • R-OUT-011: Workflows with zero projected tokens MUST appear after all workflows with non-zero projections.
  • R-OUT-012: Among workflows with equal projected tokens, the ordering SHOULD be deterministic (e.g., alphabetical by workflow ID).

If the GitHub API returns an authentication error (HTTP 401 or 403):

  • R-ERR-001: The implementation MUST emit a descriptive error message to stderr indicating the authentication failure and guidance on re-authenticating with gh auth login.
  • R-ERR-002: The implementation MUST exit with code 2.

If the GitHub API returns a rate-limit response (HTTP 429 or a X-RateLimit-Remaining: 0 header):

  • R-ERR-010: The implementation SHOULD retry the request after the period indicated by the X-RateLimit-Reset header.
  • R-ERR-011: The implementation MUST emit a warning to stderr when entering a rate-limit wait state.
  • R-ERR-012: If retry is not feasible, the implementation MUST exit with a non-zero status and a message indicating the rate limit condition.

When one or more workflows in the discovery set encounter individual errors (e.g., artifact download failure, API timeout for a specific workflow):

  • R-ERR-020: The implementation MUST continue processing the remaining workflows rather than aborting the entire forecast.
  • R-ERR-021: Workflows that encountered individual errors MUST appear in output with sampled_runs: 0 and all projection fields zeroed.
  • R-ERR-022: The implementation MUST emit a warning to stderr for each workflow that encountered an individual error.

If workflow discovery yields zero workflows:

  • R-ERR-030: The implementation MUST emit a message to stderr indicating that no agentic workflows were found and describing the discovery mode used.
  • R-ERR-031: The implementation MUST exit with code 3.

When --verbose is specified, the implementation SHOULD emit the following additional diagnostic information to stderr:

  • The list of discovered workflows and their identifiers
  • The number of runs fetched per workflow
  • The number of runs with valid ET data versus missing artifacts
  • The computed λ (Poisson rate) for each workflow
  • Timing information for API calls and simulation execution

  • R-IMPL-001: The Monte Carlo engine MUST use a cryptographically seeded pseudorandom number generator (PRNG). Implementations MUST NOT use a fixed seed unless in test mode.
  • R-IMPL-002: The PRNG MUST be seeded independently per forecast invocation to ensure different results on repeated calls.
  • R-IMPL-010: The 10,000-trial simulation for a single workflow MUST complete within 500 milliseconds on a single CPU core with a sample size of 100 runs.
  • R-IMPL-011: Multiple workflows SHOULD be forecasted concurrently where the runtime environment supports parallelism.
  • R-IMPL-012: API calls for data sampling SHOULD be made concurrently across workflows, subject to GitHub API rate limit constraints.
  • R-IMPL-020: Given a fixed sample and fixed PRNG seed (in test mode), the Monte Carlo output MUST be reproducible. This requirement applies to test and validation scenarios only; production invocations MUST use random seeds (R-IMPL-001).
  • R-IMPL-030: All intermediate ET computations MUST use 64-bit floating-point arithmetic (IEEE 754 double precision).
  • R-IMPL-031: JSON serialization of numeric fields MUST NOT produce non-finite values (NaN, +Inf, -Inf). If a computation produces a non-finite value, it MUST be replaced with 0 and a warning MUST be emitted.
  • R-IMPL-032: Implementations MUST NOT round projected ET values in intermediate computations; rounding for display purposes MUST occur only at serialization time.

Because the forecast command is marked Experimental:

  • R-IMPL-040: The implementation MUST emit a warning to stderr on every invocation indicating the experimental status of the command unless --json is specified (JSON callers are assumed to be automated pipelines that handle warnings separately).
  • R-IMPL-041: The JSON output schema MAY have new fields added in minor versions without notice. Callers MUST treat unknown fields as ignorable.

  • T-FC-001: Invocation with invalid --days value exits non-zero with descriptive error.
  • T-FC-002: Invocation with invalid --period value exits non-zero with descriptive error.
  • T-FC-003: Invocation with --sample < 1 exits non-zero.
  • T-FC-004: Invocation with invalid --repo format exits non-zero.
  • T-FC-005: Unmatched workflow_id positional argument exits non-zero with identification of the unmatched value.
  • T-FC-010: Local mode: discovers workflows from .github/workflows/*.lock.yml.
  • T-FC-011: Local mode: no lock files found exits with code 3.
  • T-FC-012: Remote mode: calls GitHub Actions API and matches workflow IDs case-insensitively.
  • T-FC-013: Remote mode: missing frontmatter fields default to zero/empty without error.
  • T-FC-030: Remote mode: on GitHub API rate-limit exhaustion during workflow discovery, the implementation backs off and emits a warning before continuing with caller-supplied workflow IDs as partial results.
  • T-FC-020: Sampling respects --sample limit.
  • T-FC-021: Sampling respects --days historical window cutoff.
  • T-FC-022: Run with missing aw_info.json artifact contributes zero ET and is still counted in sampled_runs.
  • T-FC-023: Workflow with zero sampled runs produces nil projection with zero fields.
  • T-FC-031: With λ ≤ 15, Knuth’s algorithm is used for Poisson draw (verifiable by seeded PRNG in test mode).
  • T-FC-032: With λ > 15, Normal approximation is used; drawn value is non-negative.
  • T-FC-033: With λ = 0, projected tokens is exactly 0 for all trials.
  • T-FC-034: Bootstrap resampling draws with replacement from historical ET observations.
  • T-FC-035: Only successful Bernoulli draws contribute ET to the trial total.
  • T-FC-036: 10,000 trials are executed per workflow.
  • T-FC-037: P10 ≤ P50 ≤ P90 for all non-zero projections.
  • T-FC-038: projected_effective_tokens equals p50_projected_effective_tokens.
  • T-FC-039: Boundary crossover: λ = 15 uses Knuth’s exact branch.
  • T-FC-040: Boundary crossover: λ > 15 uses Normal approximation branch.
  • T-FC-041: Runs sharing headSha and headBranch are grouped into the same episode.
  • T-FC-042: runs_per_episode equals sampled_run_count / sampled_episodes.
  • T-FC-043: Episode table is printed in console output when any workflow has runs_per_episode > 1.
  • T-FC-044: Episode table is suppressed when all workflows have runs_per_episode = 1.0.
  • T-FC-050: Console output contains all required columns.
  • T-FC-051: JSON output is valid JSON conforming to the schema in Section 9.2.
  • T-FC-052: JSON as_of field is a valid RFC 3339 UTC timestamp.
  • T-FC-053: JSON workflows array is sorted by projected_effective_tokens descending.
  • T-FC-054: No stdout output (other than JSON) when --json is specified.
  • T-FC-055: Experimental warning emitted to stderr unless --json is specified.
RequirementTest IDLevelStatus
Flag validationT-FC-001–0051Required
Local workflow discoveryT-FC-010–0111Required
Remote workflow discoveryT-FC-012–0132Required
Remote discovery rate-limit backoff and partial resultsT-FC-0302Required
Data sampling with limit and windowT-FC-020–0211Required
Missing artifact graceful handlingT-FC-0221Required
Nil projection for empty sampleT-FC-0231Required
Knuth Poisson algorithm (λ ≤ 15)T-FC-0311Required
Normal approximation (λ > 15)T-FC-0321Required
Zero-λ projectionT-FC-0331Required
Bootstrap resamplingT-FC-0341Required
Bernoulli success filteringT-FC-0351Required
10,000 trial countT-FC-0361Required
Percentile orderingT-FC-0371Required
P50 field consistencyT-FC-0381Required
λ crossover threshold enforcementT-FC-039–0401Required
Episode groupingT-FC-041–0422Required
Episode table display logicT-FC-043–0442Required
Console output columnsT-FC-0501Required
JSON schema conformanceT-FC-051–0542Required
Experimental status warningT-FC-0551Required

This section maps normative forecast requirements to implementation files.

Normative AreaImplementation File(s)
Monte Carlo engine (Poisson/Bootstrap/Bernoulli)pkg/cli/forecast_montecarlo.go
Forecast command orchestration and output fieldspkg/cli/forecast.go, pkg/cli/forecast_command.go
Workflow discovery, rate-limit backoff, and run samplingpkg/cli/forecast.go
Forecast compliance tests (including rate-limit backoff and λ thresholds)pkg/cli/forecast_montecarlo_test.go

Sync procedure:

  1. Update this specification when changing projection algorithms or thresholds.
  2. Update corresponding Go implementation/tests in the files above in the same change.
  3. Re-run forecast tests to verify normative parity.

A workflow named ci-doctor has the following historical sample over 30 days:

  • 42 completed runs
  • 5 runs missing aw_info.json (treated as 0 ET)
  • ET observations (for the 37 runs with artifacts): range from 8,000 to 18,000, mean ≈ 12,500
  • 38 successful runs (yield = 38/42 ≈ 0.905)
  • Projection period: month (30 days)
observed_runs_per_period = (42 / 30) × 30 = 42.0
λ = 42.0

Since λ > 15, Normal approximation is used: Normal(μ=42, σ=√42 ≈ 6.48).

Draw k ~ round(Normal(42, 6.48)) = 44 (example).

For each of the 44 runs:

  1. Draw success: Bernoulli(0.905) → say 40 succeed.
  2. For each of the 40 successful runs, draw one ET observation from the 37-item historical pool (bootstrap).
  3. Sum the 40 ET draws.

One trial might yield: 40 × 12,200 (average draw) ≈ 488,000 ET.

Sorted trial totals (example summary):

P10 ≈ 415,000 (10th percentile — lower bound of 80% CI)
P50 ≈ 479,000 (median — headline projection)
P90 ≈ 545,000 (90th percentile — upper bound of 80% CI)
mean ≈ 481,000
std_dev ≈ 40,000

Appendix B: Poisson Algorithm Selection Rationale

Section titled “Appendix B: Poisson Algorithm Selection Rationale”

Knuth’s exact Poisson algorithm is used for small λ (≤ 15) because it produces exact integer draws from the Poisson distribution without bias. For large λ, the Poisson distribution converges to a Normal distribution (N(λ, λ)), making the Normal approximation computationally efficient and sufficiently accurate.

The threshold of λ = 15 is chosen as the crossover point where Normal approximation error is below 1% for the tails relevant to P10/P90 computation. Implementations MAY lower this threshold (e.g., to λ = 30) for greater accuracy at a minor performance cost.

Appendix C: Bootstrap Resampling Rationale

Section titled “Appendix C: Bootstrap Resampling Rationale”

Traditional projection models assume a parametric distribution (e.g., log-normal) for per-run token usage. Agentic workflow token usage is frequently multi-modal (e.g., simple tasks versus complex multi-step tasks) and exhibits heavy tails due to recursive sub-agent chains. Bootstrap resampling avoids distributional misspecification by directly sampling from the empirical distribution, preserving these characteristics faithfully. The tradeoff is that projections are bounded by observed extremes; extrapolation beyond observed maximum ET requires explicit assumption and is out of scope for this specification.

Appendix D: Episode Count Lower-Bound Semantics

Section titled “Appendix D: Episode Count Lower-Bound Semantics”

For orchestrator workflows that primarily use workflow_call or workflow_dispatch triggers, episodes are initiated by calls from another workflow rather than directly by GitHub events. These cross-workflow links are embedded in aw_info.json artifacts and are unavailable during forecasting when artifacts cannot be retrieved. As a result, each received workflow_call is counted as a separate episode, causing the episode count to overcount episodes and undercount the linkage. This means runs_per_episode may appear closer to 1.0 than its true value. Callers MUST treat sampled_episodes as a lower-bound estimate in this scenario and SHOULD note this limitation in any capacity planning documents.

  • Credential scope: The forecast command accesses the GitHub Actions API using the credentials of the gh CLI. Token permissions MUST include actions:read for the target repository. Callers SHOULD use the minimum necessary scope.
  • Artifact content: The aw_info.json artifact MAY contain sensitive information such as prompt fragments embedded in ET metadata. Implementations MUST NOT log artifact payloads at verbosity levels accessible to non-administrative users.
  • Remote repository access: When --repo targets a repository the caller does not own, the caller MUST have explicit read access. The implementation MUST NOT attempt to bypass or circumvent repository access controls.
  • JSON output: The JSON output schema exposes token consumption patterns that MAY reveal information about system architecture and model configuration. JSON output SHOULD be treated as internal operational data and not exposed publicly.

  • [KNUTH-TAOCP] Knuth, D.E., “The Art of Computer Programming, Volume 2: Seminumerical Algorithms”, 3rd edition. Section 3.4.1 (Poisson distribution generation algorithm).
  • [BOOTSTRAP] Efron, B. and Tibshirani, R., “An Introduction to the Bootstrap”, Chapman & Hall, 1993.
  • [GH-ACTIONS-API] GitHub, “GitHub Actions REST API Reference”. https://docs.github.com/en/rest/actions

  • Initial specification for gh aw forecast command
  • Defined command interface: flags --days, --period, --sample, --repo, --json, --verbose
  • Defined local and remote workflow discovery modes
  • Defined data sampling procedure and per-run metric derivation
  • Defined Monte Carlo projection engine with Poisson + bootstrap algorithm
  • Defined episode analysis with lower-bound semantics for orchestrator workflows
  • Defined console table output format
  • Defined JSON output schema (Sections 9.2.1–9.2.6)
  • Defined error handling and exit codes
  • Defined compliance test suite (T-FC-001 through T-FC-055)
  • Added appendices: worked example, algorithm rationale, security considerations

Copyright 2026 GitHub Agentic Workflows Team. All rights reserved.