A/B Experiments Specification

Version: 1.0.0
Status: Draft
Latest Version: experiments-specification
Editors: gh-aw maintainers

Abstract

This specification defines the A/B experiment system for GitHub Agentic Workflows (gh-aw). It covers the experiments: frontmatter schema, variant selection algorithms, state persistence backends, expression and template integration, activation job structure, audit CLI integration, and statistical analysis requirements. Conforming implementations provide operators with a zero-infrastructure mechanism to conduct controlled experiments on agentic workflow behavior using only workflow frontmatter declarations, without any external service dependency.

This document consolidates and supersedes the normative sections of ADR-29534, ADR-29618, ADR-29628, ADR-29985, and ADR-29996. It also incorporates corrective requirements identified during an expert review of the implementation in May 2026.

Status of This Document

This is a Draft specification. It may be updated, replaced, or made obsolete at any time. A future revision will promote this document to Candidate Recommendation once the reference implementation (gh-aw v1.x) satisfies all conformance requirements below.

Promotion from Draft to Candidate Recommendation requires all of the following:

Reference implementation completeness: 100% of normative requirements in §§4–12 are implemented in gh-aw and mapped to concrete implementation files (tracking issue: #31983).
Compliance coverage: At least 95% of normative requirements have automated tests, and all MUST/MUST NOT requirements have at least one passing automated test (tracking issue: #31983).
CI stability window: The experiments-related test suite passes on the default branch for 30 consecutive days with no unresolved regression in variant selection, persistence, or reporting behavior (tracking issue: #31983).
Interoperability evidence: At least two production workflows using experiments: run for a minimum of 500 total assignments each with valid assignment artifacts and reproducible audit output (tracking issue: #31983).
Review sign-off: Written approval from at least two gh-aw maintainers that Sections 10–14 are complete, internally consistent, and suitable for Candidate Recommendation publication (tracking issue: #31983).

Sync

Who reviews: The experiments specification editors (gh-aw maintainers) perform the primary review; one release owner for the current minor version performs final sign-off.
When: Review occurs on the first business day of each month and during every minor-release cut.
What triggers an immediate sync update:
1. Any change to experiments: schema fields or validation behavior (§4)
2. Any change to variant selection, gating, or persistence logic (§§5–7)
3. Any change to audit/reporting output contracts (§§10–11)
4. Any incident postmortem that identifies spec/implementation drift

When a trigger occurs, spec updates SHOULD be merged in the same PR as the implementation change or in a linked follow-up PR within 3 business days.

Feedback should be filed as GitHub issues against the github/gh-aw repository with the experiments label.

Introduction
Conformance
Definitions
Frontmatter Schema
Variant Selection Algorithms
Date-Range Gating
State Persistence
Expression and Template Integration
Activation Job Structure
Audit CLI Integration
Statistical Analysis and Reporting
Simultaneous Experiments and Interaction Effects
Security Considerations
Compliance Testing
References
Appendices
Change Log

1. Introduction

1.1 Purpose

Agentic workflows compiled by gh-aw use a frontmatter-driven configuration model. Teams running these workflows need a first-class mechanism to test different prompt variants (tone, verbosity, persona, feature flags embedded in the prompt) across successive workflow runs. Without such a mechanism, variant testing is ad-hoc, untracked, and statistically unbalanced.

This specification defines a self-contained A/B experiment system that requires no external service, no manual coordination, and no changes outside the workflow frontmatter.

1.2 Scope

This specification covers:

The experiments: frontmatter schema and its two syntactic forms (bare-array, rich-object).
Variant selection algorithms: balanced round-robin (least-used), weighted random, and date-gated fallback.
State persistence backends: git-branch (repo) and GitHub Actions cache (cache).
Expression and Handlebars template integration in the compiled workflow prompt.
The activation job structure generated by the compiler.
The gh aw audit CLI filtering interface for experiment-annotated runs.
Requirements for statistical analysis and reporting workflows that consume experiment artifacts.

This specification does not cover:

The internal compiler architecture beyond what is observable at the compiled YAML boundary.
External analytics dashboards or third-party experiment platforms.
Multi-armed bandit or adaptive allocation algorithms (considered future work).

1.3 Design Goals

Zero external dependencies — all state is stored within the repository or GitHub Actions infrastructure.
Declarative — the complete experiment configuration lives in the workflow frontmatter.
Backward compatible — adding experiments: to an existing workflow MUST NOT break any existing compiled output; removing it MUST restore the original output exactly.
Statistically sound — the default selection algorithm guarantees approximate variant balance in the minimum number of runs.
Observable — every run produces a durable artifact recording the variant assignment, and OTEL attributes propagate assignments to distributed-tracing backends automatically.

2. Conformance

2.1 Requirements Notation

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

2.2 Conformance Classes

This specification defines three conformance classes:

Class	Requirements
Level 1 — Basic	Satisfies all MUST/MUST NOT requirements in §4, §5, §8, and §9
Level 2 — Standard	Level 1 plus §6 (date gating), §7 (state persistence), §10 (audit CLI)
Level 3 — Complete	Level 2 plus §11 (statistical analysis and reporting) and §12 (simultaneous experiments)

An implementation is considered non-conformant if it fails any MUST or MUST NOT requirement at the level it claims to implement.

2.3 Normative vs. Informative Content

Sections containing numbered requirements (e.g., “R-SCHEMA-001”) are normative. Notes, rationale blocks, and appendices are informative and carry no conformance weight.

3. Definitions

Term	Definition
Experiment	A named A/B test declared in workflow frontmatter, associating an identifier with two or more variant strings.
Variant	A named string value representing one treatment arm in an experiment.
Control variant	The first variant in the declared `variants` array; used as baseline and as fallback during date gating.
Invocation counter	A per-experiment, per-variant integer stored in `state.json` that records the cumulative number of times a variant has been selected.
state.json	The JSON file that stores invocation counters and per-run assignment history for all experiments in a single workflow.
Run record	An entry in the `state.json` `runs` array recording the run ID, timestamp, and variant assignments for one workflow run.
Sanitized workflow ID	The workflow basename (without `.md`) with hyphens removed and lowercased, used as a cache/branch key component.
Activation job	The `activation` GitHub Actions job generated by the compiler that picks variants and exposes them to downstream jobs.
Experiment artifact	A GitHub Actions artifact named `{sanitizedID}-experiment` uploaded by the activation job and containing `state.json` and `assignments.json`.
assignments.json	A file in the experiment artifact containing only the current run’s variant assignments as a flat JSON object.

4. Frontmatter Schema

4.1 Field Declaration

R-SCHEMA-001: Workflow frontmatter MAY include an experiments field. Its absence MUST produce no change in compiled output.

R-SCHEMA-002: The value of experiments MUST be a YAML map. Non-map values MUST be rejected at compile time with a descriptive error.

R-SCHEMA-003: Every key in the experiments map, except the reserved storage key (§7.1), MUST be an experiment name that matches the regular expression ^[a-zA-Z_][a-zA-Z0-9_]*$. Keys that do not match MUST be silently skipped with a compile-time warning emitted to stderr.

Note (informative): The identifier pattern ensures experiment names can be used as GitHub Actions step output names and embedded in ${{ experiments.<name> }} expressions without bracket notation.

4.2 Bare-Array Form

R-SCHEMA-004: Each experiment value MAY be declared as a YAML sequence of two or more strings:

experiments:
  prompt_style: [concise, detailed]

R-SCHEMA-005: A bare-array value with fewer than two entries MUST NOT be accepted; the compiler MUST emit a compile-time error.

4.3 Rich Object Form

R-SCHEMA-006: Each experiment value MAY alternatively be declared as a YAML object with a required variants field and optional metadata fields. The two forms MUST be accepted in the same experiments map without conflict.

R-SCHEMA-007: The variants field MUST be an array of two or more non-empty strings. The same minimum-two-variants constraint from R-SCHEMA-005 applies.

R-SCHEMA-008: The following optional fields are defined for the object form:

Field	Type	Description
`description`	string	Human-readable explanation of what the experiment tests.
`hypothesis`	string	Null and alternative hypothesis statements.
`metric`	string	Primary metric name to observe (e.g., `effective_tokens`).
`secondary_metrics`	string[]	Additional metrics to collect.
`guardrail_metrics`	object[]	Thresholds that must not degrade (see §4.4).
`min_samples`	integer ≥ 1	Minimum runs per variant before analysis is reliable. Defaults to 20.
`weight`	integer[]	Per-variant probability weights (see §5.2).
`issue`	integer ≥ 1	GitHub issue number tracking this experiment.
`start_date`	string (YYYY-MM-DD)	Experiment is inactive before this date (see §6).
`end_date`	string (YYYY-MM-DD)	Experiment is inactive after this date (see §6).
`analysis_type`	string enum	Statistical test for automated reporting (see §11.2).
`tags`	string[]	Free-form labels for dashboard filtering.
`notify`	object	Significance-alert destination (see §4.5).

R-SCHEMA-009: The weight, issue, min_samples, start_date, end_date, analysis_type, tags, and notify fields carry no effect on variant assignment outside their documented subsections. description, hypothesis, metric, secondary_metrics, and tags are purely informative at runtime.

R-SCHEMA-010: Implementations MUST NOT introduce additional properties in the object form without a corresponding schema update; the compiler MUST reject unknown keys under strict mode.

4.4 Guardrail Metrics

R-SCHEMA-011: Each entry in guardrail_metrics MUST be an object with exactly two string fields: name and threshold. The threshold MUST match the pattern ^(>=|<=|==|>|<)-?\d+(\.\d+)?$ (e.g., >=0.95, ==0, <=0.05).

R-SCHEMA-012: Guardrail evaluation is INFORMATIVE at the schema level — the compiler does not enforce guardrails at compile time. Reporting tooling (§11) MUST evaluate each guardrail and include pass/fail status in its output.

4.5 Notify Object

R-SCHEMA-013: The notify object MUST contain only the keys discussion and/or issue, each of which MUST be a positive integer (minimum 1). Unknown keys in notify MUST be rejected by schema validation.

R-SCHEMA-014: When notify.issue is set and the reporting workflow posts a comment to that issue, the compiled workflow MUST declare permissions: issues: write. Implementations that generate reporting workflows MUST automatically add this permission when notify.issue is present in any experiment configuration within the scope of that workflow.

Note (informative): Failure to include issues: write causes comment posting to silently fail with a 403 response. This was identified as a defect in the daily-experiment-report workflow (May 2026 review).

5. Variant Selection Algorithms

5.1 Balanced Round-Robin (Least-Used)

R-SELECT-001: When weight is absent or invalid (§5.2), implementations MUST select the variant with the lowest cumulative invocation count stored in state.json.

R-SELECT-002: When two or more variants share the lowest count — including the initial state where all counts are zero — implementations MUST break ties by selecting uniformly at random from the tied variants. No variant MUST be systematically favoured by position.

R-SELECT-003: After selecting a variant via round-robin, implementations MUST increment the invocation counter for that variant in state.json before persisting state.

Note (informative): Round-robin guarantees that over K×N runs each variant appears approximately N times, achieving balance in far fewer runs than random selection. The random tie-breaking on the first run ensures no variant is systematically advantaged.

5.2 Weighted Random Selection

R-SELECT-004: When weight is provided and its length equals the length of variants, implementations MUST use weighted random selection: each variant is chosen with probability proportional to its weight value.

R-SELECT-005: When all weight values are zero, implementations MUST return the control variant (first entry in variants) without erroring.

R-SELECT-006: Weighted random selection MUST increment the invocation counter for the selected variant before persisting state.

Note (normative correction): ADR-29618 Rule 9 incorrectly stated that weighted selection “MUST NOT increment any variant counter.” This rule is hereby superseded. Counter increments for weighted selection are required to enable min_samples progress tracking and accurate per-run history. The reference implementation (pick_experiment.cjs) already implements this correct behavior by calling recordVariant unconditionally after both selection paths.

R-SELECT-007: When weight is provided but its length does not equal the length of variants, implementations MUST treat weight as absent and fall back to round-robin selection (R-SELECT-001).

Note (statistical, informative): Standard power calculations assume balanced allocations. When weights are non-uniform (e.g., [70, 30]), the effective sample size is reduced. The min_samples target should be interpreted as the minimum required for the smaller group. For a 70/30 split, experimenters should set min_samples to the desired count for the 30% arm and expect the 70% arm to accumulate proportionally more observations.

5.3 Variant Exposure

R-SELECT-008: Implementations MUST expose each selected variant as a named step output steps.pick-experiment.outputs.<experiment-name> and MUST also set a combined JSON step output steps.pick-experiment.outputs.experiments containing all variant assignments as a serialized JSON object.

R-SELECT-009: Experiment names MUST be sorted alphabetically when building the experiments JSON output to produce deterministic, reproducible output across runs with identical state.

6. Date-Range Gating

R-DATE-001: When start_date is provided and the current date (UTC, YYYY-MM-DD format) is strictly before start_date, implementations MUST return the control variant without incrementing any counter.

R-DATE-002: When end_date is provided and the current date (UTC, YYYY-MM-DD format) is strictly after end_date, implementations MUST return the control variant without incrementing any counter.

R-DATE-003: Date comparison MUST use UTC date. Local timezone offsets MUST NOT affect the result.

R-DATE-004: When both start_date and end_date are provided and the current UTC date is within the inclusive range [start_date, end_date], the experiment is active and normal variant selection (§5) applies.

R-DATE-005: If start_date or end_date do not match the YYYY-MM-DD pattern, implementations SHOULD treat them as absent (ignore silently) rather than hard-failing, to preserve forward compatibility.

7. State Persistence

7.1 Storage Configuration

R-STORE-001: The experiments: map MUST support a reserved storage key whose value is one of "repo" (default) or "cache". Any other value MUST produce a compile-time warning and fall back to "repo".

R-STORE-002: When storage is absent, implementations MUST behave as if storage: repo was specified.

R-STORE-003: The storage key MUST NOT be treated as an experiment name; it MUST be excluded from experiment configuration extraction.

7.2 `state.json` Format

R-STORE-004: The state.json file MUST be a valid JSON object with the following top-level structure:

{
  "counts": {
    "<experiment_name>": {
      "<variant>": <integer>
    }
  },
  "runs": [
    {
      "run_id": "<string>",
      "timestamp": "<ISO-8601 UTC string>",
      "assignments": {
        "<experiment_name>": "<variant>"
      }
    }
  ]
}

R-STORE-005: The runs array MUST be pruned to at most 512 entries (keeping the most recent) to prevent unbounded growth.

R-STORE-006: When loading a state.json that has no runs field (legacy format), implementations MUST initialize runs to an empty array and continue normally.

R-STORE-007: When at least one experiment is assigned on a run, implementations MUST append one run record to state.runs before persisting. Each record MUST contain:

run_id: the value of GITHUB_RUN_ID, or "" when absent.
timestamp: an ISO-8601 UTC timestamp of the selection moment.
assignments: an object mapping each assigned experiment name to its selected variant.

R-STORE-008: When no experiments are assigned (e.g., all experiments are outside their date window), implementations MUST NOT append a run record or rewrite state.json.

7.3 `repo` Storage Mode

R-STORE-REPO-001: When storage: repo is active, the activation job MUST load experiment state by fetching state.json from the git branch named experiments/{sanitizedWorkflowID} via the GitHub REST API (GET /repos/{owner}/{repo}/contents/{path}).

R-STORE-REPO-002: A 404 response (branch or file does not exist) MUST be treated as an empty initial state; the activation job MUST NOT fail.

R-STORE-REPO-003: After the activation job completes, a dedicated push_experiments_state job MUST be generated. This job MUST:

Download the experiment artifact from the current run.
Commit the updated state.json and assignments.json to the experiments git branch.
Declare permissions: contents: write.
Be listed as a dependency of the conclusion job to ensure state is persisted before the workflow terminates.

R-STORE-REPO-004: The commit SHOULD be made via the GitHub GraphQL createCommitOnBranch mutation (producing a verified, signed commit). A plain git push MAY be used as a fallback when the GraphQL mutation is unavailable.

R-STORE-REPO-005: The push step SHOULD implement retry logic with exponential backoff (minimum 3 attempts, base delay ≥ 1 second) to handle transient API failures and concurrent push conflicts.

Note (race condition, informative): When two workflow runs start concurrently, both will read the same state.json from the branch before either has committed its update. Both runs will therefore select the same least-used variant. The retry logic in R-STORE-REPO-005 handles write conflicts at push time but does not prevent duplicate variant selections at read time. On low-frequency workflows (daily cron) this is effectively never a problem. On high-frequency workflows (hourly or per-commit), experimenters should account for a small probability of temporarily imbalanced runs. A future revision of this specification MAY address this with an optimistic-concurrency guard at the fetch step.

7.4 `cache` Storage Mode

R-STORE-CACHE-001: When storage: cache is explicitly set, the activation job MUST restore experiment state from GitHub Actions cache using a key of the form experiments-{sanitizedWorkflowID}-{GITHUB_RUN_ID} and a restore-key prefix experiments-{sanitizedWorkflowID}-.

R-STORE-CACHE-002: The activation job MUST save experiment state back to cache after variant selection using if: always().

R-STORE-CACHE-003: When storage: cache is active, no push_experiments_state job SHALL be generated.

R-STORE-CACHE-004: Implementations MUST NOT require contents: write permission when storage: cache is configured.

Note (informative): GitHub Actions cache has a 7-day inactivity eviction policy. State accumulated during an experiment may be silently lost over holidays or between infrequent runs. For this reason repo is the default storage mode. Use cache only when contents: write cannot be granted to the workflow.

7.5 Experiment Artifact

R-STORE-ARTIFACT-001: The activation job MUST upload the experiment state directory as a GitHub Actions artifact named {sanitizedWorkflowID}-experiment (or experiment for workflow_call triggers) with if: always() and a retention period of at least 30 days.

R-STORE-ARTIFACT-002: When assignments.json exists in the state directory, it MUST be included in the artifact alongside state.json.

8. Expression and Template Integration

8.1 Compiler Expression Rewriting

R-EXPR-001: The compiler MUST rewrite every ${{ experiments.<name> }} expression in the frontmatter or prompt source to steps.pick-experiment.outputs.<name> during the expression extraction phase, so the runtime value is injected by the GitHub Actions expression engine.

R-EXPR-002: Each experiment MUST be mapped to an environment variable named GH_AW_EXPERIMENTS_<NAME> (uppercased) that resolves to steps.pick-experiment.outputs.<name>. This environment variable MUST be set in every workflow step that performs prompt interpolation or template substitution.

8.2 Handlebars Template Integration

R-EXPR-003: Implementations MUST substitute __GH_AW_EXPERIMENTS_<NAME>__ placeholders in the raw prompt text before Handlebars template rendering, so that {{#if experiments.<name> == "value" }} conditionals evaluate the actual runtime variant.

R-EXPR-004: Implementations MUST NOT pass raw __GH_AW_EXPERIMENTS_*__ placeholders to the Handlebars rendering engine; all substitutions MUST occur in a prior step.

R-EXPR-005: The isTruthy helper used in Handlebars conditionals MUST treat the string "no" as falsy, in addition to the standard falsy values "", "false", "0", undefined, and null. This enables yes/no flag experiments where {{#if experiments.feature }} evaluates to false when the no variant is active.

Note (informative): The "no" falsy behavior is a deliberate design choice that enables simple boolean-flag experiments (feature: [yes, no]). It differs from standard JavaScript truthiness and should be clearly documented for contributors.

9. Activation Job Structure

R-JOB-001: When the experiments field is present in the frontmatter, the compiled activation job MUST include the experiment steps defined in §9.1 or §9.2 as appropriate.

R-JOB-002: Implementations MUST NOT inject experiment steps into workflows that do not declare the experiments frontmatter field.

R-JOB-003: The activation job MUST expose a needs.activation.outputs.experiments output containing the full JSON variant assignment object so that downstream jobs can reference it via needs.activation.outputs.experiments.

9.1 `cache` Storage Step Order

When storage: cache, the activation job MUST include the following steps in order:

Restore experiment state — actions/cache/restore with the workflow-specific key.
Pick experiment variants — pick_experiment.cjs via actions/github-script.
Save experiment state — actions/cache/save with if: always().
Upload experiment artifact — actions/upload-artifact with if: always().

9.2 `repo` Storage Step Order

When storage: repo (default), the activation job MUST include the following steps in order:

Restore experiment state from git — load_experiment_state_from_repo.cjs via actions/github-script.
Pick experiment variants — pick_experiment.cjs via actions/github-script.
Upload experiment artifact — actions/upload-artifact with if: always().

A separate push_experiments_state job (R-STORE-REPO-003) commits the updated state after the activation job completes.

9.3 OTEL Resource Attributes

R-JOB-004: After variant selection, when at least one experiment is assigned, pick_experiment.cjs MUST call core.exportVariable("OTEL_RESOURCE_ATTRIBUTES", …) with key-value pairs of the form experiment.<name>=<variant>, comma-separated when multiple experiments are active.

R-JOB-005: When OTEL_RESOURCE_ATTRIBUTES is already set, implementations MUST append the experiment attributes to the existing value with a comma separator rather than overwriting it.

R-JOB-006: When no experiments are assigned, implementations MUST NOT modify OTEL_RESOURCE_ATTRIBUTES.

10. Audit CLI Integration

10.1 Filter Flags

R-AUDIT-001: The gh aw audit command MUST accept an --experiment <name> flag that filters runs to those with a variant assignment for the named experiment.

R-AUDIT-002: The gh aw audit command MUST accept a --variant <value> flag that, when combined with --experiment, further restricts results to runs assigned that exact variant value.

R-AUDIT-003: --variant used without --experiment MUST cause a non-zero exit code with an error message that includes a suggestion to add --experiment.

R-AUDIT-004: When a run is skipped by the filter, an informational message MUST be emitted to stderr identifying the run ID, the experiment name, and (when applicable) the required variant.

10.2 Run Overview Display

R-AUDIT-005: The run Overview section MUST include an Experiment field when the run’s experiment artifact contains one or more assignments.

R-AUDIT-006: The experiment label MUST be formatted as a comma-separated, alphabetically sorted list of name=variant pairs (e.g., caveman=yes, style=concise).

R-AUDIT-007: The Experiment field MUST be omitted from console and JSON output when no experiment assignments are present (omitempty semantics).

10.3 Per-Run Assignment Lookup

R-AUDIT-008: When state.runs is non-empty and the last record’s assignments map is non-empty, the audit reporter MUST use that record’s assignments directly as the current-run experiment data.

R-AUDIT-009: When state.runs is empty, absent, or the last record’s assignments map is empty, the audit reporter MUST fall back to the max-count heuristic: the variant with the highest cumulative count is assumed to have been selected on the most recent run; ties are broken by sorted variant order.

10.4 Filter Application

R-AUDIT-010: Implementations MUST apply the experiment/variant filter before calling any report-rendering code. A filtered-out run MUST return nil, not an error.

R-AUDIT-011: Implementations MUST apply the filter in both the cached-summary path and the fresh-processing path for consistent behavior.

R-AUDIT-012: Implementations SHOULD extract experiment data at most once per AuditWorkflowRun invocation to avoid redundant artifact reads.

R-AUDIT-013: When neither --experiment nor --variant is set, implementations MUST NOT read the experiment artifact solely for filtering purposes.

11. Statistical Analysis and Reporting

This section applies to the Level 3 — Complete conformance class (§2.2) and to any automated workflow that reports on experiment outcomes.

11.1 Per-Run Assignment Source

R-STAT-001: Reporting tools that consume state.json files MUST derive per-run variant assignments from the state.runs array when it is present and non-empty.

R-STAT-002: Reporting tools MUST NOT use the cumulative-count delta inference method (comparing consecutive snapshots) as the primary assignment source when state.runs is available. The delta method MAY be used as a fallback for legacy state files with no runs array.

Note (informative): The delta method is fragile — it fails when multiple runs complete between downloaded snapshots, when runs are cancelled before the experiment step, or when state.json is fetched from different points in the artifact history. The runs array, introduced in v1.1.0 (ADR-29985), provides exact, auditable per-run assignment records.

11.2 Statistical Tests

R-STAT-003: When analysis_type is declared for an experiment, reporting tools SHOULD use the specified test for significance analysis:

`analysis_type` value	Test to apply
`t_test`	Welch’s two-sample t-test (does not assume equal variance)
`mann_whitney`	Mann-Whitney U non-parametric rank test
`proportion_test`	Two-proportion z-test
`bayesian_ab`	Bayesian A/B analysis (posterior probability of superiority)

R-STAT-004: When analysis_type is absent, reporting tools SHOULD default to the two-proportion z-test for binary outcomes (success/failure) and Welch’s t-test for continuous metrics (e.g., duration).

11.3 Multiple Comparison Correction

R-STAT-005: When an experiment declares K ≥ 3 variants and reporting tools perform pairwise comparisons against the control, the significance threshold SHOULD be adjusted using the Bonferroni correction: α_adjusted = 0.05 / (K − 1).

Note (informative): Without correction, the probability of at least one false positive across K−1 pairwise tests at α = 0.05 is approximately 1 − (1 − 0.05)^(K−1). For K = 3 this is ~9.75%; for K = 5 it exceeds 18%. The Bonferroni correction is conservative but simple. The Holm-Bonferroni step-down procedure is a less conservative alternative.

R-STAT-006: When a multiple-comparison correction is applied, reporting tools MUST state the correction method and the adjusted α threshold in the report output.

11.4 Minimum Sample Size Gate

R-STAT-007: Reporting tools MUST NOT issue a PROMOTE recommendation for any variant until all variants in the experiment have accumulated at least min_samples runs (or 20 if min_samples is not declared). When any variant is below threshold, the recommendation MUST be EXTEND.

R-STAT-008: When weights are non-uniform (§5.2), the min_samples target applies to the smallest expected group. For a weight: [70, 30] experiment with min_samples: 30, the control arm is not eligible for analysis until the 30% arm has at least 30 observations, even if the 70% arm has accumulated many more.

11.5 Guardrail Evaluation

R-STAT-009: Reporting tools that evaluate guardrail_metrics MUST emit a GUARDRAIL_FAILED status for any variant that violates a declared threshold, and MUST override the recommendation to ABANDON regardless of the primary-metric p-value.

R-STAT-010: Multi-variant experiments MUST show guardrail pass/fail status per variant, not aggregated across the experiment.

11.6 Reporting Workflow Permissions

R-STAT-011: Any automated workflow that posts comments to issues (e.g., via notify.issue or step-based issue comment creation) MUST declare permissions: issues: write in its frontmatter.

R-STAT-012: Any automated workflow that posts discussions MUST declare permissions: discussions: write.

12. Simultaneous Experiments and Interaction Effects

R-MULTI-001: Each experiment in the experiments map MUST be assigned independently. The selection algorithm for one experiment MUST NOT depend on the selected variant of any other experiment.

R-MULTI-002: Implementations SHOULD NOT run more than three experiments simultaneously in a single workflow. When more than three experiments are active, a compile-time warning SHOULD be emitted.

Note (statistical, informative): When two or more experiments are active simultaneously, observed differences in outcome metrics can be caused by either experiment individually or by their interaction (i.e., a specific combination of variant values). This violation of the Stable Unit Treatment Value Assumption (SUTVA) inflates the risk of misattribution. For example, if prompt_style=concise and emoji_density=heavy are both active, it is impossible to determine from pairwise analysis alone whether a change in output quality was caused by verbosity, emoji use, or the combination. Experimenters who need to measure interactions MUST use a full factorial design and ensure sufficient sample size for all K₁ × K₂ × … cell combinations.

R-MULTI-003: Reporting tools MUST note in their output when multiple experiments were simultaneously active on runs included in the analysis window, to alert reviewers to potential confounding.

R-MULTI-004: Experiments that change the engine: frontmatter key MUST NOT be implemented within a single workflow file. Engine-switching experiments MUST use separate compiled workflow files (one per variant), which can then be compared via their respective GitHub Actions run metrics.

R-MULTI-005: When two or more experiments are simultaneously active in the same analysis window, reporting tools MUST detect and bound interaction risk by preserving the full assignment vector per run and evaluating whether each observed combination cell has sufficient sample coverage. If interaction effects cannot be bounded (for example, sparse cells below min_samples), the report MUST emit an explicit interaction-risk status and MUST NOT recommend PROMOTE for affected variants.

12.1 Conflict Resolution Norms

A conflict occurs when two or more simultaneously active experiments would assign incompatible configurations to the same workflow run. This subsection defines normative behavior for each storage mode.

R-CONFLICT-001 (general): When two experiments assign variants that together produce a logically invalid workflow configuration (e.g., two engine: variants via separate experiment keys), the compiler MUST reject the workflow at compile time with a descriptive error. Runtime conflict detection is NOT a substitute for compile-time validation.

12.1.1 Conflict Resolution for `repo` Storage Mode

R-CONFLICT-REPO-001: Under repo storage, each experiment’s variant selection reads and writes an independent key in state.json. There is no shared mutable state between experiments at the selection layer. Variant assignments for experiment A MUST NOT block or override variant assignments for experiment B, even when both experiments are active on the same run.

R-CONFLICT-REPO-002: When a concurrent write conflict is detected at push time (e.g., a non-fast-forward rejection from the GitHub API), the push step MUST retry with the merged state from both runs. The retry MUST NOT discard either run’s assignment record.

R-CONFLICT-REPO-003: If two concurrent runs select the same least-used variant for the same experiment (a read-time race), both selections are considered valid. The run records MUST reflect each run’s independently selected variant. No conflict error is raised for this condition.

12.1.2 Conflict Resolution for `cache` Storage Mode

R-CONFLICT-CACHE-001: Under cache storage, GitHub Actions cache is eventually consistent across concurrent runs. When two runs attempt to save conflicting cache entries under the same key, GitHub Actions will store one entry and silently drop the other. Implementations MUST treat this as an acceptable data loss (see §7.4 informative note on cache eviction) and MUST NOT treat a missing cache restore as an error condition.

R-CONFLICT-CACHE-002: Because cache storage does not provide atomic read-modify-write semantics, implementations using cache mode MUST document to users that high-concurrency workflows may experience elevated variant imbalance compared to repo mode.

12.1.3 Conflict Resolution for Mixed Storage Mode

R-CONFLICT-MIX-001: All experiments within a single workflow MUST share the same storage mode. Mixed-mode configurations (some experiments in repo, others in cache) are NOT SUPPORTED and MUST produce a compile-time error.

R-CONFLICT-MIX-002: This restriction exists because the storage key is a single top-level field in the experiments map that applies uniformly to all experiments in that map. Workflow authors who require different storage modes for different experiments MUST split them into separate workflow files.

13. Security Considerations

13.1 State File Integrity

The experiment state is stored in a git branch (repo mode) or GitHub Actions cache (cache mode). Both backends are protected by repository access controls. However:

Any user with write access to the repository can modify state.json on the experiments branch, potentially manipulating variant counters or forging run records.
Implementers that require tamper-evident state SHOULD use signed commits via the GitHub GraphQL createCommitOnBranch mutation (R-STORE-REPO-004).

13.2 Prompt Injection via Variant Values

Variant strings declared in frontmatter are static strings set by the workflow author. They are not derived from user-supplied input and therefore do not introduce prompt injection risk at the frontmatter level. Workflow authors MUST NOT use runtime user input (e.g., issue titles, PR bodies) as variant values.

13.3 OTEL Attribute Leakage

Experiment assignments exported as OTEL resource attributes (§9.3) may be visible in distributed-tracing backends. Variant names and experiment names SHOULD NOT embed sensitive information.

13.4 Permission Minimization

The repo storage mode requires contents: write. Workflows SHOULD limit all other permissions to read to minimize the blast radius of a compromised token.
Reporting workflows that post comments require issues: write or discussions: write (§11.6). These permissions SHOULD be granted only to the specific reporting workflow, not to the experiment-running workflow itself.

14. Compliance Testing

14.1 Test Suite Requirements

Conformance at each level is verified by the following test categories.

14.1.1 Schema Tests (Level 1)

Test ID	Requirement	Description
T-SCHEMA-001	R-SCHEMA-005	Reject bare-array with fewer than 2 variants
T-SCHEMA-002	R-SCHEMA-003	Skip and warn on invalid experiment name
T-SCHEMA-003	R-SCHEMA-007	Reject object form with `variants` containing < 2 entries
T-SCHEMA-004	R-SCHEMA-011	Reject guardrail with invalid threshold pattern
T-SCHEMA-005	R-SCHEMA-013	Reject `notify` object with unknown keys
T-SCHEMA-006	R-SCHEMA-001	Compile workflow without `experiments:` field — output unchanged

14.1.2 Variant Selection Tests (Level 1)

Test ID	Requirement	Description
T-SELECT-001	R-SELECT-001	Round-robin: select variant with lowest count
T-SELECT-002	R-SELECT-002	Round-robin: random tie-breaking on first run
T-SELECT-003	R-SELECT-003	Round-robin: counter incremented after selection
T-SELECT-004	R-SELECT-004	Weighted: selection probability proportional to weights
T-SELECT-005	R-SELECT-006	Weighted: counter incremented after selection
T-SELECT-006	R-SELECT-005	Weighted: all-zero weights return control variant
T-SELECT-007	R-SELECT-007	Weighted: mismatched length falls back to round-robin

14.1.3 Expression Integration Tests (Level 1)

Test ID	Requirement	Description
T-EXPR-001	R-EXPR-001	`${{ experiments.x }}` rewritten to step output reference
T-EXPR-002	R-EXPR-003	Placeholder substituted before Handlebars rendering
T-EXPR-003	R-EXPR-005	`"no"` treated as falsy in `isTruthy`
T-EXPR-004	R-EXPR-004	Raw placeholder not passed to Handlebars engine

14.1.4 State Persistence Tests (Level 2)

Test ID	Requirement	Description
T-STORE-001	R-STORE-REPO-002	Empty state on first run (404 branch)
T-STORE-002	R-STORE-004	Valid `state.json` structure written after run
T-STORE-003	R-STORE-007	Run record appended with correct fields
T-STORE-004	R-STORE-005	`runs` pruned to ≤ 512 entries
T-STORE-005	R-STORE-006	Legacy state (no `runs` field) initialized to empty array
T-STORE-006	R-STORE-CACHE-004	No `contents: write` required for cache mode

14.1.5 Audit CLI Tests (Level 2)

Test ID	Requirement	Description
T-AUDIT-001	R-AUDIT-003	`--variant` without `--experiment` returns non-zero exit
T-AUDIT-002	R-AUDIT-008	Assignment read from `state.runs` when available
T-AUDIT-003	R-AUDIT-009	Fallback to max-count heuristic for legacy state
T-AUDIT-004	R-AUDIT-005	Overview `Experiment` field present when assignments exist
T-AUDIT-005	R-AUDIT-007	`Experiment` field omitted when no assignments

14.1.6 Statistical Reporting Tests (Level 3)

Test ID	Requirement	Description
T-STAT-001	R-STAT-001	Assignments derived from `state.runs`, not delta inference
T-STAT-002	R-STAT-005	Bonferroni correction applied for K ≥ 3 variants
T-STAT-003	R-STAT-007	PROMOTE withheld until all variants reach `min_samples`
T-STAT-004	R-STAT-009	GUARDRAIL_FAILED forces ABANDON recommendation
T-STAT-005	R-STAT-011	Reporting workflow declares `issues: write`

14.2 Compliance Checklist

Requirement	Test ID	Level	Status
R-SCHEMA-001	T-SCHEMA-006	1	Required
R-SCHEMA-005	T-SCHEMA-001	1	Required
R-SELECT-001	T-SELECT-001	1	Required
R-SELECT-002	T-SELECT-002	1	Required
R-SELECT-003	T-SELECT-003	1	Required
R-SELECT-006	T-SELECT-005	1	Required
R-EXPR-001	T-EXPR-001	1	Required
R-EXPR-005	T-EXPR-003	1	Required
R-STORE-002	—	2	Required
R-STORE-REPO-002	T-STORE-001	2	Required
R-STORE-007	T-STORE-003	2	Required
R-AUDIT-003	T-AUDIT-001	2	Required
R-AUDIT-008	T-AUDIT-002	2	Required
R-STAT-001	T-STAT-001	3	Required
R-STAT-005	T-STAT-002	3	Recommended
R-STAT-007	T-STAT-003	3	Required
R-STAT-011	T-STAT-005	3	Required

15. References

Normative References

[RFC 2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels”, RFC 2119, March 1997. https://www.ietf.org/rfc/rfc2119.txt
[ADR-29534] gh-aw maintainers, “Frontmatter A/B Experiments with Balanced Variant Selection”, 2026-05-01. docs/adr/29534-frontmatter-ab-experiments-variant-selection.md
[ADR-29618] gh-aw maintainers, “Rich Experiment Metadata Schema Extension with Weighted Selection and Date Gating”, 2026-05-01. docs/adr/29618-rich-experiment-metadata-schema-extension.md (normative sections superseded by §5.2 of this document)
[ADR-29628] gh-aw maintainers, “Add --experiment and --variant Filter Flags to gh aw audit”, 2026-05-01. docs/adr/29628-experiment-variant-filter-flags-for-audit.md
[ADR-29985] gh-aw maintainers, “Experiment Per-Run State, OTEL Integration, and Schema Extensions”, 2026-05-03. docs/adr/29985-experiment-per-run-state-otel-integration-and-schema-extensions.md
[ADR-29996] gh-aw maintainers, “Experiment State Storage — Git Branch as Default, Cache as Fallback”, 2026-05-03. docs/adr/29996-experiment-state-git-branch-storage.md

Informative References

[SUTVA] Rubin, D. B., “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies”, Journal of Educational Psychology, 66(5):688–701, 1974. (Stable Unit Treatment Value Assumption)
[BONFERRONI] Dunn, O. J., “Multiple Comparisons Among Means”, Journal of the American Statistical Association, 56(293):52–64, 1961.
[WELCH-TTEST] Welch, B. L., “The Generalization of Student’s Problem When Several Different Population Variances are Involved”, Biometrika, 34(1/2):28–35, 1947.
[GitHub Actions Cache] GitHub Docs, “Caching dependencies to speed up workflows”. https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows

Appendices

Appendix A: Full Object-Form Example

---
on:
  schedule: daily on weekdays
engine: copilot
permissions:
  contents: read
  pull-requests: read

experiments:
  storage: repo
  prompt_style:
    variants: [concise, detailed, step_by_step]
    description: "Test whether verbosity level affects output quality"
    hypothesis: "H0: no change in effective_tokens. H1: concise reduces by >=15%"
    metric: effective_tokens
    secondary_metrics: [duration_ms, discussion_word_count]
    guardrail_metrics:
      - name: success_rate
        threshold: ">=0.95"
      - name: empty_output_rate
        threshold: "==0"
    weight: [40, 40, 20]
    min_samples: 30
    start_date: "2026-05-01"
    end_date: "2026-08-01"
    issue: 1234
    analysis_type: t_test
    tags: [cost, prompting, verbosity]
    notify:
      issue: 1234
---

Summarize the pull requests merged today.

{{#if experiments.prompt_style == "concise" }}
Write a maximum of 5 bullet points.
{{#else if experiments.prompt_style == "detailed" }}
Write a structured report with sections for new features, bug fixes, refactors, and docs.
{{#else}}
Write a numbered step-by-step walkthrough of each change with rationale.
{{#endif}}

Appendix A2: Weighted Variant Selection — Worked Example

This appendix walks through the probability math for a three-variant weighted experiment to illustrate how the weight array maps to selection probability, how counters are updated, and how balance is maintained over many runs.

A2.1 Scenario Setup

An experiment named response_tone has three variants with non-uniform weights:

experiments:
  storage: repo
  response_tone:
    variants: [formal, casual, neutral]
    weight: [20, 50, 30]

The weight values are relative proportions, not absolute percentages. The implementation normalizes them to compute probabilities:

total_weight = 20 + 50 + 30 = 100

P(formal)  = 20 / 100 = 0.20  (20%)
P(casual)  = 50 / 100 = 0.50  (50%)
P(neutral) = 30 / 100 = 0.30  (30%)

For a 10-run experiment sequence, the expected variant distribution is:

Variant	Weight	Expected runs (of 10)
formal	20	2
casual	50	5
neutral	30	3

A2.2 Selection Algorithm (Weighted Random)

The weighted algorithm draws a uniform random number r ∈ [0, 1) and maps it to a variant via cumulative weight:

Cumulative ranges:
  [0.00, 0.20)  →  formal
  [0.20, 0.70)  →  casual
  [0.70, 1.00)  →  neutral

Example draws:

r	Selected variant
0.11	formal
0.45	casual
0.72	neutral
0.19	formal
0.68	casual

A2.3 Counter Updates

After each run, the counter for the selected variant is incremented in state.json. After 10 runs with the distribution above, a typical counts object is:

{
  "counts": {
    "response_tone": {
      "formal":  2,
      "casual":  5,
      "neutral": 3
    }
  }
}

Per R-SELECT-006, the weighted algorithm MUST increment invocation counters after every selection. This allows the audit CLI and reporting workflows to verify that observed variant frequencies approximate the declared weights over time.

A2.4 Long-Run Balance Verification

Over N runs, the observed frequency for variant v should converge to weight[v] / total_weight by the Law of Large Numbers. Reporting workflows SHOULD flag experiments where any variant’s observed frequency deviates from its target weight by more than ±10 percentage points over at least 30 runs, as this may indicate a misconfigured weight array or a bug in the selection implementation.

For the example above, after 100 runs:

Variant	Expected runs	Acceptable range (±10 pp)
formal	20	10 – 30
casual	50	40 – 60
neutral	30	20 – 40

A2.5 Contrast with Balanced Round-Robin

The balanced (least-used) algorithm ignores weights and selects the least-run variant deterministically. Use weighted when you intentionally want unequal traffic allocation (e.g., to expose fewer users to an experimental variant while still gathering comparative data). Use balanced when you want equal allocation and maximum statistical efficiency per total run count.

Appendix B: `state.json` Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["counts"],
  "properties": {
    "counts": {
      "type": "object",
      "additionalProperties": {
        "type": "object",
        "additionalProperties": { "type": "integer", "minimum": 0 }
      }
    },
    "runs": {
      "type": "array",
      "maxItems": 512,
      "items": {
        "type": "object",
        "required": ["run_id", "timestamp", "assignments"],
        "properties": {
          "run_id": { "type": "string" },
          "timestamp": { "type": "string", "format": "date-time" },
          "assignments": {
            "type": "object",
            "additionalProperties": { "type": "string" }
          }
        }
      }
    }
  }
}

Appendix C: Sample Size Reference

For a two-proportion test with 80% statistical power and α = 0.05 (two-tailed), the approximate minimum runs per variant are:

Minimum Detectable Effect (pp)	Runs per variant
5	~620
10	~160
15	~70
20	~40
30	~20

Note for weighted experiments: When weight is non-uniform, apply these figures to the smaller group. For a 70/30 split aiming to detect a 10 pp effect, you need ~160 runs in the 30% arm (≈ 533 total runs).

Appendix D: Known Limitations

Read-time race condition: Concurrent runs with repo storage may read stale state and select the same variant. See R-STORE-REPO-005 and the informative note in §7.3.
Interaction effects: Running multiple experiments simultaneously can produce unattributable results. See §12 and R-MULTI-002.
Engine-switching experiments: Changing the engine: key requires separate workflow files; see R-MULTI-004.
analysis_type advisory only: Reporting workflows that do not implement all four statistical tests will fall back to defaults. The field documents intent; it does not enforce a specific computation path.
State branch growth: The experiments git branch grows monotonically. Operators MAY prune old commits from the experiments branch without affecting the current state.

Sync Follow-ups (May 2026 Expert Review)

This appendix itemizes corrective follow-ups referenced in the abstract.

FR-001 (implemented via R-SELECT-006): Weighted selection increments invocation counters after each selection.
FR-002 (implemented via R-STAT-001/R-STAT-002): Reporting uses state.runs assignment records instead of count-delta inference.
FR-003 (implemented via R-STAT-011/R-STAT-012): Reporting workflows that write issues/discussions declare explicit write permissions.
FR-004 (implemented via R-MULTI-005): Concurrent-experiment interaction effects are explicitly detected and bounded before promotion decisions.
FR-005 (implemented via daily-experiment-report workflow update): Reporting guidance now includes a factorial-interaction helper for K₁×K₂ cell-level significance output and sparse-cell risk surfacing.
FR-006 (implemented via compiler diagnostics): The compiler now emits a warning when more than one experiment is active and weighted traffic is configured, indicating potential sparse interaction cells.

Change Log

Version 1.1.0 (Draft) — 2026-05-12

Added: Daily reporting helper guidance for factorial K₁×K₂ interaction cell significance output.
Added: Compiler warning requirement for sparse interaction-cell risk when multiple experiments and weighted traffic are configured.
Updated: Sync Follow-ups appendix to replace v1.1.0 TODOs with implemented corrective items.

Version 1.0.1 (Draft) — 2026-05-07

Added: R-MULTI-005 requiring interaction-risk detection/bounding for simultaneous experiments.
Added: Sync Follow-ups appendix with itemized May 2026 expert-review corrective items and owned TODOs.

Version 1.0.0 (Draft) — 2026-05-03

Initial publication consolidating ADR-29534, ADR-29618, ADR-29628, ADR-29985, and ADR-29996.
Correction: R-SELECT-006 supersedes ADR-29618 Rule 9 — weighted selection MUST increment invocation counters (was incorrectly stated as MUST NOT; the reference implementation already implements the correct behavior).
Added: R-STAT-001/R-STAT-002 — reporting tools MUST use state.runs for per-run assignment lookup, not the fragile delta-count inference method.
Added: R-STAT-005/R-STAT-006 — Bonferroni correction SHOULD be applied for K ≥ 3 variants to control family-wise error rate.
Added: R-STAT-008 — min_samples applies to the smallest expected group when weights are non-uniform.
Added: R-STAT-011/R-STAT-012 — reporting workflows MUST declare issues: write / discussions: write when posting comments.
Added: R-MULTI-002/R-MULTI-003 — warning for > 3 simultaneous experiments; interaction effects must be noted in reports.
Added: §13 Security Considerations — state integrity, prompt injection, OTEL leakage, permission minimization.
Added: Appendix C (sample size reference) and Appendix D (known limitations).
Informative note: storage: cache default changed to storage: repo in ADR-29996; any documentation or issue templates that still refer to “cache-based” assignment should be updated.

A/B Experiments Specification

A/B Experiments Specification

Abstract

Status of This Document

Sync

Table of Contents

1. Introduction

1.1 Purpose

1.2 Scope

1.3 Design Goals

2. Conformance

2.1 Requirements Notation

2.2 Conformance Classes

2.3 Normative vs. Informative Content

3. Definitions

4. Frontmatter Schema

4.1 Field Declaration

4.2 Bare-Array Form

4.3 Rich Object Form

4.4 Guardrail Metrics

4.5 Notify Object

5. Variant Selection Algorithms

5.1 Balanced Round-Robin (Least-Used)

5.2 Weighted Random Selection

5.3 Variant Exposure

6. Date-Range Gating

7. State Persistence

7.1 Storage Configuration

7.2 state.json Format

7.3 repo Storage Mode

7.4 cache Storage Mode

7.5 Experiment Artifact

8. Expression and Template Integration

8.1 Compiler Expression Rewriting

8.2 Handlebars Template Integration

9. Activation Job Structure

9.1 cache Storage Step Order

9.2 repo Storage Step Order

9.3 OTEL Resource Attributes

10. Audit CLI Integration

10.1 Filter Flags

10.2 Run Overview Display

10.3 Per-Run Assignment Lookup

10.4 Filter Application

11. Statistical Analysis and Reporting

11.1 Per-Run Assignment Source

11.2 Statistical Tests

11.3 Multiple Comparison Correction

11.4 Minimum Sample Size Gate

11.5 Guardrail Evaluation

11.6 Reporting Workflow Permissions

12. Simultaneous Experiments and Interaction Effects

12.1 Conflict Resolution Norms

12.1.1 Conflict Resolution for repo Storage Mode

12.1.2 Conflict Resolution for cache Storage Mode

12.1.3 Conflict Resolution for Mixed Storage Mode

13. Security Considerations

13.1 State File Integrity

13.2 Prompt Injection via Variant Values

13.3 OTEL Attribute Leakage

13.4 Permission Minimization

14. Compliance Testing

14.1 Test Suite Requirements

14.1.1 Schema Tests (Level 1)

14.1.2 Variant Selection Tests (Level 1)

14.1.3 Expression Integration Tests (Level 1)

14.1.4 State Persistence Tests (Level 2)

14.1.5 Audit CLI Tests (Level 2)

14.1.6 Statistical Reporting Tests (Level 3)

14.2 Compliance Checklist

15. References

Normative References

Informative References

Appendices

Appendix A: Full Object-Form Example

Appendix A2: Weighted Variant Selection — Worked Example

A2.1 Scenario Setup

A2.2 Selection Algorithm (Weighted Random)

A2.3 Counter Updates

A2.4 Long-Run Balance Verification

7.2 `state.json` Format

7.3 `repo` Storage Mode

7.4 `cache` Storage Mode

9.1 `cache` Storage Step Order

9.2 `repo` Storage Step Order

12.1.1 Conflict Resolution for `repo` Storage Mode

12.1.2 Conflict Resolution for `cache` Storage Mode

Appendix B: `state.json` Schema