GitHub Agentic Workflows

Blog

Agent of the Day – June 1, 2026

Architectural drift is quiet and cumulative. A file grows past 600 lines. A function absorbs one more responsibility. An import cycle sneaks in between two packages that “just need to share a little logic.” None of it trips a CI gate, no test turns red, and six months later a new engineer opens that directory and wonders how it got this bad. The Architecture Guardian workflow exists precisely to interrupt that pattern before it becomes load-bearing.

The Architecture Guardian runs on a weekday schedule, firing each afternoon around 14:00 UTC. It pulls the last 24 hours of commits, walks every changed Go and JavaScript file, and applies a tiered set of structural checks:

  • File size: files over 500 lines generate a warning; over 1,000 lines, a blocker.
  • Function length: any function exceeding 80 lines is flagged.
  • Export count: more than 10 exports from a single file draws scrutiny.
  • Import cycles: the full dependency graph of changed packages is traced for cycles.

When violations surface, the workflow doesn’t just log and move on. It opens a GitHub issue labeled architecture, automated-analysis, and cookie, assigned directly to Copilot for triage. The issue is the artifact — something a team can discuss, link to a PR, close when remediated.

The engine is GitHub Copilot, running as an agentic workflow defined in architecture-guardian.md. No bash scripts wrapping static analysis tools, no bespoke CI job to maintain. The analysis logic, thresholds, and issue-creation behavior all live in a single, readable workflow spec.

Run 26766995181 completed on June 1, 2026 at 16:18 UTC, five minutes and forty seconds after it started. The agent worked through three turns with claude-sonnet-4.6 via GitHub Copilot, made 10 GitHub API calls, and consumed 125,356 tokens — a number that looks large until you factor in the effective token count of 1,206,982 once prompt caching is included. Caching is doing real work here.

The verdict: no violations. Every changed file over the past 24 hours fell within the configured thresholds. The agent’s own summary put it plainly — “0 files analyzed, no import cycles detected.” Nothing to open, nothing to assign.

That outcome is worth pausing on. A clean run isn’t a null result; it’s confirmation. The codebase was touched, the guardian looked, and the boundaries held. Knowing that with specificity — on a schedule, with a receipt — is materially different from assuming it because nothing has caught fire yet.

The 500-line warning and 1,000-line blocker aren’t arbitrary. Files in that range have a documented tendency to accumulate mixed responsibilities: they’re long because they’re doing too many things, not because the domain is genuinely complex. The 80-line function limit enforces a similar discipline. It’s not a style preference; it’s a forcing function for decomposition.

Export counts above 10 are a softer signal — a package with 15 exports might be perfectly well-structured — but they surface files worth a second look. Import cycles are harder: they indicate a structural coupling that can’t be resolved without a real refactor, and they compound over time.

The Architecture Guardian makes these checks automatic and visible without requiring anyone to remember to run a linter or build a policy around code review checklists. The standards are encoded in the workflow. The workflow runs whether or not anyone’s thinking about it.

A few things worth noting if you’re thinking about adapting this pattern for your own team:

Scheduling matters. A daily check at 14:00 UTC catches violations before they’re a day old. Violations that linger for a week become rationalizations.

Issue creation is the accountability loop. Logging a warning to stdout is easy to ignore. An open issue is harder to lose, links to the violating commit, and can be closed with a reference to the fixing PR. That chain is the point.

Clean runs are data. The June 1 run found nothing. That’s not a failure of the workflow — it’s the workflow confirming steady-state health. Over time, a history of clean runs punctuated by occasional issues tells you something real about your team’s structural discipline.

Token efficiency scales. 1.2 million effective tokens for a daily architectural scan, amortized across a codebase’s active lifetime, is not expensive. The cost of a missed import cycle or a 2,000-line God file is.


The Architecture Guardian is one of the workflows available in github/gh-aw. If your team is dealing with structural drift — or wants to make sure it never starts — the repository has the workflow definitions, the engine configuration, and the patterns to adapt it to your thresholds and language stack.

Weekly Update – June 1, 2026

It’s been a busy week in github/gh-aw! Five releases landed between May 28 and May 31, capped off by v0.77.4 — one of the biggest releases in recent memory. Here’s everything that shipped.

v0.77.4 published on May 31st and packs in a ton of new capability.

  • Anthropic WIF Authentication (#35939): Claude-engine workflows can now authenticate via Workload Identity Federation. No more long-lived API key secrets stored in your repo — WIF handles it securely.

  • copilot-sdk Engine (#35936): A new engine: copilot-sdk frontmatter option gives workflows direct access to the Copilot SDK runtime, opening up new integration patterns.

  • aw.yml Manifest: Includes, Skills & Agents (#35778): Your repository manifest now supports includes, skills, and agents keys so you can compose and share workflow components across repos.

  • Per-Workflow 24-Hour Effective-Token Guardrail (#36042): A configurable token guardrail prevents runaway agent costs with enterprise-grade defaults and handy ET shorthand support.

  • search_commits in GitHub MCP Search Toolset (#36115): Agents can now search commits directly via the GitHub MCP search toolset.

  • New Skills: copilot-review and go-codemod (#36111, #36034): Two new skills help agents plan and address PR review feedback, and implement Go codemods for the gh aw fix command.

  • Prefer toolcache Copilot CLI (#35992): Workflows now use the Actions toolcache copy of the Copilot CLI before downloading a release — faster setup for everyone.
  • Reusable workflow timeout (#36107): timeout-minutes is now correctly passed through reusable workflow callers.
  • Threat-detection hardening (#36113): Missing prompt artifacts no longer block safe-output execution.
  • on.needs YAML strip (#35965): Processed on.needs keys are stripped from emitted YAML, preventing invalid workflow syntax.

v0.77.3 on May 29th brought sandbox improvements and better initialization:

  • authHeader in sandbox agent targets (#35694): You can now specify custom authentication headers directly in sandbox.agent.targets frontmatter.
  • gh aw init creates the Agentic Workflows custom agent (#35773): Running gh aw init now scaffolds a GitHub Copilot custom agent for Agentic Workflows right out of the box.
  • Stricter schema validation for workflow_call/workflow_dispatch (#35788): Unknown input keys are now rejected at compile time.

Agent of the Week: api-consumption-report

Section titled “ Agent of the Week: api-consumption-report”

The bean counter who never sleeps — tracks every GitHub API call your workflows make and publishes a detailed report so you know exactly where your rate-limit quota is going.

This week api-consumption-report analyzed 95 workflow runs across the repository (58 successes, 37 failures — it doesn’t sugarcoat the numbers), tallied up 10,619 GitHub REST API calls in a single day, and generated a full trend chart showing that API usage spiked to ~80K calls on May 20th before settling back down. It also uploaded five charts as release assets — a trend line, a heatmap, a per-workflow breakdown, a “burners” donut chart, and a workflow-level trend — then published the whole package as a GitHub Discussion for everyone to browse.

Hilariously, in one of its recent runs it completed in under 2 minutes with zero token usage and exactly one GitHub API call. Turns out that was the run where the cache hadn’t warmed yet — it took a look around, shrugged, and went home early.

Usage tip: Schedule this workflow weekly to catch runaway API consumption before you hit rate limits — the per-workflow breakdown makes it easy to spot which agent is hogging the quota.

View the workflow on GitHub

Upgrade to v0.77.4 today and explore the new copilot-sdk engine and WIF authentication for Claude. As always, feedback and contributions are welcome at github/gh-aw.

Agent of the Day – May 29, 2026

By the time an issue makes it into your backlog, someone already spent time writing it. The least you can do is make sure it gets read by the right person quickly. In practice, that rarely happens — unlabeled issues pile up, the search experience degrades, and the right engineer finds out about a relevant bug two sprints too late. Labeling sounds simple. Doing it consistently, at scale, without burning anyone’s afternoon, is the actual challenge.

That’s exactly the problem the Auto-Triage Issues workflow in gh-aw was built to solve.


Workflow: Auto-Triage Issues
Engine: GitHub Copilot (gpt-5-mini)
Run: #26640355375 — May 29, 2026, 13:34 UTC
Result: ✓ SUCCESS


Auto-Triage Issues runs on a schedule — several times a day — and also fires on issues events. Each pass, it reads through unlabeled GitHub issues, reasons about their content, and applies labels with a stated confidence level and rationale. No human in the loop. No queue to drain manually.

The agent runs behind an enabled squid-proxy firewall, with outbound access scoped to github.com and approved defaults. That constraint is intentional: triage doesn’t need the open internet, and limiting the blast radius of any agent is good practice regardless of what it’s doing.

Today’s midday run is a useful case study in how the workflow behaves under varying load.


The 07:45 UTC pass (run #26625003469) was a light one: 7 turns, finished in 5 minutes. A handful of issues to consider, quick classification, done. That’s what a steady-state workload looks like.

By 13:34 UTC, the picture was different. The agent completed 28 turns over 10 minutes — four times the conversational depth, twice the elapsed time. Same workflow, same model, same success result. The difference was the volume and complexity of what was waiting in the queue.

This matters because it shows the system isn’t just running a fixed script. The agent works through each issue, reasons about it, and the turn count reflects real cognitive work being done. A heavier inbox produces a longer run, not a failure or a time-out.


Two issues received labels during the midday run:

IssueLabels AppliedRationale
#35708automation”Automated triage report with no bug/feature signal”
#34915documentation, automation”Automated documentation quality report generated by automation; content is documentation-focused and workflow-generated”

Both calls were high-confidence. Issue #34915 is a good example of the multi-label path: the agent identified that the issue was both workflow-generated and documentation-focused, and applied both labels rather than forcing a single category. That kind of nuanced classification is where static regex-based approaches tend to fall short.


At the end of each run, the workflow doesn’t just apply labels and exit quietly. It creates — or updates — a GitHub Discussion titled [Auto-Triage Report] 2026-05-29, containing a Markdown table that summarizes every issue it classified: the issue number, the labels applied, confidence level, and the agent’s reasoning.

That report serves two purposes. First, it’s auditable — a reviewer can open the Discussion and see exactly what the agent decided and why, without digging through logs. Second, it creates a natural place for human override: if a classification looks wrong, the context is right there to inform a correction.

Transparency in automated triage isn’t optional. Reviewers need to trust the output before they’ll stop second-guessing it.


The model choice here is deliberate. gpt-5-mini is fast and cost-effective for classification tasks where the signal is textual and the label set is bounded. You don’t need a heavyweight model to tell the difference between a documentation report and a bug report. Reserving larger models for tasks that actually need them — planning, synthesis, code generation — keeps the system efficient across a full day of scheduled runs.


If your repository is drowning in unlabeled issues, Auto-Triage is a pattern worth adopting. The workflow lives in github/gh-aw, alongside the rest of the agentic workflow library. The firewall configuration, the Discussion report pattern, and the label confidence output are all ready to fork and adapt.

Triage shouldn’t be a task anyone has to remember to do. It should just happen — correctly, consistently, and with a paper trail.

Agent of the Day – May 28, 2026

Every codebase accumulates sediment. A helper function that made sense six months ago. A wrapper that lost its reason to exist after a refactor. Nobody deletes it on purpose — it just lingers. In Go, that lingering costs you: extra surface area to maintain, test coverage for code that does nothing new, and cognitive overhead for every engineer who reads the file.

The Dead Code Removal Agent is a scheduled GitHub Actions workflow that runs daily on the gh-aw repository. Its job is simple: find unused code, verify nothing breaks, and open a pull request. No human intervention required until review time.

On May 27, 2026, the agent completed run #100. Not a fanfare moment — just another daily run doing exactly what it was built to do. It finished in 11.4 minutes across 5 turns, consumed 14.6M effective tokens, and used 12 GitHub Actions minutes.

The target this time was NewValidationErrorWithLocation in pkg/workflow/workflow_errors.go. The function was a constructor wrapper around WorkflowValidationError — originally a convenience, but over time it became redundant as callers could initialize the struct directly. The agent identified it, confirmed it had no remaining callers, and started working.

The tool call sequence tells the story cleanly: one Install, eight Check passes, five Reads, three Views, four Edits, a Find, a Verify, a Format, two Runs, two Creates, an Update, and a Vet. That’s methodical, not mechanical. The agent didn’t just delete the function — it removed the corresponding TestNewValidationErrorWithLocation test from pkg/workflow/error_helpers_test.go and updated compiler_error_formatting_test.go to use direct WorkflowValidationError struct initialization instead.

Verification was thorough. Before touching the PR, the agent ran go build ./..., go vet ./..., go vet -tags=integration ./..., and make fmt. Everything passed. The resulting PR — “chore: remove dead functions — 1 function removed” on branch chore/remove-dead-code-20260527 — arrived clean, with no lint issues and a test suite that still compiles.

Zoom out a week and the picture gets more interesting. Across five runs in the last seven days, the agent logged:

  • 35.5 minutes total duration
  • 38.9M effective tokens
  • 38 GitHub Actions minutes
  • 21 turns across all five runs
  • 5 out of 5 high-confidence episodes

Run classification across that window: two normal runs, one risky, one failure, one in-progress. The failure and the risky classification matter as much as the successes. The agent doesn’t always find something safe to remove, and when it can’t complete cleanly, it doesn’t force a PR. That restraint is a feature, not a gap.

Dead code removal is well-suited to an agent for a specific reason: the feedback loop is entirely mechanical. Does it build? Does go vet pass? Does the test suite still run? Those questions have definitive answers. The agent never has to speculate about intent — it just has to be rigorous about verification, which it is.

The harder editorial question — should this code be removed — is answered by the PR review. The agent does the investigation and the grunt work. Engineers do the judgment call. That division feels right.

There’s also something useful about the daily cadence. A function doesn’t become dead overnight. But catching it the morning after the last caller disappears, rather than six months later during a refactor, is the difference between a one-line deletion and an archaeology project.

If you’re curious about how the Dead Code Removal Agent is built, or if you want to run something similar against your own Go codebase, the workflow lives at github/gh-aw. The patterns here — schedule-triggered agents, structured verification steps, PR-as-output — are composable. Start there.

Run #100 was just another Tuesday. That’s the point.

Agent of the Day – May 27, 2026

Every day, 236 agentic workflows run inside the gh-aw repository. Most complete quietly. A few fail in patterns worth tracking. And once a week, one workflow reads the entire fleet, scores it, and writes up what it found. That workflow is the Agent Performance Analyzer, and its run on May 27, 2026 produced the clearest signal in months.

Agent of the Day: Agent Performance Analyzer — Meta-Orchestrator

Section titled “Agent of the Day: Agent Performance Analyzer — Meta-Orchestrator”

The agent-performance-analyzer is not a workflow that builds features or merges PRs. Its job is to watch everything else. On a daily schedule, it fans out across the full fleet of 236 workflows, scores each agent group across three dimensions — quality (0–100), effectiveness (0–100), and ecosystem health (0–100) — and surfaces what the aggregate data says about systemic health. Think of it as a standing post-incident review that runs without anyone needing to call one.

Run #26515287616, logged on May 27, ran for 10.7 minutes and processed 12.2 million effective tokens. Those numbers matter because they reflect how much context the analyzer actually reads — audit logs, PR outcomes, failure histories, discussion threads — before rendering a score. This is not a lightweight health check.

The headline number from this week’s pass: ecosystem health hit 90/100, up 20 points from the prior week. That is the largest single-week jump in the recorded history of this metric. It is also a number that demands interpretation, not celebration. A 20-point move in one week usually means either the fleet genuinely improved, or something was suppressing the score before and is now resolved. The weekly Discussion #35220 breaks down the contributing factors — most of the lift came from copilot-swe-agent merge rate recovery, which landed at 67% week-over-week, up 6 percentage points, with 6 merges on May 27 alone. Merge rate as a proxy for workflow effectiveness is imperfect, but 67% across a fleet this size is a meaningful signal.

The top performers bear out that story. Lint Monster scored 90/100 on quality and 85/100 on effectiveness — consistent, expected, unglamorous. copilot-swe-agent followed at 88/100 quality and 84/100 effectiveness. spec-enforcer/extractor went 3-for-3 on merges this week, a 100% merge rate on a small but non-trivial sample. These are the parts of the fleet holding their line.

Quality, though, is flat. 74/100 for the fourth consecutive week. A plateau at week four is no longer noise. The analyzer flagged this directly: without intervention, the quality score will not self-correct. The fleet is not degrading, but it is not improving either, and in a system that runs daily, stasis accumulates.

The more operationally significant output from this run was not the Discussion — it was issue #35219. The analyzer detected a Copilot CLI execution failure pattern affecting the Daily News and Daily Issues Report workflows across five or more consecutive days at a 100% failure rate. A workflow failing once is noise. Failing every day for a week is infrastructure. The issue was filed automatically based on threshold logic baked into the analyzer’s scoring criteria. No human had to notice the pattern.

Three other systemic issues surfaced in Discussion #35220. A safe-outputs permission regression is blocking three or more agent groups and has been classified P1. A CGO/CJS build regression running at 37% failure rate has now exceeded 90 days without resolution — that is a P0 by any reasonable SLO definition. And 87 of the fleet’s 236 workflows show no recent runs at all, which makes them deprecation candidates pending owner review. The firewall processed 113 requests during this period and blocked 30 of them — a 27% block rate — which is consistent with prior weeks but warrants monitoring if the trend climbs.

The value of a meta-orchestrator is not that it prevents incidents. It is that it shortens the time between an incident beginning and someone with context knowing about it. Five consecutive days of 100% failure on two named workflows, with an auto-filed issue linking directly to the evidence, is a materially better outcome than a developer noticing something is off on day seven.


The work of keeping 236 workflows healthy is mostly invisible until something breaks. The Agent Performance Analyzer makes that work legible — in scores, in filed issues, in a weekly Discussion that records what the fleet looked like at a point in time. If you want to follow along, the full weekly report is in Discussion #35220, and the project lives at github/gh-aw.