GitHub Agentic Workflows

Blog

Agent of the Day – June 2, 2026

Agent of the Day – June 2, 2026: The Data Detective

Section titled “Agent of the Day – June 2, 2026: The Data Detective”

You know that feeling when a bill arrives and it’s higher than you expected — and the line items are all vague? That’s what staring at aggregate AI token consumption looks like without good tooling. The number goes up, the curve bends, and everyone shrugs. Was it a new workflow? A prompt gone feral? A perfectly normal Monday?

That’s the exact problem Scout was built for.


Scout is gh-aw’s on-demand research agent — a workflow you invoke with a question and come back to with an answer. It doesn’t file PRs or leave comments as part of a pipeline. It reads, reasons, and reports, turning an open-ended research prompt into structured evidence a team can actually act on.

On May 31, 2026 (run #26709587451), Scout received a deceptively simple prompt on issue #36100: investigate token usage trends from the agentic-token-audit and agentic-token-optimizer workflows across April and May.

Eight turns and 8.1 minutes later, it had the answer — and it wasn’t pretty.


The headline: daily token consumption in gh-aw nearly doubled over two months, peaking at 138 million tokens on May 29 — the highest single day in the entire dataset.

WindowAvg tokens/dayAvg action-min/day
April 2026 (21 days)~80.1M~713
Early May (days 1–5)~62.1M
Late May (days 20–29)~101.8M~900

Run counts stayed nearly flat the whole time — capped near 100/day by the collector’s limit. More runs weren’t the culprit. The growth was coming from within each run.

Scout traced it to two compounding forces. First, heavy-hitter workflows: the May 29 spike was dominated by PR Sous Chef (15.7M tokens across 5 runs, averaging ~186 turns per run), Safe Output Health Monitor (8.7M, single run), and Go Logger Enhancement (8.5M). Token variance tracked workflow mix and turn count almost exactly. Second, catalog growth: ~111 new agentic workflow .md files were added between April and May, pushing the repository to over 237 workflows. More workflows meant more scheduled runners pulling heavier daily reporters and analyzers into the mix.

There’s a silver lining. The agentic-token-optimizer workflow is doing its job — flagging concrete savings targets and driving commits. After Scout’s predecessor run flagged go-logger at 1.7M tokens per run on May 31, commit #36088 (“Trim go-logger workflow prompt and validation overhead”) landed quickly. The feedback loop works.

The gap is velocity: new workflows are arriving faster than optimizations land, so the net curve still bends upward.


What makes this run compelling isn’t just the findings — it’s how Scout approached the problem. It used 37 distinct tool types across 8 turns, drawing on Tavily’s research suite (search, crawl, extract, map, and research) to pull historical snapshot data and cross-reference it against repository commits. It made 61 network requests with zero firewall blocks, querying the memory/token-audit branch for the daily snapshot history and reconciling gaps in the mid-May data (several dates had empty downloads from API rate-limit failures during collection).

The result was a structured research report posted directly to issue #36100, complete with a data table, a trend attribution section, caveats about data quality during the blind-spot window (May 6–19), and concrete recommendations — all in a single comment.

No pipeline. No scaffolding. Just: “here’s a hard question” → “here’s a rigorous answer.”


Scout is a good reminder that not every agent needs to do something to be valuable. Some of the highest-leverage work in a complex system is the work of seeing clearly — quantifying what’s happening, attributing root causes, and giving a team a shared picture to reason from. Without that, optimization work is guesswork.

When your token bill doubles in six weeks, you want a Scout.


Want to run your own research agent or explore the full gh-aw workflow catalog? Check out the project at github.com/github/gh-aw.

Agent of the Day – June 1, 2026

Agent of the Day – June 1, 2026: The Red Team That Never Sleeps

Section titled “Agent of the Day – June 1, 2026: The Red Team That Never Sleeps”

Security scanning is easy to deprioritize. It’s invisible when it works, painful when it doesn’t, and nobody schedules it at 11:47 PM on a Sunday. That’s exactly why we automated it.

Meet the Daily Security Red Team Agent — a Claude-powered workflow that runs nightly against actions/setup/js and actions/setup/sh, looking for the things no one wants to find: backdoors, secret leaks, destructive operations, and supply-chain compromise. Last night’s run (#123, 2026-05-31T23:47:47Z) came back clean. That’s the good news. The more interesting story is what it took to get there.


In 16 agentic turns over about six minutes, the agent unshallowed the repository to 12,465 commits and scanned 717 files — 379 in production scope — using bash as its forensic workhorse. It called bash 14 times: 12 directory-scan passes, two cache reads to pull context from prior runs, and one safe-output call to log its findings.

Twelve candidates came up for review. All twelve were dismissed. The agent’s logged rationale is worth reading in full, because it shows exactly the kind of reasoning you want from a security scanner:

“eval/exec calls are git/regex operations, base64 is GitHub API content decoding, rm -rf ops are workspace-scoped or credential cleanup, IP 172.30.0.1 is the documented Docker/AWF gateway, external URLs are docs/spec/placeholders, installers verify SHA256 checksums, and git tokens use the secure extraheader pattern with no secret logging.”

That’s not hand-waving. Each dismissal maps to a specific artifact class with a specific justification. The one item that didn’t get a full pass: a low-severity pre-existing observation, already in cache, about an antigravity installer that soft-skips checksum verification on HTTP 404. Noted, tracked, not new.

No issues were created this run. The agent is configured to open up to five GitHub issues per run, labeled security, red-team, prefixed with [SECURITY]. Strict mode means it won’t fabricate urgency. If it doesn’t find something real, it files nothing.


Here’s the part that makes this more than just a nightly cron job dressed up in AI. Since May 12, the workflow has been running an A/B experiment (issue #31673) comparing two analysis techniques: single_pass versus iterative. The experiment is tracking false-positive rates across both variants to figure out which approach surfaces real issues without drowning engineers in noise.

Last night’s run used the full-comprehensive technique variant. That matters because the approach shapes how the agent allocates its 1,076,688 tokens across 16 turns — whether it commits to a single deep pass or revisits candidates in multiple rounds. Understanding which technique produces better signal is precisely the kind of question you can only answer by running both and measuring.

The agent’s own behavior fingerprint classified this run as exploratory — methodical, wide-coverage, following leads rather than checking predetermined boxes. That fits the full-comprehensive profile. It also means roughly half the turns were data-gathering that could, in principle, move to deterministic pre-processing steps. That’s not a criticism; it’s a roadmap.


Actions setup scripts are high-value targets. They run early in CI pipelines, often with elevated permissions, before most other controls are in place. A compromised installer or a leaked token in that path is a bad day for everyone downstream.

Running a human red-team review at that depth every night isn’t realistic. Running a token-heavy AI agent that unshallows 12,000+ commits and reasons through eval patterns at 11 PM on a Sunday, every Sunday? That’s exactly the kind of work that should be automated — not because it’s easy, but because the alternative is doing it inconsistently or not at all.

The workflow logged a clean bill of health. The experiment is generating data. The cache carries forward observations across runs so context doesn’t reset to zero every night. That’s an agent doing its job.


Daily workflow activity chart


If you want to see how the workflow is structured, run your own experiments, or understand how cache-memory persistence works across agentic runs, the full source is at github/gh-aw. The red team never sleeps — but it does file issues when it finds something.

Agent of the Day – June 1, 2026

Architectural drift is quiet and cumulative. A file grows past 600 lines. A function absorbs one more responsibility. An import cycle sneaks in between two packages that “just need to share a little logic.” None of it trips a CI gate, no test turns red, and six months later a new engineer opens that directory and wonders how it got this bad. The Architecture Guardian workflow exists precisely to interrupt that pattern before it becomes load-bearing.

The Architecture Guardian runs on a weekday schedule, firing each afternoon around 14:00 UTC. It pulls the last 24 hours of commits, walks every changed Go and JavaScript file, and applies a tiered set of structural checks:

  • File size: files over 500 lines generate a warning; over 1,000 lines, a blocker.
  • Function length: any function exceeding 80 lines is flagged.
  • Export count: more than 10 exports from a single file draws scrutiny.
  • Import cycles: the full dependency graph of changed packages is traced for cycles.

When violations surface, the workflow doesn’t just log and move on. It opens a GitHub issue labeled architecture, automated-analysis, and cookie, assigned directly to Copilot for triage. The issue is the artifact — something a team can discuss, link to a PR, close when remediated.

The engine is GitHub Copilot, running as an agentic workflow defined in architecture-guardian.md. No bash scripts wrapping static analysis tools, no bespoke CI job to maintain. The analysis logic, thresholds, and issue-creation behavior all live in a single, readable workflow spec.

Run 26766995181 completed on June 1, 2026 at 16:18 UTC, five minutes and forty seconds after it started. The agent worked through three turns with claude-sonnet-4.6 via GitHub Copilot, made 10 GitHub API calls, and consumed 125,356 tokens — a number that looks large until you factor in the effective token count of 1,206,982 once prompt caching is included. Caching is doing real work here.

The verdict: no violations. Every changed file over the past 24 hours fell within the configured thresholds. The agent’s own summary put it plainly — “0 files analyzed, no import cycles detected.” Nothing to open, nothing to assign.

That outcome is worth pausing on. A clean run isn’t a null result; it’s confirmation. The codebase was touched, the guardian looked, and the boundaries held. Knowing that with specificity — on a schedule, with a receipt — is materially different from assuming it because nothing has caught fire yet.

The 500-line warning and 1,000-line blocker aren’t arbitrary. Files in that range have a documented tendency to accumulate mixed responsibilities: they’re long because they’re doing too many things, not because the domain is genuinely complex. The 80-line function limit enforces a similar discipline. It’s not a style preference; it’s a forcing function for decomposition.

Export counts above 10 are a softer signal — a package with 15 exports might be perfectly well-structured — but they surface files worth a second look. Import cycles are harder: they indicate a structural coupling that can’t be resolved without a real refactor, and they compound over time.

The Architecture Guardian makes these checks automatic and visible without requiring anyone to remember to run a linter or build a policy around code review checklists. The standards are encoded in the workflow. The workflow runs whether or not anyone’s thinking about it.

A few things worth noting if you’re thinking about adapting this pattern for your own team:

Scheduling matters. A daily check at 14:00 UTC catches violations before they’re a day old. Violations that linger for a week become rationalizations.

Issue creation is the accountability loop. Logging a warning to stdout is easy to ignore. An open issue is harder to lose, links to the violating commit, and can be closed with a reference to the fixing PR. That chain is the point.

Clean runs are data. The June 1 run found nothing. That’s not a failure of the workflow — it’s the workflow confirming steady-state health. Over time, a history of clean runs punctuated by occasional issues tells you something real about your team’s structural discipline.

Token efficiency scales. 1.2 million effective tokens for a daily architectural scan, amortized across a codebase’s active lifetime, is not expensive. The cost of a missed import cycle or a 2,000-line God file is.


The Architecture Guardian is one of the workflows available in github/gh-aw. If your team is dealing with structural drift — or wants to make sure it never starts — the repository has the workflow definitions, the engine configuration, and the patterns to adapt it to your thresholds and language stack.

Weekly Update – June 1, 2026

It’s been a busy week in github/gh-aw! Five releases landed between May 28 and May 31, capped off by v0.77.4 — one of the biggest releases in recent memory. Here’s everything that shipped.

v0.77.4 published on May 31st and packs in a ton of new capability.

  • Anthropic WIF Authentication (#35939): Claude-engine workflows can now authenticate via Workload Identity Federation. No more long-lived API key secrets stored in your repo — WIF handles it securely.

  • copilot-sdk Engine (#35936): A new engine: copilot-sdk frontmatter option gives workflows direct access to the Copilot SDK runtime, opening up new integration patterns.

  • aw.yml Manifest: Includes, Skills & Agents (#35778): Your repository manifest now supports includes, skills, and agents keys so you can compose and share workflow components across repos.

  • Per-Workflow 24-Hour Effective-Token Guardrail (#36042): A configurable token guardrail prevents runaway agent costs with enterprise-grade defaults and handy ET shorthand support.

  • search_commits in GitHub MCP Search Toolset (#36115): Agents can now search commits directly via the GitHub MCP search toolset.

  • New Skills: copilot-review and go-codemod (#36111, #36034): Two new skills help agents plan and address PR review feedback, and implement Go codemods for the gh aw fix command.

  • Prefer toolcache Copilot CLI (#35992): Workflows now use the Actions toolcache copy of the Copilot CLI before downloading a release — faster setup for everyone.
  • Reusable workflow timeout (#36107): timeout-minutes is now correctly passed through reusable workflow callers.
  • Threat-detection hardening (#36113): Missing prompt artifacts no longer block safe-output execution.
  • on.needs YAML strip (#35965): Processed on.needs keys are stripped from emitted YAML, preventing invalid workflow syntax.

v0.77.3 on May 29th brought sandbox improvements and better initialization:

  • authHeader in sandbox agent targets (#35694): You can now specify custom authentication headers directly in sandbox.agent.targets frontmatter.
  • gh aw init creates the Agentic Workflows custom agent (#35773): Running gh aw init now scaffolds a GitHub Copilot custom agent for Agentic Workflows right out of the box.
  • Stricter schema validation for workflow_call/workflow_dispatch (#35788): Unknown input keys are now rejected at compile time.

Agent of the Week: api-consumption-report

Section titled “ Agent of the Week: api-consumption-report”

The bean counter who never sleeps — tracks every GitHub API call your workflows make and publishes a detailed report so you know exactly where your rate-limit quota is going.

This week api-consumption-report analyzed 95 workflow runs across the repository (58 successes, 37 failures — it doesn’t sugarcoat the numbers), tallied up 10,619 GitHub REST API calls in a single day, and generated a full trend chart showing that API usage spiked to ~80K calls on May 20th before settling back down. It also uploaded five charts as release assets — a trend line, a heatmap, a per-workflow breakdown, a “burners” donut chart, and a workflow-level trend — then published the whole package as a GitHub Discussion for everyone to browse.

Hilariously, in one of its recent runs it completed in under 2 minutes with zero token usage and exactly one GitHub API call. Turns out that was the run where the cache hadn’t warmed yet — it took a look around, shrugged, and went home early.

Usage tip: Schedule this workflow weekly to catch runaway API consumption before you hit rate limits — the per-workflow breakdown makes it easy to spot which agent is hogging the quota.

View the workflow on GitHub

Upgrade to v0.77.4 today and explore the new copilot-sdk engine and WIF authentication for Claude. As always, feedback and contributions are welcome at github/gh-aw.

Agent of the Day – May 29, 2026

By the time an issue makes it into your backlog, someone already spent time writing it. The least you can do is make sure it gets read by the right person quickly. In practice, that rarely happens — unlabeled issues pile up, the search experience degrades, and the right engineer finds out about a relevant bug two sprints too late. Labeling sounds simple. Doing it consistently, at scale, without burning anyone’s afternoon, is the actual challenge.

That’s exactly the problem the Auto-Triage Issues workflow in gh-aw was built to solve.


Workflow: Auto-Triage Issues
Engine: GitHub Copilot (gpt-5-mini)
Run: #26640355375 — May 29, 2026, 13:34 UTC
Result: ✓ SUCCESS


Auto-Triage Issues runs on a schedule — several times a day — and also fires on issues events. Each pass, it reads through unlabeled GitHub issues, reasons about their content, and applies labels with a stated confidence level and rationale. No human in the loop. No queue to drain manually.

The agent runs behind an enabled squid-proxy firewall, with outbound access scoped to github.com and approved defaults. That constraint is intentional: triage doesn’t need the open internet, and limiting the blast radius of any agent is good practice regardless of what it’s doing.

Today’s midday run is a useful case study in how the workflow behaves under varying load.


The 07:45 UTC pass (run #26625003469) was a light one: 7 turns, finished in 5 minutes. A handful of issues to consider, quick classification, done. That’s what a steady-state workload looks like.

By 13:34 UTC, the picture was different. The agent completed 28 turns over 10 minutes — four times the conversational depth, twice the elapsed time. Same workflow, same model, same success result. The difference was the volume and complexity of what was waiting in the queue.

This matters because it shows the system isn’t just running a fixed script. The agent works through each issue, reasons about it, and the turn count reflects real cognitive work being done. A heavier inbox produces a longer run, not a failure or a time-out.


Two issues received labels during the midday run:

IssueLabels AppliedRationale
#35708automation”Automated triage report with no bug/feature signal”
#34915documentation, automation”Automated documentation quality report generated by automation; content is documentation-focused and workflow-generated”

Both calls were high-confidence. Issue #34915 is a good example of the multi-label path: the agent identified that the issue was both workflow-generated and documentation-focused, and applied both labels rather than forcing a single category. That kind of nuanced classification is where static regex-based approaches tend to fall short.


At the end of each run, the workflow doesn’t just apply labels and exit quietly. It creates — or updates — a GitHub Discussion titled [Auto-Triage Report] 2026-05-29, containing a Markdown table that summarizes every issue it classified: the issue number, the labels applied, confidence level, and the agent’s reasoning.

That report serves two purposes. First, it’s auditable — a reviewer can open the Discussion and see exactly what the agent decided and why, without digging through logs. Second, it creates a natural place for human override: if a classification looks wrong, the context is right there to inform a correction.

Transparency in automated triage isn’t optional. Reviewers need to trust the output before they’ll stop second-guessing it.


The model choice here is deliberate. gpt-5-mini is fast and cost-effective for classification tasks where the signal is textual and the label set is bounded. You don’t need a heavyweight model to tell the difference between a documentation report and a bug report. Reserving larger models for tasks that actually need them — planning, synthesis, code generation — keeps the system efficient across a full day of scheduled runs.


If your repository is drowning in unlabeled issues, Auto-Triage is a pattern worth adopting. The workflow lives in github/gh-aw, alongside the rest of the agentic workflow library. The firewall configuration, the Discussion report pattern, and the label confidence output are all ready to fork and adapt.

Triage shouldn’t be a task anyone has to remember to do. It should just happen — correctly, consistently, and with a paper trail.