GitHub Agentic Workflows

Blog

Agent of the Day – May 28, 2026

Every codebase accumulates sediment. A helper function that made sense six months ago. A wrapper that lost its reason to exist after a refactor. Nobody deletes it on purpose — it just lingers. In Go, that lingering costs you: extra surface area to maintain, test coverage for code that does nothing new, and cognitive overhead for every engineer who reads the file.

The Dead Code Removal Agent is a scheduled GitHub Actions workflow that runs daily on the gh-aw repository. Its job is simple: find unused code, verify nothing breaks, and open a pull request. No human intervention required until review time.

On May 27, 2026, the agent completed run #100. Not a fanfare moment — just another daily run doing exactly what it was built to do. It finished in 11.4 minutes across 5 turns, consumed 14.6M effective tokens, and used 12 GitHub Actions minutes.

The target this time was NewValidationErrorWithLocation in pkg/workflow/workflow_errors.go. The function was a constructor wrapper around WorkflowValidationError — originally a convenience, but over time it became redundant as callers could initialize the struct directly. The agent identified it, confirmed it had no remaining callers, and started working.

The tool call sequence tells the story cleanly: one Install, eight Check passes, five Reads, three Views, four Edits, a Find, a Verify, a Format, two Runs, two Creates, an Update, and a Vet. That’s methodical, not mechanical. The agent didn’t just delete the function — it removed the corresponding TestNewValidationErrorWithLocation test from pkg/workflow/error_helpers_test.go and updated compiler_error_formatting_test.go to use direct WorkflowValidationError struct initialization instead.

Verification was thorough. Before touching the PR, the agent ran go build ./..., go vet ./..., go vet -tags=integration ./..., and make fmt. Everything passed. The resulting PR — “chore: remove dead functions — 1 function removed” on branch chore/remove-dead-code-20260527 — arrived clean, with no lint issues and a test suite that still compiles.

Zoom out a week and the picture gets more interesting. Across five runs in the last seven days, the agent logged:

  • 35.5 minutes total duration
  • 38.9M effective tokens
  • 38 GitHub Actions minutes
  • 21 turns across all five runs
  • 5 out of 5 high-confidence episodes

Run classification across that window: two normal runs, one risky, one failure, one in-progress. The failure and the risky classification matter as much as the successes. The agent doesn’t always find something safe to remove, and when it can’t complete cleanly, it doesn’t force a PR. That restraint is a feature, not a gap.

Dead code removal is well-suited to an agent for a specific reason: the feedback loop is entirely mechanical. Does it build? Does go vet pass? Does the test suite still run? Those questions have definitive answers. The agent never has to speculate about intent — it just has to be rigorous about verification, which it is.

The harder editorial question — should this code be removed — is answered by the PR review. The agent does the investigation and the grunt work. Engineers do the judgment call. That division feels right.

There’s also something useful about the daily cadence. A function doesn’t become dead overnight. But catching it the morning after the last caller disappears, rather than six months later during a refactor, is the difference between a one-line deletion and an archaeology project.

If you’re curious about how the Dead Code Removal Agent is built, or if you want to run something similar against your own Go codebase, the workflow lives at github/gh-aw. The patterns here — schedule-triggered agents, structured verification steps, PR-as-output — are composable. Start there.

Run #100 was just another Tuesday. That’s the point.

Agent of the Day – May 27, 2026

Every day, 236 agentic workflows run inside the gh-aw repository. Most complete quietly. A few fail in patterns worth tracking. And once a week, one workflow reads the entire fleet, scores it, and writes up what it found. That workflow is the Agent Performance Analyzer, and its run on May 27, 2026 produced the clearest signal in months.

Agent of the Day: Agent Performance Analyzer — Meta-Orchestrator

Section titled “Agent of the Day: Agent Performance Analyzer — Meta-Orchestrator”

The agent-performance-analyzer is not a workflow that builds features or merges PRs. Its job is to watch everything else. On a daily schedule, it fans out across the full fleet of 236 workflows, scores each agent group across three dimensions — quality (0–100), effectiveness (0–100), and ecosystem health (0–100) — and surfaces what the aggregate data says about systemic health. Think of it as a standing post-incident review that runs without anyone needing to call one.

Run #26515287616, logged on May 27, ran for 10.7 minutes and processed 12.2 million effective tokens. Those numbers matter because they reflect how much context the analyzer actually reads — audit logs, PR outcomes, failure histories, discussion threads — before rendering a score. This is not a lightweight health check.

The headline number from this week’s pass: ecosystem health hit 90/100, up 20 points from the prior week. That is the largest single-week jump in the recorded history of this metric. It is also a number that demands interpretation, not celebration. A 20-point move in one week usually means either the fleet genuinely improved, or something was suppressing the score before and is now resolved. The weekly Discussion #35220 breaks down the contributing factors — most of the lift came from copilot-swe-agent merge rate recovery, which landed at 67% week-over-week, up 6 percentage points, with 6 merges on May 27 alone. Merge rate as a proxy for workflow effectiveness is imperfect, but 67% across a fleet this size is a meaningful signal.

The top performers bear out that story. Lint Monster scored 90/100 on quality and 85/100 on effectiveness — consistent, expected, unglamorous. copilot-swe-agent followed at 88/100 quality and 84/100 effectiveness. spec-enforcer/extractor went 3-for-3 on merges this week, a 100% merge rate on a small but non-trivial sample. These are the parts of the fleet holding their line.

Quality, though, is flat. 74/100 for the fourth consecutive week. A plateau at week four is no longer noise. The analyzer flagged this directly: without intervention, the quality score will not self-correct. The fleet is not degrading, but it is not improving either, and in a system that runs daily, stasis accumulates.

The more operationally significant output from this run was not the Discussion — it was issue #35219. The analyzer detected a Copilot CLI execution failure pattern affecting the Daily News and Daily Issues Report workflows across five or more consecutive days at a 100% failure rate. A workflow failing once is noise. Failing every day for a week is infrastructure. The issue was filed automatically based on threshold logic baked into the analyzer’s scoring criteria. No human had to notice the pattern.

Three other systemic issues surfaced in Discussion #35220. A safe-outputs permission regression is blocking three or more agent groups and has been classified P1. A CGO/CJS build regression running at 37% failure rate has now exceeded 90 days without resolution — that is a P0 by any reasonable SLO definition. And 87 of the fleet’s 236 workflows show no recent runs at all, which makes them deprecation candidates pending owner review. The firewall processed 113 requests during this period and blocked 30 of them — a 27% block rate — which is consistent with prior weeks but warrants monitoring if the trend climbs.

The value of a meta-orchestrator is not that it prevents incidents. It is that it shortens the time between an incident beginning and someone with context knowing about it. Five consecutive days of 100% failure on two named workflows, with an auto-filed issue linking directly to the evidence, is a materially better outcome than a developer noticing something is off on day seven.


The work of keeping 236 workflows healthy is mostly invisible until something breaks. The Agent Performance Analyzer makes that work legible — in scores, in filed issues, in a weekly Discussion that records what the fleet looked like at a point in time. If you want to follow along, the full weekly report is in Discussion #35220, and the project lives at github/gh-aw.

Agent of the Day – May 26, 2026

Every morning someone at GitHub opens their laptop and wonders: how well did the coding agents do yesterday? Did they ship? Did they stall? Did they create more work than they saved? These questions used to require manual spelunking through dashboards, cross-referencing merged PRs with author names, and guessing at patterns from vibes alone.

Not anymore.

Agent of the Day: Copilot Agent PR Analysis

Section titled “ Agent of the Day: Copilot Agent PR Analysis”

The Copilot Agent PR Analysis workflow runs daily at 6pm UTC with a single mandate: understand how GitHub’s own coding agents are performing in the wild. It watches copilot-swe-agent-authored pull requests, tracks their lifecycle from open to merge (or close), and surfaces patterns that would otherwise vanish into the noise of a busy repository.

Run 26415065259 on May 25th tells the story. Six minutes. Nineteen agent turns. Nearly a million tokens processed. And at the end, a GitHub Discussion summarizing everything the agents accomplished in the last 24 hours—merge rates, review turnaround, file change distributions, the works.

Workflow activity chart

What makes this run interesting isn’t just the output—it’s the mechanics underneath. The workflow starts by reading pre-fetched PR data from /tmp/gh-aw/agent/pr-data/copilot-prs.json, a file populated by an earlier step that batches GitHub API calls. This matters because API rate limits are a real constraint when you’re analyzing dozens of PRs daily. By front-loading the data fetch, the Claude Opus 4.7 model can focus on analysis rather than pagination logistics.

From there, the agent orchestrates across 16 different tool types. github-list_pull_requests and github-search_pull_requests pull in the raw data. github-get_file_contents adds context when the agent needs to understand what a PR actually changed. push_repo_memory persists metrics for trend analysis—because spotting a single bad day matters less than spotting a three-week decline. And create_discussion posts the findings where the team can actually see them.

The token economics tell their own story. Of the 947,148 tokens consumed, over 3 million effective tokens came from cache reads—a 63% hit rate. That’s not an accident. The workflow’s prompt structure and tool imports are designed to maximize cache reuse across runs. At $1.53 per execution, this is the kind of analysis that would cost ten times more if you rebuilt context from scratch each day.

Nineteen turns might sound like a lot, but the average inter-turn time of 19.8 seconds reveals something important: this agent is thinking, not thrashing. It’s making deliberate tool calls, waiting for responses, incorporating results, and planning next steps. The turn count reflects adaptive planning—the kind of reasoning that adjusts when it finds fewer PRs than expected or more activity in an unexpected repository corner.

PR #34947, merged just one day after this run, shows the feedback loop in action. Titled “Normalize copilot-session-insights discussion output hierarchy and disclosure,” it refined how the analysis gets presented—making the daily summaries easier to scan and the trend data more accessible. The workflow’s own output informed improvements to the workflow itself.

This is what continuous observability looks like for AI systems. Traditional software gets monitored with APM tools, error rates, and latency percentiles. But when your “software” is an autonomous agent making judgment calls about code, you need a different kind of visibility. You need to know: are the agents getting better at writing tests? Are they over-indexing on certain file types? Are their PRs sitting in review limbo, or are humans accepting them quickly?

The Copilot Agent PR Analysis workflow answers these questions daily, automatically, without anyone remembering to ask.


Curious about building workflows that watch your workflows? Explore the full gh-aw project at github/gh-aw—where agentic automation meets operational insight.

Agent of the Day – May 25, 2026

Some days the agent has nothing to report, and that’s exactly the point. I pulled up run 26407385057 this morning — 3.8 minutes, clean sweep. No violations. The Architecture Guardian looked at everything that landed in the last 24 hours and came back with a simple verdict: all changed files are within configured thresholds. In a codebase that moves this fast, that outcome doesn’t happen by accident.

The Architecture Guardian runs every weekday around 14:00 UTC. Its job is unglamorous and essential: scan every .go, .js, .cjs, and .mjs file touched in the last 24 hours (tests and vendor excluded) and ask whether the code is still structurally sound. It’s the kind of review that humans intend to do and quietly skip.

The mechanics are deliberate. A bash pre-step calls git log --since="24 hours ago" to build the file list. From there it computes line counts, function sizes, and export counts for each file, then runs go list ./... to catch import cycles before they calcify. Everything lands in /tmp/gh-aw/agent/arch-metrics.json. A lightweight sub-agent — violation-classifier, running on a small model — reads that JSON and applies a three-tier severity ladder:

  • BLOCKER — files exceeding 1,000 lines or any import cycle
  • ! WARNING — files over 500 lines or functions over 80 lines
  • INFO — files exporting more than 10 identifiers

If it finds something, it opens a GitHub issue with a structured report, tagged architecture, automated-analysis, and cookie. If not, it calls noop and gets out of the way. There’s also a guard against noise: a shared skip-if-issue-open.md import prevents the agent from filing duplicate issues when a violation is already being tracked.

Workflow activity chart

What stands out about today’s run isn’t the clean result — it’s the efficiency behind it. 121,425 input tokens processed, but 75,961 of those came from cache reads. That’s roughly 63% cache hit rate, which means the agent isn’t re-reading static context on every run; it’s built to reuse it. Total AI turns: 3. GitHub API calls: 4. The whole thing resolved in under 4 minutes with 307 output tokens — barely a paragraph’s worth of text to confirm the codebase is healthy.

That ratio matters. The Architecture Guardian isn’t trying to be clever. It’s trying to be cheap and reliable — the kind of automation you can run daily without flinching at the cost or the alert fatigue. Thresholds live in .architecture.yml, so teams can tune what counts as a violation without touching the workflow itself. The 2-day expiry on issues (via daily-issue-base.md) keeps the tracker clean even when something does slip through.

I’ve seen codebases where large files and tangled imports accumulate like sediment — not because anyone chose it, but because nobody had a lightweight, automatic way to notice. This workflow is that noticing mechanism. It doesn’t replace a thoughtful architecture review. It makes sure the small things don’t compound into the kind of mess that makes a real review feel hopeless.

Today it found nothing. Some days it will. Either way, it showed up.


Explore the full workflow and the rest of the gh-aw suite at github/gh-aw.

Weekly Update – May 25, 2026

It’s been a productive week in github/gh-aw — six pre-releases landed on top of the stable v0.74.8, culminating in v0.75.4 on May 24th. Here’s what shipped.

v0.75.4 is the headline pre-release of the week, rolling up improvements across the Codex engine, observability, and the compiler.

  • Codex harness hardened (#34459): The Codex engine now includes secret diagnostics, missing-key fast-fail, and --json streaming mode. If OPENAI_API_KEY is absent, you’ll get a clear error instead of a mysterious silence — and dev.md has been switched to Codex for a better developer experience.
  • OTel child SDK correlation (#34450): OTEL_RESOURCE_ATTRIBUTES are now injected into gh-aw workflows, so child processes using the OpenTelemetry SDK automatically inherit trace context. End-to-end distributed tracing just got a whole lot more useful.
  • Go 1.26 (#34318): The project has migrated to Go 1.26.
  • Gemini chunked threat-detection parsing (#34509): Gemini’s stream-json responses were sometimes arriving as fragmented chunks, causing detection to report a missing verdict. That’s fixed.
  • Codex default model set to gpt-5.3-codex (#34518): No more empty-string fallback crashes when engine.model is unset for the Codex engine.
  • First-class engine.permission-mode (#34525): Claude’s permission mode (acceptEdits vs bypassPermissions) was previously derived implicitly from bash wildcard detection, which could silently disable --allowed-tools enforcement. You can now set engine.permission-mode explicitly in your workflow frontmatter, giving you a clear, auditable security boundary.
  • add-wizard github.com org fallback for GHE (#34526): Shorthand workflow specs from public sources were resolving on the active GHE host and returning confusing 404s. The resolver now falls back to github.com for org-less shorthands.
  • PR Sous Chef startup crash context (#34524): AWF startup failures were showing up as generic Copilot termination with stdout/stderr: undefined. Failure context is now surfaced correctly.
  • FAQ condensed ~21% (#34488): Verbose multi-paragraph answers have been collapsed into tight, scannable responses. Less scrolling, same information.

The workflow that turns your codebase’s bad habits into laws.

This week linter-miner went on a deep dive through the gh-aw codebase, mining for antipatterns ripe for static analysis enforcement. It zeroed in on the fmt.Fprintln(w, fmt.Sprintf(...)) redundancy — a pattern that allocates an intermediate string, then allocates again to append a newline, when a single fmt.Fprintf call would do the job cleanly. The result: a brand-new fprintlnsprintf linter, complete with a bundle of existing violations for the PR reviewer to clean up. It took 39 turns and 10.8 minutes, burning through over a million tokens with the dedication of an engineer who really cares about unnecessary heap allocations.

Notably, it failed twice before nailing it on the third run — apparently even automated linter writers need a couple of drafts before the code compiles.

Usage tip: Linter miner is most valuable right after a refactor or new abstraction lands — that’s when consistent usage patterns (and consistent antipatterns) start to crystallize, and the window to enforce them early is at its widest.

View the workflow on GitHub

Check out v0.75.4 or the stable v0.74.8 — and as always, contributions and feedback are welcome in github/gh-aw.