DataOps
DataOps combines deterministic data extraction with agentic analysis: shell commands in steps: reliably collect and prepare data (fast, cacheable, reproducible), then the AI agent reads the results and generates insights. Use this pattern for data aggregation, report generation, trend analysis, and auditing.
The DataOps Pattern
Section titled “The DataOps Pattern”Basic Structure
Section titled “Basic Structure”---on: schedule: daily workflow_dispatch:
steps: - name: Collect data run: | # Deterministic data extraction gh api ... > /tmp/gh-aw/data.json
safe-outputs: create-discussion: category: "reports"---
# Analysis Workflow
Analyze the data at `/tmp/gh-aw/data.json` and create a summary report.Example: PR Activity Summary
Section titled “Example: PR Activity Summary”This workflow collects statistics from recent pull requests and generates a weekly summary:
---name: Weekly PR Summarydescription: Summarizes pull request activity from the last weekon: schedule: weekly workflow_dispatch:
permissions: contents: read pull-requests: read
engine: copilotstrict: true
network: allowed: - defaults - github
safe-outputs: create-discussion: title-prefix: "[weekly-summary] " category: "announcements" max: 1 close-older-discussions: true
tools: bash: ["*"]
steps: - name: Fetch recent pull requests env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} run: | mkdir -p /tmp/gh-aw/pr-data
# Fetch last 100 PRs with key metadata gh pr list \ --repo "${{ github.repository }}" \ --state all \ --limit 100 \ --json number,title,state,author,createdAt,mergedAt,closedAt,additions,deletions,changedFiles,labels \ > /tmp/gh-aw/pr-data/recent-prs.json
echo "Fetched $(jq 'length' /tmp/gh-aw/pr-data/recent-prs.json) PRs"
- name: Compute summary statistics run: | cd /tmp/gh-aw/pr-data
# Generate statistics summary jq '{ total: length, merged: [.[] | select(.state == "MERGED")] | length, open: [.[] | select(.state == "OPEN")] | length, closed: [.[] | select(.state == "CLOSED")] | length, total_additions: [.[].additions] | add, total_deletions: [.[].deletions] | add, total_files_changed: [.[].changedFiles] | add, authors: [.[].author.login] | unique | length, top_authors: ([.[].author.login] | group_by(.) | map({author: .[0], count: length}) | sort_by(-.count) | .[0:5]) }' recent-prs.json > stats.json
echo "Statistics computed:" cat stats.json
timeout-minutes: 10---
# Weekly Pull Request Summary
Analyze the prepared data:- `/tmp/gh-aw/pr-data/recent-prs.json` - Last 100 PRs with full metadata- `/tmp/gh-aw/pr-data/stats.json` - Pre-computed statistics
Create a discussion summarizing: total PRs, merge rate, code changes (+/- lines), top contributors, and any notable trends. Keep it concise and factual.Data Caching
Section titled “Data Caching”For workflows that run frequently or process large datasets, use caching to avoid redundant API calls:
---cache: - key: pr-data-${{ github.run_id }} path: /tmp/gh-aw/pr-data restore-keys: | pr-data-
steps: - name: Check cache and fetch only new data run: | if [ -f /tmp/gh-aw/pr-data/recent-prs.json ]; then echo "Using cached data" else gh pr list --limit 100 --json ... > /tmp/gh-aw/pr-data/recent-prs.json fi---Advanced: Multi-Source Data
Section titled “Advanced: Multi-Source Data”Combine data from multiple sources before analysis:
---steps: - name: Fetch PR data run: gh pr list --json ... > /tmp/gh-aw/prs.json
- name: Fetch issue data run: gh issue list --json ... > /tmp/gh-aw/issues.json
- name: Fetch workflow runs run: gh run list --json ... > /tmp/gh-aw/runs.json
- name: Combine into unified dataset run: | jq -s '{prs: .[0], issues: .[1], runs: .[2]}' \ /tmp/gh-aw/prs.json \ /tmp/gh-aw/issues.json \ /tmp/gh-aw/runs.json \ > /tmp/gh-aw/combined.json---
# Repository Health Report
Analyze the combined data at `/tmp/gh-aw/combined.json` covering:- Pull request velocity and review times- Issue response rates and resolution times- CI/CD success rates and flaky testsBest Practices
Section titled “Best Practices”- Keep steps deterministic - Same inputs should produce the same outputs; avoid randomness or time-dependent logic.
- Pre-compute aggregations - Use
jq,awk, or Python to compute statistics upfront, reducing agent token usage. - Structure data clearly - Output JSON with clear field names; include a summary file alongside raw data.
- Document data locations - Tell the agent where to find the data and what format to expect.
- Use safe outputs - Discussions are ideal for reports (support threading and reactions).
Additional Resources
Section titled “Additional Resources”- Steps Reference - Shell step configuration
- Safe Outputs Reference - Validated GitHub operations
- Cache Memory - Caching data between runs
- DailyOps - Scheduled improvement workflows