DataOps
DataOps combines deterministic data extraction with agentic analysis. Shell commands in steps: collect and prepare data, then the AI agent in the markdown body analyzes results and produces safe outputs like discussions or comments.
When to Use DataOps
Section titled “When to Use DataOps”- Data aggregation - Collect metrics from APIs, logs, or repositories
- Report generation - Analyze data and produce human-readable summaries
- Trend analysis - Process historical data and identify patterns
- Auditing - Gather evidence and generate audit reports
The DataOps Pattern
Section titled “The DataOps Pattern”Separation of Concerns
Section titled “Separation of Concerns”DataOps separates two distinct phases:
-
Deterministic extraction (
steps:) - Shell commands that reliably fetch, filter, and structure data. These run before the agent and produce predictable, reproducible results. -
Agentic analysis (markdown body) - The AI agent reads the prepared data, interprets patterns, and generates insights. The agent has access to the data files created by the steps.
This separation ensures data collection is fast, reliable, and cacheable, while the AI focuses on interpretation and communication.
Basic Structure
Section titled “Basic Structure”---on: schedule: daily workflow_dispatch:
steps: - name: Collect data run: | # Deterministic data extraction gh api ... > /tmp/gh-aw/data.json
safe-outputs: create-discussion: category: "reports"---
# Analysis Workflow
Analyze the data at `/tmp/gh-aw/data.json` and create a summary report.Example: PR Activity Summary
Section titled “Example: PR Activity Summary”This workflow collects statistics from recent pull requests and generates a weekly summary:
---name: Weekly PR Summarydescription: Summarizes pull request activity from the last weekon: schedule: weekly workflow_dispatch:
permissions: contents: read pull-requests: read
engine: copilotstrict: true
network: allowed: - defaults - github
safe-outputs: create-discussion: title-prefix: "[weekly-summary] " category: "announcements" max: 1 close-older-discussions: true
tools: bash: ["*"]
steps: - name: Fetch recent pull requests env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} run: | mkdir -p /tmp/gh-aw/pr-data
# Fetch last 100 PRs with key metadata gh pr list \ --repo "${{ github.repository }}" \ --state all \ --limit 100 \ --json number,title,state,author,createdAt,mergedAt,closedAt,additions,deletions,changedFiles,labels \ > /tmp/gh-aw/pr-data/recent-prs.json
echo "Fetched $(jq 'length' /tmp/gh-aw/pr-data/recent-prs.json) PRs"
- name: Compute summary statistics run: | cd /tmp/gh-aw/pr-data
# Generate statistics summary jq '{ total: length, merged: [.[] | select(.state == "MERGED")] | length, open: [.[] | select(.state == "OPEN")] | length, closed: [.[] | select(.state == "CLOSED")] | length, total_additions: [.[].additions] | add, total_deletions: [.[].deletions] | add, total_files_changed: [.[].changedFiles] | add, authors: [.[].author.login] | unique | length, top_authors: ([.[].author.login] | group_by(.) | map({author: .[0], count: length}) | sort_by(-.count) | .[0:5]) }' recent-prs.json > stats.json
echo "Statistics computed:" cat stats.json
timeout-minutes: 10---
# Weekly Pull Request Summary
Generate a summary of pull request activity for the repository.
## Available Data
The following data has been prepared for your analysis:
- `/tmp/gh-aw/pr-data/recent-prs.json` - Last 100 PRs with full metadata- `/tmp/gh-aw/pr-data/stats.json` - Pre-computed statistics
## Your Task
1. **Read the prepared data** from the files above2. **Analyze the statistics** to identify: - Overall activity levels - Merge rate and velocity - Most active contributors - Code churn (additions vs deletions)3. **Generate a summary report** as a GitHub discussion with: - Key metrics in a clear format - Notable trends or observations - Top contributors acknowledgment
## Report Format
Create a discussion with this structure:
```markdown# Weekly PR Summary - [Date Range]
## Key Metrics- **Total PRs**: X- **Merged**: X (Y%)- **Open**: X- **Code Changes**: +X / -Y lines across Z files
## Top Contributors1. @author1 - X PRs2. @author2 - Y PRs...
## Observations[Brief insights about activity patterns]```
Keep the report concise and factual. Focus on the numbers and let them tell the story.Data Caching
Section titled “Data Caching”For workflows that run frequently or process large datasets, use caching to avoid redundant API calls:
---cache: - key: pr-data-${{ github.run_id }} path: /tmp/gh-aw/pr-data restore-keys: | pr-data-
steps: - name: Check cache and fetch only new data run: | if [ -f /tmp/gh-aw/pr-data/recent-prs.json ]; then echo "Using cached data" else gh pr list --limit 100 --json ... > /tmp/gh-aw/pr-data/recent-prs.json fi---Advanced: Multi-Source Data
Section titled “Advanced: Multi-Source Data”Combine data from multiple sources before analysis:
---steps: - name: Fetch PR data run: gh pr list --json ... > /tmp/gh-aw/prs.json
- name: Fetch issue data run: gh issue list --json ... > /tmp/gh-aw/issues.json
- name: Fetch workflow runs run: gh run list --json ... > /tmp/gh-aw/runs.json
- name: Combine into unified dataset run: | jq -s '{prs: .[0], issues: .[1], runs: .[2]}' \ /tmp/gh-aw/prs.json \ /tmp/gh-aw/issues.json \ /tmp/gh-aw/runs.json \ > /tmp/gh-aw/combined.json---
# Repository Health Report
Analyze the combined data at `/tmp/gh-aw/combined.json` covering:- Pull request velocity and review times- Issue response rates and resolution times- CI/CD success rates and flaky testsBest Practices
Section titled “Best Practices”Keep steps deterministic - Avoid randomness or time-dependent logic in steps. The same inputs should produce the same outputs.
Pre-compute aggregations - Use jq, awk, or Python in steps to compute statistics. This reduces agent token usage and improves reliability.
Structure data clearly - Output JSON with clear field names. Include a summary file alongside raw data.
Document data locations - Tell the agent exactly where to find the prepared data and what format to expect.
Use safe outputs - Always use safe-outputs for agent actions. Discussions are ideal for reports since they support threading and reactions.
Additional Resources
Section titled “Additional Resources”- Steps Reference - Shell step configuration
- Safe Outputs Reference - Validated GitHub operations
- Cache Configuration - Caching data between runs
- DailyOps - Scheduled improvement workflows