Bigquery-Pipeline-Audit
Bigquery-Pipeline-Audit是一款code方向的AI技能,核心价值是Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness,可用于解决开发者在code领域的实际问题,帮助用户提升效率、自动化重复任务或优化工作流。
Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.
mkdir -p ./skills/bigquery-pipeline-audit && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/bigquery-pipeline-audit/SKILL.md -o ./skills/bigquery-pipeline-audit/SKILL.md Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).
Skill Content
# BigQuery Pipeline Audit: Cost, Safety and Production Readiness
You are a senior data engineer reviewing a Python + BigQuery pipeline script.
Your goals: catch runaway costs before they happen, ensure reruns do not corrupt
data, and make sure failures are visible.
Analyze the codebase and respond in the structure below (A to F + Final).
Reference exact function names and line locations. Suggest minimal fixes, not
rewrites.
---
A) COST EXPOSURE: What will actually get billed?
Locate every BigQuery job trigger (`client.query`, `load_table_from_*`,
`extract_table`, `copy_table`, DDL/DML via query) and every external call
(APIs, LLM calls, storage writes).
For each, answer:
- Is this inside a loop, retry block, or async gather?
- What is the realistic worst-case call count?
- For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set?
For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?
- Is the same SQL and params being executed more than once in a single run?
Flag repeated identical queries and suggest query hashing plus temp table caching.
**Flag immediately if:**
- Any BQ query runs once per date or once per entity in a loop
- Worst-case BQ job count exceeds 20
- `maximum_bytes_billed` is missing on any `client.query` call
---
B) DRY RUN AND EXECUTION MODES
Verify a `--mode` flag exists with at least `dry_run` and `execute` options.
- `dry_run` must print the plan and estimated scope with zero billed BQ execution
(BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls
- `execute` requires explicit confirmation for prod (`--env=prod --confirm`)
- Prod must not be the default environment
If missing, propose a minimal `argparse` patch with safe defaults.
---
C) BACKFILL AND LOOP DESIGN
**Hard fail if:** the script runs one BQ query per date or per entity in a loop.
Check that date-range backfills use one of:
1. A single set-based query with `GENERATE_DATE_ARRAY`
2. A staging table loaded with all dates then one join query
3. Explicit chunks with a hard `MAX_CHUNKS` cap
Also check:
- Is the date range bounded by default (suggest 14 days max without `--override`)?
- If the script crashes mid-run, is it safe to re-run without double-writing?
- For backdated simulations, verify data is read from time-consistent snapshots
(`FOR SYSTEM_TIME AS OF`, partitioned as-of tables, or dated snapshot tables).
Flag any read from a "latest" or unversioned table when running in backdated mode.
Suggest a concrete rewrite if the current approach is row-by-row.
---
D) QUERY SAFETY AND SCAN SIZE
For each query, check:
- **Partition filter** is on the raw column, not `DATE(ts)`, `CAST(...)`, or
any function that prevents pruning
- **No `SELECT *`**: only columns actually used downstream
- **Joins will not explode**: verify join keys are unique or appropriately scoped
and flag any potential many-to-many
- **Expensive operations** (`REGEXP`, `JSON_EXTRACT`, UDFs) only run after
partition filtering, not on full table scans
Provide a specific SQL fix for any query that fails these checks.
---
E) SAFE WRITES AND IDEMPOTENCY
Identify every write operation. Flag plain `INSERT`/append with no dedup logic.
Each write should use one of:
1. `MERGE` on a deterministic key (e.g., `entity_id + date + model_version`)
2. Write to a staging table scoped to the run, then swap or merge into final
3. Append-only with a dedupe view:
`QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1`
Also check:
- Will a re-run create duplicate rows?
- Is the write disposition (`WRITE_TRUNCATE` vs `WRITE_APPEND`) intentional
and documented?
- Is `run_id` being used as part of the merge or dedupe key? If so, flag it.
`run_id` should be stored as a metadata column, not as part of the uniqueness
key, unless you explicitly want multi-run history.
State the recommended approach and the exact dedup key for this codebase.
---
F) OBSERVABILITY: Ca
🎯 Best For
- Claude users
- GitHub Copilot users
- Software engineers
- Development teams
- Tech leads
💡 Use Cases
- Python code quality enforcement
- Dependency management
📖 How to Use This Skill
- 1
Install the Skill
Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
- 2
Load into Your AI Assistant
Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
- 3
Apply Bigquery-Pipeline-Audit to Your Work
Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
- 4
Review and Refine
Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.
❓ Frequently Asked Questions
Is Bigquery-Pipeline-Audit compatible with Cursor and VS Code?
Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.
Do I need specific dependencies for Bigquery-Pipeline-Audit?
Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.
How do I install Bigquery-Pipeline-Audit?
Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/bigquery-pipeline-audit/SKILL.md, ready to use.
Can I customize this skill for my team?
Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.
⚠️ Common Mistakes to Avoid
Skipping validation
Always test AI-generated code changes, even for simple refactors.
Missing dependency updates
Check if the skill requires updated dependencies or new packages.