MR
Mayur Rathi
@github
⭐ 34.1k GitHub stars

Bigquery-Pipeline-Audit

Bigquery-Pipeline-Audit是一款code方向的AI技能,核心价值是Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness,可用于解决开发者在code领域的实际问题,帮助用户提升效率、自动化重复任务或优化工作流。

Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.

Last verified on: 2026-05-30
mkdir -p ./skills/bigquery-pipeline-audit && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/bigquery-pipeline-audit/SKILL.md -o ./skills/bigquery-pipeline-audit/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# BigQuery Pipeline Audit: Cost, Safety and Production Readiness


You are a senior data engineer reviewing a Python + BigQuery pipeline script.

Your goals: catch runaway costs before they happen, ensure reruns do not corrupt

data, and make sure failures are visible.


Analyze the codebase and respond in the structure below (A to F + Final).

Reference exact function names and line locations. Suggest minimal fixes, not

rewrites.


---


A) COST EXPOSURE: What will actually get billed?


Locate every BigQuery job trigger (`client.query`, `load_table_from_*`,

`extract_table`, `copy_table`, DDL/DML via query) and every external call

(APIs, LLM calls, storage writes).


For each, answer:

- Is this inside a loop, retry block, or async gather?

- What is the realistic worst-case call count?

- For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set?

For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?

- Is the same SQL and params being executed more than once in a single run?

Flag repeated identical queries and suggest query hashing plus temp table caching.


**Flag immediately if:**

- Any BQ query runs once per date or once per entity in a loop

- Worst-case BQ job count exceeds 20

- `maximum_bytes_billed` is missing on any `client.query` call


---


B) DRY RUN AND EXECUTION MODES


Verify a `--mode` flag exists with at least `dry_run` and `execute` options.


- `dry_run` must print the plan and estimated scope with zero billed BQ execution

(BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls

- `execute` requires explicit confirmation for prod (`--env=prod --confirm`)

- Prod must not be the default environment


If missing, propose a minimal `argparse` patch with safe defaults.


---


C) BACKFILL AND LOOP DESIGN


**Hard fail if:** the script runs one BQ query per date or per entity in a loop.


Check that date-range backfills use one of:

1. A single set-based query with `GENERATE_DATE_ARRAY`

2. A staging table loaded with all dates then one join query

3. Explicit chunks with a hard `MAX_CHUNKS` cap


Also check:

- Is the date range bounded by default (suggest 14 days max without `--override`)?

- If the script crashes mid-run, is it safe to re-run without double-writing?

- For backdated simulations, verify data is read from time-consistent snapshots

(`FOR SYSTEM_TIME AS OF`, partitioned as-of tables, or dated snapshot tables).

Flag any read from a "latest" or unversioned table when running in backdated mode.


Suggest a concrete rewrite if the current approach is row-by-row.


---


D) QUERY SAFETY AND SCAN SIZE


For each query, check:

- **Partition filter** is on the raw column, not `DATE(ts)`, `CAST(...)`, or

any function that prevents pruning

- **No `SELECT *`**: only columns actually used downstream

- **Joins will not explode**: verify join keys are unique or appropriately scoped

and flag any potential many-to-many

- **Expensive operations** (`REGEXP`, `JSON_EXTRACT`, UDFs) only run after

partition filtering, not on full table scans


Provide a specific SQL fix for any query that fails these checks.


---


E) SAFE WRITES AND IDEMPOTENCY


Identify every write operation. Flag plain `INSERT`/append with no dedup logic.


Each write should use one of:

1. `MERGE` on a deterministic key (e.g., `entity_id + date + model_version`)

2. Write to a staging table scoped to the run, then swap or merge into final

3. Append-only with a dedupe view:

`QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1`


Also check:

- Will a re-run create duplicate rows?

- Is the write disposition (`WRITE_TRUNCATE` vs `WRITE_APPEND`) intentional

and documented?

- Is `run_id` being used as part of the merge or dedupe key? If so, flag it.

`run_id` should be stored as a metadata column, not as part of the uniqueness

key, unless you explicitly want multi-run history.


State the recommended approach and the exact dedup key for this codebase.


---


F) OBSERVABILITY: Ca

🎯 Best For

  • Claude users
  • GitHub Copilot users
  • Software engineers
  • Development teams
  • Tech leads

💡 Use Cases

  • Python code quality enforcement
  • Dependency management

📖 How to Use This Skill

  1. 1

    Install the Skill

    Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.

  2. 2

    Load into Your AI Assistant

    Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.

  3. 3

    Apply Bigquery-Pipeline-Audit to Your Work

    Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.

  4. 4

    Review and Refine

    Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is Bigquery-Pipeline-Audit compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Bigquery-Pipeline-Audit?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Bigquery-Pipeline-Audit?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/bigquery-pipeline-audit/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills