Mayur Rathi

⭐ 34.1k GitHub stars

Bigquery-Pipeline-Audit

Bigquery-Pipeline-Audit is an code AI skill with a core value of Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.

Last verified on: 2026-07-14

Quick Facts

Category code

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-07-14

Risk Level Low

mkdir -p ./skills/bigquery-pipeline-audit && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/bigquery-pipeline-audit/SKILL.md -o ./skills/bigquery-pipeline-audit/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# BigQuery Pipeline Audit: Cost, Safety and Production Readiness

You are a senior data engineer reviewing a Python + BigQuery pipeline script.

Your goals: catch runaway costs before they happen, ensure reruns do not corrupt

data, and make sure failures are visible.

Analyze the codebase and respond in the structure below (A to F + Final).

Reference exact function names and line locations. Suggest minimal fixes, not

rewrites.

---

A) COST EXPOSURE: What will actually get billed?

Locate every BigQuery job trigger (`client.query`, `load_table_from_*`,

`extract_table`, `copy_table`, DDL/DML via query) and every external call

(APIs, LLM calls, storage writes).

For each, answer:

- Is this inside a loop, retry block, or async gather?

- What is the realistic worst-case call count?

- For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set?

For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?

- Is the same SQL and params being executed more than once in a single run?

Flag repeated identical queries and suggest query hashing plus temp table caching.

**Flag immediately if:**

- Any BQ query runs once per date or once per entity in a loop

- Worst-case BQ job count exceeds 20

- `maximum_bytes_billed` is missing on any `client.query` call

---

B) DRY RUN AND EXECUTION MODES

Verify a `--mode` flag exists with at least `dry_run` and `execute` options.

- `dry_run` must print the plan and estimated scope with zero billed BQ execution

(BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls

- `execute` requires explicit confirmation for prod (`--env=prod --confirm`)

- Prod must not be the default environment

If missing, propose a minimal `argparse` patch with safe defaults.

---

C) BACKFILL AND LOOP DESIGN

**Hard fail if:** the script runs one BQ query per date or per entity in a loop.

Check that date-range backfills use one of:

1. A single set-based query with `GENERATE_DATE_ARRAY`

2. A staging table loaded with all dates then one join query

3. Explicit chunks with a hard `MAX_CHUNKS` cap

Also check:

- Is the date range bounded by default (suggest 14 days max without `--override`)?

- If the script crashes mid-run, is it safe to re-run without double-writing?

- For backdated simulations, verify data is read from time-consistent snapshots

(`FOR SYSTEM_TIME AS OF`, partitioned as-of tables, or dated snapshot tables).

Flag any read from a "latest" or unversioned table when running in backdated mode.

Suggest a concrete rewrite if the current approach is row-by-row.

---

D) QUERY SAFETY AND SCAN SIZE

For each query, check:

- **Partition filter** is on the raw column, not `DATE(ts)`, `CAST(...)`, or

any function that prevents pruning

- **No `SELECT *`**: only columns actually used downstream

- **Joins will not explode**: verify join keys are unique or appropriately scoped

and flag any potential many-to-many

- **Expensive operations** (`REGEXP`, `JSON_EXTRACT`, UDFs) only run after

partition filtering, not on full table scans

Provide a specific SQL fix for any query that fails these checks.

---

E) SAFE WRITES AND IDEMPOTENCY

Identify every write operation. Flag plain `INSERT`/append with no dedup logic.

Each write should use one of:

1. `MERGE` on a deterministic key (e.g., `entity_id + date + model_version`)

2. Write to a staging table scoped to the run, then swap or merge into final

3. Append-only with a dedupe view:

`QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1`

Also check:

- Will a re-run create duplicate rows?

- Is the write disposition (`WRITE_TRUNCATE` vs `WRITE_APPEND`) intentional

and documented?

- Is `run_id` being used as part of the merge or dedupe key? If so, flag it.

`run_id` should be stored as a metadata column, not as part of the uniqueness

key, unless you explicitly want multi-run history.

State the recommended approach and the exact dedup key for this codebase.

---

F) OBSERVABILITY: Ca

🎯 Best For

Claude users
GitHub Copilot users
Software engineers
Development teams
Tech leads

💡 Use Cases

Python code quality enforcement
Dependency management

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Bigquery-Pipeline-Audit to Your Work

Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
4
Review and Refine

Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is Bigquery-Pipeline-Audit compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Bigquery-Pipeline-Audit?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Bigquery-Pipeline-Audit?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/bigquery-pipeline-audit/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills

apify-integration-expert Apify-Integration-Expert dataverse-python-authentication-security Dataverse-Python-Authentication-Security dataverse-python-best-practices Dataverse-Python-Best-Practices dataverse-python-pandas-integration Dataverse-Python-Pandas-Integration dataverse-python-performance-optimization Dataverse-Python-Performance-Optimization dataverse-python-quickstart Dataverse-Python-Quickstart