Mayur Rathi

⭐ 34.1k GitHub stars

Datanalysis-Credit-Risk

Datanalysis-Credit-Risk is an data AI skill with a core value of Credit risk data cleaning and variable screening pipeline for pre-loan modeling. It helps developers solve real-world problems in the data domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Credit risk data cleaning and variable screening pipeline for pre-loan modeling. Use when working with raw credit data that needs quality assessment, missing value analysis, or variable selection bef

Last verified on: 2026-07-14

Quick Facts

Category data

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-07-14

Risk Level Low

mkdir -p ./skills/datanalysis-credit-risk && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/datanalysis-credit-risk/SKILL.md -o ./skills/datanalysis-credit-risk/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Data Cleaning and Variable Screening

Quick Start

bash

# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"

Complete Process Description

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

1. **Get Data** - Load and format raw data

2. **Organization Sample Analysis** - Statistics of sample count and bad sample rate for each organization

3. **Separate OOS Data** - Separate out-of-sample (OOS) samples from modeling samples

4. **Filter Abnormal Months** - Remove months with insufficient bad sample count or total sample count

5. **Calculate Missing Rate** - Calculate overall and organization-level missing rates for each feature

6. **Drop High Missing Rate Features** - Remove features with overall missing rate exceeding threshold

7. **Drop Low IV Features** - Remove features with overall IV too low or IV too low in too many organizations

8. **Drop High PSI Features** - Remove features with unstable PSI

9. **Null Importance Denoising** - Remove noise features using label permutation method

10. **Drop High Correlation Features** - Remove high correlation features based on original gain

11. **Export Report** - Generate Excel report containing details and statistics of all steps

Core Functions

| Function | Purpose | Module |

|------|------|----------|

| `get_dataset()` | Load and format data | references.func |

| `org_analysis()` | Organization sample analysis | references.func |

| `missing_check()` | Calculate missing rate | references.func |

| `drop_abnormal_ym()` | Filter abnormal months | references.analysis |

| `drop_highmiss_features()` | Drop high missing rate features | references.analysis |

| `drop_lowiv_features()` | Drop low IV features | references.analysis |

| `drop_highpsi_features()` | Drop high PSI features | references.analysis |

| `drop_highnoise_features()` | Null Importance denoising | references.analysis |

| `drop_highcorr_features()` | Drop high correlation features | references.analysis |

| `iv_distribution_by_org()` | IV distribution statistics | references.analysis |

| `psi_distribution_by_org()` | PSI distribution statistics | references.analysis |

| `value_ratio_distribution_by_org()` | Value ratio distribution statistics | references.analysis |

| `export_cleaning_report()` | Export cleaning report | references.analysis |

Parameter Description

Data Loading Parameters

- `DATA_PATH`: Data file path (best are parquet format)

- `DATE_COL`: Date column name

- `Y_COL`: Label column name

- `ORG_COL`: Organization column name

- `KEY_COLS`: Primary key column name list

OOS Organization Configuration

- `OOS_ORGS`: Out-of-sample organization list

Abnormal Month Filtering Parameters

- `min_ym_bad_sample`: Minimum bad sample count per month (default 10)

- `min_ym_sample`: Minimum total sample count per month (default 500)

Missing Rate Parameters

- `missing_ratio`: Overall missing rate threshold (default 0.6)

IV Parameters

- `overall_iv_threshold`: Overall IV threshold (default 0.1)

- `org_iv_threshold`: Single organization IV threshold (default 0.1)

- `max_org_threshold`: Maximum tolerated low IV organization count (default 2)

PSI Parameters

- `psi_threshold`: PSI threshold (default 0.1)

- `max_months_ratio`: Maximum unstable month ratio (default 1/3)

- `max_orgs`: Maximum unstable organization count (default 6)

Null Importance Parameters

- `n_estimators`: Number of trees (default 100)

- `max_depth`: Maximum tree depth (default 5)

- `gain_threshold`: Gain difference threshold (default 50)

High Correlation Parameters

- `max_corr`: Correlation threshold (default 0.9)

- `top_n_keep`: Keep top N features by original gain ranking (default 20)

Output Report

The generated Excel report contains the following sheets:

1. **汇总** - Summary information of all steps, including operation results and conditions

2. **机构样本统计** - Sample co

🎯 Best For

Claude users
GitHub Copilot users
Data professionals
Analytics teams
Researchers

💡 Use Cases

Data pipeline auditing
Query optimization

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Datanalysis-Credit-Risk to Your Work

Provide context for your task — paste source material, describe your audience, or share existing work to guide the AI.
4
Review and Refine

Edit the AI output for accuracy, tone, and completeness. Add human insight where the AI lacks context.

❓ Frequently Asked Questions

How do I install Datanalysis-Credit-Risk?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/datanalysis-credit-risk/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Ignoring data quality

AI analysis inherits all data quality issues — profile your data first.

🔗 Related Skills

acreadiness-policy Acreadiness-Policy adr-generator ADR Generator ai-ready Ai-Ready azure-pricing Azure-Pricing create-architectural-decision-record Create-Architectural-Decision-Record create-specification Create-Specification