MR
Mayur Rathi
@github
⭐ 34.1k GitHub stars

Datanalysis-Credit-Risk

Datanalysis-Credit-Risk是一款data方向的AI技能,核心价值是Credit risk data cleaning and variable screening pipeline for pre-loan modeling,可用于解决开发者在data领域的实际问题,帮助用户提升效率、自动化重复任务或优化工作流。

Credit risk data cleaning and variable screening pipeline for pre-loan modeling. Use when working with raw credit data that needs quality assessment, missing value analysis, or variable selection bef

Last verified on: 2026-05-30
mkdir -p ./skills/datanalysis-credit-risk && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/datanalysis-credit-risk/SKILL.md -o ./skills/datanalysis-credit-risk/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Data Cleaning and Variable Screening


Quick Start


bash
# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"

Complete Process Description


The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:


1. **Get Data** - Load and format raw data

2. **Organization Sample Analysis** - Statistics of sample count and bad sample rate for each organization

3. **Separate OOS Data** - Separate out-of-sample (OOS) samples from modeling samples

4. **Filter Abnormal Months** - Remove months with insufficient bad sample count or total sample count

5. **Calculate Missing Rate** - Calculate overall and organization-level missing rates for each feature

6. **Drop High Missing Rate Features** - Remove features with overall missing rate exceeding threshold

7. **Drop Low IV Features** - Remove features with overall IV too low or IV too low in too many organizations

8. **Drop High PSI Features** - Remove features with unstable PSI

9. **Null Importance Denoising** - Remove noise features using label permutation method

10. **Drop High Correlation Features** - Remove high correlation features based on original gain

11. **Export Report** - Generate Excel report containing details and statistics of all steps


Core Functions


| Function | Purpose | Module |

|------|------|----------|

| `get_dataset()` | Load and format data | references.func |

| `org_analysis()` | Organization sample analysis | references.func |

| `missing_check()` | Calculate missing rate | references.func |

| `drop_abnormal_ym()` | Filter abnormal months | references.analysis |

| `drop_highmiss_features()` | Drop high missing rate features | references.analysis |

| `drop_lowiv_features()` | Drop low IV features | references.analysis |

| `drop_highpsi_features()` | Drop high PSI features | references.analysis |

| `drop_highnoise_features()` | Null Importance denoising | references.analysis |

| `drop_highcorr_features()` | Drop high correlation features | references.analysis |

| `iv_distribution_by_org()` | IV distribution statistics | references.analysis |

| `psi_distribution_by_org()` | PSI distribution statistics | references.analysis |

| `value_ratio_distribution_by_org()` | Value ratio distribution statistics | references.analysis |

| `export_cleaning_report()` | Export cleaning report | references.analysis |


Parameter Description


Data Loading Parameters

- `DATA_PATH`: Data file path (best are parquet format)

- `DATE_COL`: Date column name

- `Y_COL`: Label column name

- `ORG_COL`: Organization column name

- `KEY_COLS`: Primary key column name list


OOS Organization Configuration

- `OOS_ORGS`: Out-of-sample organization list


Abnormal Month Filtering Parameters

- `min_ym_bad_sample`: Minimum bad sample count per month (default 10)

- `min_ym_sample`: Minimum total sample count per month (default 500)


Missing Rate Parameters

- `missing_ratio`: Overall missing rate threshold (default 0.6)


IV Parameters

- `overall_iv_threshold`: Overall IV threshold (default 0.1)

- `org_iv_threshold`: Single organization IV threshold (default 0.1)

- `max_org_threshold`: Maximum tolerated low IV organization count (default 2)


PSI Parameters

- `psi_threshold`: PSI threshold (default 0.1)

- `max_months_ratio`: Maximum unstable month ratio (default 1/3)

- `max_orgs`: Maximum unstable organization count (default 6)


Null Importance Parameters

- `n_estimators`: Number of trees (default 100)

- `max_depth`: Maximum tree depth (default 5)

- `gain_threshold`: Gain difference threshold (default 50)


High Correlation Parameters

- `max_corr`: Correlation threshold (default 0.9)

- `top_n_keep`: Keep top N features by original gain ranking (default 20)


Output Report


The generated Excel report contains the following sheets:


1. **汇总** - Summary information of all steps, including operation results and conditions

2. **机构样本统计** - Sample co

🎯 Best For

  • Claude users
  • GitHub Copilot users
  • Data professionals
  • Analytics teams
  • Researchers

💡 Use Cases

  • Data pipeline auditing
  • Query optimization

📖 How to Use This Skill

  1. 1

    Install the Skill

    Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.

  2. 2

    Load into Your AI Assistant

    Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.

  3. 3

    Apply Datanalysis-Credit-Risk to Your Work

    Provide context for your task — paste source material, describe your audience, or share existing work to guide the AI.

  4. 4

    Review and Refine

    Edit the AI output for accuracy, tone, and completeness. Add human insight where the AI lacks context.

❓ Frequently Asked Questions

How do I install Datanalysis-Credit-Risk?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/datanalysis-credit-risk/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Ignoring data quality

AI analysis inherits all data quality issues — profile your data first.

🔗 Related Skills