Datanalysis-Credit-Risk
Datanalysis-Credit-Risk是一款data方向的AI技能,核心价值是Credit risk data cleaning and variable screening pipeline for pre-loan modeling,可用于解决开发者在data领域的实际问题,帮助用户提升效率、自动化重复任务或优化工作流。
Credit risk data cleaning and variable screening pipeline for pre-loan modeling. Use when working with raw credit data that needs quality assessment, missing value analysis, or variable selection bef
mkdir -p ./skills/datanalysis-credit-risk && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/datanalysis-credit-risk/SKILL.md -o ./skills/datanalysis-credit-risk/SKILL.md Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).
Skill Content
# Data Cleaning and Variable Screening
Quick Start
# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"Complete Process Description
The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:
1. **Get Data** - Load and format raw data
2. **Organization Sample Analysis** - Statistics of sample count and bad sample rate for each organization
3. **Separate OOS Data** - Separate out-of-sample (OOS) samples from modeling samples
4. **Filter Abnormal Months** - Remove months with insufficient bad sample count or total sample count
5. **Calculate Missing Rate** - Calculate overall and organization-level missing rates for each feature
6. **Drop High Missing Rate Features** - Remove features with overall missing rate exceeding threshold
7. **Drop Low IV Features** - Remove features with overall IV too low or IV too low in too many organizations
8. **Drop High PSI Features** - Remove features with unstable PSI
9. **Null Importance Denoising** - Remove noise features using label permutation method
10. **Drop High Correlation Features** - Remove high correlation features based on original gain
11. **Export Report** - Generate Excel report containing details and statistics of all steps
Core Functions
| Function | Purpose | Module |
|------|------|----------|
| `get_dataset()` | Load and format data | references.func |
| `org_analysis()` | Organization sample analysis | references.func |
| `missing_check()` | Calculate missing rate | references.func |
| `drop_abnormal_ym()` | Filter abnormal months | references.analysis |
| `drop_highmiss_features()` | Drop high missing rate features | references.analysis |
| `drop_lowiv_features()` | Drop low IV features | references.analysis |
| `drop_highpsi_features()` | Drop high PSI features | references.analysis |
| `drop_highnoise_features()` | Null Importance denoising | references.analysis |
| `drop_highcorr_features()` | Drop high correlation features | references.analysis |
| `iv_distribution_by_org()` | IV distribution statistics | references.analysis |
| `psi_distribution_by_org()` | PSI distribution statistics | references.analysis |
| `value_ratio_distribution_by_org()` | Value ratio distribution statistics | references.analysis |
| `export_cleaning_report()` | Export cleaning report | references.analysis |
Parameter Description
Data Loading Parameters
- `DATA_PATH`: Data file path (best are parquet format)
- `DATE_COL`: Date column name
- `Y_COL`: Label column name
- `ORG_COL`: Organization column name
- `KEY_COLS`: Primary key column name list
OOS Organization Configuration
- `OOS_ORGS`: Out-of-sample organization list
Abnormal Month Filtering Parameters
- `min_ym_bad_sample`: Minimum bad sample count per month (default 10)
- `min_ym_sample`: Minimum total sample count per month (default 500)
Missing Rate Parameters
- `missing_ratio`: Overall missing rate threshold (default 0.6)
IV Parameters
- `overall_iv_threshold`: Overall IV threshold (default 0.1)
- `org_iv_threshold`: Single organization IV threshold (default 0.1)
- `max_org_threshold`: Maximum tolerated low IV organization count (default 2)
PSI Parameters
- `psi_threshold`: PSI threshold (default 0.1)
- `max_months_ratio`: Maximum unstable month ratio (default 1/3)
- `max_orgs`: Maximum unstable organization count (default 6)
Null Importance Parameters
- `n_estimators`: Number of trees (default 100)
- `max_depth`: Maximum tree depth (default 5)
- `gain_threshold`: Gain difference threshold (default 50)
High Correlation Parameters
- `max_corr`: Correlation threshold (default 0.9)
- `top_n_keep`: Keep top N features by original gain ranking (default 20)
Output Report
The generated Excel report contains the following sheets:
1. **汇总** - Summary information of all steps, including operation results and conditions
2. **机构样本统计** - Sample co
🎯 Best For
- Claude users
- GitHub Copilot users
- Data professionals
- Analytics teams
- Researchers
💡 Use Cases
- Data pipeline auditing
- Query optimization
📖 How to Use This Skill
- 1
Install the Skill
Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
- 2
Load into Your AI Assistant
Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
- 3
Apply Datanalysis-Credit-Risk to Your Work
Provide context for your task — paste source material, describe your audience, or share existing work to guide the AI.
- 4
Review and Refine
Edit the AI output for accuracy, tone, and completeness. Add human insight where the AI lacks context.
❓ Frequently Asked Questions
How do I install Datanalysis-Credit-Risk?
Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/datanalysis-credit-risk/SKILL.md, ready to use.
Can I customize this skill for my team?
Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.
⚠️ Common Mistakes to Avoid
Ignoring data quality
AI analysis inherits all data quality issues — profile your data first.