datanalysis-credit-risk

作者: github

信貸風險數據清洗與變量篩選流程,適用於貸前建模。執行11個獨立步驟,涵蓋數據加載、異常時期過濾、缺失率分析、低IV與高PSI變量剔除、空值重要性去噪,以及基於相關性的特徵消除。支持組織層級分析,具備獨立建模與樣本外(OOS)樣本處理功能,並提供IV與PSI計算的多進程加速。生成包含15項內容的綜合Excel報告。

npx skills add https://github.com/github/awesome-copilot --skill datanalysis-credit-risk

Data Cleaning and Variable Screening

Quick Start

# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"

Complete Process Description

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

  1. Get Data - Load and format raw data
  2. Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization
  3. Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples
  4. Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count
  5. Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature
  6. Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold
  7. Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations
  8. Drop High PSI Features - Remove features with unstable PSI
  9. Null Importance Denoising - Remove noise features using label permutation method
  10. Drop High Correlation Features - Remove high correlation features based on original gain
  11. Export Report - Generate Excel report containing details and statistics of all steps

Core Functions

FunctionPurposeModule
get_dataset()Load and format datareferences.func
org_analysis()Organization sample analysisreferences.func
missing_check()Calculate missing ratereferences.func
drop_abnormal_ym()Filter abnormal monthsreferences.analysis
drop_highmiss_features()Drop high missing rate featuresreferences.analysis
drop_lowiv_features()Drop low IV featuresreferences.analysis
drop_highpsi_features()Drop high PSI featuresreferences.analysis
drop_highnoise_features()Null Importance denoisingreferences.analysis
drop_highcorr_features()Drop high correlation featuresreferences.analysis
iv_distribution_by_org()IV distribution statisticsreferences.analysis
psi_distribution_by_org()PSI distribution statisticsreferences.analysis
value_ratio_distribution_by_org()Value ratio distribution statisticsreferences.analysis
export_cleaning_report()Export cleaning reportreferences.analysis

Parameter Description

Data Loading Parameters

  • DATA_PATH: Data file path (best are parquet format)
  • DATE_COL: Date column name
  • Y_COL: Label column name
  • ORG_COL: Organization column name
  • KEY_COLS: Primary key column name list

OOS Organization Configuration

  • OOS_ORGS: Out-of-sample organization list

Abnormal Month Filtering Parameters

  • min_ym_bad_sample: Minimum bad sample count per month (default 10)
  • min_ym_sample: Minimum total sample count per month (default 500)

Missing Rate Parameters

  • missing_ratio: Overall missing rate threshold (default 0.6)

IV Parameters

  • overall_iv_threshold: Overall IV threshold (default 0.1)
  • org_iv_threshold: Single organization IV threshold (default 0.1)
  • max_org_threshold: Maximum tolerated low IV organization count (default 2)

PSI Parameters

  • psi_threshold: PSI threshold (default 0.1)
  • max_months_ratio: Maximum unstable month ratio (default 1/3)
  • max_orgs: Maximum unstable organization count (default 6)

Null Importance Parameters

  • n_estimators: Number of trees (default 100)
  • max_depth: Maximum tree depth (default 5)
  • gain_threshold: Gain difference threshold (default 50)

High Correlation Parameters

  • max_corr: Correlation threshold (default 0.9)
  • top_n_keep: Keep top N features by original gain ranking (default 20)

Output Report

The generated Excel report contains the following sheets:

  1. 汇总 - Summary information of all steps, including operation results and conditions
  2. 机构样本统计 - Sample count and bad sample rate for each organization
  3. 分离OOS数据 - OOS sample and modeling sample counts
  4. Step4-异常月份处理 - Abnormal months that were removed
  5. 缺失率明细 - Overall and organization-level missing rates for each feature
  6. Step5-有值率分布统计 - Distribution of features in different value ratio ranges
  7. Step6-高缺失率处理 - High missing rate features that were removed
  8. Step7-IV明细 - IV values of each feature in each organization and overall
  9. Step7-IV处理 - Features that do not meet IV conditions and low IV organizations
  10. Step7-IV分布统计 - Distribution of features in different IV ranges
  11. Step8-PSI明细 - PSI values of each feature in each organization each month
  12. Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations
  13. Step8-PSI分布统计 - Distribution of features in different PSI ranges
  14. Step9-null importance处理 - Noise features that were removed
  15. Step10-高相关性剔除 - High correlation features that were removed

Features

  • Interactive Input: Parameters can be input before each step execution, with default values supported
  • Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis
  • Complete Report: Generate complete Excel report containing details, statistics, and distributions
  • Multi-process Support: IV and PSI calculations support multi-process acceleration
  • Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction

來自 github 的更多技能

console-rendering
github
在 Go 中使用基於結構體標籤的控制台渲染系統的說明
official
acquire-codebase-knowledge
github
當使用者明確要求對現有程式碼庫進行映射、文件化或入門引導時,使用此技能。觸發詞如「映射此程式碼庫」、「文件化…」等提示。
official
acreadiness-assess
github
Run the AgentRC readiness assessment on the current repository and produce a static HTML dashboard at reports/index.html. Wraps `npx github:microsoft/agentrc…
official
acreadiness-generate-instructions
github
透過 AgentRC 指令命令生成量身打造的 AI 代理指令檔案。產生 .github/copilot-instructions.md(預設,建議用於 VS Code 中的 Copilot…
official
acreadiness-policy
github
幫助使用者選取、撰寫或套用 AgentRC 政策。政策可透過停用不相關的檢查、覆寫影響/等級、設定…來自訂整備度評分。
official
add-educational-comments
github
為程式碼檔案添加教育性註解,將其轉化為有效的學習資源。根據三個可設定的知識層級(初學者、中級、進階)調整解釋深度與語氣。若未提供檔案,會自動請求提供,並以編號清單對應以便快速選取。僅透過教育性註解將檔案擴充最多125%(嚴格上限:400行新註解;超過1,000行的檔案上限為300行)。保留檔案編碼、縮排風格、語法正確性及……
official
adobe-illustrator-scripting
github
使用 ExtendScript (JavaScript/JSX) 編寫、除錯及最佳化 Adobe Illustrator 自動化腳本。適用於建立或修改操控…的腳本時。
official
agent-governance
github
宣告式政策、意圖分類與稽核軌跡,用於控制AI代理工具存取與行為。可組合的治理政策定義允許/封鎖的工具、內容過濾器、速率限制與核准要求——以配置而非程式碼形式儲存。語意意圖分類在工具執行前,透過基於模式的訊號偵測危險提示(資料外洩、權限提升、提示注入)。工具層級治理裝飾器在函式層級強制執行政策……
official