Three Models, One Story: Closing the Explainability Gap in Lending Decisions
Three models, five scores, seven reasons? No thanks. Every decision needs one clear story.
Stacks are common. A bureau score, an internal ML model, and a third-party score often deliver lift. The story gets worse. Borrowers get vague adverse action letters. Similar files receive different reasons. Internal teams try to reverse-engineer feature weights into borrower language. That is not explainability. It is risk.
What good looks like
Every decision yields a single, clean narrative plus a short, ranked set of reason codes that match what was actually scored. The narrative is generated at the moment of decision and stored with model versions, policy versions, and thresholds. When examiners ask, you pull the exact narrative and supporting evidence from the system of record. You are not reconstructing the why after the fact.
Real-World example
A regional bank ran a bureau score, a gradient-boosting model, and a third-party fraud score. Denials were defensible, yet letters were inconsistent across near-identical files. We introduced a reason-code dictionary with borrower-facing phrasing, mapped all model outputs to that dictionary, and added a deterministic selector that chose one primary driver and two backups when models disagreed. A “decision narrative” service wrote a single sentence plus the top three reasons into the decision log on every approval or decline. The audit file became a query, not a hunt. The next review closed without findings on adverse action notices because the borrower explanation matched the model evidence and the policy language.
Framework: the one-story layer
Unify the language. Build one reason-code dictionary with clear borrower phrasing. Map each model’s features and outputs to those codes.
Select the driver. Create rules that pick one primary driver and up to two secondary reasons when models disagree. Avoid treating raw feature importances as borrower-facing reasons.
Log at decision time. Emit one sentence plus ranked reasons with each decision. Store model IDs, policy IDs, thresholds, and the narrative together.
Standardize letters. Tie narratives to pre-approved templates so the language is consistent and accurate.
Shadow, then switch. Run for 30 days in shadow mode. Compare new narratives to current letters on stratified samples. Tune selection rules before full cutover.
Metrics that matter
Reason clarity rate: percent of decisions where a single driver qualifies as dominant under your selection rules.
Consistency on like-for-like files: percent of matched files that share the same primary reason.
Letter rework rate: percent of notices requiring manual edits before delivery.
Retrievability: time to assemble evidence for a 50-file sample from the system of record.
Pitfalls to avoid
Feature-importance worship. Importances are diagnostics, not borrower explanations.
After-the-fact patching. If you cannot generate a one-sentence narrative without caveats, you are not production-ready.
Language drift. Multiple templates with different phrasings create fairness and accuracy risk.
A note on compliance fit
Regulation B expects specific, principal reasons in adverse action notices and cautions that listing more than four reasons is not helpful. The CFPB has also clarified that using AI or complex models does not change the obligation to give specific, accurate reasons. Supervisory guidance on model risk management expects controlled implementation and use, which is exactly what the one-story layer enforces in practice. Build for that bar, not around it.
Bottom line
Use as many models as you need. If the output cannot pass a one-sentence explanation and produce a short, specific set of reasons at decision time, you are building liability, not advantage.
References
· CFPB, Regulation B, §1002.9 adverse action, official commentary on specific reasons and count.
· CFPB Circular 2023-03 and newsroom note on AI and adverse action explanations.