Ante-hoc (Inherently Interpretable) Models

Ante-hoc methods build interpretability directly into the model architecture, making explanations an inherent part of the prediction process rather than a post-hoc addition. Historically, this approach dates back to the earliest days of AI, when expert systems in the 1970s and 1980s were designed with rule-based reasoning that was inherently transparent. The evolution of ante-hoc methods accelerated in the 2010s as researchers developed techniques to make inherently interpretable models competitive with black-box alternatives. This approach is fundamentally different from post-hoc methods because the explanation is the model itself, guaranteeing perfect fidelity. According to Abusitta et al. (2024) survey in Expert Systems with Applications, ante-hoc methods are used in approximately 23% of XAI deployments, primarily in regulated industries where explanation guarantees are required.

The case for ante-hoc models has been strengthened by research showing that interpretable models often match black-box performance. Cynthia Rudin's work demonstrated that for tabular data in healthcare and criminal justice, well-designed interpretable models achieve within 1-2% accuracy of deep learning approaches while providing complete transparency. This finding challenges the assumption that accuracy must be sacrificed for interpretability, a topic explored further in the Evaluation & Frameworks page.

Linear Models

Linear regression and logistic regression provide direct interpretability through coefficients that quantify feature importance. Each coefficient represents the expected change in output for a unit change in the corresponding feature, holding other features constant. The linearity assumption enables straightforward interpretation of feature effects, because the prediction is simply a weighted sum of features.

Advantages: Complete transparency, statistical significance tests | Limitations: Cannot capture complex nonlinear relationships

Decision Trees

Decision trees partition the feature space through a series of binary splits, creating interpretable rule-based paths from root to leaf. Each prediction can be explained by tracing the path through the tree and identifying which conditions led to the final decision. Shallow trees are highly interpretable, though deeper trees become increasingly difficult for humans to understand.

Advantages: Visual representation, handles mixed data types | Limitations: Instability, tendency to overfit without pruning

Generalized Additive Models (GAMs)

GAMs extend linear models by allowing nonlinear transformations of individual features while maintaining additivity: f(x) = g1(x1) + g2(x2) + ... + gn(xn). Each component function can be visualized independently, enabling understanding of individual feature effects while capturing nonlinear patterns. Explainable Boosting Machines (EBMs) are a modern variant achieving near-black-box accuracy (Lou et al., 2013).

Advantages: Nonlinear feature effects, maintained interpretability | Limitations: Cannot model feature interactions without explicit terms

Rule-Based Systems

Rule-based learning algorithms generate explicit IF-THEN rules that can be directly understood by humans. Methods like RIPPER and decision lists produce ordered rule sets where each rule provides a transparent justification for classifications. Modern approaches like Bayesian Rule Lists provide principled ways to learn compact rule sets with uncertainty quantification.

Advantages: Human-readable rules, domain knowledge integration | Limitations: May require extensive feature engineering

Post-hoc Local Explanations

Local explanation methods provide interpretations for individual predictions, answering the question "Why did the model make this specific prediction?" These techniques approximate complex model behavior in the neighborhood of a particular instance, which is especially relevant for decision justification in individual cases. According to Vimbi et al. (2024), local explanations are preferred in 78% of clinical AI applications because healthcare decisions require justification at the individual patient level rather than general model behavior.

The two dominant local explanation methods, LIME and SHAP, take fundamentally different approaches. LIME approximates the model locally with a simple surrogate, while SHAP computes exact feature contributions based on game theory. Salih et al. (2024) found that SHAP provides 15% higher consistency across repeated explanations, but LIME runs approximately 3x faster, making method choice dependent on the specific deployment requirements. For applications requiring real-time explanations, such as fraud detection alerts (see Applications), LIME's speed advantage often outweighs SHAP's theoretical guarantees.

LIME (Local Interpretable Model-agnostic Explanations)

LIME, introduced by Ribeiro et al. (2016), generates local explanations by fitting a simple interpretable model (typically linear) in the neighborhood of the prediction being explained. The method works by: (1) perturbing the input instance to create synthetic neighbors, (2) obtaining model predictions for these neighbors, (3) weighting neighbors by proximity to the original instance, and (4) fitting an interpretable surrogate model to approximate local behavior.

For text classification, LIME identifies which words most influenced the prediction. For images, it identifies which superpixels were most important. The method is model-agnostic and can explain any classifier, making it widely applicable across domains.

Original paper: DOI: 10.1145/2939672.2939778 | Implementation: GitHub

SHAP (SHapley Additive exPlanations)

SHAP, developed by Lundberg & Lee (2017), provides theoretically grounded feature attributions based on Shapley values from cooperative game theory. Each feature is assigned an importance value representing its contribution to the prediction, with the guarantee that contributions sum to the difference between the prediction and the average prediction.

SHAP unifies several existing attribution methods (LIME, DeepLIFT, layer-wise relevance propagation) under a common framework and provides the only explanation method satisfying three desirable properties: local accuracy, missingness, and consistency. Tree SHAP provides exact Shapley values for tree-based models in polynomial time.

Original paper: arXiv:1705.07874 | Implementation: GitHub

Anchors

Anchors, also developed by Ribeiro et al. (2018), provide rule-based explanations with coverage guarantees. An anchor is a sufficient condition for a prediction, meaning that if the anchor conditions hold, the prediction will remain the same with high probability regardless of other feature values. This addresses a limitation of LIME, which only provides local linear approximations without coverage guarantees.

For example, an anchor for a sentiment classifier might be: "If the review contains 'excellent' and 'highly recommend', the prediction is positive." The anchor approach is particularly useful when users need to understand which conditions are sufficient for a prediction.

Original paper: AAAI 2018

Post-hoc Global Explanations

Global explanation methods describe overall model behavior, answering the question "How does the model generally make predictions?" These techniques provide insights into which features are most important across the entire dataset and how the model responds to feature changes. The literature suggests that global explanations are particularly valuable for model auditing and debugging, because they reveal systematic patterns that local explanations cannot detect. According to multiple studies, a model might use unexpected features consistently, which would only become apparent through global analysis.

Research demonstrates that global and local explanations serve complementary purposes. Specifically, global methods identify which features matter overall, whereas local methods explain individual decisions. Therefore, practitioners often combine both approaches. For example, SHAP summary plots provide global feature importance rankings, while force plots explain specific predictions. In other words, the distinction between global and local is not a choice between alternatives but rather a question of what level of analysis is appropriate for the use case at hand. See the Evaluation page for metrics comparing global vs. local explanation quality.

Permutation Feature Importance

Permutation importance measures how much model performance degrades when a feature's values are randomly shuffled, breaking the relationship between the feature and the target. Features whose permutation causes large performance drops are considered important. This method is model-agnostic and accounts for feature interactions, unlike coefficient-based importance in linear models.

Advantages: Model-agnostic, accounts for interactions | Limitations: Correlated features may share importance

Partial Dependence Plots (PDP)

PDPs show the marginal effect of one or two features on the predicted outcome of a machine learning model. The partial dependence function marginalizes over all other features, showing the average prediction as the feature of interest varies. This reveals the functional relationship between features and predictions.

For a house price model, a PDP might show that predicted price increases linearly with square footage up to 3,000 sq ft, then plateaus. PDPs are intuitive and widely used but can be misleading when features are correlated (Apley & Zhu, 2020).

Original reference: Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics.

Model Extraction (Knowledge Distillation)

Model extraction creates a simpler, interpretable model that mimics the behavior of a complex black-box model. The complex model serves as a "teacher" that labels data, and a simpler "student" model (decision tree, rule list) is trained to match these predictions. The student model provides an interpretable approximation of the teacher's decision boundary.

This approach trades some fidelity for interpretability, and the quality of explanations depends on how well the surrogate model approximates the original.

Advantages: Produces truly interpretable model | Limitations: Approximation may miss important behaviors

Model-Agnostic Methods

Model-agnostic methods work with any machine learning model by treating it as a black box and only using input-output relationships. This means that they can explain any model without access to internal weights or architecture. Consequently, they provide maximum flexibility across diverse model architectures, from traditional machine learning to deep neural networks. However, this flexibility comes at a cost: model-agnostic methods must approximate model behavior through perturbation or sampling, whereas model-specific methods can directly inspect internal computations. According to multiple studies, this trade-off between generality and fidelity is the central consideration when selecting an explanation approach. Overall, the literature suggests that model-agnostic methods are preferred when models change frequently or when the same explanation infrastructure must support diverse architectures.

Method Type Description Key Strength
LIME Local Fits local linear surrogate model around predictions Works with any data type (tabular, text, images)
SHAP Local/Global Shapley value-based feature attributions Theoretical guarantees, consistent attributions
PDP Global Shows marginal effect of features on prediction Intuitive visualization of feature effects
ICE (Individual Conditional Expectation) Local/Global Shows individual response curves per instance Reveals heterogeneous effects hidden by PDP
ALE (Accumulated Local Effects) Global Unbiased feature effect estimation Handles correlated features better than PDP
Anchors Local Sufficient conditions for predictions Coverage guarantees, rule-based
Counterfactual Explanations Local Minimal changes to flip prediction Actionable, contrastive explanations

Model-Specific Methods

Model-specific methods leverage internal structure and computations of particular model architectures to generate explanations. This means that they can provide deeper insight into how a particular architecture processes information. Consequently, model-specific methods often provide more accurate and computationally efficient explanations compared to model-agnostic approaches. For example, attention visualization directly reveals what a transformer model "looks at," whereas LIME must approximate this through perturbation. In other words, the trade-off is between flexibility (model-agnostic) and fidelity (model-specific). Research demonstrates that model-specific methods achieve 15-25% higher fidelity scores in controlled evaluations. Therefore, practitioners should consider using model-specific methods when available for their architecture.

Attention Visualization

Transformer-based models compute attention weights indicating which input elements the model focuses on when making predictions. Visualizing attention patterns reveals how the model processes sequences, showing which tokens or image patches receive the most "attention" during prediction. While intuitive, recent research questions whether attention weights faithfully represent model reasoning (Jain & Wallace, 2019).

Applicable to: Transformers, BERT, GPT, Vision Transformers

Grad-CAM (Gradient-weighted Class Activation Mapping)

Grad-CAM produces visual explanations for CNN predictions by using the gradient information flowing into the final convolutional layer. It highlights regions of the input image that are most important for predicting a specific class. The method works by computing the gradient of the target class score with respect to feature maps and using these gradients as importance weights.

Grad-CAM has been widely adopted for explaining image classification models, particularly in medical imaging where visual explanations help clinicians understand AI predictions.

Original paper: Selvaraju et al. (2017)

Layer-wise Relevance Propagation (LRP)

LRP propagates the prediction backward through the network, decomposing the output into relevance scores for each input feature. The propagation follows conservation rules ensuring that relevance is preserved across layers. This provides a principled way to attribute predictions to input features based on the network's internal computations.

Applicable to: Feedforward neural networks, CNNs

DeepLIFT

DeepLIFT (Deep Learning Important FeaTures) assigns contribution scores to inputs by comparing their activation to a "reference" activation. Unlike gradient-based methods that can give zero attribution to saturated units, DeepLIFT compares differences in activation, providing more informative attributions when gradients are zero or near-zero.

Original paper: Shrikumar et al. (2017)

Integrated Gradients

Integrated Gradients attributes predictions by accumulating gradients along a path from a baseline input to the actual input. This method satisfies two key axioms: sensitivity (if a feature matters, it receives non-zero attribution) and implementation invariance (equivalent networks give identical attributions). It provides theoretically grounded attributions for any differentiable model.

Original paper: Sundararajan et al. (2017)

Example-Based Explanations

Example-based explanations justify predictions by pointing to training examples or generating illustrative instances. This approach leverages human reasoning patterns that naturally compare cases, making explanations intuitive for non-technical users. In other words, rather than describing abstract feature contributions, example-based methods show concrete instances that the user can directly examine. Research demonstrates that example-based explanations achieve 25% higher user comprehension scores than feature attribution methods in user studies. Specifically, counterfactual explanations are particularly effective because they provide actionable information. Building on case-based reasoning traditions from expert systems, modern example-based methods combine principled selection criteria with deep learning representations.

Prototypes and Criticisms

Prototype-based explanations identify representative examples from the training data that characterize each class or cluster. Criticisms are examples that are not well represented by prototypes. Together, they provide a summary of the data distribution and can explain predictions by showing similar training examples.

MMD-critic (Kim et al., 2016) is a principled approach for selecting prototypes and criticisms that maximize coverage of the data distribution.

Intuition: "This image was classified as a dog because it is similar to these training examples..."

Counterfactual Explanations

Counterfactual explanations describe the smallest change to input features that would change the prediction. They answer "What would need to be different for the outcome to change?" For a loan rejection, a counterfactual might be: "If your income were $5,000 higher, your loan would be approved."

Counterfactuals are particularly valuable for actionable explanations, telling users exactly what they could change to achieve a different outcome. The challenge lies in generating realistic, sparse counterfactuals that represent achievable changes (Wachter et al., 2017).

Key application: Recourse in automated decision-making

Adversarial Examples

While originally studied as a robustness concern, adversarial examples provide insights into model behavior by revealing decision boundaries. Small perturbations that change predictions expose model sensitivities and can help users understand what features the model relies on. Understanding adversarial vulnerabilities is crucial for deploying XAI in safety-critical applications.

Insight: Reveals model sensitivities and potential failure modes

Influential Instances

Influence functions identify which training examples were most influential in determining a particular prediction. By computing how the prediction would change if a training example were removed, these methods trace predictions back to their training data origins. This is valuable for debugging models and understanding data-driven decisions.

Original paper: Koh & Liang (2017)

Method Comparison

Different XAI methods have distinct strengths and are suited for different use cases. The following table summarizes key characteristics to guide method selection:

Method Scope Model Dependency Computational Cost Best For
LIME Local Agnostic Moderate (sampling) Quick individual explanations
SHAP Local/Global Agnostic (with fast tree variants) High (exact) / Low (TreeSHAP) Rigorous feature attribution
Grad-CAM Local CNN-specific Low Image classification explanations
Attention Local Transformer-specific Very Low (computed during inference) Sequence/NLP model insights
Counterfactuals Local Agnostic High (optimization) Actionable recourse explanations
PDPs Global Agnostic Moderate Understanding feature effects
Decision Trees Global Inherent Very Low Fully transparent models

Selecting XAI Methods

Method selection depends on several factors: (1) whether global understanding or individual explanations are needed, (2) the model architecture being explained, (3) computational resources available, (4) the target audience's technical sophistication, and (5) regulatory requirements for explanation type. In practice, combining multiple methods provides complementary insights into model behavior (Hassija et al., 2023).

Recent Developments (2024-2025)

XAI technique development has accelerated in recent years, driven by the need to explain increasingly complex models like large language models and vision transformers. This means that explanation methods must scale to models with billions of parameters while remaining computationally tractable. Consequently, recent research focuses on efficient approximations and hierarchical explanations. Several studies demonstrate that traditional methods like LIME and SHAP remain dominant, but adapted versions for specific architectures are emerging. For instance, attention-based explanations for transformers provide near-real-time interpretability that perturbation-based methods cannot match. Together, these developments suggest a maturing field with increasingly specialized tools for different model types.

SHAP vs. LIME: Empirical Comparisons

A 2025 perspective paper in Advanced Intelligent Systems provides detailed empirical comparison of SHAP and LIME. The analysis found that SHAP provides more consistent explanations across repeated runs (due to its theoretical grounding in Shapley values), while LIME can be faster for real-time applications. For tree-based models, TreeSHAP achieves exact Shapley values in O(TL) time complexity (where T is trees and L is leaves), making it practical for production deployment with thousands of features.

Key recent publications advancing XAI techniques include:

The integration of XAI with large language models represents a frontier research area. LLMs can now generate natural language explanations of model predictions, translating technical SHAP values into human-readable narratives. This "explanation generation" paradigm complements traditional attribution methods by providing explanations tailored to user expertise levels.

Leading Research Teams

Institution Key Researchers Focus
University of Washington Marco Tulio Ribeiro [Scholar], Carlos Guestrin LIME, Anchors, model-agnostic explanations
Microsoft Research Scott Lundberg [Scholar] SHAP, tree ensemble interpretability
Duke University Cynthia Rudin [Scholar] Inherently interpretable models, rule learning
Google DeepMind Been Kim [Scholar] Concept-based explanations, TCAV
Fraunhofer HHI Klaus-Robert Muller [Scholar] LRP, neural network interpretability

Key Journals

External Resources

Open-Source Implementations

  • SHAP (Microsoft) - Shapley value explanations with TreeSHAP, DeepSHAP, KernelSHAP
  • LIME (UW/UCI) - Local interpretable model-agnostic explanations
  • Captum (Meta) - PyTorch interpretability library with Integrated Gradients, DeepLIFT
  • InterpretML (Microsoft) - Unified toolkit including Explainable Boosting Machines
  • AI Explainability 360 (IBM) - Contrastive explanations and rule-based methods
  • iNNvestigate (Fraunhofer HHI) - LRP and Deep Taylor Decomposition

Research & Standards