Draft:Feature Importance

machine learning topic From Wikipedia, the free encyclopedia

Feature Importance (or Variable Importance or Feature Attribution) refers to a set of techniques and mathematical frameworks used in machine learning, statistics, and their applications to quantify the contribution of input variables (features) to a model's output or to the underlying data-generating process. It is a component of Explainable Artificial Intelligence (XAI) and Interpretable Machine Learning (IML) and is used for model improvement, scientific inference, and feature selection.[1][2]


There is no singular ground-truth value of feature importance, rather a context-dependent score that varies based on the scope of the explanation, the model (or data) of interest, and the ultimate purpose of the analysis.[1] Due to model complexity, model behavior can be summarized in many different human interpretable ways. In the presence of correlated features, the task of assigning credit becomes a multi-way trade-off where researchers must choose between different sets of axioms that favor different principles (i.e., relating to the internal logic of the model or the underlying nature of the data).

In an online survey (n=266),[2] 35.3% of researchers responded that they primarily use feature importance to get insights about data. Another 18.4% cited its primary use as justifying the model, and an equal 18.4% used it for debugging and improving the model (27.8% just wanted to see survey results).

History and Theoretical Foundations

The progression of the field of feature importance and the changes in focus reflect the broader shift in the culture of data science from parametric modeling and inference toward algorithmic modeling and prediction.[3] Feature importance continues to be an active area of research.[4]

Early Development and Correlation (1920s–1970s)

The earliest forms of feature importance assessed the strength of the relationships between pairs of variables within animal biology or human psychology using methods such as Francis Galton's correlation coefficient.[5] The formal quest to determine variable importance continued with Sewall Wright’s development of path analysis in 1921.[6] Wright sought to understand causal influences in complex systems, and he provided a method "path analysis" to determine the correlative influences along direct paths by decomposing correlation and partial correlation coefficients into their path-based components. For decades, the dominant approach to variable importance was the inspection of standardized regression coefficients (or regression weights). These weights, however, were unstable in the presence of multicollinearity (i.e., when there is high correlation between predictor variables). This led to a problem where the importance assigned to a variable depended on the order in which it was entered into a sequential regression model.[4]

In 1960, Hoffman proposed a method (relative weights) to handle these correlations,[7] which was later critiqued and refined by Darlington in 1968 as well as Green in 1978.[8] [9] During this period, researchers primarily focused on partitioning the coefficient of determination () among predictors.

The Averaging Movement (1980s–2000s)

To address the arbitrary nature of sequential entry, William Kruskal proposed a solution in 1987: averaging relative importance over all possible orderings of the independent variables.[10] This ensured that no single variable was unfairly penalized or elevated by its position in the model. This approach was formalized as the LMG method, named after Lindeman, Merenda, and Gold.[11] In 2005, Feldman introduced the Proportional Marginal Value Decomposition (PMVD), which added an "Exclusion" property—ensuring that a regressor with a true coefficient of zero receives a zero share of importance asymptotically.[12] Here, variable importance began to be linked with Shapely values,[13] an important development for the next era.

The Random Forest and XAI Era (2001–Present)

The year 2001 marked a shift towards machine learning with Leo Breiman’s introduction of Random Forests. Breiman moved away from model parameters (coefficients) and linear importance by introducing "permutation importance" (Mean Decrease Accuracy or GINI), which assessed nonlinear importance by measuring the drop in model performance when a feature's values were randomly shuffled.[14] Simultaneously, Lipovetsky and Conklin (2001) applied the Shapley value from cooperative game theory to regression, providing a consistent method for variance attribution in the presence of multicollinearity.[15]

Gradient-based feature importance methods emerged from early sensitivity analysis in neural networks, where partial derivatives of a model’s output with respect to its inputs were used to quantify how small input changes affect predictions. Since the explosion in prominence of deep learning, gradient-based feature attribution has increased in popularity. Simple input gradients and gradient/input methods were among the earliest and most widely used due to their computational efficiency, but they were later criticized for instability and noise, especially in deep, non-linear models. To address these issues, more principled approaches were introduced, most notably Integrated Gradients,[16] which average gradients along a path from a baseline input to the actual input and satisfy desirable axioms such as sensitivity and implementation invariance. Closely related methods include DeepLIFT, which propagates contribution scores relative to a reference activation, and Layer-wise Relevance Propagation (LRP), which redistributes prediction scores backward through the network. Variants such as SmoothGrad further improve robustness by averaging gradients over noisy perturbations.

In 2017, Scott Lundberg and Su-In Lee introduced SHAP (SHapley Additive exPlanations). SHAP unified various local attribution methods (like LIME and DeepLIFT) under the umbrella of Shapley values, providing the first mathematically consistent framework for local feature importance in any machine learning model.[17] This became one of the most cited machine learning papers of all time.[2]

Taxonomy and Classification of Methods

There are a few ways that researchers tend to classify and group different methods.

Classification by Scope

Some researchers divide feature importance methods into four distinct settings based on two axes: Global vs. Local and Data vs. Model.[18] [19] [20]

More information Reference: Model, Reference: Data ...
Reference: ModelReference: Data
Global Global-Model Importance: Explains how a trained model behaves across the entire dataset. It identifies which features the model generally relies on for its predictions.[18] Global-Data Importance: Explains the true relationships in the underlying phenomenon. It seeks to identify the intrinsic predictive power of features within the population, regardless of a specific model's choices.[1]
Local Local-Model Importance: Explains why a specific prediction was made for a single instance. It quantifies the influence of each feature on that a specific model's specific outcome.[18] Local-Data Importance: Explains the role of a feature for a specific individual in the real world (e.g., why a specific patient developed a disease), focusing on the causal or statistical dependencies for that point.[18]
Close

Methods like SHAP and LIME are local-model. Global-model importance metrics include permutation importance.[14] and SAGE[20]. Global-data methods include MCI[1] and UMFI[21]

Classification by Correlation Treatment: The Marginal to Conditional Continuum

Feature importance methods often differ in how they treat the correlation between features. If features were independent some feature importance methods would give identical results. Thus, some researchers choose to classify methods based on how different methods assign credit to correlated features.[22] [4] [23]

To distinguish marginal and conditional methods, suppose that the response variable is fully determined by two of the predictor variables . Further suppose that there is another predictor variable that is fully determined by the first predictor and the third predictor is independent of both .

  • Conditional Feature Importance: Evaluates a feature by conditioning on the values of all other features. This approach respects the dependence structure of the data and measures the unique information a feature provides that is not already captured by other variables. As such, purely conditional methods would assign all the importance while giving zero importance to the other two variables. Methods in this category include conditional permutation importance[22], Leave-One-Covariate-Out, and partial correlation.
  • Marginal Feature Importance: Relies on associations between the response and predictors, regardless of multicollinearity. Purely marginal methods assign the same high importance to all features. Examples include correlation, marginal contribution feature importance.[1], and ultra-marginal feature importance[1]

Methods such as SHAP and permutation importance are somewhere in between the two extremes as importance is shared among correlated features.

Classification by Mechanism: Gradient vs. Non-Gradient

The technical implementation of the importance measure often dictates its applicability to different model architectures.

More information Mechanism, Description ...
MechanismDescriptionExamples
Gradient-Based Utilizes the derivatives (gradients) of the model's output with respect to its input features. These are typically model-specific and used for differentiable models like neural networks.[24] Saliency Maps, Integrated Gradients, Grad-CAM, DeepLIFT.[24]
Non-Gradient Based Treats the model as a "black box" and relies on perturbations, shuffling, or submodel training. These are typically model-agnostic and applicable to any algorithm.[24] Permutation Importance, KernelSHAP, LOCO, MCI.[1]
Close

Classification by Purpose

The choice of method can be driven by the end-user's objective.

  • Model Explanation: The goal is to understand the "logic" of a black-box model to ensure safety, fairness, and reliability. Methods like SHAP, LIME, and Accumulated local effects are standard here.[17]
  • Data Explanation (Scientific Inference): The goal is to learn about the real world. Researchers prioritize methods that handle redundancy and correlation in a way that reflects the true underlying relationships (e.g., MCI, UMFI).[1][21]
  • Model Optimization (Feature Selection): The goal is to improve the model's performance by removing irrelevant or redundant features. Techniques like Recursive Feature Elimination (RFE) use importance scores as a selection criterion.[25]

Other Classifications

Achen (1982) introduced a classification of linear regression-based feature importance methods: "dispersion importance" (explained variance), "level importance" (impact on the mean), and "theoretical importance" (the change in response for a given change in regressor).[26]

Axiomatic Foundations of Feature Importance

To move beyond heuristic rankings, researchers use axioms to define what a "fair" or "valid" importance score should look like. These axioms provide the mathematical justification for selecting one method over another.

The Shapley Axioms (S1–S4)

Shapley values are the unique solution that satisfies four core game-theoretic axioms, which describe how to fairly distribute the "total gain" of a model's prediction among the participating features.[27]

  1. Efficiency (Local Accuracy): The sum of the importance scores for all features must equal the difference between the model's prediction for an instance and the expected prediction .[17]
  2. Symmetry: If two features and contribute exactly the same value to every possible subset of other features, they must receive the same importance score: .[17]
  3. Dummy (Null Player): If a feature contributes nothing to the value function for any subset of features, its importance score must be zero. This is crucial for identifying irrelevant features.[17]
  4. Additivity (Linearity): If the value function is the sum of two functions , then the importance scores must be the sum of the scores calculated for each function: .[17]

Data-Driven Axioms for MCI and UMFI

For scientific discovery, the Shapley axioms are sometimes criticized because they average contributions, which can lead to diluted importance for correlated features.[1] Some researchers proposed alternative axioms for true-to-the-data methods.

  • Marginal Contribution: The importance of a feature must be at least as high as the gain it provides when added to the set of all other features.[1]
  • Elimination: Removing other features from the feature set can only decrease (or leave unchanged) the importance of a remaining feature. It cannot increase it.[1]
  • IRI & SD (Invariance under Redundant Information and Symmetry under Duplication): Adding a redundant feature should not change the importance of preexisting features, and identical features should receive equal importance.[21] [1]
  • Blood Relation: A feature should have non-zero importance if and only if the feature is blood related (associated) with the response in the ground-truth causal graph.[21]

The Inconsistency Theorem

It is mathematically impossible for a single feature importance score to satisfy certain intuitive properties simultaneously—such as being consistent between local and global settings while also being robust to all types of feature dependencies (like colliders).[18] This suggests that users must prioritize specific axioms based on their task; for example, if one values local accuracy (efficiency), they might have to sacrifice robustness to perfect correlation.[27]

Methods and Algorithms

Shapley Values and SHAP Variants

SHAP (SHapley Additive exPlanations) interprets the model prediction as a "game" where feature values are the "players".[2]

  • KernelSHAP: A model-agnostic approximation that uses a weighted linear regression (the "Shapley kernel") to estimate Shapley values. Its main limitation is computational speed, as it requires many model evaluations.[2]
  • TreeSHAP: An algorithm specifically designed for tree ensembles (XGBoost, LightGBM, Random Forest). It computes exact Shapley values in polynomial time by traversing the tree structure. However, "path-dependent" TreeSHAP can sometimes produce unintuitive results because it changes the value function to rely on conditional expectations.[2]
  • DeepSHAP: Combines SHAP values with the DeepLIFT algorithm to provide fast attributions for neural networks.[24]

Permutation Importance (PI)

Permutation importance defines the importance of feature as the increase in model error after shuffling the values of in the test set.[14]

More information Aspect, Description ...
AspectDescription
IntuitionIf the model relies on a feature, shuffling its values destroys the relationship, causing the error to spike.[2]
ProsEasy to understand; does not require model retraining; captures both main effects and interactions.[2]
ConsVulnerable to correlated features; the permuted features can force the input off the training distribution.[22]
Close

Conditional Permutation Importance (CPI)

Strobl et al. (2008) introduced CPI to address the bias of PI toward correlated features. CPI permutes feature only within "blocks" defined by the values of other features that are associated with . This ensures that the shuffled values stay "local" to the original data distribution.[22]

  • Algorithm Details: The party package implementation in R uses p-values from independence tests to select which features to condition on. If the p-value is below a threshold , the feature is included in the conditioning set.[22]
  • Limitations: High sample sizes can lead to "greedy" conditioning, where almost all features are selected for the blocks, making the permutation less effective. Newer implementations like permimp aim to be less sensitive to these sample-size effects.[28]

Leave-One-Covariate-Out (LOCO)

In LOCO, to find the importance of , train two models: one with all features and one without . The importance is the difference in their predictive risk (e.g., loss).[27]

  • Comparison with Shapley: While Shapley values average the marginal contributions across all submodels, LOCO looks only at the "top" (the full model vs. the model with features). Research by Verdinelli and Wasserman (2023) suggests that for many statistical purposes, a normalized version of LOCO is more reliable and easier to interpret than Shapley values.[27]

Marginal Contribution Feature Importance (MCI)

MCI is specifically designed for "data explanation". Unlike Shapley, which averages contributions, MCI identifies the maximum contribution a feature can make to any possible subset.[1]

  • Why use MCI?: In systems with high redundancy (e.g., measuring multiple similar metabolites in a biological pathway), Shapley values for each metabolite will approach zero as the number of redundant features increases. MCI remains robust, assigning high importance to any feature that could provide high predictive power in some context.[1]
  • Limitations: Computationally expensive and can miss correlated interactions.[21]

Integrated Gradients (IG)

Integrated Gradients is a leading gradient-based method for deep networks. It addresses the "saturation" problem of simple saliency maps (where gradients can become zero even for important features) by integrating the gradients along a path from a "baseline" (e.g., an all-black image) to the actual input.[29]

Recursive Feature Elimination

RFE is a "wrapper" method that iteratively prunes the feature set.[25]

More information Train the model (e.g., SVM, Random Forest) on the full set of features ...
StepAction
1. InitializationTrain the model (e.g., SVM, Random Forest) on the full set of features .[30]
2. RankingUse the model's internal importance measure (e.g., weights for SVM, MDI for RF) to rank features.[30]
3. EliminationRemove the least important feature (or a fraction of the least important features).[30]
4. IterationRepeat the process on the remaining subset until the desired number of features is reached or performance begins to drop.[30]
Close

RFE is particularly effective because it accounts for feature interactions that might be missed by simple filter methods (like Pearson correlation). However, it is computationally expensive as it requires retraining the model in each iteration.[31]

Applications

See also

References

Related Articles

Wikiwand AI