How are methods in XAI evaluated?

Mohammad Amin Dadgar
5 min readJan 4, 2023

After investigating the Explainable AI (XAI) methods, I wondered how XAI methods are evaluated. So after reading multiple articles here’s a small summary of what I’ve found.

Happy faces representing the enthusiasm of the AI team after having the right AI model explanations (src: unsplash.com)

One of the main hot topics in XAI is the evaluation of methods. It has been hard to evaluate the explanations of XAI methods and as a result, there are much fewer articles investigating the methods themselves.

To evaluate methods we need to first introduce some metrics used in this field. Since measures like accuracy, recall or F1 are just suitable to evaluate the AI model performance itself, these methods are not applicable directly to evaluate XAI methods. So the first part of this article is to introduce some XAI metrics from different research papers. In the second part, we will show how evaluation methods work.

1. Explainable AI Evaluation Metrics

There are many metrics, each investigating one or more aspects of an XAI method. I’ve listed some here.

Trustworthiness

Is my Explainable AI trustworthy? does it produce the same results in different runs? Do the explanations it suggests right? These are some questions to ask yourself to find out whether a method is trustworthy or not. As far as I know, This metric is a subjective metric meaning it is hard to use a number to show how trustworthy it is or not (Not yet qualitatively measured).

Note: The term trust is much deeper investigated in an article by Jacovi et. al. [1]

Fidelity

The term fidelity as far as I understood belongs to XAI methods using a surrogate model (LIME, KernelSHAP, … which learns a transparent model to approximate the black-box method). This term investigates whether the surrogate model performance is near to the black-box model or not. The more performances are the same, the higher this metric will be.

Causality

Causality is another quantitatively measured metric that represents the reasons behind the prediction. For example, generating a text to show why the method does predict a specific output for inputs. Also, counterfactual examples can show the cause of why not a specific output happened

For example in a loan-giving application, if the man had a higher salary we could give him the loan. So the cause is the low salary. This metric is also a subjective metric and I haven’t yet found an article to measure it quantitatively.

Faithfulness

Are the explanations right, in a way that it wouldn’t omit the true reasons behind the explanations?

As you can see, there’s a similarity in this metric and trustworthiness in which they both discuss the true explanations. The difference between these two is that the term trustworthiness considers the relationship between humans and the method but the term Faithfulness explores the method itself. The Faithfulness metric is again subjective and cannot be explored quantitatively yet.

Confidence

The confidence metric is a metric that mostly is subjective. The definition of this metric can be the confidence of the explanations in XAI methods or can be the confidence humans would have in the AI model after having the explanations.

Metrics Discussion

After reading different articles to find out what XAI metrics are, I understood that these metrics are mostly correlated with the fields of social sciences and psychology because they explain how the explanations of the XAI methods are for humans. So to make them quantitative, the science of human understanding must be investigated before everything.

2. Evaluation Methods

After having an idea of what XAI evaluation metrics are in the first section of this article, it’s easy to understand the evaluation methods themselves as we represent them here.

In total, I separated evaluation methods into two categories. one manual evaluation method in which mainly survey and questionnaires are used and automatic evaluation methods in which the results of XAI methods are investigated.

Manual Evaluation Methods

In this type of evaluation comparison of XAI methods is used via surveys or questionnaires. Comparing XAI methods output using humans can be easy to do but expensive. For example, the XAI method outputs for multiple inputs are given and different people have been asked to complete a checklist or questionnaire.

Different articles have been shown to do this kind of evaluation since more metrics can be interpreted from the results. For example Goodness, Satisfaction, Trust, Validity, Curiosity and etc can be investigated from the results of the questions [2, 3, 4].

Automated Evaluation Methods

There are a lot of methods evaluating the XAI methods [5, 6, 7]. From synthetic data generation to find whether the feature rankings (assuming the XAI method does the feature ranking) are correct, to more robust methods that have a better statistical background. Here I‘d like to present one of the methods that have a statistical background since I think it is a more complete framework for some XAI methods. The framework is named LEAF [7] (Local Explanation evAluation Framework).

LEAF method explores four aspects of an XAI Feature Importance method explanations. The aspects are Local fidelity, Local concordance, Reiteration Similarity, and Perscriptivity. The first two aspects compare the surrogate model and the model itself’s performance having all features and K features in the surrogate model respectively. Reiteration Similarity runs the XAI method multiple times and examines how the results are the same (using box plots). The last one Perscriptivity examines the boundary of the model and XAI surrogate model, the more boundaries are close the more Perscriptivity value is close to 1.

Last words

Normally evaluating XAI methods are not easy as it’s correlated with different sciences [8], but what can we do is to investigate deep enough to find the one and most suitable for our method. By that, we could better compare XAI methods. This can derive the community to a better method that better explains AI models.

As always, if you had any questions or discussions don’t hesitate to ask as a comment here or in my social accounts.

References

[1] “Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI”, A. Jacovi, A. Marasović, T. Miller, Y. Goldberg, FACCT 2021

[2] “Explainable Artificial Intelligence: Evaluating the Objective and Subjective Impacts of xAI on Human-Agent Interaction” ,A. Silva, M. Schrum, E. Hedlund-Botti, 2022, International Journal of Human–Computer Interaction 3

[3] “Metrics for Explainable AI: Challenges and Prospects”, Hoffman, Robert R. and Mueller, Shane T. and Klein, Gary and Litman, Jordan, 2018, arXiv

[4] “Roadmap of Designing Cognitive Metrics for Explainable Artificial Intelligence (XAI)”, Hsiao, Janet Hui-wen and Ngai, Hilary Hei Ting and Qiu, Luyu and Yang, Yi and Cao, Caleb Chen, 2021, arXiv

[5] “XAI Evaluation: Evaluating Black-Box Model Explanations for Prediction,”, Y. Zhang, F. Xu, J. Zou, O. L. Petrosian and K. V. Krinkin, 2021 II International Conference on Neural Networks and Neurotechnologies (NeuroNT), 2021, pp. 13–16

[6] “A Benchmark for Interpretability Methods in Deep Neural Networks”, Hooker, Sara and Erhan, Dumitru and Kindermans, Pieter-Jan and Kim, Been, arXiv, 2018

[7] “To trust or not to trust an explanation: using LEAF to evaluate local linear XAI methods” E. G. Amparore, A. Perotti, P. Bajardi, 2021, CoRR

[8] “Explanation in artificial intelligence: Insights from the social sciences” T. Miller, Artif. Intell., vol. 267, pp. 1–38, Feb. 2019, doi: 10.1016/J.ARTINT.2018.07.007.

--

--

Mohammad Amin Dadgar

C.S Artificial Intelligence Master’s degree student and data analyst at TogetherCrew. My LinkedIn profile link: https://www.linkedin.com/in/mramin22/