Faithful or Deceptive? Evaluating the Faithfulness of Natural Language Explanations

AI Machine Learning ProjectsMachine LearningTechnology

By AiT Analyst On Sep 2, 2023

Explaining how neural models make predictions is important, but current methods like saliency maps and counterfactuals can sometimes mislead us. They don’t always provide accurate insights into how the model actually works.

Researchers from the University of Copenhagen, Denmark, University College London, UK, and University of Oxford, UK conducted a study titled Faithfulness Tests for Natural Language Explanations to evaluate the accuracy of natural language explanations (NLEs).

The team included Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein.

The team created two tests: one with a counterfactual input editor that added reasons leading to different predictions, but NLEs couldn’t capture the changes. The second test reconstructed inputs based on NLEs and checked if they resulted in different predictions. These tests are crucial for assessing the accuracy of emerging NLE models and developing trustworthy explanations.

The Faithfulness Tests

The Counterfactual Test

Do natural language explanation (NLE) models accurately reflect the reasons behind counterfactual predictions? Counterfactual explanations are often sought after by humans to understand why one event occurred instead of another.

In the field of machine learning (ML), interventions can be made on the input or representation space to generate counterfactual explanations. In this study, the researchers focus on interventions that insert tokens into the input to create a new instance that yields a different prediction. The goal is to determine whether the NLEs generated by the model reflect these inserted tokens.

Also Read: The Rapidly Approaching Gen AI Transformation of Financial Media

To accomplish this, the researchers define an intervention function that generates a set of words (W) to be inserted into the original input. The resulting modified input should lead to a different prediction. The NLE is expected to include at least one word from W that corresponds to the counterfactual prediction. The researchers provide examples of such interventions in the appendix.

To generate the input edits (W), the researchers propose a neural model editor. During training, tokens in the input are masked, and the editor predicts the masked tokens using the model’s predicted label. The inference is performed by searching for different positions to insert candidate tokens. The training objective is to minimize the cross-entropy loss for generating the inserts.

The researchers measure the unfaithfulness of NLEs by calculating the percentage of instances in the test set where the editor successfully finds counterfactual interventions that are not reflected in the NLEs. This measure is based on syntactical alignment. Although paraphrases of the inserted tokens may appear in the NLEs, a subset of NLEs is manually verified to ensure accuracy.

It’s important to note that this metric only evaluates the faithfulness of NLEs regarding counterfactual predictions and relies on the performance of the editor. However, if the editor fails to find significant counterfactual reasons that are not reflected in the NLEs, it can be considered evidence of the NLEs’ faithfulness.

The Input Reconstruction Test

Do the reasons provided in a natural language explanation (NLE) lead to the same prediction as the original input?

The concept of sufficiency is used to evaluate the faithfulness of explanations. If the reasons in an explanation are sufficient for the model to make the same prediction as on the original input, the explanation is considered faithful. This concept has been applied to saliency explanations, where the mapping between tokens and saliency scores allows for easy construction of the reasons.

However, for NLEs, which lack this direct mapping, automated extraction of reasons is challenging. In this study, task-dependent automated agents called Rs are proposed to extract the reasons. Rs are built for the e-SNLI and ComVE datasets due to the structure of the NLEs and dataset characteristics. However, constructing an R for the CoS-E dataset was not possible.

For e-SNLI, many NLEs follow specific templates, and a list of these templates covering a large portion of the NLEs is provided. In the test, the reconstructed premise and hypothesis are taken from these templates. Only sentences containing a subject and a verb are considered. If the NLE for the original input is faithful, the prediction for the reconstructed input should be the same as the original.

In the ComVE task, the goal is to identify the sentence that contradicts common sense. If the generated NLE is faithful, replacing the correct sentence with the NLE should result in the same prediction.

Overall, the study investigates whether the reasons provided in NLEs are sufficient to lead to the same prediction as the original input. Automated agents are used to extract the reasons for specific datasets, and if the reconstructed inputs based on these reasons yield the same predictions, it suggests the faithfulness of the NLEs.

Read: 5 Ways Netflix is Using AI to Improve Customer Experience

Experiments – Investigating Setup Variations and Conditioning Strategies

The study explores four setups for natural language explanation (NLE) models based on whether they use multi-task or single-task objectives and whether the generation of NLEs is conditioned on the predicted label or not.

The setups are denoted as MT (multi-task) or ST (single-task), Ra (rationalizing models) or Re (reasoning models). The models used for prediction and NLE generation is based on the T5-base model.

Also Read: Legal Tech Industry Likely to Become a Big Avenue for Generative AI Tools

Both the prediction model and the editor model are trained for 20 epochs, with evaluation at each epoch to select the checkpoints with the highest success rate. The Adam optimizer is used with a learning rate of 1e-4. During training, the editor masks a random number of consecutive tokens, and during inference, candidate insertions are generated for random positions.

Vibesies Launches AI-Native Hosting Platform Built Around Vibe Coding

May 15, 2026

Pacvue Launches MCP Server, Making Commerce Media Data Accessible Across Enterprise AI Tools

May 15, 2026

iManage MCP Server is now Available to Connect Governed Knowledge to the Broader AI Ecosystem

May 15, 2026

Prev Next 1 of 14,855

A manual evaluation is conducted by annotating the first 100 test instances for each model. The evaluation follows a similar approach to related work and confirms that paraphrases are not used in the instances evaluated, validating the trustworthiness of the automatic metric used in the study.

Baseline

For the counterfactual test, a random baseline is introduced to serve as a comparison. Random adjectives are inserted before nouns and random adverbs before verbs. Positions for insertion are randomly selected, and candidate words are chosen from a complete list of adjectives and adverbs available in WordNet. Nouns and verbs in the text are identified using spaCy.

A random baseline is used for the counterfactual test, involving random adjective and adverb insertions.
Three datasets with NLEs are utilized: e-SNLI, CoS-E, and ComVE.
e-SNLI focuses on entailment, CoS-E deals with commonsense question answering, and ComVE involves commonsense reasoning.

Results

Counterfactual Test

The study makes two key observations.

Firstly, the random baseline tends to find words that are less often found in the corresponding NLE compared to the counterfactual editor. This could be because the randomly selected words are rare in the dataset compared to the words the editor learns to insert.

Secondly, the counterfactual editor is more effective at finding words that change the model’s prediction, resulting in a higher overall percentage of unfaithful instances. The insertions made by the editor lead to counterfactual predictions for a significant number of instances.

The combined results of the random baseline and the editor show high percentages of unfaithfulness to the counterfactual, ranging from 37.04% to 59.04% across different datasets and models. However, it’s important to note that these percentages should not be interpreted as a comprehensive estimate of unfaithfulness since the test is not exhaustive.

The Input Reconstruction Test:

It revealed that inputs could be reconstructed for a significant number of instances, up to 4487 out of 10,000 test instances in e-SNLI and all test instances in ComVE. However, a substantial percentage of NLEs were found to be unfaithful, reaching up to 14% for e-SNLI and 40% for ComVE. Examples can be found in Table 1 (row 2) and Table 6. Interestingly, this test identified more unfaithful NLEs for ComVE compared to e-SNLI, highlighting the importance of diverse faithfulness tests. In terms of model performance, all four model types showed similar faithfulness results across datasets, with no consistent ranking. This contradicts the hypothesis that certain configurations, such as ST-Re being more faithful than MT-Re, would consistently hold true. Additionally, Re models tended to be less faithful than Ra models in most cases.

Tests for Saliency Maps

The faithfulness and utility of explanations have been extensively explored for saliency maps, which measure the importance of tokens in a model’s decision-making. Various metrics, such as comprehensiveness and sufficiency, have been proposed to evaluate the faithfulness of saliency maps by assessing the impact of removing important tokens or manipulating them adversarially. However, saliency maps can be manipulated to conceal biases and do not directly apply to natural language explanations (NLEs), which can include text not present in the input.

In this study, the authors propose diagnostic tests specifically designed to evaluate the faithfulness of NLE models. These tests aim to address the unique challenges posed by NLEs, where explanations can go beyond the input. By developing diagnostic methods tailored to NLEs, the study aims to provide a framework for evaluating the faithfulness of NLE models, contributing to the understanding and improvement of these explanations.

Overall, the study highlights the need for diagnostic tests that can assess the faithfulness of NLE models, as existing approaches developed for saliency maps are not directly applicable. The proposed tests aim to fill this gap and provide a specific evaluation framework for NLEs.

Tests for NLEs

Previous work has focused on assessing the plausibility and utility of natural language explanations (NLEs). Some studies have explored the benefits of additional context in NLEs for model predictions, while others have measured the utility of NLEs in simulating a model’s output. There is limited research on sanity tests for the faithfulness of NLEs, with only one study proposing two pass/fail tests. In contrast, the current study introduces complementary tests that provide quantitative metrics to evaluate the faithfulness of NLEs. These tests aim to enhance the understanding and assessment of NLEs by offering additional insights into their reliability.

Also Read: Demystifying Artificial Intelligence vs. Machine Learning: Understanding the Differences & Applications

Limitations Of NLEs

Although our tests provide valuable insights into the faithfulness of NLEs, they have some limitations. Firstly, NLEs may not capture all the underlying reasons for a model’s prediction, so even if they fail to represent the reasons for counterfactual predictions, they can still offer faithful explanations by considering other relevant factors. Additionally, both the random baseline and the counterfactual editor can generate incoherent text. Future research should explore methods to generate semantically coherent insertion candidates that reveal unfaithful NLEs.

The second test relies on task-dependent heuristics, which may not apply to all tasks. The proposed reconstruction functions for eSNLI and ComVE datasets are based on manual rules, but such rules were not feasible for the CoS-E dataset. To address this limitation, future research could explore automated reconstruction functions that utilize machine learning models. These models would be trained to generate reconstructed inputs based on the generated NLEs, with a small number of annotated instances provided for training. This approach would enable the development of machine learning models capable of generating reconstructed inputs for different datasets, enhancing the applicability and scalability of the tests.

Conclusion

The study presented two tests that assess the faithfulness of natural language explanation (NLE) models. The results indicate that all four high-level setups of NLE models are susceptible to generating unfaithful explanations. This underscores the importance of establishing proof of faithfulness for NLE models.

The tests we introduced provide valuable tools to evaluate and ensure the faithfulness of emerging NLE models. Furthermore, our findings encourage further exploration in the development of complementary tests to comprehensively evaluate the faithfulness of NLEs. By promoting robust evaluations, we can advance the understanding and reliability of NLE models in providing faithful explanations.

[To share your insights with us, please write to sghosh@martechseries.com].

Artificial Intelligence and Machine Learning natural language explanations

Faithful or Deceptive? Evaluating the Faithfulness of Natural Language Explanations

The Faithfulness Tests

Results

Tests for Saliency Maps

Tests for NLEs

Limitations Of NLEs

Conclusion

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2026 AiThority. All Rights Reserved. Privacy Policy

Faithful or Deceptive? Evaluating the Faithfulness of Natural Language Explanations

The Faithfulness Tests

Results

Tests for Saliency Maps

Tests for NLEs

Limitations Of NLEs

Conclusion

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

﻿Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought. Copyright © 2026 AiThority. All Rights Reserved. Privacy Policy

Please fill your details and we’ll get in touch with you!

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2026 AiThority. All Rights Reserved. Privacy Policy