Because they are so complicated, even those who designed them, modern machine-learning models like neural network are often called “black boxes”.
Scientists use explanation methods to explain individual model decisions in order to provide some insight. For example, they might highlight certain words in movie reviews that could have influenced the model’s judgement that the review was positive.
These explanations are useless if people don’t understand them. It can even be worse if they misunderstand. MIT researchers developed a mathematical framework that could be used to quantify and assess the understanding of machine-learning model explanations. This framework can provide insights into model behavior that may be overlooked if only a few explanations are being considered to fully understand the model.
Yilun Zhou is an electrical engineering and computer science graduate student at the Computer Science and Artificial Intelligence Laboratory. He was also the lead author of a paper presenting the framework.
Marco Tulio Ribeiro (a senior researcher at Microsoft Research) and Julie Shah (a senior author), are Zhou’s coauthors. Shah is a professor of astronautics, and also the director of CSAIL’s Interactive Robotics Group. The Conference of the North American Chapter of the Association for Computational Linguistics will present the research.
Understanding local explanations
You can understand machine-learning models by finding another model that is similar to its predictions, but has transparent reasoning patterns. Unfortunately, neural network models of recent years are so complex that this approach is often unsuccessful. Researchers resort to local explanations that are focused on specific inputs. These explanations often highlight words in the text to indicate their importance to one prediction made the model.
People implicitly apply these local explanations to their overall model behavior. One person might see that a local explanation method highlighted the positive words (“memorable,” flawless, or charming”) as the most influential in determining whether a movie review contained a positive sentiment. Zhou suggests that they might assume that positive words contribute to the model’s predictions. However, this may not always be true.
Researchers developed ExSum, which is short for explanation summary. It formalizes these types of claims into rules that can then be measured using quantifiable metrics. ExSum evaluates rules on an entire set of data, not just the one for which they are constructed.
An individual can create rules using a graphical user interface. These rules can be modified, tuned, and subsequently evaluated. One example is when one studies a model to determine whether movie reviews are positive or negative. This means that words such as “not”, “no” and “nothing,” contribute to negative sentiments in movie reviews.
ExSum allows users to see how well a rule is performing using three metrics: coverage and validity. The rule’s coverage measures its generality across the entire dataset. The percentage of examples that are in agreement with the rule is called validity. Sharpness is a measure of how exact the rule is. A highly valid rule might be so general that it’s not useful in understanding the model.
Test assumptions
ExSum can be used by a researcher to gain a better understanding of the behavior of her model, Zhou explains.
If she suspects that her model is discriminatory in gender terms, she can create rules to state that male pronouns make a positive contribution while female pronouns make a negative contribution. These rules are likely to be biased if they have high validity.
ExSum can reveal surprising information about a model’s behavior. The researchers discovered that the negative words had a greater impact on the model’s decision-making than the positive words when they evaluated the movie review classifier. Zhou suggests that this could be because reviewers are more polite than being blunt when critiquing a film.
Extension of the framework
Zhou plans to continue this work and expand the concept of understanding to include other criteria and explanation forms like counterfactual explanations, which indicate how to modify an input in order to alter the model prediction. They focused their efforts on feature attribution methods. These describe the features that a model uses to make a decision.
He also wants to improve the user interface and framework so that people can create rules more quickly. Writing rules can take hours. Some human involvement is necessary because it requires that humans understand the explanations. AI assistance could help speed up the process.
Zhou is contemplating the future ExSum and hopes that their work will help to change the way researchers view machine-learning model explanations.

