The world of artificial intelligence (AI) is constantly advancing, bringing revolutionary changes to various sectors, from medicine to manufacturing. Among the most promising achievements are the so-called visual-language models (VLMs), sophisticated systems trained to simultaneously understand images and text. Their ability to connect visual information with linguistic descriptions opens the door to numerous applications, including advanced diagnostic tools and automated quality control systems. However, recent research, including that conducted at the Massachusetts Institute of Technology (MIT), sheds light on a critical weakness of these models: their inability to correctly process negations. Words like "no," "not," or "without" can lead to completely erroneous interpretations, which can have catastrophic consequences in sensitive areas of application.
Unexpected failures of artificial intelligence: The problem with negation
Imagine a scenario in a radiology office. A doctor is analyzing a patient's chest X-ray. They notice tissue swelling but observe that the heart is not enlarged. In an effort to speed up the diagnosis and find similar recorded cases, the radiologist might rely on a visual-language model. If the artificial intelligence system misinterprets the query and searches for cases of patients who have both swollen tissue and an enlarged heart, the initial diagnosis could be drastically different. Specifically, the combination of swollen tissue and an enlarged heart strongly suggests heart problems, whereas the absence of an enlarged heart, despite swollen tissue, opens up the possibility of a range of other potential causes. Such an error in interpretation, caused by a misunderstanding of negation, can lead the diagnostic process in a completely wrong direction.
Researchers from MIT, in collaboration with colleagues from other institutions, have systematically investigated this problem. Their findings indicate that visual-language models show a distinct propensity for errors in real-world situations when faced with negative words. Kumail Alhamoud, a graduate student at MIT and lead author of the study, emphasizes: "These negative words can have a very significant impact, and if we blindly use these models, we can face catastrophic consequences." This warning is not limited to medical diagnostics; it extends to all high-risk applications where decisions are based on information generated by these AI systems, from autonomous vehicles to quality control in industrial plants.
How do visual-language models work and where does the 'short circuit' occur?
Visual-language models (VLMs) are sophisticated machine learning systems trained on vast datasets containing images and their corresponding textual descriptions. Through the training process, models learn to encode both images and text into numerical representations, known as vector embeddings. The goal is for the model to learn to generate similar vectors for an image and its corresponding description. VLMs typically use two separate encoders: one for processing images and another for processing text. These encoders are optimized simultaneously so that their output vectors are as similar as possible for semantically related image-text pairs.
The problem with negation stems from the very nature of the data on which these models are trained. "Image descriptions mostly express what is in the images – they are positive labels. And that's actually the whole problem. Nobody looks at a picture of a dog jumping over a fence and describes it with 'a dog jumping over a fence, without a helicopter'," explains Marzyeh Ghassemi, an associate professor at MIT and senior author of the research. Since training datasets predominantly contain affirmative descriptions, VLMs simply do not have enough opportunities to learn to recognize and correctly interpret negation. The lack of examples where it is explicitly stated what *is not* present in the image leads to models developing a kind of "affirmation bias."
Testing the limits of understanding: How models failed the negation test
To investigate this problem further, scientists designed two specific benchmark tasks aimed at testing the ability of VLMs to understand negation. In the first task, they used a large language model (LLM) to generate new descriptions for existing images. The LLM was asked to think about related objects that are *not present* in the image and include them in the description. They then tested the VLMs by giving them queries with negative words, asking them to retrieve images that contain certain objects but not others. For example, a model might be tasked with finding images with a cat but without a dog.
The second task consisted of multiple-choice questions. The VLM was shown an image and had to choose the most appropriate description from a series of very similar options. These descriptions differed only in details – some added a reference to an object not appearing in the image, while others negated an object that was clearly visible. The results were devastating. Models often failed on both tasks. In image retrieval tasks, performance dropped by almost 25% when queries contained negations. When answering multiple-choice questions, the best models achieved an accuracy of only about 39%, while some models performed at the level of random guessing, and even below.
One of the key reasons for such failures lies in the aforementioned "affirmation bias." VLMs tend to ignore negative words and focus exclusively on the objects mentioned in the query, regardless of whether those objects are affirmed or negated. "This doesn't just happen with words like 'no' and 'not'. No matter how you express negation or exclusion, the models will simply ignore it," points out Alhamoud. This weakness proved consistent across all tested visual-language models, including some of the most well-known and widely used in the industry.
The search for a solution: New datasets and future directions
Faced with this challenge, researchers did not stop at just identifying the problem. As a first step towards a solution, they developed new datasets that explicitly include negative words. Using an existing dataset of 10 million image-text pairs, they used a large language model to suggest related descriptions that specify what is excluded from the images. This resulted in new descriptions enriched with negations. Particular attention was paid to making these synthetically generated descriptions sound natural, to avoid VLMs trained on such data later failing when faced with more complex, human-written descriptions in the real world.
After creating these enriched datasets, the team performed a process called finetuning on existing VLMs. The results were encouraging. Finetuning with the new data led to performance improvements across all segments. The models' ability to retrieve images based on queries with negation improved by approximately 10%, while success in the multiple-choice question answering task increased by an impressive 30%.
"Our solution is not perfect. We are just re-describing datasets, which is a form of data augmentation. We haven't even touched how these models work, but we hope this is a signal that this is a solvable problem and that others can take our solution and improve it," Alhamoud modestly comments. Nevertheless, this progress shows that the problem is not insurmountable and that targeted data enrichment can bring significant improvements.
Broader implications and the necessity of caution
The findings of this research, which will be presented at the prestigious Conference on Computer Vision and Pattern Recognition, have far-reaching implications. They serve as an important warning to users and developers of visual-language models. If something as fundamental as understanding negation is impaired, it raises questions about the reliability of these systems in many existing applications. Professor Ghassemi emphasizes: "This is a technical paper, but there are larger questions to consider. If something as basic as negation is broken, we should not be using large visual-language models in many of the ways we currently use them – without intensive evaluation."
Therefore, it is crucial that potential users of these technologies are aware of this, perhaps previously overlooked, shortcoming. Before implementing VLMs in high-risk environments, it is necessary to conduct thorough testing, including scenarios with negations, to assess their actual reliability. This problem is not limited to specific words like "no" or "not"; it concerns the general ability of models to understand absence, exclusion, or opposition.
Future research could focus on deeper changes in the architecture of the models themselves. One possible direction is to train VLMs to process textual and visual information in a way that would allow them to better understand semantic nuances, including negation. This might involve developing more sophisticated attention mechanisms or new loss functions that would explicitly penalize misinterpretation of negations during training. Furthermore, the development of additional, specialized datasets, tailored to specific application areas such as healthcare, could further enhance the performance and safety of these powerful tools. While visual-language models undoubtedly offer enormous potential, ensuring their robust and reliable functioning, especially in the context of understanding negation, remains a key challenge for the scientific community.
Source: Massachusetts Institute of Technology
Greška: Koordinate nisu pronađene za mjesto:
Creation time: 15 May, 2025