Postavke privatnosti

Fatal AI blind spot: Visual-language models don't understand 'no', study finds

Visual-language models (VLMs) revolutionize technology, but an mit study reveals their critical flaw: a fundamental misunderstanding of negation. This weakness can lead to catastrophic errors in medicine and other high-risk AI applications, as models ignore words like 'no' or 'no', questioning their reliability.

Fatal AI blind spot: Visual-language models don
Photo by: Domagoj Skledar/ arhiva (vlastita)

The world of artificial intelligence (AI) is constantly advancing, bringing revolutionary changes to various sectors, from medicine to manufacturing. Among the most promising achievements are the so-called visual-language models (VLMs), sophisticated systems trained to simultaneously understand images and text. Their ability to connect visual information with linguistic descriptions opens the door to numerous applications, including advanced diagnostic tools and automated quality control systems. However, recent research, including that conducted at the Massachusetts Institute of Technology (MIT), sheds light on a critical weakness of these models: their inability to correctly process negations. Words like "no," "not," or "without" can lead to completely erroneous interpretations, which can have catastrophic consequences in sensitive areas of application.


Unexpected failures of artificial intelligence: The problem with negation


Imagine a scenario in a radiology office. A doctor is analyzing a patient's chest X-ray. They notice tissue swelling but observe that the heart is not enlarged. In an effort to speed up the diagnosis and find similar recorded cases, the radiologist might rely on a visual-language model. If the artificial intelligence system misinterprets the query and searches for cases of patients who have both swollen tissue and an enlarged heart, the initial diagnosis could be drastically different. Specifically, the combination of swollen tissue and an enlarged heart strongly suggests heart problems, whereas the absence of an enlarged heart, despite swollen tissue, opens up the possibility of a range of other potential causes. Such an error in interpretation, caused by a misunderstanding of negation, can lead the diagnostic process in a completely wrong direction.


Researchers from MIT, in collaboration with colleagues from other institutions, have systematically investigated this problem. Their findings indicate that visual-language models show a distinct propensity for errors in real-world situations when faced with negative words. Kumail Alhamoud, a graduate student at MIT and lead author of the study, emphasizes: "These negative words can have a very significant impact, and if we blindly use these models, we can face catastrophic consequences." This warning is not limited to medical diagnostics; it extends to all high-risk applications where decisions are based on information generated by these AI systems, from autonomous vehicles to quality control in industrial plants.


How do visual-language models work and where does the 'short circuit' occur?


Visual-language models (VLMs) are sophisticated machine learning systems trained on vast datasets containing images and their corresponding textual descriptions. Through the training process, models learn to encode both images and text into numerical representations, known as vector embeddings. The goal is for the model to learn to generate similar vectors for an image and its corresponding description. VLMs typically use two separate encoders: one for processing images and another for processing text. These encoders are optimized simultaneously so that their output vectors are as similar as possible for semantically related image-text pairs.


The problem with negation stems from the very nature of the data on which these models are trained. "Image descriptions mostly express what is in the images – they are positive labels. And that's actually the whole problem. Nobody looks at a picture of a dog jumping over a fence and describes it with 'a dog jumping over a fence, without a helicopter'," explains Marzyeh Ghassemi, an associate professor at MIT and senior author of the research. Since training datasets predominantly contain affirmative descriptions, VLMs simply do not have enough opportunities to learn to recognize and correctly interpret negation. The lack of examples where it is explicitly stated what *is not* present in the image leads to models developing a kind of "affirmation bias."


Testing the limits of understanding: How models failed the negation test


To investigate this problem further, scientists designed two specific benchmark tasks aimed at testing the ability of VLMs to understand negation. In the first task, they used a large language model (LLM) to generate new descriptions for existing images. The LLM was asked to think about related objects that are *not present* in the image and include them in the description. They then tested the VLMs by giving them queries with negative words, asking them to retrieve images that contain certain objects but not others. For example, a model might be tasked with finding images with a cat but without a dog.


The second task consisted of multiple-choice questions. The VLM was shown an image and had to choose the most appropriate description from a series of very similar options. These descriptions differed only in details – some added a reference to an object not appearing in the image, while others negated an object that was clearly visible. The results were devastating. Models often failed on both tasks. In image retrieval tasks, performance dropped by almost 25% when queries contained negations. When answering multiple-choice questions, the best models achieved an accuracy of only about 39%, while some models performed at the level of random guessing, and even below.


One of the key reasons for such failures lies in the aforementioned "affirmation bias." VLMs tend to ignore negative words and focus exclusively on the objects mentioned in the query, regardless of whether those objects are affirmed or negated. "This doesn't just happen with words like 'no' and 'not'. No matter how you express negation or exclusion, the models will simply ignore it," points out Alhamoud. This weakness proved consistent across all tested visual-language models, including some of the most well-known and widely used in the industry.


The search for a solution: New datasets and future directions


Faced with this challenge, researchers did not stop at just identifying the problem. As a first step towards a solution, they developed new datasets that explicitly include negative words. Using an existing dataset of 10 million image-text pairs, they used a large language model to suggest related descriptions that specify what is excluded from the images. This resulted in new descriptions enriched with negations. Particular attention was paid to making these synthetically generated descriptions sound natural, to avoid VLMs trained on such data later failing when faced with more complex, human-written descriptions in the real world.


After creating these enriched datasets, the team performed a process called finetuning on existing VLMs. The results were encouraging. Finetuning with the new data led to performance improvements across all segments. The models' ability to retrieve images based on queries with negation improved by approximately 10%, while success in the multiple-choice question answering task increased by an impressive 30%.


"Our solution is not perfect. We are just re-describing datasets, which is a form of data augmentation. We haven't even touched how these models work, but we hope this is a signal that this is a solvable problem and that others can take our solution and improve it," Alhamoud modestly comments. Nevertheless, this progress shows that the problem is not insurmountable and that targeted data enrichment can bring significant improvements.


Broader implications and the necessity of caution


The findings of this research, which will be presented at the prestigious Conference on Computer Vision and Pattern Recognition, have far-reaching implications. They serve as an important warning to users and developers of visual-language models. If something as fundamental as understanding negation is impaired, it raises questions about the reliability of these systems in many existing applications. Professor Ghassemi emphasizes: "This is a technical paper, but there are larger questions to consider. If something as basic as negation is broken, we should not be using large visual-language models in many of the ways we currently use them – without intensive evaluation."


Therefore, it is crucial that potential users of these technologies are aware of this, perhaps previously overlooked, shortcoming. Before implementing VLMs in high-risk environments, it is necessary to conduct thorough testing, including scenarios with negations, to assess their actual reliability. This problem is not limited to specific words like "no" or "not"; it concerns the general ability of models to understand absence, exclusion, or opposition.


Future research could focus on deeper changes in the architecture of the models themselves. One possible direction is to train VLMs to process textual and visual information in a way that would allow them to better understand semantic nuances, including negation. This might involve developing more sophisticated attention mechanisms or new loss functions that would explicitly penalize misinterpretation of negations during training. Furthermore, the development of additional, specialized datasets, tailored to specific application areas such as healthcare, could further enhance the performance and safety of these powerful tools. While visual-language models undoubtedly offer enormous potential, ensuring their robust and reliable functioning, especially in the context of understanding negation, remains a key challenge for the scientific community.

Source: Massachusetts Institute of Technology

Find accommodation nearby

Creation time: 15 May, 2025

Science & tech desk

Our Science and Technology Editorial Desk was born from a long-standing passion for exploring, interpreting, and bringing complex topics closer to everyday readers. It is written by employees and volunteers who have followed the development of science and technological innovation for decades, from laboratory discoveries to solutions that change daily life. Although we write in the plural, every article is authored by a real person with extensive editorial and journalistic experience, and deep respect for facts and verifiable information.

Our editorial team bases its work on the belief that science is strongest when it is accessible to everyone. That is why we strive for clarity, precision, and readability, without oversimplifying in a way that would compromise the quality of the content. We often spend hours studying research papers, technical documents, and expert sources in order to present each topic in a way that will interest rather than burden the reader. In every article, we aim to connect scientific insights with real life, showing how ideas from research centres, universities, and technology labs shape the world around us.

Our long experience in journalism allows us to recognize what is truly important for the reader, whether it is progress in artificial intelligence, medical breakthroughs, energy solutions, space missions, or devices that enter our everyday lives before we even imagine their possibilities. Our view of technology is not purely technical; we are also interested in the human stories behind major advances – researchers who spend years completing projects, engineers who turn ideas into functional systems, and visionaries who push the boundaries of what is possible.

A strong sense of responsibility guides our work as well. We want readers to trust the information we provide, so we verify sources, compare data, and avoid rushing to publish when something is not fully clear. Trust is built more slowly than news is written, but we believe that only such journalism has lasting value.

To us, technology is more than devices, and science is more than theory. These are fields that drive progress, shape society, and create new opportunities for everyone who wants to understand how the world works today and where it is heading tomorrow. That is why we approach every topic with seriousness but also with curiosity, because curiosity opens the door to the best stories.

Our mission is to bring readers closer to a world that is changing faster than ever before, with the conviction that quality journalism can be a bridge between experts, innovators, and all those who want to understand what happens behind the headlines. In this we see our true task: to transform the complex into the understandable, the distant into the familiar, and the unknown into the inspiring.

NOTE FOR OUR READERS
Karlobag.eu provides news, analyses and information on global events and topics of interest to readers worldwide. All published information is for informational purposes only.
We emphasize that we are not experts in scientific, medical, financial or legal fields. Therefore, before making any decisions based on the information from our portal, we recommend that you consult with qualified experts.
Karlobag.eu may contain links to external third-party sites, including affiliate links and sponsored content. If you purchase a product or service through these links, we may earn a commission. We have no control over the content or policies of these sites and assume no responsibility for their accuracy, availability or any transactions conducted through them.
If we publish information about events or ticket sales, please note that we do not sell tickets either directly or via intermediaries. Our portal solely informs readers about events and purchasing opportunities through external sales platforms. We connect readers with partners offering ticket sales services, but do not guarantee their availability, prices or purchase conditions. All ticket information is obtained from third parties and may be subject to change without prior notice. We recommend that you thoroughly check the sales conditions with the selected partner before any purchase, as the Karlobag.eu portal does not assume responsibility for transactions or ticket sale conditions.
All information on our portal is subject to change without prior notice. By using this portal, you agree to read the content at your own risk.