Postavke privatnosti

Lack of transparency in AI training datasets: how the new tool can improve model accuracy and reduce data bias

A recent mit study reveals serious shortcomings in the transparency of data used to train large language models. The new tool helps researchers better understand data sources, thereby reducing the risk of bias and improving model efficiency.

Lack of transparency in AI training datasets: how the new tool can improve model accuracy and reduce data bias
Photo by: objava za medije/ objava za medije

Critical lack of transparency in training datasets for large language models
Researchers have developed a tool that allows artificial intelligence experts to more easily select data that best fits their models, potentially increasing model accuracy and reducing bias.

In training powerful language models, researchers rely on extensive data collections that encompass diverse information gathered from thousands of websites. However, as these datasets are combined and reused in different collections, crucial details about their origins are often lost or become unclear.

This lack of information not only raises legal and ethical concerns but can also negatively impact model performance. For example, if a dataset is misclassified, a researcher training a model for a specific task might unintentionally use data that is not suitable for that purpose.

Moreover, data from unknown sources may contain biases that lead to unfair predictions when the model is used in real-world situations, such as credit scoring or customer service interactions.

To increase data transparency, a team of multidisciplinary researchers from MIT and other institutions conducted a systematic review of over 1,800 textual datasets on popular websites. They found that more than 70 percent of these datasets lacked critical licensing information, while around 50 percent had errors in documentation.

Development of tools for greater data transparency
Researchers have developed a tool called Data Provenance Explorer that enables experts to easily review and assess the origin of datasets. This tool generates an overview of authors, sources, licenses, and permissible usage methods, which can significantly enhance the responsible use of AI technologies.

Data Provenance Explorer not only helps in selecting appropriate datasets for specific tasks but also allows users to download cards with detailed information about datasets, facilitating the understanding of risks and limitations associated with the data used.

Risks of bias and unethical use
The study also revealed that almost all dataset creators come from developed countries, which may limit the model’s ability to function correctly in different regions. For example, a dataset for Turkish developed by researchers in the US and China may not cover important cultural aspects, potentially affecting model accuracy in the Turkish context.

Researchers noted a significant increase in restrictions in datasets created in 2023 and 2024, indicating a growing concern among the academic community that their data might be misused for commercial purposes.

Challenges and future directions for research
To facilitate the collection of this information without manual review, Data Provenance Explorer offers users the ability to sort and filter datasets according to various criteria. This tool allows for the downloading of summarized dataset characteristics, marking a step forward in understanding the data used to train AI models.

In the future, researchers plan to expand their analysis to multimodal data, including videos and audio, and to explore how the terms of use on websites serving as data sources reflect on the use of datasets. They also intend to collaborate with regulators to address unique issues of copyright and ethics related to data finetuning.

MIT’s research highlights the need for data transparency, laying the foundation for a more ethical and legally compliant development of artificial intelligence in the future.

Find accommodation nearby

Creation time: 31 August, 2024

Science & tech desk

Our Science and Technology Editorial Desk was born from a long-standing passion for exploring, interpreting, and bringing complex topics closer to everyday readers. It is written by employees and volunteers who have followed the development of science and technological innovation for decades, from laboratory discoveries to solutions that change daily life. Although we write in the plural, every article is authored by a real person with extensive editorial and journalistic experience, and deep respect for facts and verifiable information.

Our editorial team bases its work on the belief that science is strongest when it is accessible to everyone. That is why we strive for clarity, precision, and readability, without oversimplifying in a way that would compromise the quality of the content. We often spend hours studying research papers, technical documents, and expert sources in order to present each topic in a way that will interest rather than burden the reader. In every article, we aim to connect scientific insights with real life, showing how ideas from research centres, universities, and technology labs shape the world around us.

Our long experience in journalism allows us to recognize what is truly important for the reader, whether it is progress in artificial intelligence, medical breakthroughs, energy solutions, space missions, or devices that enter our everyday lives before we even imagine their possibilities. Our view of technology is not purely technical; we are also interested in the human stories behind major advances – researchers who spend years completing projects, engineers who turn ideas into functional systems, and visionaries who push the boundaries of what is possible.

A strong sense of responsibility guides our work as well. We want readers to trust the information we provide, so we verify sources, compare data, and avoid rushing to publish when something is not fully clear. Trust is built more slowly than news is written, but we believe that only such journalism has lasting value.

To us, technology is more than devices, and science is more than theory. These are fields that drive progress, shape society, and create new opportunities for everyone who wants to understand how the world works today and where it is heading tomorrow. That is why we approach every topic with seriousness but also with curiosity, because curiosity opens the door to the best stories.

Our mission is to bring readers closer to a world that is changing faster than ever before, with the conviction that quality journalism can be a bridge between experts, innovators, and all those who want to understand what happens behind the headlines. In this we see our true task: to transform the complex into the understandable, the distant into the familiar, and the unknown into the inspiring.

NOTE FOR OUR READERS
Karlobag.eu provides news, analyses and information on global events and topics of interest to readers worldwide. All published information is for informational purposes only.
We emphasize that we are not experts in scientific, medical, financial or legal fields. Therefore, before making any decisions based on the information from our portal, we recommend that you consult with qualified experts.
Karlobag.eu may contain links to external third-party sites, including affiliate links and sponsored content. If you purchase a product or service through these links, we may earn a commission. We have no control over the content or policies of these sites and assume no responsibility for their accuracy, availability or any transactions conducted through them.
If we publish information about events or ticket sales, please note that we do not sell tickets either directly or via intermediaries. Our portal solely informs readers about events and purchasing opportunities through external sales platforms. We connect readers with partners offering ticket sales services, but do not guarantee their availability, prices or purchase conditions. All ticket information is obtained from third parties and may be subject to change without prior notice. We recommend that you thoroughly check the sales conditions with the selected partner before any purchase, as the Karlobag.eu portal does not assume responsibility for transactions or ticket sale conditions.
All information on our portal is subject to change without prior notice. By using this portal, you agree to read the content at your own risk.