Large language models (LLMs) represent the pinnacle of today's artificial intelligence technology, demonstrating an extraordinary ability to understand and generate text. Their skill in textual reasoning allows them to grasp the context of documents and provide logical, coherent answers. However, despite this sophistication, these same models often encounter insurmountable obstacles when faced with the simplest mathematical or logical problems. The paradox lies in the fact that textual reasoning, their fundamental strength, is often an inadequate tool for solving computational or algorithmic tasks.
Although some advanced LLMs, such as GPT-4, are capable of generating programming code in languages like Python to solve symbolic queries, a key challenge remains: the models do not always know when it is appropriate to use code instead of text, nor what type of code would be most effective for a given problem. It seems that these powerful language models need a kind of "coach" or "mentor" to guide them toward the optimal problem-solving technique. This is precisely where an innovative solution from the Massachusetts Institute of Technology (MIT) comes into play.
An intelligent assistant for language models
Researchers at MIT have developed a system called CodeSteer, an intelligent assistant designed to guide a large language model through the process of choosing between generating text and programming code until it arrives at the correct solution to a query. CodeSteer, which is itself a smaller, specialized language model, works by automatically generating a series of prompts to iteratively guide the work of a larger, more powerful LLM. After each step, CodeSteer analyzes the model's current and previous responses and provides guidance for correcting or improving the solution, continuing this process until it deems the answer correct and complete.
This approach has proven to be extremely successful. The research found that supplementing a larger LLM with the CodeSteer system increased its accuracy on symbolic tasks by more than 30 percent. The tested tasks included a wide range of problems, from multiplying numbers and solving Sudoku puzzles to logical tasks like stacking blocks. It is also significant that this system enabled less sophisticated models to outperform more advanced models with improved reasoning abilities but without external guidance.
This advancement has the potential to drastically improve the problem-solving capabilities of LLMs, especially for complex tasks that are extremely difficult to solve with textual reasoning alone. Examples of such tasks include generating paths for robots in uncertain environments or optimizing shipment schedules within a complex international supply chain.
"We are witnessing a race to develop ever-better models capable of everything, but we have taken a complementary approach," said Chuchu Fan, an associate professor of aeronautics and astronautics (AeroAstro) and a principal investigator in MIT's Laboratory for Information and Decision Systems (LIDS). "Researchers have spent years developing effective technologies and tools for solving problems in many domains. Our goal is to enable LLMs to choose the right tools and methods and leverage the expertise of others to enhance their own capabilities."
The scientific paper on this research, which will be presented at the International Conference on Machine Learning, includes, alongside Professor Fan, LIDS graduate student Yongchao Chen, AeroAstro graduate student Yilun Hao, University of Illinois Urbana-Champaign graduate student Yueying Liu, and MIT-IBM Watson AI Lab scientist Yang Zhang.
How does the "coach" for an LLM work?
To understand the problem that CodeSteer solves, one only needs to ask an LLM a simple question: which number is larger, 9.11 or 9.9? Using textual reasoning, the model will often give the wrong answer. However, if instructed to use programming code for the answer, it will generate and execute a simple Python script to compare the two numbers and arrive at the correct solution without any problem.
Because they were initially trained to understand and predict human language, LLMs are more inclined to answer queries using text, even when code would be significantly more effective. Although they have learned to generate code through the process of fine-tuning, they often generate an incorrect or less efficient version of the required code.
Instead of trying to retrain powerful LLMs like GPT-4 or Claude to improve these capabilities, which is an extremely expensive and complex process, the MIT researchers opted for a more refined solution. They fine-tuned a smaller, "lighter" language model to serve as a guide for the larger model, steering it between text and code. Fine-tuning the smaller model does not change the fundamental architecture of the larger LLM, thus eliminating the risk of impairing its other, already perfected abilities.
"We also found inspiration in people. In sports, a coach may not be better than the team's star player, but they can still provide useful advice to guide the athlete. This guidance method also works for LLMs," explains Yongchao Chen.
This "coach," CodeSteer, works in tandem with the larger LLM. It first reviews the query and determines whether text or code is more appropriate for solving the problem and what type of code would be best. It then generates a specific prompt for the larger LLM, instructing it to use a particular coding method or textual reasoning. The larger model follows this instruction, generates a response, and sends it back to CodeSteer for verification. If the answer is incorrect, CodeSteer continues to generate new prompts, encouraging the LLM to try different approaches that might solve the problem. This could include, for example, incorporating a search algorithm or a specific constraint into the Python code, until a correct result is achieved.
"We found that the larger LLM will often try to be 'lazy' and use shorter, less effective code that will not perform the correct symbolic calculation. We designed CodeSteer to avoid this phenomenon," adds Chen. To ensure quality, the system also includes a "symbolic checker" that assesses the complexity of the generated code and sends a signal to CodeSteer if the code is too simple or inefficient. In addition, the researchers have incorporated a self-checking mechanism for answers, which prompts the LLM to generate additional code to calculate the answer and thus confirm its correctness.
Tackling complex tasks and creating new benchmarks
During the development of the CodeSteer system, the research team faced an unexpected challenge: the lack of appropriate datasets for fine-tuning and testing the model. Most existing benchmarks did not specify whether a particular query could be best solved with text or code. Consequently, the researchers had to create their own resource.
They collected a corpus of 37 complex symbolic tasks, including spatial reasoning, mathematics, logical reasoning about order, and optimization, and based on this, they built their own dataset called SymBench. They implemented a fine-tuning approach that uses SymBench to maximize CodeSteer's performance.
In experiments, CodeSteer outperformed all nine baseline methods it was compared against and raised the average accuracy from 53.3% to an impressive 86.4%. It showed similar performance even on tasks it had never seen before, as well as on different types of large language models. Furthermore, a general-purpose model enhanced with CodeSteer can achieve higher accuracy than state-of-the-art models specifically designed for complex reasoning and planning, and with significantly lower computational consumption.
"Our method utilizes the LLM's own capabilities. By extending the LLM with the ability to cleverly use coding, we can take a model that is already very powerful and further improve its performance," points out Chen.
Experts outside the MIT team have also recognized the importance of this achievement. Jinsung Yoon, a scientist at Google Cloud AI, who was not involved in the work, commented: "The authors present an elegant solution to a key challenge of using tools in LLMs. This simple yet impactful method allows state-of-the-art LLMs to achieve significant performance improvements without the need for direct fine-tuning."
A similar opinion is shared by Chi Wang, a senior scientist at Google DeepMind, who also did not participate in the research. "Their success in training a smaller, specialized model to strategically guide larger, advanced models is particularly impactful. This intelligent collaboration among different AI 'agents' paves the way for more robust and versatile applications in complex real-world scenarios."
Looking to the future, the researchers plan to further optimize CodeSteer to speed up its iterative prompting process. Additionally, they are exploring how to effectively fine-tune a single model that would have the intrinsic ability to switch between textual reasoning and code generation, rather than relying on a separate assistant. This research, supported in part by the U.S. Office of Naval Research and the MIT-IBM Watson AI Lab, represents a significant step toward creating a more versatile and reliable artificial intelligence.
Greška: Koordinate nisu pronađene za mjesto:
Creation time: 4 hours ago