MIT’s Sandook shows how data centers can get more performance without new equipment
MIT researchers have presented Sandook, a software solution that could help data centers extract noticeably more performance from existing SSD devices without buying additional hardware. It is an approach that targets one of the most expensive and least visible problems of modern digital infrastructure: the fact that large data storage systems, even when they are technically sound and networked for shared use, often operate below their real potential. According to MIT News, the system is designed to distribute workloads across multiple storage devices in real time while reducing the effects of slowdowns caused by differences among the SSDs themselves, by conflicts between reading and writing, and by the process of so-called “garbage collection.” The researchers claim that such an approach can deliver a very tangible increase in speed on real-world tasks and, in some scenarios, almost double performance compared with usual static workload distribution methods.
Why the problem matters for data centers
The operation of data centers today no longer relies only on processors and networks. Data access speed has become equally important, especially in training artificial intelligence models, running databases, processing large volumes of user content, and storing files that are constantly read and updated. In practice, several SSD devices are therefore often connected into a shared pool accessed by multiple applications. Such device “pooling” has a clear economic logic: not every application needs to have its own full-capacity disk if the resource can be shared over the network. The problem, however, is that not all SSDs respond the same way under the same load, so one slower or temporarily congested device can lower the overall performance of the entire set. It is precisely this gap between nominal capacity and actually achieved speed that represents the space in which Sandook is trying to make a difference.
In its announcement, MIT emphasizes that in existing environments a significant part of device capacity is still not being used efficiently, even when the devices are formally combined for greater utilization. In other words, the mere fact that SSDs are connected into a shared system does not mean that a data center will automatically get an optimal result. If the disks were purchased in different periods, from different manufacturers, with different degrees of wear and different capacities, their behavior under load necessarily differs. When the internal processes of the SSDs themselves are added to that, it becomes clear why traditional, even task distribution is often not enough.
Three sources of slowdown the system is trying to contain
According to the research description, Sandook was developed to simultaneously address three main sources of performance variability. The first is the differences among the SSD devices themselves. In real data centers, equipment is not always bought all at once, nor is it necessarily from the same series or the same manufacturer. Over time, some disks become more worn, some operate under heavier load, and some have different technical characteristics. That means that even when an administrator assigns formally the same task to every device, the final result will not be the same.
The second problem comes from simultaneous reading and writing on the same SSD. When a device has to write new data, it often first has to erase part of the existing blocks. That process can slow down read operations taking place on the same device at the same time. In environments where applications require predictable latency, such interference can be very costly. The third source of slowdown is “garbage collection,” the internal process of gathering and removing outdated data in order to free up space. This process, as the authors point out, is activated at intervals that the data center operator cannot directly control, and when it starts, it can abruptly slow down disk operation.
It is precisely this combination of short-term and long-term causes of performance drops that makes the problem especially unpleasant. Some slowdowns appear suddenly and last briefly, while others develop over months through device wear. If a management system observes only one cause, it can easily overlook the other. That is why the researchers claim that Sandook’s advantage lies in the fact that it does not try to treat only one symptom, but instead observes the storage stack’s behavior as a whole.
A two-layer architecture: global picture and local response
The central technical idea of the system is a two-layer management architecture. At the top is a global scheduler that sees the broader picture of the entire device set and decides which SSD will receive which tasks. At the lower level are local schedulers on individual machines that can react very quickly when a device starts lagging behind or suddenly becomes congested. This is meant to combine what is often difficult to reconcile in large systems: strategic planning at the level of the entire data center and immediate operational response to a problem that appears within a fraction of a second.
MIT states that Sandook reduces interference between reading and writing by rotating the SSDs that an individual application uses for these two types of operations. This reduces the probability that reading and writing will collide on the same device at the same moment. In addition, the system profiles the usual behavior of each SSD so that it can recognize when a specific device is likely slowing down because of garbage collection. When it detects such a situation, it redirects part of the load to other devices until the affected SSD stabilizes. The point of the approach is not to completely “switch off” the problematic disk, but to temporarily reduce its burden and then gradually return it to full operation when it proves that it can once again handle more work.
Such a model is especially important because different types of variability take place on different time scales. Garbage collection can cause a sudden drop in performance, while device wear creates slower, cumulative slowdown. The global controller can take the longer-term device profile into account, while the local scheduler can react to an immediate stall. In theory, it is precisely this combination that gives the system the flexibility that simpler distribution models do not have.
Test results: from databases to AI model training
The researchers tested Sandook on a set of 10 SSDs and observed system behavior in four different types of tasks: database operation, machine learning model training, image compression, and storage of user data. According to MIT’s announcement, the increase in throughput per application ranged from 12 to 94 percent compared with static methods, while the overall utilization of SSD capacity rose by 23 percent. The authors also state that the system enabled SSDs to reach 95 percent of their theoretical maximum performance, and that without specialized hardware or adaptations that would have to be made specifically for an individual application.
These figures deserve careful reading. They do not mean that every data center will automatically get twice the performance in all scenarios, but that under test conditions, on tasks resembling real workloads, the software approach to smarter work distribution delivered very measurable results. This matters because, in practice, infrastructure investments are often viewed through the purchase of new equipment. Sandook suggests that at least part of the gains can also be achieved at the level of managing existing resources, which is especially important for operators in a period of rising energy costs and pressure for sustainability.
Less waste, more utilization
One of the most striking emphases in MIT’s announcement is not only technical, but also economic and environmental. The paper’s lead author, Gohar Irfan Chaudhry, warned that problems in computing infrastructure are too often solved by simply adding more resources, even though that is not sustainable in the long run. Such an approach means more money spent, more materials consumed, and a shorter effective lifespan for expensive equipment that has already been produced. In that sense, Sandook fits into a broader trend of technological solutions that do not necessarily require a new generation of devices, but instead try to extract the maximum from existing systems before replacement is considered.
For the data center industry, this is not a marginal topic. SSDs are fast, but they are also expensive, and at large scales even relatively small improvements in utilization can mean savings measured in significant amounts. When that is combined with the fact that modern data centers already carry a large part of the burden of the digital economy, from internet services to generative artificial intelligence, it becomes clear why every increase in efficiency is interesting from both a business and a regulatory perspective. Buying less new equipment does not mean only lower capital costs, but can also mean a smaller carbon footprint over the lifecycle of the infrastructure.
Without specialized hardware, but not without serious context
An important element of the work is also the claim that no specialized hardware is required to apply the approach. That increases the practical appeal of the solution because many studies remain confined to the laboratory precisely because they require a special type of equipment or expensive modifications to existing infrastructure. At the same time, the available data show that Sandook was developed and evaluated in a serious technical environment. The project’s publicly available GitHub page states that the experiments used Samsung PM1725a and Western Digital DC SN200 NVMe SSDs, a 100 GbE Mellanox ConnectX-6 network card, Intel Xeon E5-2680 v4 processors, and Ubuntu 23.04 with Linux kernel 6.5. Such details do not mean that the solution is reserved only for an identical configuration, but they do show that this is not an abstract simulation without contact with real infrastructure requirements.
The project’s publicly released repository also indicates that the researchers want to bring the solution closer to the community of systems and networking experts, rather than keeping it only at the level of a conference paper. This is also relevant because data centers often look for technologies that can be introduced and tested gradually, rather than only ideas that look good on a chart. Openness of implementation does not guarantee commercial adoption, but it makes technical verification and comparison with other approaches easier.
Conference validation and broader professional context
The paper titled
Unleashing the Potential of Datacenter SSDs by Taming Performance Variability was accepted for presentation at the USENIX NSDI 2026 symposium, one of the more important international gatherings dedicated to the design and implementation of networked and distributed systems. According to the conference’s official website, NSDI 2026 is being held from May 4 to May 6, 2026, in Renton, Washington. The mere fact that the paper was accepted does not mean that the technology is already an industry standard, but it does mean that it passed a relevant expert selection process within the community that deals with internet, cloud, and large-scale computing infrastructure.
The story gains additional weight from the reaction outside the author team. MIT relays a statement by Josh Fried, a software engineer at Google and future professor at the University of Pennsylvania, who did not participate in the research. He assesses that flash storage is a key technology of modern data centers, but that jointly sharing that resource among workloads with very different requirements still remains an open problem. In his assessment, this work moves the boundary forward noticeably with a practical solution ready for deployment, bringing flash storage closer to its full potential in production clouds. Such statements are not in themselves proof of success, but they show that the topic has broader resonance within the profession.
What comes next
The researchers announced that in future work they want to use new protocols available on newer SSDs that give operators greater control over data placement. In addition, they want to take advantage of workload predictability in artificial intelligence systems in order to further increase the efficiency of SSD operation. That is a logical direction of development because AI workloads, with large datasets and intensive exchanges between storage and computing resources, are increasingly shaping data center infrastructure. If it turns out that such predictability can be turned into even smarter storage management, Sandook or similar systems could gain an even broader field of application.
According to MIT, the research was funded in part by the U.S. National Science Foundation, the DARPA agency, and the Semiconductor Research Corporation. At a time when the artificial intelligence and cloud infrastructure industry is looking for ways to withstand growing demand without endlessly expanding the hardware base, works like this attract attention precisely because they offer a different answer: not necessarily more machines, but smarter use of those already running.
Sources:- MIT News – original article on the Sandook system, the research authors, test results, and the planned presentation of the paper (link)- USENIX NSDI 2026 – official page of the paper Unleashing the Potential of Datacenter SSDs by Taming Performance Variability with the list of authors and conference context (link)- USENIX NSDI 2026 – official conference page with the dates and location of the symposium (link)- Sandook GitHub project – publicly available repository with technical data on the test environment and system implementation (link)
Find accommodation nearby
Creation time: 4 hours ago