Enterprise storage systems solve this problem by having writes go to 8GB or more of NVRAM and then get consolidated and flushed to the SSDs. I wish consumer grade systems used a similar system.
Very interesting indeed. They mention a very simple rule of thumb (not new to this work AIUI but still worthwhile) that suggests arranging data into blocks that will all be discarded in bulk at the same time. Doing this is generally already enough to make a dent into write-amplification.
This seems to miss a reference to Zoned XFS, which is the Linux file system that actually looked into this kind of data placement at the file system layer. The paper includes numbers using RocksDB:https://dl.acm.org/doi/10.1145/3725783.3764399
Thanks for pointing this out. I’ll add the reference to the arXiv version later.
In our paper, we only evaluated with regular XFS (see Section 10.3, “What happens if a filesystem is used?” in the arXiv version), but evaluating Zoned XFS would definitely be interesting as well.
The paper shows a through analysis of write amplification and slowdown/wear with large databases (800GB) on a single machine. Databases are MySQL and postgres.
As already commended, this can lead to an optimized storage table format for greater performance. Nice!
I would expect that a similar analysis can be done for sqlite, maybe with a different dataset, single write thread..
Thanks! I have not tested SQLite myself, but it would definitely be worthwhile to evaluate as well. SQLite would likely suffer from write amplification in a similar way as MySQL or PostgreSQL, since it is also a page-based DBMS with in-place updates, regardless of the single-writer design.
The degree of the resulting write amplification depends on several factors, including the fill factor, write skewness, and the write rate relative to the SSD characteristics. We discuss this in more detail in Section 10.2, “When should the DBMS care about WAF?” in the extended arXiv version.
SMR Hard Drives have very different rules about how you should access them vs conventional hard drives or SSDs. I wonder how much optimizing for SMR drives (Big sequential writes) would also optimize for other drive types.
The zoned-storage people (whom shingled folks were a subsect of) seemed pretty ok with the FDP (Flexible Data Placement - TP4146b) scheme that finally finally finally got hammered out for NVMe 2.1 (August 2024). It was also designed to satisfy the open-channel flash people as well.
It's a fairly simple concept that lets you have some write-affinity, that lets you declare when writing that this write should be associated with other writes with the same FDP number, a tagging.
I'm not fully convinced this really is as good as what the open channel flash people wanted. But drive manufacturers were never voluntarily going to give up really complex Flash Translation Layers. They all want to be value add, have their expensive fancy controllers keeping the market from commoditizing to just using NAND directly. But FDP does show some very real promises, can have huge read-write-affinity bonuses!
I note that SSDFS filesystem is still out there being improved and maintained, for a file system that tries to take advantage of this all. I'm not sure if it's made the jump to using FDP or is still older much more ornery & never quite loved ZNS specification. I'd love to give it a try but FDP and ZNS drives are not easy to get ahold of, require asking very nicely, and when I last checked required purchasing very expensive fancy enterprise SSD that cost a ton but which had pretty so so performance figures. That was a couple years ago now.
https://www.phoronix.com/news/Linux-SSDFS-NVMe-ZNS-SSDshttps://news.ycombinator.com/item?id=34939248
The paper here is wonderful & beautiful. FDP should make this kind of thing so so so much easierz should remove so many of the downsides of drive usage mentioned here. If only it were available. I'd really love it if driver reviewers would look at and comment on the feature matrix drives have, and comment on FDP, but generally, it feels like there's no ask, little pull and thus no push, for an obvious and basically free to implement zero cost improvement that makes everything vastly better. Alas. Can't wait. Hopefully drive prices are better by 2031 & FDP is finally available. Fingers crossed.
Speaking of zoned SSDs (ZNS or FDP), are any of these available today without having to ‘call sales’? I wanted to experiment with this maybe 2 years back and there was nothing.
This paper gives a really nice end-to-end treatment of an entire problem domain that is usually taken piecemeal. Almost all of the techniques mentioned are already used in databases in some form. It won't lead to new database types but it provides a framework for thinking about the write amplification problem.
Not every database architecture will be able to easily take advantage of all these techniques. Some designs are much more easily optimizable than others.
To add to that: some of the techniques are well known to storage experts, but not yet widespread among database engineers.
The paper does a great job of explaining the effects on database systems. Great work!
I feel fooled after clicking on the link and seeing that this PDF is downloading (or just literally writing to my SSD) until I realized that this is the point
The extended version is available on arXiv if you’d like more details: https://arxiv.org/pdf/2603.09927
The appendix includes additional details and FAQ-style answers that did not fit into the VLDB version.
That they got this to work on regular commodity SSDs (from multiple vendors) is very impressive.
In our paper, we only evaluated with regular XFS (see Section 10.3, “What happens if a filesystem is used?” in the arXiv version), but evaluating Zoned XFS would definitely be interesting as well.
I would expect that a similar analysis can be done for sqlite, maybe with a different dataset, single write thread..
The degree of the resulting write amplification depends on several factors, including the fill factor, write skewness, and the write rate relative to the SSD characteristics. We discuss this in more detail in Section 10.2, “When should the DBMS care about WAF?” in the extended arXiv version.
There is also this paper on SQLite/mobile storage and zoned devices that may be relevant in this context: https://www.usenix.org/system/files/atc24-hwang.pdf
It's a fairly simple concept that lets you have some write-affinity, that lets you declare when writing that this write should be associated with other writes with the same FDP number, a tagging.
I'm not fully convinced this really is as good as what the open channel flash people wanted. But drive manufacturers were never voluntarily going to give up really complex Flash Translation Layers. They all want to be value add, have their expensive fancy controllers keeping the market from commoditizing to just using NAND directly. But FDP does show some very real promises, can have huge read-write-affinity bonuses!
I note that SSDFS filesystem is still out there being improved and maintained, for a file system that tries to take advantage of this all. I'm not sure if it's made the jump to using FDP or is still older much more ornery & never quite loved ZNS specification. I'd love to give it a try but FDP and ZNS drives are not easy to get ahold of, require asking very nicely, and when I last checked required purchasing very expensive fancy enterprise SSD that cost a ton but which had pretty so so performance figures. That was a couple years ago now. https://www.phoronix.com/news/Linux-SSDFS-NVMe-ZNS-SSDs https://news.ycombinator.com/item?id=34939248
The paper here is wonderful & beautiful. FDP should make this kind of thing so so so much easierz should remove so many of the downsides of drive usage mentioned here. If only it were available. I'd really love it if driver reviewers would look at and comment on the feature matrix drives have, and comment on FDP, but generally, it feels like there's no ask, little pull and thus no push, for an obvious and basically free to implement zero cost improvement that makes everything vastly better. Alas. Can't wait. Hopefully drive prices are better by 2031 & FDP is finally available. Fingers crossed.
> Storage nerd @ Google
Vendors, are you listening?
(And the software would/could be so much better... If this were available to play with)
Not every database architecture will be able to easily take advantage of all these techniques. Some designs are much more easily optimizable than others.