设计工具
存储

Real life workloads allow more efficient data granularity and enable very large 固态硬盘 capacities

卢卡·伯特| 2023年9月

大容量ssd (1).e.(30TB+)带来了一系列新的挑战. 最相关的两个是:

  1. 大容量ssd是通过高密度NAND实现的, 比如QLC(四电平单元NAND,每个单元存储4位数据), 与TLC NAND(三能级电池)相比,哪些带来了更多的挑战, 每单元存储3位).
  2. 固态硬盘 capacity growth commands an equivalent growth of local DRAM memory for maps that have traditionally been a ratio of 1:1000 (DRAM to 存储 Capacity).

目前,我们正处于1:1000的比例不再可持续的地步. 但我们真的需要它吗? 为什么不是1:400的比例呢? Or 1:8000? 它们将把DRAM需求分别减少4倍或8倍. 是什么阻止了我们这么做?

这篇博客探讨了这种方法背后的思考过程, 并尝试为大容量ssd绘制一条前进的道路.

首先,为什么DRAM需要与NAND容量成1:1000的比例? The 固态硬盘 needs to map the logical block addresses (LBA) coming from the system to NAND pages and needs to keep a live copy of all of them so it knows where data can be written to or read back. LBA大小为4KB,映射地址通常为32位(4字节)。, so we need one entry of 4 bytes every LBA of 4KB; hence the 1:1000 ratio. 请注意,非常大的容量需要比这更多一点, 为简单起见, we’ll stick to this ratio as it makes the reasoning simpler and won’t materially change the outcome.

Having one map entry for each LBA is the most effective granularity as it allows the system to write (i.e.(创建一个映射条目)以尽可能低的粒度. 这通常以4KB随机写作为基准, which is commonly used to measure and compare 固态硬盘 write performance and endurance.

然而,从长远来看,这可能站不住脚. 相反,如果我们每4个lba有一个地图条目呢? 或者8 16 32+ LBAs? 如果我们每4个lba使用一个地图条目(i.e., one entry every 16KB) we may save DRAM size, but what happens when the system wants to write 4KB? 假设条目是每16KB, 固态硬盘需要读取16KB的页面, 修改将要写入的4KB, 并回写整个16KB的页面. 这将影响性能(“读取16KB, 修改4 kb, 回写4KB”, 而不仅仅是“写4KB”),而是, 最重要的是, this would impact endurance (system writes 4KB but 固态硬盘 will end up writing 16KB to NAND) thus reducing the 固态硬盘 life by a factor of 4. It is worrisome when this happens on QLC technology that has a much more challenging endurance profile. 对于QLC来说,如果有一样东西是不能浪费的,那就是耐力!

So, common reasoning is that the map granularity (or Indirection Unit – “IU” - in a more formal term) cannot be changed otherwise 固态硬盘 life (endurance) would severely decline.

虽然以上都是正确的,但是系统真的以4KB粒度写入数据吗? 多久一次? One can for sure buy a system just to run FIO with 4KB RW profile but realistically, 人们不会这样使用系统. 他们购买它们来运行应用程序、数据库、文件系统、对象存储等. 它们中有使用4KB写入的吗?

我们决定测量一下. 我们选择了一组不同的应用程序基准, 从TPC-H(数据分析)到YCSB(云运营), 在各种数据库(Microsoft®SQL Server®)上运行, RocksDB, Apache Cassandra®), 各种文件系统(EXT4), XFS)和, 在某些情况下, 完整的软件定义存储解决方案,如Red Hat®Ceph®存储, and measured how many 4KB writes are issued and what contribution they give to Write Amplification, i.e.,额外的写入会降低设备寿命.

Before going into the details of the analysis we need to discuss why write size matters when endurance is at stake.

A 4KB write will create a “write 16K to modify 4K” and thus a 4x Write Amplification Factor (“WAF”). 但如果我们有8K的写入? Assuming that is inside the same IU, it will be a “write 16K to modify 8K” so WAF=2. 好一点了. 如果我们写16K? 它可能根本不会对WAF做出贡献,因为“写入16K来修改16KB”。. 因此,只有少量的写入对WAF有贡献.

还有一种微妙的情况是,写操作可能没有对齐, so there is always a misalignment that contributes to WAF but that also decreases rapidly with size.

下图显示了这一趋势:

Luca 博客 IU Figure 1: 16K IU induced WAF showing larger IOs have a smaller impact Luca 博客 IU Figure 1: 16K IU induced WAF showing larger IOs have a smaller impact

 

大的写操作对WAF的影响最小. 256KB, 例如, may have no impact (WAF=1x) if aligned, or minimal one (WAF=1.06x)如果没有对齐. 比4KB写入带来的可怕的4x要好得多!

We then need to profile all writes coming to the 固态硬盘 and look for their alignment within an IU to compute WAF contribution of each of them. 而且越大越好. 为此,我们对系统进行了测试,以跟踪IOs的几个基准测试. We get samples for 20 min (generally between 100 and 300 million samples each benchmark) and then we post-process them to look at size, IU对齐, 并将每个IO贡献添加到WAF中.

下表显示了每种大小的桶中有多少个IOs:

图2:来自基准测试的WAF IU的真实数据(按IO计数) 图2:来自基准测试的WAF IU的真实数据(按IO计数)

 

如图所示, most writes either fit in the small size of 4-8KB (bad) bucket or in the 256KB+ (good) buckets.

如果我们应用上面的WAF图表,假设所有这些IOs都是不对齐的, 我们得到的是“最坏情况”一栏中报告的情况:大多数WAF位于1.X范围内,一些在2.X和非常特别的3.x范围. 比预期的4倍要好得多,但不足以让它可行.

然而,并不是所有的IOs都是错位的. 为什么呢?? Why would a modern file system create structures that are misaligned to such small granularities? 答案:他们没有.

We measured each of the 100+ million IOs for each benchmark and post-processed them to determine how they align with a 16KB IU. 结果在最后一栏“测量”WAF中. 一般小于5%.e., WAF >=1.这意味着一个人可以将国际单位大小增加400%, 利用QLC NAND和现有技术制造大型固态硬盘, smaller DRAM technologies at a life cost that is >5% and not 400% as postulated! 这些都是惊人的结果.

One may argue “there are a lot of small writes at 4KB and 8KB and they do have a 400% or 200% individual WAF contribution. Shouldn’t the aggregated WAF be much higher because of such small but numerous IOs contributions?”. 真正的, 有很多, 但是它们很小, 所以它们的有效载荷很小, 就体积而言, 是最小. 在上表中, a 4KB write counts as a single write as does a single 256KB write – but the latter carries 64x the amount of data than the former.

如果我们调整上表的IO卷(i.e., 考虑每个IO大小和移动的数据), 不按IO计数, 我们达成如下协议:

图3:来自基准测试的WAF IU的真实数据(按体积计算) 图3:来自基准测试的WAF IU的真实数据(按体积计算)

 

我们可以看到, 更强烈的IOs的颜色分级现在向右倾斜, meaning large IOs are moving an overwhelming amount of data and hence the WAF contribution is small.

One last thing to note is that not all 固态硬盘 workloads are suitable for this approach. 最后一行, 例如, 表示Ceph存储节点的元数据部分,它执行非常小的IO, 造成高WAF=2.35x. 大型IU驱动器不适合单独用于元数据. 然而, if we mixed data and metadata in Ceph (a common approach with NVMe 固态硬盘s) the size and amount of data trumps the size and amount of metadata so the combined WAF is minimally affected.

我们的测试表明,在实际应用程序和最常见的基准测试中, 移动到16K IU是一个有效的方法. The next step is convincing the industry to stop benchmarking 固态硬盘s with 4K RW with FIO which has bever been realistic and, 此时此刻, 对进化是有害的.

不同IU大小的影响

一个最明显的后续问题是:为什么是16KB IU大小? 为什么不是32KB或64KB,这有什么关系吗?

This is a very fair question that requires specific investigation and should be turned into a more specific question: what is the impact of different IU sizes for any given benchmark?

因为我们已经有了不受IU大小影响的痕迹, 我们只需要在适当的模型中运行它们,看看影响.

图4显示了IU大小对WAF的影响:

图4:单位大小对WAF的影响 图4:单位大小对WAF的影响

 

从图表中可以看出一些结果:

  • 国际单位大小很重要,WAF随国际单位大小而降解. 解决方案没有好坏之分, 每个人都必须根据自己的需求和目标做出不同的权衡.
  • The WAF degradation is not nearly as bad as what may be feared, as in many cases we have seen above. 即使在最坏的情况下64KB IU和最激进的基准, 它小于2倍,而不是人们担心的16倍
  • 元数据, 如前所述, 大的国际单位总是一个不好的选择,国际单位越大, 情况越糟.
  • JESD 219, 对WAF进行基准测试的行业标准概要, is not good but acceptable at 4KB IU with an extra 3% WAF which is generally tolerable but becomes unusual at larger IU with a case point at 64K IU of almost 9x

DMTS -系统架构

卢卡·伯特

Luca is a Distinguished Member of 固态硬盘 System Architecture with over 30 years of Enterprise 存储 experience. His focus is mainly on innovative features and their use in systems to further 固态硬盘 value. He holds a Master’s Degree in Solid State Physics from University of Torino (Italy).