Enterprise versus Client SSD

A growing number of enterprise datacentres that require high data throughput and low transaction latency and were previously reliant on Hard Disk Drives (HDD) in their servers are now running into performance bottlenecks. As a result, they are looking to Solid-State Drives (SSD) as a viable storage solution to increase their datacentre performance, efficiency and reliability, and lower overall operating expenses (OpEx).

To begin to understand the differences between SSD classes, we have to distinguish the two key components of an SSD – the Flash Storage Controller (or simply called the SSD controller) and the non-volatile NAND Flash memory used to store data.

In today’s market, SSD and NAND Flash memory consumption are split into three main groups:
  • Consumer devices (tablets, cameras, mobile phones),
  • Client systems (netbook, notebook, Ultrabook, AIO, desktop personal computers), embedded/industrial (gaming kiosk, purpose-built system, digital signage)
  • Enterprise computing platforms (HPC, datacentre servers).

Choosing the right SSD storage device for enterprise datacentres can be a long and arduous process of learning and qualifying a multitude of different SSD vendors and product types, as not all SSDs and NAND Flash memory are created equal.

SSDs are manufactured to be easily deployable as a replacement for or complement to rotational magnetic platter-based Hard Disk Drives (HDDs) and are available in a number of different form factors, including 2.5". They feature communication protocols/interfaces including Serial ATA (SATA), Serial Attached SCSI (SAS) and, more recently, PCIe to transfer data to and from the Central Processing Unit (CPU) of a server.

Being easily deployable, however, does not guarantee that all SSDs will be suitable in the long term for the enterprise application in which they were selected. The cost of choosing the wrong SSD can often negate any initial cost-savings and performance benefits gained when the SSDs are either worn out prematurely due to excessive writes, achieve far lower sustained write performance over their expected life time or introduce additional latency in the storage array and thus require early field replacement.

We will discuss the three main qualities that distinguish an enterprise and client-class SSD to assist in making the right purchasing decision when the time comes to replace or add further storage to an enterprise datacentre.

Performance

SSDs can deliver incredibly high read and write performance for both sequential and random data requests from the CPU through the use of multi-channel architecture and parallel access from the SSD’s Controller to the NAND Flash chips.

In a typical datacentre scenario that involves the processing of millions of bytes of random company data, including collaboration on technical CAD drawings, seismic data for analysis (e.g. Big Data) or accessing worldwide customer data for banking transactions (e.g. OLTP), the storage devices must be accessible with the least amount of latency and can involve a large number of clients needing access to the same item of data simultaneously with no degradation in response time. User experience is based upon having low latencies, which increases user productivity.

A client application will only involve a single user or application access with a higher tolerable delta between the minimum and maximum response time (or latency) on any user or system actions.

Complex storage arrays that use SSDs (e.g. Network Attached Storage, Direct Attached Storage or Storage Area Network) are also adversely affected by mismatched performance and can wreak havoc on the storage array latency, sustained performance and, ultimately, quality of service as perceived by users.

Unlike client SSDs, Kingston’s enterprise-class SSDs are optimised not only for peak performance in the first few seconds of access but, by using a larger over-provisioned area (OP), they also offer a higher sustained steady-state performance over longer periods of time. More information on specific drives can be found on the Kingston website under Enterprise SSDs.1

This guarantees that the storage array performance stays consistent with the organisation’s expected Quality of Service (QoS) requirement during peak traffic loads.

Reliability

NAND Flash memory has a number of inherent issues associated with it: the two most important being a finite life expectancy as NAND Flash cells wear out during repeated writes, and a naturally occurring error rate.

During the production process for NAND Flash, each NAND Flash die cut from silicon wafers is tested and characterised with a raw Bit Error Rate (BER or RBER).

The BER defines the rate at which naturally occurring bit errors in NAND Flash occur without the benefit of Error Correction Code (ECC) and which the SSD Controller corrects using on-the-fly Advanced ECC (typically called BCH ECC, Strong ECC or LDPC error correction by different SSD controller manufacturers) without disrupting user or system access.

The SSD controller’s ability to correct these bit errors can be interpreted by the Uncorrectable Bit Error Ratio (UBER), “a metric for data corruption rate equal to the number of data errors per bit read after applying any specified error-correction method”.1

As defined and standardised by the industry standards association, JEDEC in 2010 by documents JESD218A:Solid State Drive (SSD) Requirements and Endurance Test Method and JESD219:Solid State Drive (SSD) Endurance Workloads, the enterprise class differs in a number of ways from client-class SSDs including, but not limited to, their ability to support heavier write workloads, more extreme environmental conditions and recover from a higher BER than a client SSD.23

Application
Class
Workload
(see JESD219)
Active Use
(power on)
Retention Use
(power off)
UBER
Requirement
Client Client 40° C
8 hrs/day
30° C
1 year
≤10 -15
Enterprise Enterprise 55° C
24hrs/day
40° C
3 months
≤10 -16

Table 1 - JESD218A:Solid State Drive (SSD) Requirements and Endurance Test Method
Copyright JEDEC. Reproduced with permission by JEDEC.

Using the JEDEC proposed UBER requirement for enterprise versus client SSD, an enterprise class SSD is expected to experience only 1 unrecoverable bit error at a ratio of 1 bit error for every 10 quadrillion bits (~1.11 Petabytes) compared to a client SSD at 1 bit error for every 1 quadrillion bits (~0.11 Petabytes) processed.

Kingston’s enterprise SSDs will also add additional technologies that will allow for the recovery of corrupted blocks of data using parity data stored in other NAND dies (similar to RAIDing drives, this allows for the recovery of specific blocks that can be rebuilt with the parity data stored in other blocks).

To complement the redundant data block recovery technologies built into Kingston enterprise SSDs, periodic checkpoint creation, Cyclic Redundancy Check (CRC) and ECC error correction are also implemented in an end-to-end internal protection scheme to guarantee the integrity of data from the host through the flash and back to the host. End-to-end data protection means that data received from the host is checked for integrity during its storage into the SSD’s internal cache and when written or read back from the NAND storage areas.

Similar to an enterprise-class SSD's enhanced ECC protection against bit errors, SSDs may also contain physical circuitry for power loss detection that manages power storage capacitors on the SSDs. Power fail support in hardware monitors incoming power to the SSD and, during a surprise power loss, it provides temporary power to the SSD circuitry using tantalum capacitors to complete any internally or externally issued outstanding writes before powering down the SSDs. Power fail protection circuitry is usually required for applications where data loss is not recoverable.

Power fail protection may also be implemented in the SSD firmware through frequent flushing of data in the SSD controller’s cache areas (e.g. its FTranslation Layer table) to the NAND storage – this does not guarantee that no data will be lost during a power loss event but tries to minimise the impact of unsafe power shutdowns. Firmware power fail protection also ensures that the SSD is not likely to become inoperable after encountering an unsafe shutdown.

In many situations, the use of software-defined storage or server clustering may reduce the need for hardware-based power fail support as any data is replicated onto a separate and independent storage device on a different server or servers. Web-scale datacentres often dispense with power fail support using software defined storage to, in effect, RAID servers to store redundant copies of the same data.

Endurance

All NAND Flash memory contained in Flash storage devices degrades in its ability to reliably store bits of data with every program or erase (P/E) cycle of a NAND Flash memory cell until the NAND Flash blocks can no longer reliably store data. At that point, a degraded or bad block is removed from the user addressable storage pool and the logical block address (or LBA) is moved to a new physical address on the NAND Flash storage array. A new storage block replaces the bad one using the spares block pool that is part of the Over Provisioned (OP) storage on the SSD.

As the cell is constantly programmed or erased, the BER also increases linearly and therefore a complex set of management techniques must be implemented on the enterprise SSD Controller to manage the cell capability to reliably store data over the expected life of the SSD.4

The P/E endurance of a given NAND Flash memory can vary substantially depending on the current lithography manufacturing process and type of NAND Flash produced.

NAND flash memory typeTLCMLCSLC
Architecture 3 bits per cell 2 bits per cell 1 bit per cell
Capacity Highest capacity High capacity Lowest capacity
Endurance (P/E) Lowest endurance Medium endurance High endurance
Cost $ $$ $$$$
Approx NAND Bit Error Rate (BER) 10^4 10^7 10^9

Table 2 – NAND flash memory types 56

Enterprise SSDs will also vary from client SSDs on their duty cycle. As an enterprise-class SSD must be able to withstand heavy read or write activity in scenarios typical for a datacentre server that requires access to the data for 24 hours of every day in the week compared to a client-class SSD which is typically only fully utilised for 8 hours each day during the week. Enterprise SSDs have a 24x7 duty cycle compared to client SSDs with a 20/80 duty cycle (20% of the time active, 80% in idle or sleep mode during computer usage).

Understanding the write endurance of any application or SSD can be complex, which is why the JEDEC committee also proposed an endurance measurement metric using the Terabytes Written (TBW) value to indicate the amount of host data that can be written to the SSD before the NAND Flash contained in the SSD becomes an unreliable storage medium and the drive should be retired.

Using the JESD218A testing methods and JESD219 enterprise-class workloads proposed by JEDEC, it becomes an easier task to interpret an SSD manufacturer's endurance calculations via TBW and extrapolate a more understandable endurance measure that can be applied to any datacentre.

As noted in documents JESD218 and JESD219, different application-class workloads can also suffer from a Write Amplification Factor (WAF) at an order of magnitude higher than the actual writes submitted by the host and easily lead to unmanageable NAND Flash wear, higher NAND Flash BER from excessive writes over time and slower performance from widely distributed invalid pages across the SSD.

While TBW is an important topic for the discussion between enterprise and client-class SSDs, TBW is only a NAND Flash level endurance prediction model and the Mean Time Between Failure (MTBF) should be observed as a component-level endurance and reliability prediction model based on the reliability of components used on the device. The expectations placed on an enterprise-class SSD's components include longevity and working harder at managing the voltages across all NAND Flash memory over the SSD's life expectancy. All enterprise SSDs should be rated at least at one million hours MTBF, which translates to over 114 years! Kingston specs its SSDs very conservatively and it is not uncommon to see higher MTBF specifications on SSDs. It is important to note that 1 million hours is more than a sufficient starting point for enterprise SSDs.

S.M.A.R.T. monitoring and reporting on enterprise-class SSDs allows the device to be easily queried for life expectancy prior to failure based on the current Write Amplification Factor (WAF) and wear level. Pre-failure predictive warnings for failure events such as a loss of power, bit errors occurring from the physical interface or un-even wear distribution are often also supported. The Kingston SSD Manager utility can be downloaded from the Kingston web site and used to view a drive’s status.

Client-class SSDs may only feature the minimum S.M.A.R.T. output for monitoring the SSD during standard use or post failure.

Depending on the application class and capacity of the SSD, an increased reserve capacity of NAND Flash memory can also be allocated as an over-provisioned (OP) spare capacity. The OP capacity is hidden from user and operating system access and can be used as a temporary write buffer for higher sustained performance and as a replacement of defective Flash memory cells during the life-expectancy of the SSD to enhance the reliability and endurance of the SSD (with greater numbers of Spare Blocks).

Conclusion

There are distinctive differences between enterprise and client-class SSDs ranging from their NAND Flash memory program and erase endurance to their complex management techniques to suit different application class workloads.

Understanding these differences in application classes as they pertain to performance, reliability and endurance can be an effective tool in minimising and managing the risk of disruptive downtime in the demanding and often mission-critical enterprise environment. For more questions, contact your Kingston representative or use the Ask An Expert or Tech Support Chat features on Kingston.com.

Related Articles