A growing number of enterprise datacenters requiring high data throughput and low transaction latency previously reliant on Hard Disk Drives (HDD) in their servers are now running into performance bottlenecks and are looking to Solid-State Drives (SSDs) as a viable storage solution to increase their datacenter performance, efficiency, reliability and lowering overall operating expenses (OpEx).
To begin to understand the differences between SSD classes, we have to distinguish the two key components of an SSD – the Flash Storage Controller (or simply called SSD controller) and the non-volatile NAND Flash memory used to store data.
In today’s market, SSD and NAND Flash memory consumption are split into three main groups:
- Consumer devices (Tablets, cameras, mobile phones),
- Client systems (Netbook, notebook, Ultrabook, AIO, desktop personal computers), embedded/industrial (Gaming kiosk, purpose-built system, digital signage)
- Enterprise computing platforms (HPC, datacenter servers).
Choosing the right SSD storage device for enterprise datacenters can be a long and arduous process of learning and qualifying a multitude of different SSD vendors and product types as not all SSDs and NAND Flash memory are created equal.
SSDs are manufactured to be easily deployable as replacements for or complement to rotational magnetic platter-based Hard Disk Drives (HDDs) and are available in a number of different form factors, including 2.5", communication protocol / interfaces including Serial ATA (SATA), Serial Attached SCSI (SAS) and more recently PCIe to transfer data to and from the Central Processing Unit (CPU) of a server.
Being easily deployable, however, does not guarantee that all SSDs will be suitable in the long term for the enterprise application they were selected for; the cost of choosing the wrong SSD can often negate any initial cost-savings and performance benefits gained when the SSDs are either worn out prematurely due to excessive writes, achieve far lower sustained write performance over their expected lifetime or introduce additional latency in the storage array and thus require early field replacement.
We will discuss the three main qualities that distinguish an enterprise and client class SSD to assist in making the right purchasing decision when the time comes to replace or add further storage to an enterprise datacenter.
SSDs can deliver incredibly high read and write performance for both sequential and random data requests from the CPU through the use of multi-channel architecture and parallel access from the SSD’s Controller to the NAND Flash chips.
In a typical datacenter scenario involving processing millions of bytes of random company data, including collaboration on technical CAD drawings, seismic data for analysis (e.g., Big Data) or accessing worldwide customer data for banking transactions (e.g., OLTP), the storage devices must be accessible with the least amount of latency and can involve a large number of clients needing access to the same piece of data simultaneously with no degradation in response time. User experience is based upon having low latencies, which increase user productivity.
A client application will only involve a single user or application access with a higher tolerable delta between the minimum and maximum response time (or latency) on any user or system actions.
Complex storage arrays using SSDs (e.g., Network Attached Storage, Direct Attached Storage or Storage Area Network) are also adversely affected by mismatched performance and can cause havoc on the storage array latency, sustained performance and ultimately, quality of service as perceived by users.
Unlike client SSDs, Kingston’s enterprise class SSDs are optimized not only for peak performance in the first few seconds of access but using a larger over-provisioned area (OP) they also offer a higher sustained steady state performance over longer periods of time. More information on specific drives can be found on the Kingston web site under Enterprise SSDs.
This guarantees that the storage array performance stays consistent with the organization’s expected Quality of Service (QoS) requirement during peak traffic loads.
NAND Flash memory has a number of inherent issues associated with it, the two most important include a finite life expectancy as NAND Flash cells wear out during repeated writes, and a naturally occurring error rate.
During the production process of NAND Flash, each NAND Flash die cut from silicon wafers is tested and characterized with a raw Bit Error Rate (BER or RBER).
The BER defines the rate at which naturally occurring bit errors in NAND Flash occur without the benefit of Error Correction Code (ECC) and which the SSD Controller corrects using on-the-fly Advanced ECC (typically called BCH ECC, Strong ECC or LDPC error correction by different SSD controller manufacturers) without disrupting user or system access.
The SSD controller’s ability to correct these bit errors can be interpreted by the Uncorrectable Bit Error Ratio (UBER), “a metric for data corruption rate equal to the number of data errors per bit read after applying any specified error-correction method”. 
As defined and standardized by the industry standards association, JEDEC in 2010 with documents JESD218A:Solid State Drive (SSD) Requirements and Endurance Test Method and JESD219:Solid State Drive (SSD) Endurance Workloads, the enterprise class differs in a number of ways from client class SSDs including but not limited in their ability to support heavier write workloads, more extreme environmental conditions and recovery from a higher BER than a client SSD.  
|Application Class||Workload (see JESD219)||Active Use (power on)||Retention Use (power off)||UBER Requirement|
Table 1 - JESD218A:Solid State Drive (SSD) Requirements and Endurance Test Method
Copyright JEDEC. Reproduced with permission by JEDEC.
Using the JEDEC proposed UBER requirement for enterprise versus client SSD, an enterprise class SSD is expected to only experience 1 unrecoverable bit error at a ratio of 1 bit error for every 10 quadrillion bits (~1.11 Petabytes) compared to a client SSD at 1 bit error for every 1 quadrillion bits (~0.11 Petabytes) processed.
Kingston’s enterprise SSDs will also add additional technologies that will allow for the recovery of corrupted blocks of data using parity data stored in other NAND dies (similar to RAIDing drives, this allows for the recovery of specific blocks that can be rebuilt with the parity data stored in other blocks).
To complement the redundant data block recovery technologies built into Kingston enterprise SSDs, periodic checkpoint creation, Cyclic Redundancy Check (CRC) and ECC error correction are also implemented in an End-to-End internal protection scheme to guarantee the integrity of data from the host through the flash and back to the host. End-to-End data protection means that data that is received from the host is checked for integrity during its storage into the SSD’s internal cache and when written or read back from the NAND storage areas.
Similar to an enterprise class SSDs enhanced ECC protection against bit errors, SSDs may also contain physical circuitry for power loss detection that manages power storage capacitors on the SSDs. Power Fail support in hardware monitors incoming power to the SSD and during a surprise power loss, it provides temporary power to the SSD circuitry using Tantalum capacitors to complete any internally or externally issued outstanding writes before powering down the SSDs. Power Fail protection circuitry is usually required for applications where data loss is not recoverable.
Power Fail protection may also be implemented in the SSD firmware through frequent flushing of data in the SSD controller’s cache areas (e.g., its FTranslation Layer table) to the NAND storage – this does not guarantee that no data will be lost during a power loss event but tries to minimize the impact of unsafe power shutdowns. Firmware Power Fail Protection also ensures that the SSD is not likely to become inoperable after encountering an unsafe shutdown.
In many situations, the use of Software Defined Storage or server clustering may reduce the need for hardware-based Power Fail support as any data is replicated onto a separate and independent storage device on a different server or servers. Web-scale datacenters often dispense with Power Fail support using Software Defined Storage to, in effect, RAID servers to store redundant copies of the same data.
All NAND Flash memory contained in Flash storage devices degrade in their ability to reliably store bits of data with every program or erase (P/E) cycle of a NAND Flash memory cell until the NAND Flash blocks can no longer reliably store data; at that point, a degraded or bad block is removed from the user addressable storage pool and the logical block address (or LBA) is moved to a new physical address on NAND Flash storage array. A new storage block replaces the bad one using the Spares Block pool that is part of the Over Provisioned (OP) storage on the SSD.
As the cell is constantly programmed or erased, the BER also increases linearly and it is for this reason a complex set of management techniques must be implemented on the enterprise SSD Controller to manage the cell capability to reliably store data over the expected life of the SSD. 
The P/E endurance of a given NAND Flash memory can vary substantially depending on the current lithography manufacturing process and type of NAND Flash produced.
|NAND flash memory type||TLC||MLC||SLC|
|Architecture||3 bits per cell||2 bits per cell||1 bit per cell|
|Capacity||Highest capacity||High capacity||Lowest capacity|
|Endurance (P/E)||Lowest endurance||Medium endurance||High endurance|
|Approx NAND Bit Error Rate (BER)||10^4||10^7||10^9|
Enterprise SSDs will also vary from client SSDs on their duty cycle. An enterprise class SSD must be able to withstand heavy read or write activity in scenarios typical with a datacenter server requiring access to the data across the entire 24 hours of every day in the week compared to a Client class SSD which is typically only fully utilized for 8 hours a day in the week. Enterprise SSDs have a 24x7 duty cycle compared to client SSDs with a 20/80 duty cycle (20% of the time active, 80% in idle or sleep mode during computer usage).
Understanding the write endurance of any application or SSD can be complex, which is why the JEDEC committee also proposed an endurance measurement metric using the TeraBytes Written (TBW) value to indicate the amount of raw Host data that can be written to the SSD before the NAND Flash contained in the SSD becomes an unreliable storage medium and the drive should be retired.
Using the JEDEC proposed JESD218A testing methods and JESD219 enterprise class workloads, it becomes an easier task to interpret an SSD manufacturer’s endurance calculations via TBW and extrapolate a more understandable endurance measure that can be applied to any datacenter.
As noted in documents JESD218 and JESD219, different application class workloads can also suffer from a Write Amplification Factor (WAF) in order of magnitude higher than the actual writes submitted by the host and easily lead to unmanageable NAND Flash wear, higher NAND Flash BER from excessive writes over time and slower performance from widely distributed invalid pages across the SSD.
While TBW is an important topic for the discussion between enterprise and client class SSDs, TBW is only a NAND Flash level endurance prediction model and the Mean Time Between Failure (MTBF) should be observed as a component level endurance and reliability prediction model based on the reliability of components utilized on the device. The expectation of an enterprise class SSDs components include outlasting and working harder at managing the voltages across all NAND Flash memory over the SSDs life expectancy. All enterprise SSDs should be rated at least at one million hours MTBF, which translates to over 114 years! Kingston specs its SSDs very conservatively and it is not uncommon to see higher MTBF specifications on SSDs; it is important to note that 1 million hours is more than a sufficient starting point for enterprise SSDs.
S.M.A.R.T. monitoring and reporting on enterprise class SSDs allows the device to be easily queried pre-failure for life expectancy based on the current write amplification (WAF) factor and wear level. Pre-failure predictive warnings for failure events such as a loss of power, bit errors occurring from the physical interface or un-even wear distribution are often also supported. The Kingston SSD Manager utility can be downloaded from the Kingston web site and used to view a drive’s status.
Client class SSDs may only feature the minimum S.M.A.R.T. output for monitoring the SSD during standard use or post-failure.
Depending on the application class and capacity of the SSD, an increased reserve capacity of NAND Flash memory can also be allocated as an over-provisioned (OP) spare capacity. The OP capacity is hidden from user and operating system access and can be utilized as a temporary write buffer for higher sustained performance and as a replacement of defective Flash memory cells during the life-expectancy of the SSD to enhance the reliability and endurance of the SSD (with greater numbers of Spare Blocks).
There are distinctive differences between enterprise and client class SSDs ranging from their NAND Flash memory Program and Erase endurance to their complex management techniques to suit different application class workloads.
Understanding these differences in application classes as it pertains to performance, reliability and endurance can be an effective tool in minimizing and managing the risk of disruptive downtime in the demanding and often mission critical, enterprise environment. For more questions, contact your Kingston representative or utilize the Ask An Expert or Tech Support Chat features on Kingston.com.
Uncorrectable bit-error rate (UBER) JEDEC dictionary, JEDEC Committee
JESD218A: Solid State Drive (SSD) Requirements and Endurance Test Method, JEDEC Committee
The Bleak Future of NAND Flash Memor, University of California
Characterization and Error-Correcting Codes for TLC Flash Memories, University of California
NAND Flash Qualification Guideline, California Institute of Technology.