White Paper

Oracle NoSQL Database and IBM's High IOPS Adapters Offer Cost-effective, Extreme Performance for Big Data

Abstract

This paper describes the benefits of storing Oracle NoSQL Database data on Fusion’s ioDrive2 products. Oracle and Fusion-io partnered to test, validate, and deliver extreme high performance big data solutions for real-time applications. The superior performance of Fusion’s ioDrive2 complements the scalability, reliability, and simplicity of Oracle NoSQL Database, dramatically improving throughput and response times for serving key-value data. The combination of Oracle NoSQL Databases and ioMemory provide a compelling and cost-effective solution in a variety of scenarios. Results of testing showed that using an ioDrive2 for data delivered nearly 30 times more operations per second than a 300GB 10k SAS disk on a 90 percent read and 10 percent write workload and nearly eight times more operations per second on a 50 percent read and 50 percent write workload. Equally impressive, an ioDrive2 reduced latency over 700 percent (seven times) on inserts in a 90 percent read and 10 percent write workload and over 5800 percent (58 times) on reads in a 50 percent read and 50 percent write workload.

What is Big Data? 

Big Data is an informal term that encompasses all sorts of data, including Web logs, sensor data, tweets, blogs, user reviews, and SMS messages. It is characterized by: high volume of hundreds of terabytes or more; wide data variety with no inherent structure (one row looks very different from another); and high velocity, on the order of hundreds of thousands of operations per second. Often, big data is processed using purpose-built software designed to address a specific data processing requirement. This category of big data processing solutions is generally referred to as NoSQL (not SQL or Not Only SQL). Although it is possible to process big data using traditional SQL-based products and solutions, NoSQL databases provide a more cost-effective and horizontally scalable alternative. NoSQL databases complement SQL-based solutions, providing significant new business advantages to the enterprise. Recently, there has been a huge surge of interest in big data processing solutions. As enterprises have embraced big data processing for business benefit, open source and commercial vendors have responded by providing a variety of solutions aimed at addressing specific big data processing needs.

In October 2011, Oracle announced a suite of complementary products and technologies that provide a complete and comprehensive solution to address the big data processing needs of the market. Big data processing falls into two major categories: interactive processing and batch processing. In most big data processing applications, both kinds of data processing are required. Oracle NoSQL Database (NoSQL DB for short), also released in October 2011, is a scalable, highly available key-value store that can be used to acquire and manage vast amounts of interactive information.

About Oracle NoSQL Database 

Oracle NoSQL Database is a highly available, linearly scalable, high-performance key-value database server. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Oracle NoSQL Database provides a very simple data model to the application developer. Each row is identified by a unique key, and also has a value, of arbitrary length, which is interpreted by the application. The application can manipulate (insert, delete, update, read) a single row in a transaction. The application can also perform an iterative, non-transactional scan of all the rows in the database. The simplicity of this data model and access provides tremendous flexibility and performance benefits over an SQL-based solution for big data processing.

As mentioned earlier, big data is characterized by variety, volume and velocity. The key-value paradigm permits the application to manage any kind of data: one row can be structurally very different from another row. The volume of data managed might change dramatically from one day to the next. For example, if e-commerce transactions are being managed in Oracle NoSQL Database, the volume of transactions and data can increase more than ten-fold during a busy shopping season, such as the weeks before the Christmas holiday. The data management system needs to scale easily to handle the change in workload without compromising performance. Similarly, high throughput and low response time are critical in many big data processing applications such as e-commerce, targeted advertising, and any application that provides interactive access to the customer.

NoSQL DB is a sharded system each shard manages a subset of data. Typically, a shard is composed of three independent nodes to provide High Availability. One of the nodes in the shard is designated as a master, meaning it can serve read as well as write requests. Changes to data on the master node are continually propagated to the other nodes (the replicas) in the shard in order to keep the replicas up-to-date. Replicas can serve read requests; in case the master node fails, one of the surviving replicas is elected as the master, and processing continues without any interruption in database activity. Figure 1 illustrates the architecture of a typical NoSQL Database configuration with two clients. Note that the number of clients can vary, depending on application requirements.

Figure 1: NoSQL Database system architecture

Each node (master or replica) uses Berkeley DB Java Edition HA as the underlying data manager. Berkeley DB Java Edition uses a log-structured storage format to store the records and indices in the database. Log-structured storage is naturally optimized for write performance and can deliver extremely high write throughput. Through a combination of clever optimizations and effective use of memory, Berkeley DB Java Edition delivers excellent read performance as well.

 

Big Data—The Problems with Conventional Technology 

Transactional semantics, high availability, scalable throughput and predictable latency are “must-have” requirements for the interactive (or “real-time”) big data processing for which Oracle NoSQL Database is designed. For example, a retail e-commerce application must respond to user requests in under one or two seconds to ensure high user retention. Similarly, an in-home health care application must have the ability to capture and monitor data from multiple sensors, while processing and responding to critical medical events reliably and predictably without data loss.

A common technique to ensure high throughput and low latency is to store all the information in memory. Due to the high and unpredictable volumes of data, however, an in-memory solution is not cost-effective for big data processing. Typically, big data solutions store the vast majority of the information on disk, and use memory for caching the most frequently accessed subsets of data. The performance of storing and retrieving data from disk often limits the throughput and response time achievable by the system. In particular, the number of input-output operations per second (IOPS) that a disk can deliver will dictate the performance characteristics of the system.
Modern “spinning” disks are able to deliver fast sequential access, but poor sustained random performance of approximately 100 IOPS. Most often, the requirements of a NoSQL database application far exceed the capacity of a single disk. Consequently, high performance solutions often use multiple disks per machine in order to get additional I/O bandwidth. This can work adequately for smaller data sets, but as the volume of data to be processed increases, applications require external arrays and the cost of hardware and maintenance to scale systems quickly becomes impractical.

The Flash Memory Solution 

Fusion’s ioMemory platform delivers the microsecond latency access interactive big data applications need to maintain “real time” response times for tens of terabytes of capacities—something that in-memory databases cannot practically do. At the same time, ioMemory provides persistent storage and the necessary I/O performance that disk arrays cannot achieve without racks of infrastructure and high bandwidth network infrastructure.  Oracle and Fusion-io have partnered to test ioMemory’s benefits to the Oracle NoSQL Database.

About the Tests 

The tests were run on a single shard consisting of three nodes. Each node was a Sun Fire X4170 M2 configured with two Intel 2.93GHz 6-Core Xeon E5670 processors and 72GB of DRAM, a 300GB 10k SAS hard disk, and a 1.2TB ioDrive2. The machines were configured with Oracle Linux Server release 5.7 and a pre-release version of NoSQL Database 2.0.

The test driver consisted of a single Yahoo! Cloud Systems Benchmark (YCSB) client. The YCSB software was modified to use a larger key space for better distribution of keys when scaling up to large data sets. Tests were conducted on an Oracle system that was not tuned for flash. There were three sets of tests:

  1. Pure insert: Insert 100 million records, with an average key size of 13 bytes and an average value size of 1108 bytes.
  2. 50/50 R/W: Ten million operations consisting of a 50% read and 50% update mix, using the 100 million record store created by the insert test.
  3. 95/5 R/W: Ten million operations consisting of a 95% read and 5% update mix, again using the 100 million record store created by the insert test.

The above tests were run using both the SAS hard disk and ioDrive2. Throughput and latency were measured by the YCSB client during these tests and are summarized in the tables presented below.

Test Results

 

 

300GB SAS Disk

ioDrive2

Improvement

Throughput (operations/sec)

23,308

24,150

3.60%

Average insert latency (msec)

5.07

4.96

2.20%

Average read latency (msec)

N/A

N/A

N/A

Table 1: Pure insert test—insert 100 million 1108 byte records

 

 

300GB SAS Disk

ioDrive2

Improvement

Throughput (operations/sec)

3,342

33,693

908%

Average insert latency (msec)

36.88

6.42

574%

Average read latency (msec)

35.6

0.61

5836%

Table 2: 50/50 read/update mix. 400 million 1108 byte records in the database

 

 

300GB SAS Disk

ioDrive2

Improvement

Throughput (operations/sec)

3,583

106,616

2975%

Average insert latency (msec)

34.57

4.79

721%

Average read latency (msec)

33.16

.91

3643%

Table 3: 95/5 read/update mix. 100 million 1108 byte records in the database

Interpreting the Results 

For the pure insert scenario, the performance of disk and ioDrive2 is similar. This similarity is not surprising, since the underlying log-structured storage architecture for Oracle NoSQL Database is optimized for write operations on hard disks. However, we see a dramatic difference in the read/update mix tests. Read operations require random I/Os (seeks) on conventional disks; consequently, the throughput as well as latency is affected. However, in the case of ioDrive2, the cost of random I/O and sequential I/O is almost identical. In other words, any I/O operation in a sequence of operations is equally fast! The improvement factor is 30 times (nearly 3,000%). Notice that the overall throughput improves as the ratio of reads to writes increases. This happens because the benefits of log-structured storage have less of an impact when the relative proportion of writes to reads is small.

Oracle NoSQL Database and ioDrive2—A Winning Combination

From these performance tests, it is clear that ioDrive2 provides dramatic improvements in performance for interactive big data applications. Disk drives simply cannot achieve the number of IOPS that an ioDrive2 can. The superior performance of Oracle NoSQL Database using ioDrive2 is critical for many mission-critical applications like e-retail, online advertising, home health care monitoring, financial services, security and surveillance, etc. Though the capital cost of flash storage-based technology is higher, a system using disk-based storage that delivers comparable performance will need a large number of disk spindles to deliver the required throughput, and may not be able to deliver the required latency at all. Further, the operational costs of flash-based technology, including the amount of hardware required, power consumption, and cooling, are much lower than comparable disk-based solutions. Finally, there are intangible benefits of deploying a super-high performance, low latency, and reliable NoSQL application, including customer and user loyalty and trust, competitive advantage, and lower operational costs.

Oracle NoSQL Database with Fusion ioDrive2 provides an enterprise-grade, highly reliable, highly scalable, high performance, and low-latency solution for the most demanding big data applications today.