From Accumulo to Cassandra to Mongo, Flash Makes Big Data More Efficient

Posted: 07 Nov 2013   By: Matt Kennedy
From Accumulo to Cassandra to Mongo, Flash Makes Big Data More Efficient

In big data, flash isn't about just raw performance—it’s about efficiency. Good performance is an emergent property of an efficient system. An architecture built solely on DRAM for performance and spinning disk for storage will suffer from the inefficiencies involved in migrating data between the two. Big data platforms built with Fusion-io can entirely replace spinning-disk storage and reduce DRAM footprints, or complement disk storage as a cache layer. Either strategy can improve efficiency, drive consolidation, and reduce operational costs.

We have worked with customers in several big data environments, from Accumulo and Cassandra to MongoDB, HBase, and Greenplum to increase performance density and decrease DRAM footprints.

Making Sqrrl Accumulo Fly

In Fusion-io’s Federal business, we run into many big data applications. NoSQL databases are very popular, and Accumulo is gaining traction in many agencies. To learn how we could help, we teamed up with Sqrrl, the experts in Accumulo architecture, applications, and security.

Many NoSQL databases have relied heavily on DRAM for performance. Accumulo is no different in this regard. The challenge is that DRAM capacities are not dense enough for the active datasets in most NoSQL deployments. The more dense the DRAM module, the higher the price, and toward the end of the spectrum is a sharp hockey stick curve where 32GB modules come in to play. For this reason, it is not uncommon to see engineers specifying servers with 8GB and 16GB modules. This still restricts most servers to 128 or 256 gigabytes of DRAM.

An attractive feature of Accumulo is its scalability. It’s easy to add more servers to get more DRAM footprint. But is it cost effective? And what if your active dataset grows beyond the 256 gigabytes of DRAM on your system? The servers can be configured with 12 terabytes of hard disk capacity. One trip to the disk drives for a CPU can dramatically impact the latency of a transaction.

In our recent white paper about Accumulo, we explore another way to scale and maintain predictable performance. We worked with Sqrrl to tune Accumulo to take advantage of ioMemory. Our testing showed how Fusion’s ioMemory allows customers to put terabytes of flash memory in place of standard disk drives and remove the need for the overprovisioning of DRAM as a cache. We also found that moving to ioMemory for primary storage brings a significant performance gain for most workloads—up to 10x. On the other side of the performance perspective, we find consolidation. With ioMemory, instead of requiring racks of Accumulo tablet servers, you can get the same performance with only few. Consider the difference in footprint when your application requires 50,000 transactions per second.

Download “Fusion-io and Sqrrl Make Accumulo Hypersonic” for the details.

Add Flash in Lieu of DRAM for NoSQL Variants Cassandra and MongoDB

Cassandra, the leading NoSQL database for big data workloads, thrives on flash. In our tests with Cassandra and ioMemory, Cassandra can handle larger, more random workloads with fewer cluster nodes than is possible with any other persistent medium.

Fusion ioMemory increases performance density in Cassandra nodes. This allows Cassandra architects to design clusters with less dependence on DRAM, while reducing cluster sprawl. Cassandra’s flexible configuration options make it easy for architects to design systems that use flash for performance-critical column families and disk capacity for archival column families.

Spotify serves millions of active users, and ensures they have the most responsive experience possible by running Cassandra clusters on ioMemory. Read "Accelerate Cassandra Without the Cluster Crawl" to get the details.

We have found that ioMemory enables MongoDB systems to scale out to more nodes more efficiently. ioMemory presents NoSQL databases like MongoDB with a new tier of persistent memory to achieve their performance requirements and potentially avoid having to scale-out to avoid paying high prices for DRAM. By using ioMemory as the primary storage for a MongoDB database, requests are served directly from a persistent memory tier, providing an 11X–18X performance improvement over the same classes of workloads served from disk. That means read latencies as low as 2ms from the entire database, not just the set in the memory map.

Check out More Flash, Less RAM, Faster Mongo; MongIOPS: Your Favorite Datastore, Only Faster; or Fusion Power for MongoDB to learn more.

Efficient Cluster Design for HBase

Apache HBase, the scale-out NoSQL database that runs on Hadoop, is typically deployed on clusters of commodity hardware, which can present design challenges and operational headaches. Fusion’s ioMemory platform offers an alternative to the conventional HBase cluster architecture. As a persistent memory tier, ioMemory eliminates the performance deficiency of the database’s persistent storage. This allows HBase architects to more effectively design clusters based on application requirements other than just DRAM capacity.

In our white paper, "Using Hbase with ioMemory," we explain ways to improve HBase performance, including moving the database's working set to ioMemory to eliminate the dependence on more expensive, less efficient DRAM, and preventing Java VM garbage collection from negatively impacting performance by using ioMemory to improve cluster stability.

Greenplum on Flash

Before big data was trendy, engineers used databases for analytics. Pivotal's Greenplum looks and feels like a database, but under the hood it functions much differently. Greenplum doesn't require much in the way of CPU horsepower. What it really needs is fast storage.

Disks and SSDs are not fast enough, however. We performed internal testing of Greenplum on ioMemory and found that we could achieve scan rates of more than 28 GB/s with just four segment servers. In order to get that kind of performance with hard drives, you would need an entire rack of hardware. Imagine what you could accomplish if your Greenplum servers were four times faster, and if one server could do the work of four. Find out how in our white paper, "Achieve Fast and Linearly Scalable Greenplum Performance with ioMemory."

From Accumulo to Cassandra, MongoDB, HBase and Greenplum, Fusion-io can help you accelerate your big data applications. See our big data solutions page to learn more.




Matt Kennedy

Big Data Solutions Architect
Other posts by Matt Kennedy >