Once upon a time, storage was storage and analytics lived somewhere else – far removed from the storage universe. But the world has changed and the advent of big data has brought the two close together, as shown by the Veritas 2017 Data Genomics Survey.
By analyzing 31 billion files globally, it found that data repositories have grown by nearly 50 percent annually. This is largely driven by the proliferation of new apps and emerging technologies such as Artificial Intelligence (AI) and the Internet of Things (IoT) that leverage massive data sets. If data continues to grow at this rate and companies do not find more efficient ways to store and manage their data, organizations worldwide may soon be confronted with storage expenses topping billions of dollars.
Hence many organizations are looking to big data and analytics applications to lessen their woes. Storage vendors are also arriving at the party. Instead of cobbling on big data and analytics tools from other vendors, some are adding such functionality within their storage platforms. There are many examples to choose from. Here is a sampling.
MapR has developed a converged data platform which is used by the likes of SAP and Ericsson. It is the basis for creating a global data fabric on which data, analytics and operational apps are deployed. Known as the MapR-XD, it said to scale to trillions of files, and withstand peak activity scenarios. Built on top of MapR-XD are add-on products that focus on specific areas of big data. It supports Hadoop, includes an NoSQL database management system, a query engine for big data for SQL analytics at Exabyte scale.
Data Dynamics StorageX 8.0 introduces integrated file system analytics and file-to-object conversion for cloud tiering. Using integrated analytics, users analyze file systems and meta information to identify file management opportunities. These analytics directly feed into automated, file movement policies so users can take action quickly.
“This enables you to visualize file storage infrastructure relationships and their correlations from namespace, file server, shares, exports, volumes, folders or S3 buckets to individual files,” said Cuong Le, Senior Vice President Field Operations, Data Dynamics. “Automated data movement policies facilitate the transfer or migration of source files to S3 object storage and traditional file storage resources.”
Most IBM storage systems include a function called Easy Tier that uses analytics to decide where to place data in a hierarchy of storage technologies with different performance and cost points. Data moves up and down the hierarchy based on the density of I/Os to that data in comparison with the overall load on the system. One family of systems implementing this technology is IBM Storwize family, Since Storwize systems can also attach external storage systems, the technology extends to moving data among those storage systems. A new all-flash Storwize system can be used to automatically move the most active data onto flash while the remaining data stays on existing disk storage systems.
Big data and analytics technologies can also be deployed in storage management software to track trends across multiple storage systems, call out when best practices are not being followed, predict events requiring action, and take appropriate action. Examples of this are IBM Spectrum Control on-premises storage management software and its SaaS management counterpart, IBM Spectrum Control Storage Insights. In both cases, these tools gather information from storage systems, network components, and server data and use analytics to make recommendations and to start to diagnose data center issues.
“Deploying these technologies allows for precise control based on detailed I/O measurements within a single storage system or several attached systems,” said Clodoaldo Barrera, Chief Technical Strategist, IBM Storage.
ClearSky Data delivers on-demand primary storage with offsite backup and DR as a single hybrid cloud service. Users pay for their data once, never have to replicate, and can access their protected data anywhere it's needed – on-prem or in the cloud. Users pay only for capacity that they use. Its all-flash edge is capable of providing high performance automatically without the need to segregate hot, warm and cold capacity and manage those as separate entities.
Dell EMC Isilon is purpose-built for storing, managing, analyzing and protecting file-based unstructured data. Typical uses cases include archives, file shares, home directories, media content, High Performance Computing (HPC), video surveillance, and data analytics such as Hadoop and Splunk. The Isilon product line includes all-flash, hybrid, and archive storage systems. All are powered by the Isilon OneFS operating system and can be combined into a single cluster. It includes support for the Hadoop Distributed File System (HDFS).
ECS is software-defined object storage that targets service providers, the surveillance market, healthcare and life sciences, big data analytics, the Internet of Things (IoT), and object storage. It provides up to 7.8 PB of capacity per rack. As well as metadata search, it comes with Hadoop analytics tools built in.
Scality RING software for storing big data uses common file system protocols, as well as the object protocol AWS S3. An emerging use case is archival storage of analytics data for cost-effective long-term storage of petabytes of big data. The AWS S3 protocol also enables access from Hadoop and similar applications via Hadoop Compatible File System (HCFS) protocols such as S3A, which are supported by Scality's S3 API. This enables Scality to provide long-term big data archival solutions for IoT cloud storage.
Scality's Zenko Multi-Cloud Controller is smaller capacity edge storage such as sensors, network devices, telemetry and appliances. Zenko provides a platform for intelligent tagging of metadata, attribute based search, and also policy-based workflows of this edge-data to either public or private clouds.
Some vendors are integrating analytics directly into the structure of the file system. A good example is Qumulo, which incorporates metadata into the file system to enhance search and indexing operations.
Rather than build its own analytics, Nasuni takes advantage of existing analytics and big data platforms. Its global file system lives within cloud object storage but presents itself to end users and applications via physical or virtual edge appliances that only cache active data. This approach makes it easy for customers to present all or part of their file systems to the analytics platform or service of their choice.
For example, an organization may want to analyze petabytes of image data. It can deploy AWS Image Rekognition, Azure Facial API or IBM Watson as it sees fit, and all will integrate with Nasuni. They can test the image recognition capabilities of each by spinning up Nasuni virtual edge appliances in the compute layer of each provider and configuring Nasuni to cache the test data and present it to each analytics platform as a local share. Nasuni automatically handles the transfer of files. It also makes massive migrations of large data sets or translations to the object store protocols of each analytics provider unnecessary.
“With Nasuni, analyzing big data is as simple as spinning up an edge appliance and pointing the analytics service to one of its file shares,” said John Capello, Vice President of Product Strategy, Nasuni. “It is powered by Nasuni UniFS, the Nasuni hybrid cloud platform.”
Reduxio has a global search feature, point-in-time restore, and can set policies from the dashboard. The interface also provides instant restore of virtual machine images. It provides the ability to recover data in seconds without the burden of scheduled snapshots or data backups.
Veritas Cloud Storage is a software-defined storage solution designed for massive amounts of unstructured data. It enables users to apply analytics, machine learning and classification technologies to stored data. Building on the Veritas 360 Data Management platform, Veritas Cloud Storage enables massive scalability: Organizations can scale to petabytes, storing and managing billions of files with the ability to handle a quintillion number of objects.
Enterprises use its analytics capabilities to fuel their business, and ensure compliance. For example, with the EU’s new General Data Protection Regulation (GDPR) coming into force in May 2018, Veritas Cloud Storage can be used to scan all stored data so sensitive information is properly tagged, managed and protected.
“Using such intelligence is critical to effectively extracting value from the ever-growing volume of data in a cost-effective manner,” said Amita Potnis, an analyst at IDC
Data Direct Networks (DDN) has gone the route of combining its Storage Fusion Architecture (SFA) with analytics platforms such as SAS. As such, SFA delivers peak performance by using the best processors, busses, and memory with an optimized RAID engine and DDN data management algorithms. The company claims that users report 400% to 500% performance improvement in their SAS, Informatica, and other key workflows.
Cray is well established as a provider of supercomputing systems and high-performance computing (HPC). But it just picked up Seagate’s ClusterStor HPC array business (formerly Xyratex). These arrays use the Lustre storage platform. They find deployment in artificial intelligence (AI), machine learning and deep learning applications as their datasets continue to grow. They are used in conjunction with various Cray supercomputing applications such as its Urika analytics platform.
Cloudera Data Science Workbench aims to handle a problem that exists with analytics and data science applications. The underlying storage struggles to scale sufficiently to deal with the volume of data that needs to be analyzed. Cloudera Data Science Workbench is said to enable fast, easy, and secure self-service data science for the enterprise.
Hedvig adds intelligence and data portability to moving apps across a cloud. The Hedvig Distributed Storage Platform provides multi-workload, multi-cloud and multi-tier capabilities. Its programmable data management tier enables data to span hybrid- and multi-cloud architectures. Recent enhancements deal with data locality, availability, and replication features across any public cloud.
NetApp StorageGRID is about storing and managing unstructured data at scale for object storage. It enables more fluid and secure data movement for cloud applications built for AWS S3. It includes policy-based data placement. When used with NetApp’s Cloud Sync Service, you can take advantage of cloud analytics services, such as Amazon Elastic Map Reduce (EMR), by converting and copying data into a cloud native format.
Quantum’s StorNext platform now combined a scalable, high-performance file system with data management, policy setting, intelligent tiering and workflow analysis. It addresses use cases such as genomics research, video post-production and video surveillance put strain on storage infrastructure. They need high-speed access to data for those doing complex analysis and Quantum StorNext provides it.
Tintri offers predictive analytics and machine learning for the enterprise cloud. Its cloud-based predictive analytics powered by Elastisearch, are said to be able to crunch numbers from 480,000 VMs in less than a second. This enables users to model storage and compute needs based on application behavior. Tintri Analytics is a cloud-based SaaS offering. It can dig into up to three years or more of real-time VM- and container-level data, including historical and current performance and resource usage statistics on every application. Administrators can formulate various what-if scenarios to assess the impact of changes before they are implemented.
The products covered above are no more than a sampling. Every week, more storage vendors introduce the latest analytics and big data bells and whistles for their traditional storage offerings. It’s quite possible that analytics will come standard with storage offerings in the very near future.