Brother, Can You Spare a Terabyte?

Thursday Sep 28th 2006 by Jennifer Schiff

The San Diego Supercomputer Center is making available more than 400 TB of disk space and even more archival tape space for academic and scientific data in search of a good home.

The San Diego Supercomputer Center (SDSC) is making available more than 400 TB of disk space and even more archival tape space for academic and scientific data in search of a good home. With 18 PB of archival storage, it just didn't seem nice not to share (for a modest fee, of course).

SDSC was founded in 1985 with a $170 million grant from the National Science Foundation. Since then, SDSC has served the supercomputing needs of more than 10,000 researchers at 300 academic, government and industrial institutions in the United States and around the world.

Today, operating out of the University of California, San Diego, SDSC continues to be a strategic resource to science, industry, and academia and provides a vast array of tools and expertise to the communities it serves.

Among the new services SDSC is offering the academic, scientific and digital preservation communities is the ability to store terabytes of data — experiments, collections, and more — at SDSC's Data Central storage repository.

Long-Term Storage

Richard Moore, SDSC's division director for Production Systems, notes that SDSC "has been one of the primary NSF-funded supercomputer centers for 20 years. Over our lifetime, we've always had a focus on data-intensive applications, applications that require a lot of memory or require large IO as part of their code. As a result, we built a pretty powerful data infrastructure with a large amount of disk and a large amount of archival systems. The primary reason we built those systems was to support the users of our supercomputers, to support their storage needs."

But soon Moore and his team saw a new need emerging, a need for long-term reliable storage of data collections, collections that might require 10 or 20 or 40 terabytes of archival storage space.

"We saw that we had an infrastructure that could be adapted to fill what we viewed as an emerging need to store particularly large-scale, long-term digital collections," says Moore. And the thought of being able to host and store digital collections was an exciting one to Moore and his SDSC colleagues. "It's a different kind of service than we were originally chartered for by the NSF, but it's leveraging a lot of the expertise and infrastructure that we already have."

And the community was asking for help, adds Moore. "We are already working with the Library of Congress, the California Digital Library and the National Archives and Records Administration," he reports. "And these organizations have a very critical need for long-term preservation. We want to support them, and we see long-term preservation storage as an important area."

Indeed, SDSC has been doubling its volume of stored archival data every 14 months, as the need to store data long term has increased. As a result, SDSC has more than tripled its archival storage capacity, from six petabytes to more than 18 petabytes, with about five times more bandwidth.

Data Central

SDSC's newest storage initiative is Data Central and the Data Allocations program. Total capacity for storage at SDSC is 2 PB of raw disk storage space with 18 PB of archival tape storage. Of that space, SDSC has reserved 400 TB of disk space and a significant fraction of its archival tape storage for members of the U.S. academic research community wishing to participate in the Data Allocations program.

And why should institutions consider storing their valuable data collections at SDSC?

"In terms of some of the advantages, we have a large production-level infrastructure that a group can leverage," says Moore. "We have a 24/7 staff that monitors the system, people that are on call. So there's this whole infrastructure that's in place that can be leveraged. That's one big advantage. We're also at scale to be able to host large collections — and do it over a long period of time."

Another key advantage is that SDSC can easily handle data migration.

"When individuals need to store data now, they have to go and buy equipment to store that data," says Moore. "And then they have to hire somebody to administer that equipment. And they have to find a machine room where they can put all this stuff. And they have to get redundant power into that machine room, and cooling. And then three or four years from now, they're going to have to figure out how to migrate to the next generation."

Or they can just send their data to SDSC's Data Central, where an expert staff will maintain and migrate data as needed, 24/7, in a safe, reliable storage environment — where they can still have 24/7 access to it.

Storing a TB at SDSC

"It's one thing to call up your local vendor and say, how much does a terabyte of disk space cost? Or how much would some sort of archival system cost?" says Moore. "It's another to face some of the longer-term and associated costs that aren't so obvious," what consultants and solution providers call the total cost of ownership (TCO).

"One of the things that we're doing in our cost structure is looking at annualized costs," explains Moore. "We're not going to say you can store a terabyte here for X dollars. We're going to say you can store a terabyte here for a year for X dollars. And the reason for looking at annualized costs is that we intend for this to be a sustainable effort, and there are ongoing costs, including media migration," that need to be factored in.

Data Central Data/Storage Resources

With a capacity of more than 20 petabytes of tape and disk storage, SDSC offers a wide variety of storage resources specially designed for high-performance users.

Disk Resources
SATA & SAN/Fibre Channel Disk

Capacity: 400 TB (available for allocations)

Software: SRB, GridFTP, a variety of RDBMS, SSH, HTTP

Tape Resources
Capacity: 18 PB (total)

Hardware: 6 STK Powderhorn silos, 64 IBM 3592B tape drives, and two IBM P690 nodes

Software: SRB, GridFTP, HSI

Archival System
The centralized, long-term data storage system at SDSC is the High Performance Storage System (HPSS). SDSC manages one of the world's largest productions of HPSS, which has the capacity to store 18 PB of data on archival tape. HPSS transparently uses an associated 100 TB disk cache to accelerate read and write operations.

SRB (Storage Resource Broker) is a data management software produced at SDSC. The software provides easy access to SDSC's disk and tape resources and presents them as a single file hierarchy. SRB can also be used for remote data management and access.

"In our field, media migrations every few years are not unusual at all, whether that's tape or disk," says Moore. And there are many elements involved, which can add up over a year or many years.

"It takes labor to run these systems," he says. "It takes servers to drive the disks and allow people to access [their data]. It takes networks. It takes a machine room. It takes utilities. It takes maintenance on all those systems. It takes media. So there are a number of elements that are rolled into our costs.

"Right now the best available cost estimate, when we look at our total cost of ownership for storing on disk, which is accessible all the time, is about $1,500 per terabyte per year," says Moore. The cost for archival tape is considerably less, about $500 per terabyte per year, with retrieval time (or latency) in minutes.

Still, given the cost of arrays and the personnel to manage them, $500 to $1,500 a terabyte seems cheap by comparison; SDSC only charges enough to cover its costs. And Moore anticipates that the annual cost of both disk and tape will drop substantially as the density of media increases. "I'm very hopeful that these fixed costs that constitute the rest of the total cost of ownership will also scale [down]," he says.

Shared Storage

SDSC is already storing data and hosting digital collections for a number of academic and institutional customers and is in talks with many more about helping them with their long-term storage needs. (To see a list of collections SDSC currently hosts, click here.)

This past summer, SDSC signed a Memorandum of Understanding with the National Center for Atmospheric Research to make available 100 TB of archival storage at each facility for replication of each other's data. And that storage space is scheduled to increase each year by 50 TB, reaching 300 TB by 2010. By having data stored offsite in a safe location, both institutions are helping to ensure the preservation of their digital assets for future generations.

Whether SDSC will have enough storage to satisfy its institutional customers' needs — or customers willing to pay the price — doesn't concern Moore. "We are not a commercial venture," he states. "Our focus is on large-scale collections and nonprofit university scientific researchers as well as the digital preservation community."

As SDSC opens up its Data Allocations program to the greater academic and digital preservation communities this fall, Moore and his team will find out how great the need for outsourced or shared storage is, and what price institutions are willing to pay to store their data offsite long term. That could also provide useful information for storage service providers. Stay tuned...

For more storage features, visit Enterprise Storage Forum Special Reports

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved