The amount of data we want to collect, archive, and search, is amazing. The use of metadata allows us to quickly find data files in which we are interested. However, the storage and search of the metadata alone has become a "big data" problem. One important aspect of this is where you store the metadata.
The craze of "big data" is upon us, and the world is adjusting to what it means, how they can use it, and how to build systems for it. Feeding the big data beast is a sea of sensors. For example, there are video cameras everywhere—outside stores, inside stores, intersections, on helicopters, dash cams, people with cell phones. There are road sensors, sensors in our cars, sensors all around wooded areas, sensors along bridges. There are domain-specific sensors, such as those for the power grid, the oil and gas industry, hospitals, ISPs, websites, weather, oceans, the military, and on and on.
All of this data has a common thread—the need for metadata.
Metadata is simply data about data. For example, metadata can include information about where a sensor is located (GPS coordinates), the time period for a specific recording, the direction the sensor was sensing, the firmware of the sensor, and the model of the sensor, and more.
You can also "tag" files with new metadata about information that is usually found by post-processing the data. In the case of a camera, these metadata tags could be time stamps where there is something interesting happening (perhaps along with a note of the interesting event itself). Other metadata tags could also be pointers to other related sources of information such as other cameras or weather data.
It's obvious that the usefulness of metadata depends upon its quality. If the metadata isn't accurate, then using the associated raw data will result in a degrade—if not an outright failure—of the resulting analysis. Some of the metadata has to be created by people and cannot be automated, so there is always the possibility of error.
Understanding which metadata is important for a specific data file and how to make it useful for researchers is an extremely important question. It's also probably a question that has not only a technological solution but also sociological and psychological solution.
But one seemingly simple question has a huge impact on the use of metadata: where do you store the metadata?
A Place for Your Stuff
When I originally investigated the question of where to place metadata I explored two options. The first option is to put the metadata in a central location for all data. The second option was to put the metadata with the data itself.
The first option is one that is used by many search or archive systems. The idea is fairly simple—gather metadata about a specific file and store it, typically in a database. Then you can search the database as you like, hopefully finding the files that contain the information in which you are interested (assuming that the metadata is correct—but that's another story).
One of the outputs from the search should be the location (fully qualified name which contains the full path to the file) of the file(s) of interest. Then you can copy the file(s) into some sort of working storage and have at it.
The dangers of centralizing the metadata mostly deal with the file mapping interaction between the metadata and the files. For example, you need a mechanism to update the centralized metadata server when the metadata for the various files is updated. Ideally, this update mechanism should be fairly fast; otherwise, the search data can be out of date. How you define "fast" is up to you based on your users and usage model.
In this update mechanism there is a buried problem. What happens if the database and the files are no longer in synch? For example, what happens if a file is moved so that its full path in the database is no longer valid?
The result is obvious, the database is no longer valid, at least concerning that file. Hopefully, the update mechanism can tell the database that the file has moved and to either create new metadata for the new location or to update the existing metadata to reflect the new location of the file. In either case, the update window will have an impact on updating the database.
A third aspect that needs attention is the data integrity of the database itself. You will need to provide data protection of the database using backups, copies, or something similar. Don't forget that the database is primarily used for reads, requiring that you pay attention to the size of the database and the rate of read errors. Building an index from consumer SATA drives, as some manufacturers do, means that when you read as little as 100GB you are likely to hit a read error. This forces a rebuild if you built the storage with RAID controllers, and could cause further problems during the rebuild.
The second option, storing the metadata with the data, is a very desirable approach because now you have to worry about data integrity of just one file system and not two. If the file moves, the metadata moves with it. You can add metadata to the files at any time because they stay with the files.
Ideally, if we had good tools for copying and moving metadata with files, you could easily copy or move the data files somewhere else. For example, if you copied the file to some work storage, the metadata would need to come with it. This also means that you could update the metadata of the file and then copy it back, taking the updated metadata with it.
One way to achieve this is by using extended attributes (xattr). Many file systems support extended attributes and there are ways for users to add metadata to the file via xattrs and ways to read them. Some file systems impose limitations on the extended attributes, such as the amount that can be added, but others do not. Regardless, being able to store metadata along with the data is a very attractive proposition.