The advent of technologies like virtualization and cloud computing, as well as advancements in IT hardware, have changed the way we manage data in our organizations. Comparing backup vs. archive, the two once served fairly similar purposes, but as time goes by, the more they’ve diverged.
In the most simplistic terms, both backup and archive are ways to put data someplace so you can find it later if you need to. But the devil is always in the details, and the biggest change over the years when it comes to backup vs. archive has occurred in the “if you need to” part.
Today the primary need for backups/replicas is for the purpose of high availability and disaster recovery. In an HA/DR scenario, there’s been a failure, you’ve lost access to data, and you must get up and running as soon as possible. Modern environments allow this and even allow switching over to a mirrored data set kept on premises, something impossible or at least prohibitively expensive a few decades ago.
Thanks to more reliable servers, storage, and networking, backup today is largely measured in terms of Recovery Point Objectives and Recovery Time Objectives. Setting your RPOs and RTOs generally depends on the work you do, how frequently data changes, how much downtime can be tolerated, and related considerations. In high-volume enterprises the RPO may be “seconds ago” and the RTO may be “immediately.” In a typical mid-sized business where people work on some office files, do some data entry, and send some emails, an RPO may be close of business yesterday and RTO a few hours.
Even in the case of major data corruption or attack requiring you to roll back to a point in time before the evil event occurred, you only truly need 30 to at most 90 days of backups. This is a major shift compared to what we did in the past: back up everything all the time; keep it forever, or until we run out of space.
The need for archive has also changed, and in a completely different direction. If the use case for backup is now HA/DR, the use case for archive is data governance and management over the long haul—the marathon versus the sprint. This is quite different from in the past, when archiving was likely done to ease space on primary systems, thereby reining in the cost of maintaining all that old data in perpetuity. Not only have those capacity concerns largely gone by the wayside with the scale and affordability of offsite cloud storage, we now have additional reasons for retaining inactive and historical data.
Often the first reason is regulatory and/or legal. We are under new obligations to keep data, guard it, and sometimes produce or delete it. It may be a government mandate such as the European Union’s GDPR, or it may be litigation that requires an e-discovery process, but accountable, verifiable data governance is one of the most important tasks an archive fulfills.
Another relatively new use case for old data is business intelligence and strategic analysis. Actively deriving insight from data volumes helps an organization make more informed decisions and predictions, stay ahead in their market, and benefit their customers.
Once again, you-know-who is the details, and today’s archive is only worthwhile if we can use it. To use it we need to be able to search, sort, classify, tag, find, access, read, fold, spindle, and mutilate the data we put in it. This is again much different from a backup, which looks like a big blob of data in a proprietary format that can’t be accessed by anything besides the application used to create the blob. The blob exists if you need it, it’s secure, it’s complete, and you don’t necessarily need to know anything about it beyond whether it will fulfill the HA/DR objectives.
The archive objectives, on the other hand, have grown far more complex: ensuring the constant accessibility and usability of archived data, forever. It’s not a vault for dead files anymore; what we want is an active archive. Since the very purpose of archiving has changed, archive solutions have largely been reinvented.
Fine-grain lifecycle management includes, for example, the ability to permanently delete data the moment it’s no longer needed or the moment we’re told to do so. It includes the ability to verify it has not been tampered with or accessed inappropriately. It includes the ability to locate and retrieve all data that contains a certain word or phrase, or was created on a certain date by a certain person or department. It includes the ability for an organization to set its own policies dictating which files are to be archived and when, and what’s to become of them.
While searching an archive for the file(s) you want is not a hot, new technology per se, bear in mind our IT infrastructures have changed. We are talking about millions and millions or perhaps billions of files that span local storage to remote storage to one or more cloud services. We may be talking about dozens of file types aside from the basic documents, spreadsheets, presentations, PDFs, photos, audio files, video files, CAD/CAM files, and email. We may be talking about files that are ten-plus years old, created on software we no longer use, by employees we no longer employ.
It’s the nature of archives to be messy and disorganized, which bears on search and classification. We need to make order out of chaos, extract value, intelligence, and contextual meaning from those archived files. The complicated sprawl and volume of this data means the management tools have to be more sophisticated.
This is probably the main difference between backup and archive today: in nearly every regard, backup has become quite simple, while archive has become devilishly complex.