Where do backups and old school archiving of formatted records (structured data) fall short in today’s big data world? What are the common and new use cases for an unstructured big data archive? How can we manage the scale of overwhelming unstructured data growth? And…how do we wring all the value out of that data? Watch our short video interview by industry leader Mike Matchett of Aparavi’s CTO Rod Christensen.
Dave Littman: Hi, Dave Littman, Truth in IT. Welcome to today’s video webcast, “Myth: Backup Is All I Need For Archiving – Busted!” sponsored by Aparavi. So, without further ado, let me pass things over to Mike Matchett. Mike, take it away.
Mike Matchett: Thanks, Dave. I’m here, Mike Matchett, with Rod Christensen from Aparavi, and we’re going to talk about archive as a use case and why backups just don’t make good archives. And part of the reason I think, going into this, that archives are important today really has to do with getting more value out of your data, doing better things with it, mining that massive big data, but there’s a lot more to the story, and with that, let me introduce Rod. Welcome, Rod.
Rod Christensen: Thanks, Mike. I appreciate it.
Mike: So, Rod, you’re the CTO of Aparavi. What is Aparavi, just in a thumbnail?
Rod: Aparavi is a data storage company. We specialize in what we call Active Archive, and Active Archive is actually a twist on standard archiving principles, but it has some additional features and actually tries to make use of your data in multiple ways.
Mike: So, I think we’ve got the right person today to explore this myth of backups not really providing the use cases of archive anymore. Before we even get started, when we say archive, what are we really talking about fundamentally? What is an archive versus anything else in the world?
Rod: Well, an archive is actually the long term storage of data over time, and with current things like backup, when you try to use your backups as archives, essentially what you’re doing is just storing a bunch of bits. They go die, that’s the end of it, that’s the only use. With Active Archive, we try to enable you to actually use your data for multiple purposes after the data has been stored.
Mike: Some of that purpose, I think we’ll get to that, includes looking at things and getting rid of the data faster. And some of that purpose is keeping the data longer than you might keep a backup, and I’m sure we’re going to come back to talking about that. But where I want to start, Rod, is this idea of unstructured data today. I know backups tend to be looking at an entire system and taking the whole set of bits at a low level and putting them up as a complete snapshot, if you will, somewhere else. But we’ve gotten now to the point where we’ve got all this unstructured data and it’s not really not useful to take a petabyte of unstructured data and make a big mass out of it, put it somewhere, because you can’t use it. What was sort of your first genius idea of Aparavi that had to happen in this market?
Rod: Well, one of the things that we took a look at in designing the architecture: we started at the end on what we wanted to do with the data, what we wanted to enable the user to do with the data at the end of the day. For example, e-discovery and actually utilizing that data over the long term for different purposes. That drives a lot of the decisions moving forward on how you actually set up the system. Because let’s face it, an archive and a backup is just moving data from one place to another place. It’s what you do with the data, how you prepare it for the long term that really enables a lot of the features that we call Active Archive.
Mike: Some people are going to come into this thinking, “They’re both copies of data. I should be able to do similar things with the copied data. If I’m already backing it up, then I’ve got my copy and I should be able to use that copy for everything. That’s not exactly the case, right? What do we have to do to actually make use of the data then?
Rod: Well, the thing about it is how we store the objects at the end, how we store the data. With traditional backup and when you’re trying to use backup as archive, you just have this huge blob of data. You don’t know what it is. There’s no way to classify it, there’s no way to index, there’s no way to search it, you have no idea what it is. So, what ends up happening is you just keep adding to that big data pool that you have that is basically gone to die because you have no idea what you actually have. With the Active Archive and the way we store objects as individual objects, we can do a much finer grained lifecycle management over the long term so we know when to delete things. We can delete things and remove things out of the archive based on classification or content, we know what’s in the files, we know what type of file you actually have.
So, therefore, we can do things like searching on it, we can very quickly go out and locate in your archives all the files that have a certain keyword or a certain phrase or something along those lines. So, that’s really multi-purposing data. So, can you imagine three years from now, you want to find all the documents that reference my name? You can do that very, very quickly with the Active Archive and the archive architecture we’ve set up. With backup and these big blobs of data that you have, that’s virtually impossible.
Mike: And so, if I’m hearing you right, you have to do things at a file level or at granular level, which is a file level generally in unstructured data. Tell me a little bit also about the problems of timeline—we sort of hinted at it earlier. What happens to that data over time and why does storing it one way or in a different way matter?
Rod: Well, let’s take an example backup, which people are trying to use as archives today. You back up your whole system for regulatory requirements, you’re keeping that data for X number of years. That data is actually stored as a big blob, and essentially you can’t get rid of anything within that big blob. Now, with Aparavi, we try to classify data, and depending on its type and its classification, we can remove things at different rates. For example, a document without personal identifying information in it may only be required to be kept for a year, where stuff with that identifying information or patient information or things like that may require to be kept for very, very long periods.
So, we treat each individual object, each individual file, and manage its life cycle instead of this whole big backup. That also reduces legal exposure as well, because if you have the data, and you’re keeping stuff longer than you want, that’s responsive documents that you have to provide to attorneys on request. So, therefore, getting rid of the data is just as important as saving.
Mike: Right. So, that’s kind of both ends of that problem. So, just a couple more things about the problems with using a backup as an archive, if I’m thinking about this right. A lot of people today use cloud storage for their backups, secondary storage. And we take these, as you said, it blobs of data, and we push them up into the cloud. S3 is cheap, Glacier is even cheaper. Why is that a bad strategy when it comes to archives? Why isn’t that cloud store working for that?
Rod: That’s not sustainable and not scalable over the long term. I mean, when you have documents, 5% of your system, 5% of your image that you’ve put up into the cloud is actually required to be maintained for longer periods. Ninety-five percent of it is just garbage, but you can’t prune out that 95%, so you end up storing this whole big blob for years and years and years when all you need is 5% of it. So, guess what? That means your costs significantly increase over time.
Mike: Right. So, you know, 5% out of 95%, that’s a 1/20th cost savings that you could get by just focusing on the data you’re supposed to have.
Mike: Yeah. All right. And then part of that, just flipping that coin, putting it up in the cloud is one thing, getting it back is a whole other. I can almost guess where this story is going. I’ve got a large blob of data up there. If I’ve got a backup and I want to get something out of it, I got to pull the whole damn thing down.
Rod: You got to pull the whole thing down, yeah. To just examine one file out of that blob, you have to pull the whole blob down to get to that one file.
Mike: Right. And that’s not just cost, bandwidth, but time, effort… I think you mentioned indexing before, and I think we’re going to come back to this. If you don’t know what’s in the system image, you store the whole system image before you can even find out what’s in it, right? And so part of that archiving use case is being able to know what you have before even go look for it.
Rod: Right. Once again, a blob, you have no idea what the content is, the backup system doesn’t even know what the content is. All it knows is that here’s this big blob of data that I have to store for years and years and years. With the Active Archive, we try to recognize via categorization what categories each file actually fits into, and then also we provide content index searching. So, we know all the words within a document before we actually throw it into the cloud. The good part about it is, and the really nice part is, you can actually search your documents locally even though they’re still residing in the cloud. So, we can find all instances of a file that contains a particular phrase without bringing anything down to the cloud. The only time you need to pay those egress fees is when you actually want to recover the document on site.
Mike: All right. So, just looking at some of the fundamental choices and tradeoffs involved in looking now at going to an archive if all I’ve got is backups, somebody might say, “When I take a snapshot of my system and fill the whole blob up there, it’s actually pretty optimized these days. It’s going to be a pretty fast way to do things even if it costs me more. If I have to go through file by file and make copies of it and process that, that’s probably going to take more time. So, am I really going to lose out here? Am I going to have an archive window problem like I have a backup window problem today because my archive window is bigger?” How’s that? How’s that for a question?
Rod: Well, it is actually a very interesting question. But if you look at what you’re actually backing up and actually trying to store in that blob, it’s the whole system because that’s the easiest thing to do. You have no classification or way to classify what you’re actually trying to store in a backup. So, you usually end up throwing the whole system up there, a whole image, which is extremely fast for getting them off the system. But transferring terabytes up to the cloud on the initial image of a disk image is very, very inefficient.
Now, it gets more efficient once time goes on, the second backup of that image, because it’s a differential or incremental. Now, if you look at the way we do it, we’re not trying to back up the Windows operating system. We’re not trying to back up program files. Those are really not relevant for an archive. You don’t need to keep Word 6.0 for the next 30 years. It’s just not viable. You don’t need it.
What you actually need is your user documents, the documents that you would be responsive for an illegal query or you want to do e-discovery on. You don’t need to do e-discovery on Excel. What’s the point? You know, on the Excel program itself. So, if you look at how much data is actually being shoved up into the cloud by a backup system, it’s huge. But if we can narrow that down in an archive to a much smaller set, even though it’s file by file, it actually is more efficient as far as time and space goes to get it up to the cloud in the first place.
Mike: Right. And then I think a corollary to that, if I’m looking ahead is rather than have that large blob of data to manage and store, if I’m managing those objects intelligently in a finer-grained way, file by file, I must get some advantages there too.
Rod: Yeah, right. Now, the first archive that you do is going to be your biggest archive because we consider that the base-level archive. From then on, it’s incremental or differential forever, so we don’t actually send all the documents up over and over and over again. So, it’s only the documents that have changed that are then added to the archives the second time. So, actually the second time is much more efficient than trying to do it by image.
Mike: And then I think we had hinted before, also, if you’re looking at things like, “I gotta get rid of some data sooner,” and I can actually trim my data sets at that file level, whereas I can’t really do that a backup level, I’m kind of constrained.
I had another question for you about the time frames involved, let’s just spin into that. Backups generally are done primarily for disaster recovery, or even availability, rolling over, I’ve got problems. And I wanna have a low RPO and generally only want to keep so many days of backups online that are going to be useful. I mean, if I have a virus or something, I want to restore, but I don’t want to go back two years. I’ve lost two years of work. I want to go back to some small window. When I think of archives and the use cases for those things, like e-discovery, you mentioned, and some other stuff, those could be years long and so on. So, what does that do for some of the decisions you guys have made with Aparavi. How does that change?
Rod: Well, we talk about backup windows and recovery point objectives, recovery time objectives, and those all really meld into what the future looks like. And if you look at the use case for a backup, a backup is actually morphing into disaster recovery and high availability. If you want to recover a server within a minute or two or an hour, and let’s face it, image-based backups and those kinds of technologies are very, very good for that HADR stuff. If you have to go out and recover documents from a backup three years ago, you’re in lot of trouble. You’ve got some real issues.
Mike: Do those files even work three years later, right? I mean, they’re up there, I don’t even know if I can put them on the system three years later.
Rod: So, what ends up happening is your backup high availability disaster recovery solution should only be 30, 60, 90 days. That should be kept locally on-site because that’s what you want. You want to be able to recover things very, very quickly. The archive is more geared towards the long term so don’t actually do archives on files that have changed within the last 15, 20 days. What you’re really trying to do is store this stuff for the long term. If you have to recover a full system from an archive, you’re in trouble, you’ve got issues. So, that’s where the two of them really work very well together. Now, when you’re doing things like high availability and failover, that solution even gets better because then you just fail over and continue on your archive.
Now let’s talk about backup windows for just a second. It really doesn’t matter since you’re not actually relying on the archive to recover a full system, backup windows don’t apply. If it takes us two or three days to get a couple of terabytes of data into an archive up in S3, it doesn’t take that long, but if it did, if you have a slow connection, that’s okay because you’re not relying on that as your primary backup source.
Mike: When I’ve talked to some other backup space vendors recently, they’re really moving away from taking a copy of the data and putting it somewhere to more of replication and keeping these sort of stream of things and journaling stuff going along. And I guess what we’re saying is for archive, you need a copy of the data. For backup, it’s becoming less important to have a copy that’s off there. For a lot of backup use cases, you still need a long-term backup in case you’ve corrupted your data. But there’s a lot of use case that’s changing there too, so that’s interesting. So, let me just turn a little bit to ask you one more question about this searchability thing, because I was really curious about that. I know people say that I can do some archiving with backups today, I can process some of the data. You mentioned the indexing. Is content search something you see customers really asking for in their archives? And what are some of the interesting things they’re doing with it?
Rod: Absolutely. Yeah, content searching is extremely important because that’s what makes the difference between a blob of data and actually making use of the data over the long term. Because if you don’t know what you have, if you can’t find what you have in these millions of files that are in the archive, and maybe billions of versions of billions of different objects, what are you going to do with it? You know, what’s the point? So, unless you can provide a method to actually get to them very quickly, repurpose the data, be able to search it to find out what you actually have, and classify it, as certain classifications of data to manage that, what’s the point in the whole exercise? Just throw blobs up in the cloud and it’s the cost of doing business. You know, just throw up there just because you said you can throw it up there.
Mike: I could tell you that there are people who do that and throw their blobs of backup to there, and then they have a billion backup files and don’t know what’s there too. So, that problem just compounds, right?
Rod: And you throw them for seven years because somebody tells you to keep them for seven years.
Mike: Right. I got 32-digit backup file names. I have no idea what they are 3 years later in any case. So that’s kind of cool. I know we don’t have a lot more time in this session, but Aparavi is kind of new. How are you guys doing? How’s your traction in the market? How are people using it and feeling about it?
Rod: I think very good. We’ve gotten quite a bit of recognition. We’re about nine months old, I think. Came out of the gate about nine months ago, and we’re really getting some traction in the market with industry recognition. So that’s really gratifying, because somebody actually sees the value of our solution, which, hey, that’s what I get paid to do. The company is doing very, very well. I’m looking for an engineer, so if there are any engineers out there…
Mike: Do I get a recruiting bonus if I find some for you here on this call?
Rod: Well, you know, this a marketing video, so I’m going to turn it into recruiting video too.
Mike: No, that’s great. And I know you’ve got some recognition there recently from Gartner. Some other folks have come along and said, “Hey, this is one of those cool new technologies for the year. People should really take a look at it,” and that’s outstanding. Any final advice for folks who are considering this idea of actually having a useful archive and getting off some of their backup reliance for the wrong reasons?
Rod: Well, I think that if you rethink what you’re actually trying to accomplish with your data, throwing it up to the cloud in big blobs just doesn’t work anymore. There’s scalability issues and long-term sustainability issues that we need to think about. So, really rethink what you’re trying to accomplish.
Mike: If I have one thing in my mind going away from this, it’s that you do need a copy of your data. You don’t need 20 copies of your data; you need one good copy that’s protected, that’s in a cost-efficient storage that’s there for the long haul, that are fine-grained for all sorts of use cases that are active, thus the Active Archive part of the phrase, and a backup’s just not gonna cut it. That blob image, for a number of reasons, is not going to cut it anymore. All right, so looks like some of the questions are coming in from the audience. One person has asked about where does Aparavi’s technology sit? What are the components and how much of it’s running on-site, how much of it is running in the cloud, I think is what they’re really asking.
Rod: Okay, great question. So, we’re offered as software as a service. So, we have a platform that actually sits up in Amazon, and it’s interesting, that’s what the customers actually talk to. To manage all your systems, you talk to aparavi.com, actually platform.aparavi.com. You log in, then what we call the point of presence, which is an appliance and an agent, sits on your local system or local site. Those two components are communicating with our web app, the platform up in the cloud. So, everything is managed remotely from AWS, policies are then pushed down over onto your local site. You never actually have to talk to your local appliance or agent. The platform will do it for you.
Mike: All right. Okay. Here’s another question that’s interesting.
Rod: That’s not a very good answer.
Mike: Here’s another question. No, I got another question here. So, this person wants to know what’s the lift and shift for adopting this? Because it sounded like during that initial cataloging of a lot of their data would take some time. Is this a big investment to just try it out or something that can be layered on the side what? What’s involved in doing like a pilot project?
Rod: To do a proof of concept pilot project is very easy. You just sign up for an account on Aparavi and then direct it to the data that you want to actually store or protect and install the appliance, install the agent, and you’re good to go. We can protect as little as 100 KB. If that’s all you want to do is check it out on a very small system with a couple hundred documents, that’s great. The nice part about it is that after the proof of concept is successful, then you can go ahead and start adding new stuff into the store, protecting more and more data, more and more servers, etc.
Mike: And I think this is a related question I was noticing coming in, obviously about the cost of pricing of this as a software as a service solution. Is it scaling with the amount of data that you are indexing or the amount of data you’re storing or amount of data you’re retrieving? How do you guys go forward on the model?
Rod: Okay, so we bill basically on amount of data protected. In other words, if you have a directory on your server that contains a terabyte of data, no matter how many copies you make of that data, how many archives you actually have to do on that data, you get billed on the one terabyte, that’s how much you’re protecting. That makes the cost extremely predictable because you know how much data you’re protecting.
Mike: Yeah, that sounds pretty cool, and that sounds like a great way to price out how to do something globally without having it get out of hand.
Rod: Yeah, exactly.
Mike: And here’s a question. What kinds of systems are you really covering at the end of the day? Is this just for VMware? I know they listed a whole bunch of things. What’s your catalog of coverage?
Rod: Any Windows and Linux system basically. I mean, we can run within a virtual machine, we can run on a physical machine, it doesn’t really matter. So, as long as it’s Windows or Linux based.
Mike: Alright, that’s great. There’s a bunch more questions here but I think that’s all the time we have today, Rod. Again, thanks for answering these questions, and back to you, Dave.
Rod: Thank you, Mike.
Dave: All right guys, great job, excellent work. Thank you to Mike Matchett, Principal Analyst with Small World Big Data and Rod Christensen, CTO with Aparavi. You know, we just want to thank you again for coming. We want you to keep an eye out and stay tuned for additional videos and webcasts and giveaways that we’ll be hosting here at Truth in IT. We certainly appreciate your participation and your feedback. So, please let us know how you think we did, and for now, we want to wish you all a great day. Thanks again, and we’ll see you soon.