Unstructured data has changed dramatically over the past decade. It is not only bigger in terms of capacity, it’s also bigger in terms of quantity, there are just more files to deal with than ever. On top of the increase in size and quantity, unstructured data is more critical to the organization than it was in the past. The problem is that unstructured data protection methods haven’t changed much in the last decade.
In this ChalkTalk Video, George Crump, lead analyst of Storage Switzerland and Rod Christensen, Co-Founder and CTO at Aparavi, talk through the problems with the current unstructured data protection methods and how Aparavi solves these challenges with its new architecture.
Transcript: Unstructured Data Protection Needs a New Architecture
George: Unstructured data continues to be a problem in the data center. It’s not a surprise to most IT professionals that it requires a lot of capacity, but I think what’s taken a lot of them off guard is the rate of growth of that capacity. And to be honest with you, it’s just gonna get worse. We’re just at the beginning of this problem. Joining me on the whiteboard to discuss this, I’ve invited Rod from Aparavi. Rod, thanks for joining us.
Rod: Thank you.
George: Rod, now what we see is if this is a file system or a file server, and these are files, what we’ve seen over the past ten or so years is the number of files that are in there have just multiplied exponentially. In the old legacy backup way of backing up unstructured data, you would literally go in and kinda check each file to see if it needed to be backed up. And I think what semi-modern applications are trying to do is just scoop these up as one big image and then dump it off to secondary storage. But there’s problems with that method too, right?
Rod: Yes, there really are, George. The really big issue that you start running into is when you scoop this all up as an image, you’re managing that as a single object, a single image. There’s no way to crack that image. Yes, you can probably mount that image as a file system, but then to remove data or actually copy individual files actually just moves the problem one step out, from here out to here to try and manage the data.
George: Yeah. So all the workarounds just create a higher level of complexity, right?
Rod: Yes, yeah.
George: Now, what’s interesting is, and we’re gonna go through you guys’ architecture, you guys are doing this in really an entirely different way, right? Can you talk a little bit about that?
Rod: What we do actually, is we walk the file system and we find all changed files. After the initial snapshot is done, we actually copy that off to the secondary storage. Once that is done, we do a comparison to find out what the differences are between the file systems.
Rod: It’s very, very fast and very, very efficient the way that we’re doing it. With that, we can very quickly recognize what files actually need to be changed. From after the first snapshot, from after the first copy, we never copy that file again.
George: How do you guys store this architecturally, both on-prem and in the cloud?
Rod: Okay, so the first thing…I’ll draw it out here.
Rod: The first thing we have is your file server, which has an agent installed, okay? And let’s say it has the disk drive sitting here or the storage that you actually want to protect. The first thing it does is we have an appliance up here, it sends initially the full snapshot up here to the appliance…
George: So that’s a full copy of data.
Rod: A full copy of all the files. From then on, this guy recognizes exactly what has changed on the file system and only sends those changes to the appliance. Now when we get up to the cloud part, the appliance here actually does exactly the same thing. It knows what files have already been sent to the cloud, and it never sends them again. So it’s actually data driven from here over to here. So, once we actually make the copy from here over to here, then this guy recognizes what’s changed in that and then mirrors it up to the cloud.
George: Okay, very good. So then the cloud kind of becomes your DR copy. Do I have to be concerned about growth of this? Am I recreating the problem here?
Rod: No, you’re not because one of the things that we do is we get to set the number of snapshots here that we maintain. So keep in mind that it’s always going to be bigger than primary storage, because… well, actually not because we have compression and dedupe in there as well. So it will actually be about the same size, maybe smaller, plus or minus. But the thing is that with the number of snapshots, so if you set the number of snapshots to five, for example, we have one complete copy here, which is compressed, encrypted, and deduped. Then only the next four versions are only changes that occur here.
So, depending on your data turnover rate, you can actually compute how much it’s actually going to take in storage on this guy. Same thing for over here. We maintain one full copy over here and once that full copy is established, then we send the differences over here. So, instead of storing it in a typical image backup process here, you’d actually be sending five different disk copies up to the cloud here. So if this is one terabyte, you’re actually going to have five terabytes up here. But if you have a 10% turnover rate on your data, that would be, what, 100 gig? Okay, so up here, you’re gonna have the 1 terabyte for the base image times 4 times 100 gig, which would be 1.4 terabytes. So that is the comparison.
George: Okay, so it dramatically reduces the amount of the capacity we’re talking about.
George: So then how do I control growth of this particular appliance?
Rod: Okay, so each file out here, what we do is we classify it and we have a retention period for it. So let’s say your retention period is, let’s say three years, so a copy of this will be maintained on this appliance for three years, and then it’s deleted. Same thing up here. So no longer do you have to keep these images up here forever and ever and ever; we can actually remove information files that have retained out that we no longer need up here individually. Once again, with an image backup, you can’t do that, you can’t just remove a file out of the middle of an image.
George: Well, and that becomes really important with regulations like GDPR, where you’re gonna have to be much more specific on files. Correct?
Rod: That’s huge. With the right to forget, you have to be able to remove data out of the middle of a backup set or an image. With images, it’s darn near impossible.
George: You just can’t do it. Can you have different retentions in these two locations?
Rod: Yes, you can. This can be retained for three years, and this is the last five snapshots.
George: Okay. So this could be relatively small…
Rod: Yeah, it can be relatively small. Once again, if we have five snapshots, it’s going to be 1.4 terabytes over here, one terabyte over here. But you may allow this to grow up to 150 snapshots or 3000 snapshots.
George: Okay. What’s the installation experience like for the user? I mean, you know, obviously, they already have a legacy backup solution. This is going to probably save them a lot of both time and money and hopefully better protection. What’s the installation look like for them?
Rod: Very simple. You just download an agent, put the agent on, set up your appliance. There’s actually another piece up here that we call the platform server. The platform server is hosted by Aparavi themselves. You can host it, if you want to, on a corporate data center. But the web app server is actually the main command and control of all these things. This is where you go, this is where you connect with your browser and say, “Okay, I wanna set certain policies,” and things like that. When the user actually sets the policies up here, they’re automatically pushed down to here. So once you have a corporation set up, or your company set up, set up your appliance, it automatically inherits the policies from the platform, connect a new agent. You don’t have to do anything except install the agent; it automatically pulls down all the policies that need to be implemented.
George: And so then, I would also think that that makes scale significantly easier, right? If I’m a multi-site corporation, or something like that, I can distribute those policies pretty easily.
Rod: Yes, yeah. It’s automatic, because when you set a policy, there are certain hierarchies that you can set the policy on. You can set it way up at your organizational level, you can set it up at the appliance level or each individual agent. Now, we don’t have the concept of jobs or anything like that, like traditional backup. They’re all policies. So if data falls within a policy, it’s recognized and then acted upon within that policy.
George: So one of the big concerns that we hear a lot, and it kind of ebbs and flows, is ransomware. And of course, probably ransomware can attack anytime during the day, whatever. And we suggest that people make frequent backups, which is particularly hard with image backups, because you got to do that image all the time. Sounds like you guys have a good alternative to that.
Rod: We do. It’s actually while we’re running through these files, and actually doing it file by file, we can recognize when a certain percentage of files have changed, so like all your JPEGs and PNGs, and all that kind of stuff. If there’s a massive amount of changes, you can set the threshold like at 25%. If more than 25% of your JPEGs or DOCXs or whatever has changed, it’ll automatically start sending alerts up from the web app server, you get an SMS text on your phone, and then it will stop and allow further action to say, “Okay, this is all right, I’m not infected, go ahead and do it.”
George: Great. Okay. And then how frequently can I run the backup job?
Rod: As frequently as every minute.
George: Okay, so I can capture the changes almost in real time?
Rod: Yeah. That’s what I think we have the checkpoint for, snapshots, and then the archives.
George: So those are Aparavi terms. So explain to me what each one of those are, where they fall into the architecture.
Rod: Okay. So everything is file by file, but it’s really not image based. It still is file by file. A checkpoint is actually stored on the local system for the oops moment where you deleted a file or something like that. Typically, the default policy we set up is for checkpoint every hour. So that means that if you delete a document or you want to go to a last version, it’s all locally done here. It’s very, very quick. Snapshots are typically a roll up of what goes up to the appliance. Snapshots typically, by default, are done once a day.
George: Okay. And that’s going to protect me from, say, a media failure or even a file server failure?
Rod: Yes, right.
George: Okay, great.
Rod: Then once a night, basically—you can set it up for once a night or however often—it actually syncs the data from over here up to the cloud. And those are what we call the archives. This is the long-term stuff.
George: Gotcha. Well, Rod, thanks very much for detailing this for us.
Rod: Very good.
George: There you have it. If you’re looking to really protect unstructured data in a different way and move out of the legacy methodologies, Aparavi has a really good architecture to consider.