In today’s installment of “What people are saying about Aparavi,” we have a valuable webinar on data protection requirements with our VP of Business Development Jon Calmes and Storage Switzerland Analyst George Crump.
In “The Three New Requirements of Unstructured Data Protection – Can Your Backup Deliver?” George and Jon discuss the new problems caused by growth in unstructured data and why it needs new solutions.
George has written extensively on unstructured data protection, and we highly recommend taking a listen. He says unstructured data is frequently the most important information asset an organization can have, so traditional/legacy backup approaches aren’t good enough anymore.
Transcript: The 3 New Requirements of Unstructured Data Protection
George: Hello and welcome. I’m George Crump, Leading Analyst with Storage Switzerland. Thank you for joining us today. Today we’re talking about the three new requirements of unstructured data protection and giving you some tools to look at, whether or not your current backup solution fails to deliver the goods. So we’re going to talk about the problem in general that people are dealing with, some of the challenges that we’re seeing with legacy solutions, and what you need to be looking for when you start to look for a data protection solution that’s really going to be able to handle this. Attendees are able to receive our latest eBook that really dives deep on this subject called “Modernized and Unstructured Data Protection.” That can be found in the attachments and links section as part of the webinar. Joining me as part of the conversation is Jonathan Calmes. He is the Vice President of Business Development at Aparavi. Jon, thanks for joining us today.
Jon: George, thanks for having me. I’m definitely excited about being a part of this discussion.
George: I’m glad to have you. So before we get too far into the presentation, you want to give the folks just maybe a quick 30 seconds on you and what Aparavi is up to?
Jon: Yeah, absolutely. So I’m a veteran, as much as I can say, in the traditional backup and disaster recovery space, and felt a sea change in data protection, along with the other leaders here at offer Aparavi. We are now focusing specifically on managing the long-term data retention, the warm active archive of unstructured data across multiple cloud architectures, hybrid cloud infrastructures, really trying to get ahead of the wave that we see coming. I think enterprises already seen it today, right? But are trying to get ahead of it for the rest of the market.
George: Okay, great. And then for those who don’t know me, I’m George Crump, the Founder and Lead Analyst of Storage Switzerland. We’ve been doing Storage Switzerland—now we’re in our 12th year. Prior to that, I was CTO at a large storage integrator. One of my responsibilities was to manage tech support. We took first and second call for products like NetWorker, NetBackup, and Commvault. And if you ever want to know what the worst job in the industry is, it’s probably managing tech support for products and you don’t even have the software code for it. So a lot of what we learned there goes into a lot of the processes that we’ve developed, and I think you’ll see a lot of that come through today. So, Jon, before we talk about what’s wrong, let’s just frame up the problem. And I was very hesitant to do yet another “unstructured data is growing out of a control slide,” because it is. But I think kind of everybody knows it.
But I think the key difference—and I think what takes people off guard—is it’s not really that unstructured data is growing, I think everybody will acknowledge that. But it’s really how it’s growing, right? This is not because we’ve got another billion people on the planet and they’re all creating really big PowerPoint presentations. This is being delivered in partly some of that, but also just machines and things like that. And so now we have what I call a quantity of files issue. You know, in the old days when we would do a backup assessment, I would be on the lookout for servers that had maybe a million files on it. Now, we’re starting to see customers approach a billion files, and that just fundamentally changes the way you do a lot of things, and one of those things would be data protection, right?
“I think what takes people off guard is it’s not really that unstructured data is growing… But it’s really how it’s growing.”
Jon: Yeah, exactly. I can’t tell you how many times that working with backup companies, large organizations were hitting this type of experience and saying, “I’m not meeting my backup windows with product X or product Y any longer. It’s just sitting there and it’s scanning for a week before it’s doing anything.” And that’s because together, the way in which they are working with the file system is unrealistic, in kind of this billions of file there.
George: Yeah. And I think that the other kind of sea change we see here is that the size of the unstructured data, it’s not that…this is kind of a weird. I didn’t know how to really put this into a bullet, but it’s not that there’s less files that are measured in a K. But there’s just that there’s the fact that probably you can make a case to say that there’s even more now thanks to the Internet of Things and things like that. But there’s a lot more files that are now measured in megabytes and gigabytes. And that really creates a challenge too because you’ve got this unbelievably mixed workload that ranges in a single data center, where you could be dealing with a half a million or a couple million 3K files. And then the next job, so to speak, is backing up maybe 100,000 2-gig files. And so that really puts a different type of stressor on the backup application. And then, Jon, I think the other big one that you and I talked about on a video that we did recently was: this is just the beginning, right? This is like we’re going look back five years from now and think of this is the good old days, right?
Jon: Yeah, exactly right. The good old days when using backup can still work for your archive and your long-term retention. Exactly. That growth, I think by 2020, unstructured data specifically, according to some leading analysts, is supposed to be like 90% of all data worldwide, and it’s going to be in the zettadata bytes or the yottabytes. There is a massive rate of growth that is happening now at an enterprise scale. Absolutely.
George: Yeah. I mean, the example that I always give is four years ago, not even that, three years ago, I didn’t know I needed to know what my heart rate was every minute of every day. Now, apparently, I can’t live without it. Right, Jon? It just is this ongoing thing that is just going to get worse and worse and worse as we put more data, more sensors on different things. So I think that really becomes a challenge as we kind of move forward. The other thing is performance is really—just data in particular—is now just increasing in importance. And what we’re looking at here is, if you go back to the early 2000s type of era where most of this data was created by users and their typical backup strategy was well, if it’s important, copy it to a file server, and we’ll back it up there. The thing is it’s fundamentally changed. There are businesses that specifically run solely on unstructured data. And then, Jon, there’s a lot of businesses that make decisions based on this data. And if they don’t have it, essentially they can’t operate, right?
Jon: Yeah. I think with the rise of machine learning and AI, a lot of the new especially like in engineering, manufacturing…Elon Musk, for example, could predict that Tesla Model 3 production is going to go from the 3,000 a week to 5,000 a week based on some of that machine data and his stock still goes up, even though it’s still isn’t profitable because the business trusts the data more than the cost, I think, at this point. So you can say, “Look, here’s, here’s the path of increase. Here’s the machine data that backs that up.” And that’s obviously a macro scale. But the other thing is we are now more regulated in data than ever before. You look at things like GDPR, and even data that we previously didn’t think held value now holds risk. And that’s, I think, going to be a growing concern here stateside coming the next few years as well.
“We are now more regulated in data than ever before. You look at things like GDPR, and even data that we previously didn’t think held value now holds risk.”
George: Exactly. I also think that another big one that we see is the ability to monetize data, and I’m not talking necessarily sell it. So I’m not talking necessarily a Facebook/Cambridge situation, but legitimate reasons to actually sell data. You know, we worked with several sports leagues where they’ve taken old archives where maybe it was radio only and digitized those and then the playback of those made them available for sale. So if you want to listen to like a baseball game from the 1930s, you can subscribe to the 1930s baseball games, which is kind of interesting if you’re a baseball guy. And so there’s ways to, if you will, make money off of this data. And the other thing is there’s also new threats. We talk a lot about ransomware and ransomware is…I don’t know if it necessarily specifically targets unstructured data, but certainly, Jon, unstructured data is certainly a victim of ransomware in many cases, right?
Jon: I think that’s where it starts, in emails and things like that, that people are opening attachments unknowingly outside of malicious intent. But, yeah, the ransomware issue is a big deal, as we all know. Once it gets in, it can traverse network paths, it can get all throughout this really, really quickly, and a lot of the backup software out there and a lot of the archive software out there is going to say, “Oh, look, this file changed.” And they’re going to end up copying bad data onto your secondary and tertiary storage systems. And from that point on, you’re pretty much up a creek.
George: Exactly. And so I think that changes the way we look at protection, both in terms of frequency and analysis of data as we do the protection and things like that. So it just opens up whole different things. And then the other thing we see is this increase in demand for retention of data. For some of these monetization reasons and to do decision-making, organizations are wanting to keep data longer. The term I hear a lot is “forever,” and forever is a really, really long time. And so the ability to retain information and manage that correctly is really important. Of course, the more data you have tends to lead to more accurate decisions and more monetization.
And then, Jon, of course, GDPR happened, right? And so it kind of changes everything. One of the things I like to point out is that GDPR is really a world problem. This is not a unique-to-Europe situation. I’m on record as saying that, frankly, if you read the GDPR regulations, they’re not crazy. You don’t read these things and go, “Geez, what did they do?” A lot of what GDPR is asking an organization to do really is just good data management and good data protection, right?
Jonathan: Absolutely. I think there’s some things in there that people are going feel unreasonable about, like, “Hey, forget me, right, after X period of time, forget me.” And yes, your next point there, GDPR breaks traditional backup is because backup—and even, we say traditional, it breaks modern backup too. If you’re looking at your image-based systems, you need to intimately know what you have everywhere in order to comply. And that is a daunting task for legacy, traditional, and modern backup companies in general. And anyone with a data protection engine older than GDPR is in trouble.
“Anyone with a data protection engine older than GDPR is in trouble.”
George: I think that’s a fair assessment. I think that a project we were working on really hits that middle point that we see GDPR-compliant companies starting to use it as a competitive advantage. This particular company, after we had done that the plan, the CMO…I don’t normally deal with the CMO, we’re dealing with an end user company, walked into the room and handed me a press release that they were getting ready to put out that basically said, “We’ve done all this work and we’re fully compliant. But we also want our U.S. and Canadian customers to know that we think they’re just as important as our European customers. And they’re going to get the same rights and privileges as our European customers,” with the hidden implication of why doesn’t your competition do this?
And so I think that this whole GDPR thing really just sparks this data privacy conversation I’ve had with what I would describe as non-IT people. When everybody started getting emails at the end of May, trying to understand why everybody is letting them know that their data privacy plan had changed, that just sparked a whole different level of conversation that I don’t think anybody had before. And now I think over time, people are going to be much more in tune to that.
So I think that kind of sums up where we are. And so let’s talk about the current state of the art, if you will, in unstructured data protection. And there’s really two methods. Method one is this file by file technique. And, Jon, there are some good things here. You had mentioned the time it takes to scan the file system or walk the file system, the term we’ll use a lot, and there are some advantages here. It does give us some granularity. We do know about each individual file. And that, in some ways, is helpful as we talk about things like GDPR and stuff, right?
Jon: Exactly. When you’re looking at a long-term data, you need that high level of granularity to be able to actually manage the data in and of itself. Having that high level of granularity is what is required to start to comply with some of the regulations coming about. And one of the things I wanted to say before is when we did some survey data—and there’s a lot of people who took survey data on GDPR-preparedness—the techs were like “Meh,” because I think they saw all the inherent problems and are like, “Look, unless we rip everything out, there’s no way we’re going to comply right away. So let’s just not worry about it.” But the business side of the house was like, “No, you need to figure this out.” So the executives had a high level of fear around it, whereas the actual boots on the ground—we’ll say the sysadmins, the senior sysadmins, etc.—were kind of like, “Okay.” So our response to that is, “Well, use your executive sphere to fight for budget for better data protection and archival engines.”
George: Right. Because the right engine really helps here. And I think the other challenges we see with file by file backup is, first of all, it is still a series of sequential jobs. And I am concerned about, for example, the use of tapes here. I don’t know if John Smith calls you up one day and says, “I’d like to be forgotten,” I don’t know how you move remove John Smith from a sequential job stored on a sequential form of media that’s eight terabytes large.
Jon: And where one file relies on another file or another file or another file in a systemic incremental backup. If you can’t just go in and start removing things out of the chain.
George: And then, of course, you touched on this earlier, but the downside with a file by file backup is, especially when we’re talking about millions and millions of files, it is really slow, right? And it’s not so much that the first backup is slow, you could probably live with that, but even subsequential backups are slow. There’s no, or limited, mechanisms in place to not have to recheck certain files and things like that. So all of that really creates a challenge.
And then the other method, sort of the opposite end of the equation, is this concept of image-based backup. So, Jon, I would say it was maybe five years ago that we started to see a pretty big change in the market to this direction, specifically to take care of the fact that we had these file servers out there with millions and potentially billions of files. And essentially what we do with an image backup is we just ignore the fact that there’s a file structure there at all, and we’re really just protecting things at essentially a one and zero level. It’s just at the block and volume level, right?
Jon: Yeah, exactly. I missed part of your question. The phone got jumped up there. So usually when you say, “Right?” I just say, “Exactly.” It’s how I operate.
George: That’s the problem. So the big challenge with image backup is it really doesn’t provide any granularity. So it’s the inverse, really, of file by file backup. And then it’s almost impossible, really, to manage or categorize the unstructured data within that image because we’ve lost that granularity now.
Jon: We like to use the term “opaque images.” So that’s meaning like you can’t peer into that because it’s in a proprietary, compressed, encrypted, de-duped format, and at best you’ll get a date range. And you’ll know, “Okay, these are the things I selected to be in that image.” But you have no tools to manage the data with inside of that, right? You might have data that is inside that image that could be removed. It could be archived out, it could be deleted, but you can’t do that because, obviously, you can’t break open that image and start to manipulate it.
George: Yeah. And a question came in that I want to go ahead and address now, and it also gives me a chance to remind people to ask questions. There’s a Q&A box down at the bottom. You just click on that and type in your questions and we’ll answer it either in-line or at the end. We have a good chunk of time saved for questions. But so a question came in on—I don’t think it actually was a question, I guess it was more of a statement—”But image backups can do individual file recoveries.” And that’s absolutely true. I mean, take nothing away from the guys who leverage image-based backups; they can do very fast incremental updates of those images. And they can actually recover individual files from the images. But you have to look at how they’re doing that. They’re doing that typically through either mounting or peering into the image.
And there’s a big difference, for example, when we’re talking about something like a right to be forgotten. There’s a big difference between recovering and essentially copying something out of the image, versus just actually deleting something within the image. Because, as Jon had said earlier, all those components are interconnected and, like a house of cards, you can’t pull that middle card out and expect everything to work together. Right.
So that’s sort of the two basic ways that unstructured data is backed up today. You either do a file by file backup or you do an image-based backup. Jon, in my experience, I actually see vendors offer both, and you need to—or the backup administrator has to—really select between sort of an either/or, correct?
Jon: Yeah, I think there’s a handful of vendors out there, appliance-based vendors typically, that are doing some sort of hybrid offer for snapshots and for file by file. But that’s an interesting way to solve the problem, to say, “Yeah, we’ll master all of it.” You talked about a baseball example. I was a pitcher in college, and Sparky Anderson would hang around the field, which was phenomenal for us. We got a big kick out of it. And I came in as a freshman thinking that I was going to throw eight different pitches and batters were going be all confused. And the reality is I only could throw one of those perfectly because I was too busy trying to figure out the other seven. So when you start to try and do too many things all in one backup engine, you become a jack of all trades and a master of none.
“When you start to try and do too many things all in one backup engine, you become a jack of all trades and a master of none.”
George: I totally agree. So let’s talk about…I think we’ve set the stage that unstructured data has fundamentally changed. And the current methods, either A or B, file by file or image-based backups, really are wanting, if you will, in protecting this onslaught of unstructured data that’s here now, and just going to get worse. So let’s talk about what we think needs to happen here. So I think the number one thing, Jon, is we don’t really want to give up those rapid, granular backups. Nobody can go, “You’re the Vice President of Business Development.” You can’t go to market and say, “Hey, we invented this really, really slow way to back up data granularly.” That’s going to be a really hard sell.
Jon: I would say so, absolutely. You’re like, “Okay, I’m just going to go back to the days of TapeWare,” or something like that. And we’re just going to go back 20 years to go forward. There definitely needs, and is, a better way to handle the large files sets and handle the frequency at which you’re backing those up.
George: And I think if you look at sort of the threats and regulations on the horizon, we just can’t give that up. Ransomware and just, frankly, just general DR requirements and, of course, we’ve got the whole GDPR thing that we kind of hit pretty hard. Just all this requires speed, but also granularity. And I think what Jon was hitting there was sort of a journaling-like technique to get the best of both worlds, a fast, especially on updates, file by file backup. So I think that really becomes a key thing to look for as you’re looking at evaluating your current backup solution and potentially looking at a new one, is when it comes to your unstructured dataset, can it really give you the best of both worlds? Can it give you a fast backup while at the same time giving you very specific file granularity? And so those become key.
The next thing is data intelligence. I think what’s interesting is—and I think the file by file systems are kind of guilty of this, Jon. They have the granularity but they’re just kind of stupid about it. They might know the path, they might know the date modified. They might know some—well most don’t—the date access. But they really don’t have a lot of detail about those files, right?
Jon: Yeah, it’s pretty common. I think that the more intelligence you add early on, obviously, the more overhead you’re going to end up with. So I think that there’s been this kind of mantra—and you talked about it, right? The speed. Everybody wants something to go faster. There’s never a scenario in IT where they don’t want something high quality and fast. And in light of that, they try, “Okay, if we have to walk the entire file system in a sequential manner, let’s do it as quickly as possible.” And the way to do that is not to add any type of intelligence to when you’re indexing those files.
George: And then I think the other thing that’s lacking in that second bullet there is the concept of being able to tag and do custom tagging to files so that you know the source or purpose of those files. The other thing is that, as you start to add this intelligence, it allows you to start thinking about reducing the amount of primary and even secondary storage capacity use because you know what’s there, you know what’s has been accessed. You know that this never gets accessed or when it gets accessed. That kind of thing.
And I think, Jon, that last one is kind of important. That’s actually a quote from something I say all the time. “Look, if you can’t find the data, you don’t have it stored.” The fact that you actually do have it stored really doesn’t matter because you’ve got to be able to find it. And so the ability to not only find data, but find it quickly, really becomes a key requirement, especially as we delve into the future of all this unstructured data.
“The ability to not only find data, but find it quickly, really becomes a key requirement, especially as we delve into the future of all this unstructured data.”
Jon: Couldn’t agree more. If you don’t know what you have, you have nothing at all. That’s another way to say that.
George: Exactly. I had one slide, GDPR happened, and I was going to have another that said the cloud happened. What I advise people here is when you’re looking at how to use the cloud, especially as part of data protection, data retention, and data management, make sure that you’re approaching it like you’re playing chess and not checkers. And by checkers, I mean that a lot of vendors will say they have cloud support and it’s minimal—and I’m being kind. And they’re just trying to essentially get the checkbox. And there’s other things you can do with it. And so what I typically say happened when I talk about a minimal cloud support is all they’re doing is essentially mirroring everything to the cloud. Which is okay, because at least you got a DR copy and all that kind of stuff. God help you, if you actually have to pull it out of the cloud and pay all the egress charges, but it is there.
What you really want to do is kind of think of the cloud a little differently. Can I use the cloud as part of my retention strategy? Can I reduce the on-premises capacity of storage and really both ends? What we see, Jon, happening a lot is the growth in primary storage actually being dwarfed by the growth in secondary storage because secondary storage now is used for so many purposes. And so that ability to reduce both sides of that equation becomes critical.
Jon: I think modern research shows that for every terabyte of data on premises or the terabyte of data that you aggregate as a source, you’re looking at like 5 to 10 terabytes of secondary storage as a result. And that’s where the whole copy data management world came from. So, yeah, I think that aside from using your backup data for other things, which we’re big fans of around here at up Aparavi, is actually the intelligent reduction of storage based on known data or known variables. When you know what you have, you know what you can get rid of.
George: I think that last bullet kind of sums it up. What you want to look for is a solution that can do more than just mirror to the cloud. I mean, mirroring to the cloud is probably a fine first step, but you want to be able to apply some intelligence here and leverage the cloud to reduce your on-prem cost. I mean, if you’re just duplicating your on-prem cost in the cloud, it really doesn’t help you much. So to kind of summarize here, first, I really want to get across that unstructured data just fundamentally changed. The user home directories and user-created data is certainly still critical, and probably in fact more critical than ever. It’s certainly larger than it’s ever, been but it’s being dwarfed.
What we’re learning very quickly is machines are much better at creating data than humans ever will be. And so managing that becomes real critical. And I think that fundamentally, you just need to take a new approach to protection of unstructured data. You need to think of…unstructured data often was sort of an afterthought. Now, it needs to be really top of mind. The frequency of protection is critical, thanks to things like ransomware. And then the granularity of protection is critical thanks to regulations like GDPR. So you’ve got this perfect storm occurring that is really requiring you to rethink things.
“Unstructured data often was sort of an afterthought. Now, it needs to be really top of mind.”
And I think the other big one is the capacities required by on-premises systems really need to be reduced. And the cloud now is very cost effective and is an ideal place to retain this data for a long time. And so having that, at least as an option, I think makes sense for a lot of organizations. So, Jon, you guys are kind of a thought leader. You guys at Aparavi—and of course you personally—are really thought leaders in this space. You really created a company specifically to solve this problem. So why don’t I turn it over to you and let you walk through what you guys are doing and what the long-term vision is.
Jon: I appreciate that, George. Our long-term vision is to solve all the problems you just talked about. In a perfect world, that’s exactly what we would be doing. So a little bit about Aparavi. “Aparavi” comes from a Latin word as a form of operari, which is where we get the word prepare. So literall translated, it is “prepare,” “make ready,” and “equip.” We say it different here stateside than we probably should in true Latin, but we are here to prepare organizations for multi-cloud active archive, and we do so with built-in protection.
So just a quick “about us,” we’re in Silicon Beach, which is the greater Santa Monica area. And we are solely focused on the problem of protection, retention, and archive of unstructured data. The way our team is structured: we’ve got a leadership team. Adrian Knapp: he’s got over 25 years in storage. Rod Christensen: this is his fourth or so protection engine. If any of you guys here have been around for a while, Rod is the gentleman who wrote Yosemite TapeWare. He spent 10 years at Computer Associates on the Arcserve DDD platform. He was at NovaStor Corporation, where Adrian and I were introduced to him, for quite some time, molding products together for the new Cloudera.
And really Aparavi was in part his braindchild, knowing all the sins of backup. We set out to build an engine that could start to address those. Myself, I’ve got over 10 years in data protection at startup companies, quite a few here. Victoria Grey is our CMO. She’s got over 25 years in storage and data protection. George and Vicki go way back to the Legato days together. And she was most recently at Nexsan. And now we’re lucky enough to have her on our team. And then Jay Hill has over 10 years in cloud data management and archives, our VP of product, mainly over at Informatica.
So as you can see here, we have a ton of leadership focused solely on storage and data protection and data management. So I’m not going to explain this slide in depth, guys. This is something that we’ve talked about this entire time, but these are some statistics about unstructured data. Probably what is beneficial here just to note, a thought by Enterprise Strategy Group, which was 85% of archive data is accessed very frequently. The statistic there is actually, on a weekly basis, enterprises that are 1,000 to 5,000 employees who are looking for archive data on an almost weekly basis. And 80% of the data that’s been created is within the last two years. So the natural question is, “Why is it in archive?” And I think George answered that already. We need to get data off primary to save cost and to free up. Because data is growing so quickly, we need to get it out to archive and put it out to pasture quicker than we ever have been before.
So one of the things that we talked about a lot here is backup. And we really think that backup isn’t the solution in whole. The backups that we’re dealing with today, the engines were built for structured data as the primary data source. So they weren’t built around granularity. They weren’t built to try and learn anything on an individual file by file basis. They were looking at databases specifically, because historically that’s where the most important data had lived, in our databases.
They were designed around on-premises. They were designed around tape with proprietary formats, locked-to-data formats, .MBKs, .CMVTs, whatever it may be. But one of my favorite business stories is I had a customer using a competitor, and to set up an Amazon S3 object storage, you had to add a new tape library first, because they have no other way of recognizing an off-site type of storage repository at the time. It was pretty entertaining.
All in all, they’re not cloud optimized, and what this results in is complex, expensive, and broken backup engines that have really suffered from the add-on economy of “Okay, let’s add a new plugin for this. Let’s add a connector for this. We’ll add a gateway for this.” And when you do that, everything is overly complex. You start to need a master’s in backup to run these systems. That causes massive expense, and ultimately, that complexity leads to the brokenness of “I don’t know what I have.” And as George mentioned, if you don’t know what you have, you didn’t back it up in the first place.
So Aparavi has a better way. We have a multi-cloud protection and retention engine that allows you to dynamically select cloud storage locations. The way that we are writing data in the cloud—and I’m not going to give away a ton of the secret sauce. We’d love to have you guys on a one-on-one demo with one of our technical team members—but the way we’re placing data in cloud allows us to actually dynamically move data to and from different locations. So you could have an original file in Amazon and you could have increments in Google or Azure, what have you. We truly are multi-cloud from that standpoint.
“We have a multi-cloud protection and retention engine that allows you to dynamically select cloud storage locations.”
We also built the solution to be logical about removing, grooming, or pruning data. So because we are leveraging some of the benefits of the file by file technology, we know down to a byte level what we have. So the moment an individual file or individual increment is able to be removed, our software can automate the destruction or removal of that data, the deletion of that data out of your secondary, tertiary storage repositories, and then even all the way back down on premises if you want us to.
Alongside that, we have a compliant storage optimization, which allows you to…we can call it’s almost like a search and destroy. So in a GDPR scenario, by policy or on-demand, we can destroy that sensitive data and then also provide an audit log that is that indeed has been removed in a GDPR type scenario. Along with the multi-cloud protection retention, it wouldn’t make sense if we didn’t have multi-cloud recover and retrieval. So we do allow you to recover to a specific point in time, similar to snapshot-based technology that allows point in time recovery.
What’s really beneficial about this is you don’t need to know where your data is; you just need to know when it was. And our software is going to manage the retrieval of that, or the recovery of that, depending on what’s going on in your infrastructure from those multiple different cloud locations. This is really important when you’re talking about 7 to 10 years of retention, turnover within IT, new leadership coming in and saying, “Amazon is dead. It’s all about Azure,” or things along that nature where you as a sysadmin or and IT has to start to move and manipulate data around.
The big benefit here, guys, between multi-cloud protection, retention, multi-cloud recovery and retrieval is we have no requirements for you to actually pull the data back down on-prem before you put it back up to a new cloud. There is no reason to say, “Okay, I’m using cloud for storage economics. Oh, but if I wanna go to a new cloud, I have to pay all the egress fees in the world to bring the data back down on-prem. Oh, by the way, I have to have on-prem storage that can fit all my data.” It’s just not realistic. So we are true inter-cloud connective.
And the last thing as we talk about proprietary formats, Aparavi has a documented data format. So on our website in our resources section, you can actually get the documentation for our data format. This can be used to build connections between other backup tools. So in 10 years, if Aparavi has changed data formats, which is common… If any of you guys have been in data protection for a while, you know that as new, more efficient technologies come out, companies will change their data formats to comply, and oftentimes old versions of data can’t be read by new versions. So this documentation will completely eliminate that. And we also are in the process of standardizing and open-source-publishing a reader that will be available in public domain as long as open-source is around.
So to kind of manage all of those powerful tools, we built a data-centric retention policy, and this allows for both content and metadata classification. So we’re able to, as data is being ingested into the system, we can classify it based on its metadata, but we also index it based on its content as well. And so this allows you to set data policies based on regions, based on data types, based on business importance. So all finance data needs to go to Five9’s cloud, whereas if it’s just generic documents, things like that, we’re gonna send it to somewhere a bit cheaper, for example.
Along with that, we talked about search and knowing what you have. So our advanced archive search allows you to search through content itself as well as the classification that you may have applied. So we have a basic set of predefined classifications that follow best practices, finance data and things like that, accounting. But we also allow you to specify your own. So in media and entertainment, for example, a production house needs to know every film that they have that has Meg Ryan show up in it.
And that’s because in the future when different milestones happen with Meg Ryan, for example, they need to be able to discover that as quickly as possible to begin to formulate information around it. Again, or in a GDPR scenario, if, “Hey, I did business with you five years ago. Forget all my data.” You’ve got to be able to discover that. So we can discover it by your file name, the actual content. If “George Crump” shows up in the actual document somewhere, we’re able to discover that for you guys.
Along that, we have storage analytics built in. So this allows you to track usage and trends along your storage so you can see what data is growing where, how your cloud storage is being utilized, what data is being saved by pruning. We also added in a monitor there for percent change. So we talked about ransomware, percentage of the files that you have that have changed. That is a good indicator that something is going on. So you can actually set an if this, then that threshold. So if 80% of your data changes on an individual server, for example, you can actually stop the motion of data out to cloud to ensure that you stop the copying of bad data.
We talked a second ago about this, but full auditing, reporting, logging is available, and you can export those as CSVs. We also have a full RESTful API that you’re actually able to leverage everything from that gives you access. And then one of the great things about our policy engine is that it is a global fully automated policy engine, meaning, as you introduce new machines, as soon as they check in to the Aparavi platform, which is a web-based HTML platform, it will automatically inherit, adopt, and execute the policy that has already been set from the infrastructure. So it makes scaling really straightforward.
So here’s a quick look at the architecture. And this is where we’re going to start to address some of the granularity and rapidity that George was talking about. We have three storage operations that we’re capable of doing. You’re able to do those as a…you can do some of them, you can do all of them. It doesn’t necessarily matter to us, but it matters to you from a business outcome standpoint.
We have a web-based platform that’s actually hosted by Aparavi. So it’s a software as a service-delivered solution. So you don’t actually have to go out and stand up a web server to host our platform. That capability does exist. If you want to do that on your own, you absolutely can. But on-prem at the location, you’re going to deal with our software appliance, which is bring your own hardware, guys. We don’t want to tell you where you can put storage on or what limitations you have. So we fully support all supported versions of Windows, as well as common distributions of Linux, both on the source and that of our software clients. The source itself, guys, we can do CDP-style checkpoints there. So if you want to run a policy that says, “Grab all new and change data every one minute,” we can give you that type of granularity.
And what it’s going do is it’s gonna use the local direct-attached storage or network-attached storage to that individual server as a temporary recovery cache. What will happen next is you’re going to say, “Okay, every hour or maybe every day,” depending on your recovery points, you’re going to say, “Do a file by file Aparavi snapshot to our software appliance.” What it will then do is, once we have verified that the data is on the appliance and it is on the storage on that appliance, it’s going to go ahead and remove all of those CDP snapshots because, again, we have the data. We verified that your data is there. It no longer needs to be on the source. So we clean up behind ourselves.
And we do the very same thing with archives. So when you define your archive or your long-term data retention policy, you define your cloud providers within that, you get to define how many versions of snapshots you actually want to keep on that appliance. So, again, we’re cleaning up behind ourselves so the data can be sent out to that cloud location, and then you start to remove the snapshots once everything’s out and verified.
There’s a question here that I wanna quickly answer. There’s two of them. “What clouds do you support?” I’ve got a slide for that, don’t worry. And then the question is, “Do you store data on-premises, as well as in the cloud?” And yeah, we absolutely do. So we have the capability both on checkpoints, snapshots, and even archives to be held on-prem. So if you guys are leveraging a headquarters location data center, you can define the path right out of our solution for archives for where you want those to go. Or if you’re using some of the private cloud object storage that are that are S3 compliant, your Caringos, your Cloudians, your Scality, etc., you can even target those using our generic S3 creation.
So here’s the slides to ecosystem of things we support, operating systems we support. And from an operating system standpoint, we follow the Windows guidelines there. So if it’s supported by Windows under their standard support agreements, it’s supported by us. Linux, including Ubuntu, Red Hat, Fedora, Debian, all of those are included are supported both from the software appliance standpoint. So those can live on Linux. They can live on Windows. It’s up to you guys. From a cloud support and standpoint, we support Google Cloud, and we support Amazon AWS S3. We support Azure, and we also just certified against IBM Cloud as well.
But one of the beauties here is this isn’t the end-all-be-all from here. Any organization that is using the Amazon S3 storage API, we actually have support for you to select a generic S3 object store. And you can define your credentials, your buckets, or we can create the bucket for you but you define your access code, secret key, and we’re able to create those for you inside of the platform directly. There is really no limitation here. We didn’t want to say we’re giving preference to certain clouds over others. From a certification standpoint, we’re certified with Wasabi, Caringo, Scality, Cloudian, as well as the IBM Cloud. You guys will see a press release of our certification with IBM here shortly.
And a quick use case here, this was an engineering firm. This engineering firm is using a Veeam tool. They’re using images. And from a forecast standpoint, you know, they saw their data was growing and how their data was growing, and by doing a about a 60-day proof of concept with Aparavi on one of their machines, they figured out they were able to predict a 50% reduction in storage across the first 7 years and then after that a 75% reduction.
“By doing a about a 60-day proof of concept with Aparavi on one of their machines, they figured out they were able to predict a 50% reduction in storage across the first 7 years and then after that a 75% reduction.”
And this is by using Veeam for what it does best, which is availability and continuity of your virtual machines and then relinquishing control of the data itself to Aparavi and allowing Aparavi to use the way that we’re putting data in the cloud as well as the way we’re able to remove data out of cloud the moment its retention policy expires. We greatly reduce that storage requirement there on that end. So just as a quick overview, I’m going to leave this slide up here. Aparavi is a software as a service based solution. You’re able to grow as needed all the way down to the gigabyte level, if you want to start small, or you can you can grab a larger terabyte-based plan.
From a secondary storage growth standpoint, through our pruning, we’re able to reduce storage by up to 75%. Huge cost savings. From a location, we’re built for a hybrid or multi-cloud infrastructure. So we were able to allow you to store data on premises, to store data in the cloud, to use a combination of those, as well as be able to thread data through multiple clouds. And our open data format is designed to give you not only the storage independence, but we also want tomake sure that you can recover data, and we also want to really end that software cloud vendor lock-in. We don’t believe that you should be locked in to Aparavi just because you bought us 10 years ago or you should be locked into Amazon S3 because that’s how you started in cloud.
So that’s what I have for you. There’s a couple of little blurbs here, one from George. I’m not going to go through and read those directly. But Aparavi is a startup as we talked about. We are a company that is designed solely with active archive in mind, meaning you know what you have in your cloud storage, you’re reducing your primary location, and you can go back and get that data without having to wait days for tapes to be indexed, things like that. But we’ve got a couple questions here that I wanted to address. George, do you wanna kind of prioritize those? I know we’ve only got a couple more minutes here.
George: Yeah. Moving forward to that, let me just tell folks, you can ask some more questions. If we don’t get to them, we’ll answer them offline. Also there’s attachments available to you in the attachments section has more detail on really everything we talked about. A couple ChalkTalk videos with both Jon and I in them. So feel free to take advantage of those. And so, yeah, let’s go ahead and knock through a couple of these questions here. Let’s start with this one, Jon. “What operating systems does Aparavi support?”
Jon: Yeah, so just to go over that again, we’re supporting the current support of Windows server and workstation infrastructures there. We support virtual machines as well. At current, we do that via best practices at the actual guest level itself. So to provide that granularity in that index, we actually place a transparent agent on the guest itself in order to manage that. Veeam works similar actually when you start to work with their cloud solutions, so they run an agent individually. It’s just a lot more efficient to do it that way. And then from a Linux standpoint, we support all common distributions of Linux.
George: Okay. Another one here, and I think you kind of hit this one already. But just to clarify, “Can I use my own storage instead of cloud?” And I guess you support anybody that has object storage, that has S3 support. It looks like it’s something that you can work with, right?
Jon: Yeah. If you want to present yourselves as object storage, you can absolutely use our interface to create that if it’s private cloud. It just needs to be S3, simple storage, API compliant with V4 authentication. That’s the kicker there. And we’ll work with you guys depending on what you’re doing to get those configured if necessary. But as I mentioned, you can actually also define your own network apps so you don’t have to go out to cloud per se.
George: Good. Jon, why don’t you go ahead and prioritize a couple more that are in the queue there. And while Jon is doing that, I’ll put up on the screen for everybody contact information, Twitter feeds and handles, as well as where to find us on LinkedIn and YouTube. So if you have any needs for more information there, and again, I want to remind you about the attachments section. You can get more detail there. So let’s see. Go ahead, Jon. You want to go and just take one?
Jon: Yeah, absolutely. I think one that we haven’t answered yet throughout the presentation was, “Can users do their own recovery with the user recovery interface?” That is 100% based on the organization who’s administering it. So we’re full multi-tier and multi-tenant capable. So if you want to give an individual user, maybe a power user or a user at a remote site, for example, the ability to do retrievals and recoveries from within the actual web-based UI, you would be able to grant that through our system. They would login with their own login and password and be able to have access to just their own individual data to be able to do that, that retrieval recovery. So absolutely, we do support that.
George: Okay, and another question came in just real quick is, “Will the slide deck be available after the webinar?” Yeah, we’ll upload the PDF as soon as we’re done. And you can come back in the same link. You don’t need to reregister, and you don’t even need to listen to our presentation again unless you want to, and then right on the attachments there, there’s a link to click. Jon, why don’t you take one more, and then I’ll go ahead and wrap things up.
Jon: Yeah. Last one is, “Do we do trials?” Yeah. We do proof of concepts, absolutely. We can do 30 days and it’s actually extendable, if necessary. We handle those. No limitations, no fee on that whatsoever.
And then, “How do we bill?” This is a fun one. So we actually have a pricing page in our website where we’re all transparent, but we bill based on source data protected. So you’re only being charged for once for your data, despite the fact that we’re doing three different levels of protection. We are not brokering any of the clouds. We’re not making money on the cloud storage, so you need to bring your own cloud. But we bill at a per gig, month to month basis, on a pay as you grow. That’s kind of the most expensive price point. Or you can do an annual plan if you know like, “Look, I’ve got 10 terabytes of source data. I’ve got 50 or 100 terabytes of source data that I need to protect.” We would bill based on that, and that way we’re not charging you based on how efficient de-dupe is or how many versions of files that you guys do that. This doesn’t make sense. So it’s solely based on what you have.
George: Okay, great. Well, Jon, thanks for your time today. It was a good conversation. I also want to thank everybody for tuning in. Great questions. And I also want to remind you that the attachments are there for your use. There’s also in the upper right-hand corner of your player, there’s a little email list button. If there’s somebody, a colleague or friend in another company maybe, that you think might benefit from seeing this, please email them a link. They probably would rather get an email from you than from me. So go ahead and send it and let them know it was good.
Last thing is before you leave the presentation, there’s a section there to provide some feedback. It’s a five-star rating system. You can even type in a nice comment if you want to. One-star being not so good, five-star being great. So we appreciate five stars. Jon, again, thanks for joining us today on the webcast.
Jonathan: Absolutely. Thanks for your time, George.
George: No problem at all. Again, thank you all for tuning in. For now, though, I’m George Crump, Leading Analyst for Storage Switzerland. Thank you for joining us today.