Think backups will have you covered if you need to recover your data for legal or compliance purposes? Think again!
In our recent TruthinIt webinar with Mike Matchett, Principal Analyst of Small World Big Data, our CTO Rod Christensen explains how to reduce risk when it comes to compliance and get the most out of unstructured data archives. Learn how Aparavi’s open data format can enable e-discovery solutions to read data without having to retrieve it from multiple storage locations.
Dave: Hey, Dave Littman. Welcome to today’s webcast, “Myth: I’ll Just Use My Backups for E-Discovery—Busted!” We are going to tell you today why you should not use your backups for e-discovery. And in just a second, I’m gonna be bringing on Mike Matchett, who is CEO and Principal Analyst with Small World Big Data, as well as Rod Christensen, who is CTO with Aparavi.
But before we do that, a few housekeeping tips. Today’s webcast will go just about 20, 25 minutes. You’ll notice there is a Q&A panel beneath the video player, so just enter your questions, we’ll get to them at the end. On the bottom of the video player, there’s two controls I want you to know about. There is a volume button that you’ll see, and I’m not sure which corner it is, but we’ll point that out. If we can point that out, that’s great. And then we have a closed caption, so let’s point that out, and if you want, you can toggle those captions on or off. So if you can’t get audio where you are, if you can’t listen, you can play the captions and show those. So as you know, we’re doing an Amazon giftcard giveaway. We’re not going to disrupt our conversation, so we’re going to wait until the end, and we’re going to put up a button; we’re going to just overlay it here on the video, so click that. It will take you to the giveaway, and you’ll find out instantly whether or not you’ve won. You can also enter our next webcast for another giveaway and another chance to win. So that’s pretty much it. Please keep your questions coming. And for now, let me hand things over to Mike and Rod. Mike.
Mike: Thanks, Dave. Hi, Rod. Welcome.
Rod: Hi, Mike. Thank you.
Mike: So we were talking about the uses of backups, and what are good uses and what are bad uses. And I’ve noticed in the industry, when I talk to a lot of people, that they use backups for a lot of different things. And some of the them are appropriate, some of them may have been appropriate 20 years ago, but today, it’s just not going cut it. And one of the those things that I’ve noticed that people are trying to do with it is their legal protection, their e-discovery, their compliance, and their regulation kinds of controls, the data governance that they have provide. And they say, “You know, if I have to have a copy, I’ll just point to my backup. If I have to recover something for e-discovery, I’ll just point to my backup and reload it.” And that can’t possibly work today for a large number of reasons. So let’s dive into that a little bit. From your perspective, maybe you just start by telling me a little bit about what you think you should do for e-discovery. What is controlled in that use case that means that backups aren’t good and we should be looking for other solutions?
Rod: Well, one of the things that really is important to e-discovery is, right up front, you should know exactly what you have. It’s very important when you actually undertake the e-discovery process that you know, actually, what you’re looking for or approximately where you are going to find it. I mean, when companies were using backup and specifically tape for responsive documents for e-discovery, you’ve got a real problem because you may have hundreds of tapes to look through to see if the document you want is there. How do you get at that data? So the first thing you have to do, really, to have a successful e-discovery process is knowing what you have upfront and being able to identify what you have before you actually start looking for it.
Mike: We talked about today’s IT environments not only being larger, lots more data, of course, its data problems, a lot more systems, but distributed, data is living in a large number of places. So knowing what you have, and then I would even posit, knowing where it is in that environment.
Rod: YeahBest practices for servers these days is you don’t set up one single server running 22 different services to run your company. That’s just not the way you do things. Usually, things are on virtual machines, you have a purpose-designed server, you’ll have a SQL server, you’ll have a RES server, you’ll have your web server, your financial server, and they’re all doing one task, because you really don’t want to mix up your servers with multiple tasks, because then it’s impossible to load balance things and actually move your workflow around. So when you start doing things like that, it may seem like it’s easier to do e-discovery because things are segregated.
However, the problem is that you have data all over the place. Now, you have many, many, many servers to actually look at instead of one back-up tape. The way it was back in the ’80s, you had one back-up tape of your server, you recovered it with some other server, took a look for what you needed. That’s impossible now. Now, you have tapes all over the place with data spread all over the place, and just trying to find something that is responsive is a real big issue.
Mike: All right, so we’re talking the first problem then really being about knowing what you have, knowing where it is. I think the next problem probably is how would I get a backup out of a backup unless I restore a backup and maybe the latest backup at some point in time copy, right? And I have to have that in a stored environment. The problem, I think, even gets worse, though, because now we do incrementals and differentials and stuff. So what am I trying to even restore?
Rod: Yeah. That adds a completely different level of complexity there because now you have to take your base back-up tape, restore your latest differential and all the incrementals on top of that. So now, you’re not just dealing with one tape to recover one point in time in a server, now you’re actually across multiple tapes, maybe five, six, seven, maybe 30 tapes to actually get that full data set back. And then, the problem is you don’t even know if what you’re looking for is in that set. You need actually to restore the set and try and find it.
Mike: Right, because I think if we take these last three problems and we coalesce them together and realize if I’ve got an e-discovery target where I’m going to try to go find a particular set of data that matches a certain pattern or certain phone number, social security number, name, whatever that thing is, I’m kind of hunting through a big pile of hay for that needle. And if I’m relying on backups, I have to restore a lot of backups sometimes to find it, right? I mean, if I’ve got distributed systems, I got thousands of VMs, I don’t even know if I could restore all that.
Rod: Right. It’s an exponential problem, and it’s only gonna get worse.
Mike: And we didn’t even talk about the fact that backups are a point-in-time copy. I don’t know when in time I need to go research something, so I might have to get multiple backups for the same system to make sure I’ve covered the landscape, right?
Rod: Right. And really, in order to be fully compliant with an e-discovery and return all your response documents, you actually have to restore every doc or every tape and every incremental and differential on that tape to see if it was there, because one week it may have been there and the next week it was gone. So you can’t just look at, “Okay, I’ve got a yearly tape, let’s see if it’s there.” That just doesn’t work.
Mike: So I think it’s pretty clear that backups aren’t gonna work for at least the e-discovery use case. We talked before about a lot of other use cases that backups just aren’t suited for anymore, and you can extrapolate. So what does work for e-discovery? What should someone start to be thinking about doing to really be able to serve their e-discovery needs?
Rod: Well, the first thing, Mike, I mentioned previously. The first thing you really need to do is figure out what you have upfront. If you wait until you actually need it, it’s way, way, way too late. You’ve got a huge task in front of you to try and figure that out, as we’ve discovered with backup. So one of the things that Aparavi does is as it’s actually storing these files into the archive, it actually figures out what you’ve got. It looks at the words, the content, tries to figure out the numbers in the content, tries to look for social security numbers and phone numbers and addresses and things like that to try and come up with a classification for that document. The classification for that document could be legal, it could be confidential, all sorts of different configurations and combinations that you can make to try and recognize what you have in the data stored.
Then, once it’s all classified, you can actually do a lot of things with it. When you classify it, you can search for those files by classification. So, “Give me all the files that have social security numbers in them,” for example. Now, that’s great. But Aparavi actually takes it one step further. The next step is actually storing the words that are in the document, all the key pieces of information about the document, so that we know what the content of that document is even though it’s out on an archive somewhere, even though it’s in the cloud. So we can actually perform searches against a social security number or a phone number or an address or a name or something like that, and it’ll tell you, “This is what’s in the document. Is this the one you want to recover? Is this the one you want to bring back and take a look at?” Not only that, but it gives you the context of what you’re searching for too. So if you have what looks like a social security number, then it tells you words before that and words after that so you can actually see, “Well, yes. This is actually a responsive document because I’ve got some context of what I’m searching for.”
Mike: So when I’m looking for that needle, the idea that I can search on a pattern, I can search logically, I can search words around the things I’m looking for, really comes into its own in terms of cutting right to the chase on that?
Rod: Right. Think of Google Search for your archives. The nice part about it is that you don’t have to bring that data back from the cloud. If you’re storing it in the cloud, one of the biggest problems in the cloud is the egress fees, bringing back stuff to actually figure is this what you want? It has the same problems as backup, by the way, except it can be a lot more costly if you have to bring several terabytes of data back on site just to find out, “Hey, that’s not what I needed.” So what we do is we store those indexes and the content, the word content, the indexes on the local system, so that when we do need to go back and get it, we would find a document that you actually want to recover, we consult that local database to do the searching. And then when you want to recover it, then it pulls the document down from the cloud.
Mike: All right. So let’s just step back a little bit. I think you’re wandering into architecture and I want to understand this. So Aparavi is archiving this data or indexing the data. The data is not necessarily all living on site, it’s getting archived. This is data that’s distributed, living across different places, but you’ve got this efficiency—I don’t want to call it a caching layer—but the metadata, if you will, for the indexing is being kept locally.
Rod: Right. We recognize each and every word in a document, not only English, but all the different languages. So what we do is we actually take a look at a document, we pull all the words out, we index them, and so we know exactly what words are in a document. We store those locally, what documents have which words, so then you can perform local searches against it rather than consulting the document in the cloud.
Mike: And then the normal archiving function takes those documents according to what policies and whatever, and puts them up into different tiers of storage, but you know what’s in that document before you’ve even done that, right?
Mike: And you’re able to track versions and do all sorts of great things like de-duplication and all sorts of stuff like that on your side?
Rod: Right. So the local system, actually, keeps track of everything it needs to in order to recover the document and do searches and finds against that document. The documents themselves are actually stored up in the cloud or in an archive data center or wherever you want to put them. But for the most part, they’re stored off site. That’s the purpose of an archive.
Mike: Yeah. It makes it a cost capacity decision separate from the usefulness and the efficiency of doing the e-discovery or the other analytics you might want to do on that.
Rod: Doing e-discovery on your archives, since everything is kept locally, the indexes are kept locally, is a no-cost operation.
Mike: That’s awesome. We also talked a little bit about some of the security things that are going on around this. If I’ve got my corporate data and I’m trying to put it somewhere, tell me a little bit about the security, the roles, the protections that someone might have about getting just at everything. Because, obviously, if you give me an index of all the words of all your documents in it, there’s probably some sensitive stuff in there.
Rod: Yeah. What we call the clients or the server to actually be backed up, it never sends data out unencoded. It never leaves that physical perimeter of the client itself without being completely encrypted, and only the client really knows the key of how it’s encrypted. So it’s actually the clients that participate in these searches. And the appliances—no data ever goes into the Aparavi system, no data every crosses our boundary. It’s in your server, your data center, and the cloud. Now, when we store it in the cloud, everything is fully encrypted with AES encryption. So even if somebody did manage to hack your S3 account or your Google account or something like that, all they’re going to see is a bunch of numbers. Metadata is encrypted; it’s going be just a bunch of garbage to them. So security within Aparavi… when we did the architecture, we made the choice not to actually send data through the Aparavi system. It’s between the appliance and the onsite data center to the cloud itself.
Mike: Awesome. And yet, when we put things in the archive, there’s some access that… I mean, let me step back. If I use some other vendor’s systems to take data and tier it to the cloud and it gets encrypted and put up in the cloud, and somehow I don’t want to pay that license fee or gotta pull out of that or something, I can be really messed up, because that data up there has to come back through that vendor system in order to be useful. I think when we were talking with you guys, you have kind of an open data approach to how that data on the backend can be accessed and stored that, I think, a lot of people would find attractive. So maybe you can explain that a little bit.
Rod: So we have, actually, two things. We have open data access and open data format. The first thing is open data format, which completely documents the format of what we’re actually storing in the cloud. It doesn’t give you access to it because you actually need your security keys in order to get to it. But once you have your security keys, you can actually read the data directly yourself. On top of that, we provide what we call the open data access layer. And what the open data access layer does is it essentially presents a volume that you can mount with some commands, and those volumes use the open data format in the archives to access the data that’s in the archive itself. So it looks like you have a live disk there, even though it’s sitting in our archive. The best part of it is it can go across multiple clouds. So if this file was on this cloud and this file was on another cloud, it will stitch everything back together and give you one unified view of your file system at that point in time.
Mike: We certainly didn’t want to call you guys a multi-cloud file system company, but you got that functionality on that backend with the archive now, letting you make, again, a cost capacity decision to where you wanna invest to store the data without having to worry about the accessibility of it.
Rod: Right. It’s actually pretty cool the way it turned out. I’m really happy about it, because what that allows you to do is kick up an EC2 instance up in the cloud, now a file system, and actually start doing e-discoveries without ever bringing anything back down to onsite. So no longer do you have those egress fees. All you’re paying for is how much compute power you’re actually using to go through the archives itself.
Mike: And while not locking it into some proprietary format that someone has to be afraid that they put petabytes in their archives and then someday something goes wrong and they don’t have access to any of it…which is, I think, just a key feature there. So we were talking a little bit about performance of this and how it performs and so on. Maybe you can give me an idea of just what the scale that Aparavi Archive Solution can grow to. I mean, what can we get to in terms of size? Because, there’s little corporations, there’s big corporations, and there’s monster data sets. What are we talking about?
Rod: Well, it’s all a distributed system. So with that distribution comes scale. So depending on what your data set looks like, you may need one appliance, you may need 20 appliances. It really depends on distributing the workload and your datasets across multiple machines to actually scale through. Now, since Aparavi itself is not in the data path, you’re talking directly from your client, so it’s your site, your data center over to the cloud. It really is based on your network connection between that and the cloud of actually moving the data over itself. Consider the Aparavi SaaS platforms, what you’re actually talking to most of the time is the center of control of controlling everything on the backends and getting these things to do the pieces together to do what they need to do, and that’s a very low energy task there. So the main heavy lifting of actually doing this is moving the data from a client system up to the cloud.
Mike: And kind of an initial one-time ingestion that you can start small and move big as you go?
Rod: It can take a little while. But you know what? That’s okay. It doesn’t really matter how long it takes. I can’t believe I’ve just said that. Thanks a lot, Dave.
Mike: And then, when we are looking at getting started with putting in an archive from Aparavi, is that something where I have to take a certain minimum chunk size and put it up there to get going? Is there a one-month ingestion period? What has to happen to start to get useful with this?
Rod: We do have a free trial period of 30 days, and you can use the product for 30 days. We do have monthly billings and we have annual subscriptions that you can have, and it’s based on your usage, the amount of data that you protect. That’s key, because no matter how many copies you have or which cloud provider you are using or whether you’ve got five copies of it or 20 copies of it, you pay the same rate. It’s all based on the amount of data that’s protected.
Mike: So when you say “data protected,” if I have one file and for some reason I have 17 copies of it, because that’s what some surveys say people have when they finally get data on there, you’ll actually just charge people for the one copy that you’ve indexed and not the 17 copies that might be stored in various places?
Rod: That’s correct. So you get charged once for the file, no matter how many copies you make.
Mike: I think that’s another reason why backups aren’t going to work for this use case, because a backup’s just a large image, and you’ve gotta pay for the whole image, and now we can work at the file level, a file granularity level with access. And speaking of the file level kind of question, one of the things that I was curious about was where you go when you have to age files out. Or if I have to bring new files in. Because that’s kind of a legal thing, too. I have legal hold on some documents and I have to… What do you guys do to support policies in such a potentially largely scalable system?
Rod: That’s a great question. You know, in the backup industry, when you’ve written to sequential storage tape, one of the big headaches that IT is trying to working through right now, is GDPR, the right to be forgotten. Well, when you start to talk right to be forgotten, does that mean there’s no legal precedent on this yet, so does that mean you actually have to recover every tape, delete the file that refers to somebody in Deluth, Iowa, and remove it and then make a copy of that tape again? It gets even worse when you start talking about images, because you can’t really delete something out of the middle of an image. When you’re doing disk images, it just doesn’t work. So the backup industry has come up with this “delete on restore,” which I haven’t even got a clue how that’s actually gonna be implemented. But that means you actually recover from the tape, and then before it actually finishes recovery, it deletes it off the system like it never existed. Well, we’ll see how that works out.
Mike: Yeah. Because it’s actually there, right? I mean…
Rod: It’s still there. Yeah.
Mike: I could recover it from the tape if I am malicious or not thinking. Delete on recovery doesn’t sound infallible.
Rod: Yeah. It’s going to be very interesting to see how this all works out legally. But with the Aparavi solution, basically, you can type in on the search page my name, Rod Christensen. It will show me every document across every archive, across every system in the center of control of Aparavi. Every document, you can say, “Select all, delete.” That’s all there is to it. And it will also send you an email report confirming that you’ve deleted it, so therefore, you can actually be compliant with the law that says that you must confirm the deletion of the data.
Mike: All right. And I have to ask you though, because that sounds like a lot of power. And with great power, comes great responsibility.
Mike: If you are looking at petabytes of data, who can actually have that kind of control over it and delete things globally like that?
Rod: It can be the one big super user for a corporation, or it could be delegated to sub-authorities within the org. We have a complete hierarchical rights system that allows you to do certain things. So you can do backups or you can do stores, you can do copies, you can do archives, whatever you want to do, but you can’t recover data, or you may even not be able to delete data. It has kind of like the tracker at the grocery store. When somebody pushes the wrong key, they actually have to get a manager to go open the register again. Same thing.
Mike: Okay. Awesome. I think that’s all we really have time to explore into this use case. I’d love to talk to you some more about Aparavi at some point, Rod.
Rod: I would love to come back. It’s great talking to you.
Mike: All right. I’m excited, actually. There is a great deal of noise in the marketplace around compliance, GDPR, regulation, privacy, legal, e-discovery, and all the rest of that, and the landscape is changing fast because of the scope, scale, multicloud, regionalization laws are changing, and large companies are invading our privacy like no one should know about what happens. So I think there’s a lot more to happen here. I’m glad to talk to companies like Aparavi, and I can’t wait to find out more. So back to you, Dave.
Dave: Okay, great. Thanks, Mike. Thanks, Rod. Great job. Hey, two questions came in that we have time for. There were actually quite a few questions, but these are the two that I thought were pretty cool. So let me bounce these off of you real quick. So the first one is kind of basic, right? How do you get the data into Aparavi?
Rod: You sign up for an account, very easy, on aparavi.com website. You’ll be sent security credentials then. After you receive your email with all your login information and a couple of download links, one for the appliance and one for the clients that you need to install, install the appliance, install the agents, and you’re all ready to go.
Dave: Okay. Great. And this other question came in that we thought was kind of unique. I don’t know whether or not this is possible, but there is an MSP who asked this question, if it’s possible potentially to offer archiving as a service, as this could be a multi-tenant kind of an architecture.
Rod: Yes. We do have a multi-tenant, multi-tier architecture where an MSP can actually sign up with us and then they can create their own clients on their behalf and manage them, or they can leave it up to the client themselves to manage. It’s really totally dependent on the business model that the MSP has.
Dave: Okay. Cool. Fabulous. Great questions. Thank you for asking those and, Rod, thank you very much for coming to speak with us today. And, Mike, thank you for your expertise here, as well, and thanks everybody for joining. Let’s put up that giveaway button. And while we do that, we’ll say thanks again to Rod from Aparavi Software, Mike Matchett with Small World Big Data. I’m Dave Littman, Truth in IT. Thanks again. Make it a great day. Good luck with the giveaway. And if you didn’t win, come back. We’re doing these all the time. So thanks again and make it a great day.
Rod: All right. Take care, guys.