Exponential data growth costs organizations a fortune to hold and manage and creates chaos and governance exposure. But it also creates incredible opportunity to utilize data for insight and profit. Current data management solutions are part of the problem – not the solution.

Join Darryl Richardson, Chief Evangelist at Aparavi, as he discusses the data challenges, best practices, and opportunities when faced with data growth, data governance, and data compliance demands.


Kirstie Jeffries: Hi, everyone. Thank you for joining us today. My name is Kirstie Jeffries on Aparavi’s marketing team. Today, our Chief Product Evangelist, Darryl Richardson, will be talking about data governance and use cases for the legal industry. This will be an on-demand webinar, so we won’t be doing a live Q&A, but if you have any questions throughout the presentation, contact us via our website, aparavi.com. Darryl has many years of experience in compliance and governance, so we’re excited to have him share his insights with you all. Thanks, Darryl.

Darryl Richardson: Thanks, Kirstie, and it was good that we all had the time to get together today. So let’s talk a little bit about data governance and specifically the legal use cases for the platform here from Aparavi. So first thing, of course, what we want to do is understand the many challenges that are sitting out there in the typical enterprise today. So, being in the business for quite a while, I’ve heard a lot of these testimonials from people. And the number one challenge that they face is unmanageable data. Data is everywhere. The typical enterprise data set holds about 385 terabytes of data. And the data locations are very disparate, all over the place. We’ve got on-prem file servers, we’ve got cloud storage, we’ve got instant messaging and Teams and SharePoint. So how do you effectively manage all of these disparate locations?

Second thing from the regulatory world has always been, “How do I handle my risk-averse data or my compliance data?” So, understanding that the different markets and the different verticals that individual businesses might be in for health care. Federal agencies, the banking and financial institutions, and even the federal government have their regulations as well. But you start looking at internal compliance with enterprises, and every internal organization has their own compliance rules that they may need to define themselves.

That’s always been a challenge. Intelligence applications or platforms, as we might start hearing the term, are now becoming a thing out in the IT world. A platform is supposed to encompass many different pillars of data and manage the lifecycle through from the beginning to the end. So having the intelligence behind the data gathering and collection of all this information, you should also have a tool that could actually provide actionable, what we call intelligence, actionable intelligence, and that would be like your cut, copy, paste, or move functions based on what you now know about your information.

So let’s take a look at how we this massive data problem actually came into existence. If you look at some things here, I always like the analogy here with this pyramid. If you look at the wisdom level of this pyramid, this is the information where it all starts. So, on day one, I get my laptop, I get a new desktop, or whatever. And I start putting my common data files in this location, and that I have wisdom about because I know what it is. I know how I can get a hold of it. And I always know that that data is sitting there for me.

But as we start moving down the pyramid, you see that the data space is actually getting wider and wider. So a year from now, the data that you had a lot of wisdom about, now just becomes data that you’re knowledgeable about. It’s like, I’m looking for this one specific file. I don’t remember the name, but I know it had this content in it. Maybe I can search through the content and find that file. So when you do a search in your typical search tools, you might find 30 different results. And you just scroll down the list and you find out. As we start getting years and years under our belt of having these data sources that we put data into, and now we’re starting to use other data sources like cloud data spaces, or things get transferred between applications like Slack or Teams, and data file sharing is now becoming social.

Now we know that we have a lot of information, right? So the information layer is now this cumulation of all this data into all these disparate locations, that I have to now have effective search tools to find and manage that information. And then ultimately, in an enterprise or in large organizations, they just have what’s now known as the data layer. And the data layer is essentially all this data within the enterprise that about 50% of this data has very little or no knowledge about. And the typical, there have been studies out in the real world, so where the data set in a typical enterprise, more than 50% of the data is what’s considered dark data, and everybody here has probably heard that term. But dark data is simply data that nobody understands, nor can they tell you much about it.

If you look at this little girl here, and hopefully she’s not mad at me for putting her here because she looks very angry, but her room’s a mess, right? But if I ask that little girl to go find that pink pillow or her jacket, she would probably find it pretty quickly because she’s the one who put it there and she knows where everything is. So you’ve got a few people in the enterprise, they can put their hands on data pretty quickly. Of course, there’s that 50% that nobody knows about that you just literally have no idea what it is. But if I were to tell her to go and find me something, she could find it, but I’d have to ask her to do that.

So the organizational perspective would say, “Put all your clothes in the drawer. Put your blanket and your pillows on your bed.” So if I needed to go into a general perspective and find her jacket, I would know she put it in the closet, or if I was looking for a pillow, I would know it’s on the bed. Organizing the information is going to be paramount to managing the lifecycle of information. So let’s look at the problem if we just leave it alone. The biggest challenge that enterprises have today is, “What do I have in this massive data sprawl?”

So I could say that there are people in the organization that know what 15% of this data is. If I’m in a boat, in an ocean, in Antarctica, or somewhere around the Arctic Circle, there’s icebergs, right? I always like using the iceberg analogy. If I’m in a boat, and I’m coming up to an iceberg, I can see what relates to about 15% of this iceberg. I look at that as the data that I know about, I can put my fingers on it. I know what this data is. And I know I’m storing it in an effective way.

So as I get closer and closer to this iceberg, we can start looking down below the water level, and we can see the rest of the iceberg here. And this is data that we would consider that has very little value or none at all to the organization. So this encompasses roughly 50%-ish of your data. And then, of course, if we get all the way up to the iceberg, we’re goning to see as far down as we can see, which is about 33% to 35% of the iceberg, and then it just goes dark.

I mean, I can’t see how truly large this iceberg is. And I know it’s pretty big, because icebergs in general are…three quarters of the iceberg is under the water. So I can only see another 30% under the water because it’s clear, and then it goes dark. So that 50% of the data is data that I have zero intelligence about; I have no clue. And quite frankly, I’m not even going to be worried about it, because nobody’s ever asking me to go into it. So what are we doing?

We look at how much this data is actually costing us. If we look at companies that have made trillions, literally trillions of dollars, with data, you have Google, and Facebook, Microsoft, all these companies, their most valuable asset is their data. Now, something back in 2017 was always a great quote. And this happened while I was a specialist, but somebody said, “Look, oil is no longer the most valuable asset. The most valuable asset is data for a company.”

If you look at what Google does, Google is a data company, if they didn’t have their data, they would cease to exist. IDC, has recently come out with a new analogy, which means that data is no longer an asset, but it is a necessity. So IDC compares data to water. It’s the lifeblood of an organization. So now, without water, nothing would exist, right?

Companies look at their data, if without it, they would not exist today, and it doesn’t matter what company it is, without the data to manage their information they don’t have a company. Other things are like, what is the risk to my organization? If I’ve got 50% of my enterprise data set, I probably should know a little bit more about where all my risk-averse data is. And I can pretty much be guaranteed that 50% of my data that I don’t know about holds an immense amount of risk that I need to address and I need to start looking at sooner or later. I’m going to have to.

If I don’t act on it, I’m going to start seeing this data growth averages about 20% year over year. This growth of data is going to become my Achilles heel. And I’m going to have a compounded problem if I don’t start addressing it, which brings me to the total cost of inaction. How much is this going to cost me next year? Or how much is this going to cost me in five years? Because if my data growth is 20%, year over year, I’m going to run out of storage pretty soon, and I’m going to have to buy a new storage device. And pretty soon, my storage devices are going to to run out of their maintenance contract; I’m going to have to buy a new storage device and renew another three year maintenance contract. And all I’m going to do is simply take the data from this old device, and I’m going to move it to this new device.

And then account for three years of growth, which is like 60% now, and now I need a bigger device to handle more data because I’m not doing anything to solve my data challenges. If not now, when are we going to do this? So that’s what I always ask people.

Let’s take a look at some of the internal personnel that could manage the data better with proper data management platforms. The CCO, this is your compliance officer, obviously, she’s concerned with the compliance aspects of data. So, if she’s in healthcare, she’s worried about PHI or HIPAA data, ICD medical codes, you know, there’s all kinds of health care-related regulated data. If you look at the finance world, from the top banks down to insurance, even local banks, they’re all concerned about the SEC rules, the Sarbanes-Oxleys, the Leach rules. I mean, there’s just a ton of different financial regulations.

Sarbanes-Oxley requires you to actually have to produce data within a certain time. And if you don’t, there’s penalties or fines. And then, of course, if you look at global policies, GDPR and the Privacy Act are coming out. CCPA within California is an ever evolving law. There are 16 new provisions that are coming out that need to be added to the new law. How does this person maintain the regulated data in this barge, previously unmanaged data set?

Let’s take a look at the next level, which is your chief information or your security officer. He’s going to be concerned about preventing leaks of the data or inadvertently sending some regulated data type to somebody. And also concerned with obviously identifying the risk-adverse information and making sure that data is secured in areas within the network or in cloud locations, that he’s controlling the user access and limiting the access of what people can do. Some people may need to access the data, but they shouldn’t be able to delete information, or they shouldn’t be able to copy the information out or delete stuff for sure. So he’s worried about this kind of information.

The next guy here is somebody that we don’t really hear a lot about, because there really hasn’t been a way for this guy to actually make use of the data, because technically the data is sprawled so many places, there’s not a single place for him to actually make money from it. So the line of business guys looking at this thing, “Hey, you know what? We’re collecting all the metadata of a file, and we’re collecting the contents, and we’re keeping all of this data in a certain location. Why can’t I leverage my tools that are in front of me to make money from this information?”

There are MI and AI tools out there that are actually allowing you to mine data warehouses and data lakes. So if you’re collecting information from your enterprise into a single searchable location, then you’re most likely creating what can be considered a data lake or a data warehouse. So why not give your different sales teams, marketing teams, HR teams, why not give them the ability to mine this information to help them along their journey within the company to make use of the data? If it’s just sitting there, why not make money off it? That’s my opinion here.

And then, of course, the ever popular chief information officer. He’s concerned about reducing the storage footprint, consolidation of data centers or consolidation of applications and trying to reduce that monetary footprint within the information department, the IT department that he’s managing. So he wants to make sure that that data is highly available with the five nines of availability. You know, that’s something that he’s also going to be prioritizing, applications and doing all these things.

But the big thing here is that he’s managing a budget. And if you continue to add 20% to your storage year over year, his budget is not increasing to manage that much growth. So in certain ways, you could leverage tools that are sitting out there today to make money from it. Not only from the line of business is protected, but also to make you look like this guy really cares about this company you reporting to the CIO, he really cares. And he wants to make sure that we’re running as efficiently as possible. And the only way you’re going to do this to identify the data that you can simply delete that has zero value to your innovation.

Let’s take a look at a couple of use cases now. So a large bank in New York, and it wasn’t the one in the label here. This is just a picture. But it’s a highly regulated industry, SEC, FINRA, Sarbanes-Oxley, Dodd-Frank is the rule that actually specifies the time to response. But then you go into the environment itself. I mean, there’s literally billions of data files that need to be classified, protected based on the content. How are you going to do this from a manual perspective? Well, if you look at the human element here, the human element is the bottleneck. An effective human can look at data at about one file per minute, which is 60 files an hour, times that by 8, times that by a 40-hour work week or five. So you’ve got a 40-hour work week for these guys, I mean, they can only look at a couple hundred thousand files a year. Now how are they going to go through a billion files?

Doing the math, if you had one single person to go through this many files, it’ll literally take them like 3000 years. It’s just not going to happen. So automation is going to be the key in a profit. The bank itself was out of compliance. They knew they were out of compliance, and they were facing very large fines from the SEC. And obviously, the reputational damage could be equally financially burdensome, right?

So, multiple data centers. The data sprawl was massive. They actually thought more than 20% year-over-year growth because they literally had a no delete anything policy. Everything brought in the system was kept forever. I mean, not to mention the risk that’s related to keeping data forever is a huge problem. In the synopsis, billions of data files need to be managed, protected. Data with no value to the organization or wasn’t tagged classification needed to be deleted, and data centers need to be consolidated. So this was the challenge that they had. A platform, per se, would allow you to manage all the different storage arrays in one single platform. Hundreds of thousands of users could be managed in a certain location, a combination of emails and file shares.

There was some structured data, but predominantly 90% is all unstructured and estimated size and the environment was 56 petabytes. I mean, it’s a massive amount of data. So obviously, the solution would be to add the platform to manage and ensure the proper data management and compliance rules are being followed. The data was scanned, classified, and then all classified or risk-averse data based on classification rules were tagged and easily identifiable.

The classification-tagged data was added to cloud locations for cheaper storage, but adding long-term retention times to this. Getting this data off your primary storage is always going to be something that your storage admin felt is going to help your data protection, scheduling your backing up for protecting a smaller amount of data. If you get it up into a cloud location, apply the proper retention, delete it when it’s ready to be deleted, this complete automated platform should have the tools that do this for you.

So ultimately, what the objective was was to remove 20% of the legacy storage and hardware that were sitting in the data centers. So it’s a great solution for them. Let’s take a look at another situation. If you understand the legal challenge that sits out there, the biggest challenge for legal is finding that needle in a haystack. So if you’re either a defense counsel or you’re the prosecuting counsel or the plaintiff, the one thing you’re looking for every time you do a legal search is that one file, that one file that’s going to incriminate the other side that either makes or breaks the case. So finding the needle in the haystack is very imperative.

Now I have seen estimates of the average cost of a single document to review is between $8 and $12 for legal processing to handle. On a single box, I have 100,000 files in a typical enterprise data lawsuit. I mean, just my legal costs alone to review the information are somewhere between $800,000 and $1.2 million, depending on the services that you’re getting provided like auto redaction or for reviewing processes. You’ve got software to pay for. If you understand that time is of the essence. When you’re looking at the legal challenges, any time you can save time, you’re going to save money, and you’re also going to be able to meet your deadlines.

The biggest challenge that you might find within the typical eDiscovery process is the collection of the information. Collecting and indexing takes about 50% of the time. And then you go into this other process. So let’s take a look at that, actually. If we look at the legal process itself, on the left, we might have 10 terabytes of total data that we start with. So we’re going to define a case and some eDiscovery tool. And that case is going to have the custodial ownership of the information, it’s going to have the keywords that we’re searching for, it might have some sort of a date range, you know, has to be in between these two dates, for all these keywords from these custodians. So those are your typical tapes.

Once you do that search, 500 gigabytes of data might be that raw data that you have to collect and bring into any discovery tool. And then you go into your culling process. The culling process says, “Let’s bring in the discovery tool, and let’s start filtering out what our low hanging fruit is that we say, we know this is not relevant.” So then we would go in and start selecting processes and filters to where this data is now culled down to, say, 50 gigs.

And then that 50 gigs of data is the data that we’re going to start reviewing and tagging for relevancy, or non-relevant, or maybe it needs to be redacted or whatever. So when you get to the reviewing stage after you’ve filtered out the 50 gigs of data, you might only have five gigs of data that are actually relevant to the case. And then you can even further look through that information and say, “Look, this data has some personal information, or it’s intellectual property. I’m not obligated to have to return this unless it’s requested.”

Ultimately, the process is to produce one gig of data, which is then collected and processed and then produced to the opposing counsel. This process of this whole legal aspect is a process called the Electronic Discovery Reference Model, and it’s your typical law data collection process. So if you don’t know what that is, just look it up. It’s from Duke Law. It’s very informative, and there are nine steps to be complete, so having a platform that helps with this.

If you look at the platform itself, the heavy lifting is being done for you already, because we’re already collecting the metadata and the indexable content of all these files, and that process is 50% of your time is the collection process, and then you collect it, and then you index it. So these are all processes that take a lot of time, so your searching and your culling is made easier through advanced search features, and then reporting where you can eliminate columns that you don’t really need, because they fall out of the scope of the actual state requirements.

So once we start determining our sources, that helps to define where the search is going to be. Then we go through filtering data, and then we go through the culling process. And then we determine what of the cold data is actual evidence, and then produce only the relevant information and no red herrings out there.

I’ve been in this field for quite a while, so a lot of people ask me what should I do from here. So I would say there are a couple of things that I’ve noticed that have been asked. Having a built-in compliance or governance application that handles the complete data lifecycle from beginning to end is paramount to your success in managing an enterprise data set.

Second thing, obviously, is to have a proper records schedule in place. Even if you’re not handling regulated data, there are states that have their individual rules and regulations that might require you to keep certain amounts of data or delete data. So having a proper record schedule defined is going to be paramount as well. Securing and protecting your risk data.

So everybody’s heard of GDPR. If you do business id one of the 60 countries that have individual GDPR policies or their own privacy law, you’re going to have to be able to adhere to these policies. California data privacy is the same way. If you do business in California, you have to understand that law and have processes in place that help with that. And that brings me to the classification need. I need to make sure that my classification is complete, and I don’t have to rely on my end users to define these policies.

If we’re going to collect all the data, then why not give the end user or customer the option to keep the index so you can search through all of your enterprise data in one location? A lot of applications out there just collect the data, and then they delete it. And then they fill up a new cache with new, but they don’t ever keep that indexable content. So why not give you the option? And of course, determine where the low cost, most effective storage is, right? So being agnostic to the targets, where you’re going to need to store the information, is going to also be paramount.

So if we look at the APARAVI platform, this is kind of the wheel of what a platform should be. Optimize the server class by vertex. Let’s go through really quickly here. So automation and efficiency are going to be the key to optimize your data management workflow. Have policies that are smart. Reducing your costs, obviously, takes many applications. Enrolling into a single license is going to be very helpful. Let’s look at the the classify and intelligence aspect. Have a simple-to-use search tool that uses Boolean logic, it classifies data, you can look at your classified tags, and determine where your risk-averse data is and provide analytics to this information.

Obviously, the most important thing here is the protection aspect. This is an open data lake. So, make sure you’re protecting the information through your threat protection or your AV tools, agnostic control of your data sources and targets, and all these things are going to be important. So you can protect the information against inadvertent deletion or moving the data somewhere else that is going to compromise the security of the information.

And then of course, simplicity at scale. Have it scale to the largest environment we can, right? So having the open platform allows third party manufacturers to build in integration. No vendor lock-in is a huge thing. We’re not going to be married to a single cloud vendor. We want you guys to have the option to put the data where you need to put it, and then have analytics and everything built into the interface when you log in. So I can have my cup of coffee in the morning and get deep analytics on all the information that I have.

So if we look at the the process with simplicity and scale, we’ve got all of our data sources on the left, move it into Aparavi’s aggregator, which is a cloud-based aggregator, or it can be living in a VM environment, and then simply apply intelligent policies to move the data to any of the storage destinations for long term keeping. I have to be able to secure this information as we go. Offer the access, and the machine learning tools also are going to be imperative, so making sure people can access the data is also a very important aspect.

So what is the platform? The feature set here, obviously, is something that I have been a big part of by adding a lot of these features and functionality. Providing a full index of the contents of metadata, so it’s easily accessible. A single platform to get rid of five different layers or five different pillars of applications that you might have today having automation in place, so you can take the human element out, and it manages data from start to finish. You know, create actionable intelligence, I’m giving you all the information you need. I’m going to allow you to make decisions based on that and move the data to where you need to or take action on it. Obviously, when I log in, I want to provide data analytics. What’s the age of my data? What are the simple analytics that I can provide you? How how much of my data is classified?

So these widgets should be customizable for you. And every time you log in, you see your own dashboard. Reporting is another thing, have this recording, very easy for you to understand. Have pre-made reports that are already there and have the ability to create your own reports as well. Looking at classifications, these classification policies must be complete. Aparavi’s platform has 140 different global policies that are complete. All you have to do is enable them and hit save, and, boom, you’re classifying data from then on.

Obviously, having multiple sources of data, so not locking you into just storage or NAS, but also hit those endpoints, the laptops and desktops or mobile users, right? Have that data brought into the platform as well. So we can start securing all of the risk in the organization, as well as social media content through third party integrations like Slack or Teams or some of these other applications that you need to collect the data from, and be cloud agnostic. I don’t want to choose the cloud storage for you; I want you to choose the cloud storage.

So I’m going to leave you guys with a phrase here that only Aparavi puts you in control with knowledge and creates a new advantage, a single platform that does all of these things that are necessary for data lifecycle management.

And with that, I thank you for joining us. It was a pleasure. I love talking about this stuff. If you have any questions, don’t hesitate to out to our sales team. Kirstie, back to you.

Kirstie: Thank you so much Darryl, and thanks to all of you for listening. Check us out at aparavi.com for more information or if you want to contact us, and stay tuned for more webinars. Have a great rest of your day!