There are two related problems regarding data management that organizations always face: managing and governing their structured data, and managing and governing their unstructured data.
Understanding the different types of data your company is storing is essential to developing an effective data management strategy. However, many people I encounter do not understand the difference between structured semi-structured and unstructured data, even with examples, and why they require different approaches for data governance. In this post, we’ll dive into the question of what is unstructured data vs. structured data and semi-structured data.
Structured data is the easiest to explain but the most challenging to search through. Structured data is data that would be inside a database or some sort of data management application. These applications can track the usage and activity and provide versioning back to the beginning of the file’s existence if managed from the start.
Database type applications such as SQL, Mongo, and Caché, to name a few of the popular ones, use an application to collect the data through various data entry points like a GUI or web‐based portal. Data is added to the fields on the user interface and then inserted into various columns and rows in the database. Most websites or data entry applications will collect data into these various database formats.
Now let’s look at unstructured data. Unstructured data makes up the majority of enterprise data–well over 80%, in fact. The rapid change of data growth statistics have been astounding.This data is not usable in a traditional database application since single field entry is normally the mechanism to add data to the rows and columns. Unstructured data types are vast; there are applications that can process over 1000 types of unstructured data formats.
Examples of unstructured data types include office documents, text files, image files, PDFs, log files, and application data files like .ini or .dll. A typical user will create and process primarily unstructured data. This is the data that Aparavi is going after.
To protect any sensitive data or PII that exists in unstructured data, the first step is to understand what comprises those types of data. The following represent some of the most common examples of unstructured data.
PII is any data that can be used to distinguish one person from another and can be used to de-anonymize previously anonymous data. This includes Social Security numbers, bank account numbers, passport information, healthcare information and driver’s license information. A list of PII examples can be found in this guide by the Homeland Department of Security.
PHI is any data about health status or the provision of or payment for health care, that is created or collected by a Covered Entity (or a Business Associate of a Covered Entity),and can be linked to an individual. This includes health records, lab test results, and medical bills. Demographic information is also considered PHI under HIPAARules, as are common identifiers such insurance details and birthdates, when linked with health information.
All cardholder data is subject to the PCIDSS standards, including cardholder name, service code, card expiration date, magnetic stripe data, card verification code, and authentication data likePINs.
Protected under theCalifornia ConsumerPrivacy Act (CCPA) and New York SHIELD Act, biometric data includes fingerprints, facial recognition, retina scans, voice recognition and any physical and behavioral characteristics that can be used to digitally identify a person to grant access to systems, programs or devices. A study on biometrics in the workplace reported that 62%of organizations use some form of biometric authentication.
Consumer behavior data, which is subject to CCPA regulations and laws in various states, is any data that pertains to personal information that could identify or be linked to person or that person’s household. This includes internet browsing history, geo location data, and any information regarding a consumer’s interaction with an internet website, application, or advertisement.
Now that we understand structured vs. unstructured data, note that some data is considered semi-structured. Semi‐structured data is, as its name suggests, a mix of structured and unstructured data. An example would be an on‐prem Exchange Server. Exchange stores all the email and attachments data within its database. However, an email file can be easily moved or duplicated from your email client by simply dragging the email to the desktop. This creates an .msg file and includes all attachment data. Attachments can be opened within this client and saved to your local file share or desktop. Aparavi can also process this type of data, provided the data has been exported from the structured environment.
Before organizations can properly analyze your data, you need to know what's in your data. You almost certainly have a large quantity of both structured and unstructured data in your organization - so, how can you tell which is which?
Structured data is so named because all of the data in the set follow rules. These rules give the data structure and allow us to easily search and sort the data. A good example of structured data are values in an Excel sheet. Each cell contains a string of data that must conform to Excel’s rules, and each cell is identified by a column and row code. We could ask Excel what’s in cell B7, and we’ll get a specific piece of data.
On the other hand, unstructured data doesn’t play by any rules. For instance, consider the text in an email. An email may have no text at all, or it could contain a whole novel.
Unstructured data is most commonly accessed by the same program that created it. If you want to search your Gmail inbox, you go into Gmail and use its search tool. This means that much of your unstructured data goes unseen by data management software, and this is a serious problem for your business.
When data gets locked into a single environment, unable to be accessed by certain people or only accessible through certain platforms, it’s in what we call a data silo. The problem with data silos are that they present risks to your business since you often won’t know what’s actually in each silo. Furthermore, silos frequently create redundant data which could pose a security risk. But unstructured data isn’t the only way silos form.
Structured data can also be siloed off if it’s not easily accessible. While it’s easier to search and identify data from structured files, access permissions often keep the doors to the silo locked shut.
Although both forms of data can end up in silos, unstructured data is more likely to do so. Furthermore, unstructured data loves to hide in the dark. Since it’s often only accessible with a specific program, your average search tool or data management platform just isn’t going to find it. Data of any kind can become dark data, lurking in the shadows of your organization.
Dark data may very well be worse than a data silo. In a sense, it already is. You can’t see dark data because you don’t know where it is, and even if you find a rogue file, you won’t know what’s in it. Since unstructured data readily evades detection, it tends to remain dark. You can’t derive insights from data you don’t know about, but you certainly can suffer the consequences of dark data.
Many companies discover data breaches well after the fact. Just recently, Mobikwik’s customers discovered their own data for sale on deep web markets. Mobikwik had no idea anything had happened, and still denies responsibility, but the breach seems to be from months ago. When your data is dark, you can’t keep an eye on it and you might only find out about it in the worst of circumstances.
The Aparavi Platform processes unstructured data types like office files, text files, PDFs, etc. We can also index any type of file that has selectable text and make it easy to search through and classify those files for purposes of compliance, cost savings, storage consolidation, and more. Selectable text is any text for which you can open a file and drag your mouse cursor over the text to highlight or select. Files that do not have selectable text but have images of text (such as a scanned document) would require an OCR (optical character recognition) application to process the image text data.
As unstructured data makes up the majority of most companies’ data sets and is growing an uncontrollable rates, Aparavi focuses on helping you take control of your unstructured data. Our Platform helps you classify, protect, and optimize your data, regardless of its location.
Data intelligence takes your data and provides the information you need to truly leverage your data’s value and make intelligent decisions on your unstructured data sets. Understanding what you have is the key to getting the most out of your data. Our mission is to provide you with the tools you need to protect, analyze, and process data effectively. This enables you to adhere to data privacy regulations, defensibly delete ROT data, make informed decisions, simplify operations, and save money on your data management. To learn more, contact Aparavi or get started today.