When talking to potential customers at the various trade shows and learning about their environments and the challenges they face, it seems there are two problems regarding data governance that always show up: their massive quantity of structured data vs. the quantities of unstructured data. Recently I had a discussion during a session around this topic. Many people within the tech industry do not understand the difference and why these data structures are so different in the approach of data governance. Let me take this post to dive into the question of unstructured data vs. structured data and semi-structured data.
Understand the difference
Structured data is the easiest to explain but the most challenging to search for governed data fields. A structured data type is data that would be inside a database or some sort of data management application. These applications can track the usage and activity and provide versioning back to the beginning of the file’s existence if managed from the beginning. Database type applications such as SQL, Mongo, and Caché, to name a few of the popular ones, use an application to collect the data through various data entry points like a GUI or web‐based portal. Data is added to the fields on the UI and then inserted into the database into various columns and rows. Most websites or data entry applications will collect data into these various database formats.
Now let’s look at unstructured data. Unstructured data makes up a much larger percentage of the organizational data.This data is not usable to a traditional database type application since single field entry is normally the mechanism to add data to the rows and columns. Unstructured data types are vast; there are applications that can process over 1000 types of unstructured data formats. Examples of unstructured data types include office documents, text files, image files, PDFs, log files, and application data files like .ini or .dll. A typical user will create and process primarily unstructured data. This is the data that Aparavi is going after. Unstructured data represents well over 80% of an enterprise’s total data footprint. Read our post on fascinating data growth statistics.
There is an element of both called semi‐structured data. Semi‐structured data is sort of a mix of both. An example might be an on‐prem Exchange Server. Exchange stores all the email and attachments data within the Exchange database. However, an email can be easily copied out of your email client to your desktop by simply dragging the email to the desktop from the client. This creates an .msg file and includes all attachment data. Attachments can be opened within this client and saved to your local file share or desktop. Aparavi can process this type of data provided the data has been exported from the structured environment.
How will Aparavi applications impact these data types?
The Aparavi Platform processes unstructured data types like office files, text files, PDFs, etc. We can index any type of file that has selectable text. Selectable text that for which you can open a file and drag your mouse cursor over the text to highlight or select. These data types can be indexed and maintained on the Aparavi Platform. Files that do not have selectable text, however, have images of text that can also be processed but would require an OCR (Optical Character Recognition) application to process the image text data. This is something we will be looking into in upcoming versions of our applications. Managing the massive amount of unstructured data is where Aparavi applications will help an organization manage their data more effectively.
The future of Aparavi will be the direction of data intelligence. Data intelligence will take data and provide the needed information for potential customers to leverage the power of Aparavi and make intelligent decisions on their unstructured data sets. Understanding what you have is the key. The content index, classification, and actionable data have been the major issues that impacts customers today. Our mission is to provide customers with the tools needed to protect, analyze, and process data effectively to adhere to compliance and governance rules and ultimately remove the data when it is time to remove it.