For years IT visionaries have been telling us and showing us in charts how there is so much more text data available to us compared to structured, rowset data. I am sick of hearing this. If you don’t know that your organization is swimming in text data by now, reading yet another article about how much you have is not going to make a difference. We all know that business are capturing and creating massive amounts of text data from web sites, twitter streams, Facebook posts, documents, notes and emails. This list can go on forever and by the time you read this post your organization has probably accumulated yet another massive amount.
The problem with a message telling us what we already know ( over and over again) is the message is the wrong message. The message developers, business analysts, IT managers and data scientist should be hearing over and over again is this:
Start treating your text as text data and unlock the hidden potential!
The One-Trick Pony
Unfortunately, those of us who have spent a lifetime creating business apps for a living are sometimes a one-trick pony. Regardless of the platform ( web, desktop, mobile), regardless of front-end or server-side development, our one trick is to somehow normalize that data in a series of database tables. MySql, SQL Server or Oracle, it doesn’t matter which platform but we know and love our databases. We seem to think in terms of rows, columns and relations. Text is not relational, not in the normal sense. Have you noticed your organization’s emails, contracts, support agreements, social feeds ( don’t forget your competitors) and that copy of Moby Dick you were planning on reading does not look anything like the data stored in your databases? It is time to change how we think about and how we handle our text data. I am not talking about storage of text, I am talking about digging into text and unlocking its hidden potential. It’s time to learn new tricks and understand how to work with text data and bring business value to the organization.
We think in row-set from years of working with row-set business data
We know and love our-rowset data to a fault. When working with text data, most of us will focus on shoehorning the text data into a set of database columns and tables. “Well here is what we are going to do. We will use a stored proc and save the raw text in a SQL Server Text or varbinary field and we will have some columns to help us find the right document “. This approach will work if you just want to store and retrieve the text. It is not a great solution if you want to unlock valuable information such as document similarity, categorization, taxonomy, tagging, keywords or semantic analysis. I used to think like one of those developers. With thoughts like “Lets get it into a database and go from there”. Years ago if a client asked for me to create a custom search application across text data I would have started thinking about how I can put the text in SQL Server and map user queries to full-text search queries. Sure it could work, but I know better now. It is time we start to use text-based development techniques for our text data.
Text is not rowset data. It usually does not look like rowset data. Yet many of us desperately want to treat it like rowset data. There are ways to convert text to rowset data in a relational database. You might just be losing the valuable part of working with text data if you do this. Text data should be treated and managed as text. Unlocking the valuable insights in your text data can results in a wealth of business value. These additional insights along with your rowset data can drive intelligent solutions, chatbots and other intelligent applications. Unlocking your text data can help make better informed decisions, create actionable results and features to feed to machine learning algorithms.
The solution – learn and use text development and analytics technique
If you want to get more out of your text data you should start looking at the data as text and start using text-based development and analytics approaches. With a little knowledge you will be able to unlock a wealth of information from your text data. It is a different mindset from the typical database development process. You will need to read up and learn new techniques such as Bag of Words, sentiment analysis, language models, and search and ranking to name a few. With these text tools you can start gaining insights and provide business value from your text data.
Learning to work with text data using text development and analytic techniques is no different than learning how to process row set data. The internet has great examples for whatever language you work with (though .NET seems to be light on text mining toolsets). I have worked with text using C#, C, Java and R. It’s a matter of time before I work with text using Python. There are many client-based libraries as well as cloud-based services that you can leverage so you do not need to start from scratch.
Get started!
It’s time to start treating text data as text and discovering valuable insights and providing critical business value for your client or organization. If you have text data and want to unlock it’s potential drop me an email. I would love to chat with you about the hidden potential in your text data as well as how to give your organization a strategic advantage and improve business processes.