Like what you see so far?

Sign up for our newsletter and get great content delivered straight to your inbox.

Posted January 3, 2019 in Cloud

The Problem of Data in Machine Learning

The term ‘machine learning’ conjures up visions of complex networks of autonomous machines, crunching away on who knows what. This perception leads many to believe that the difficult tasks in the machine learning world include understanding intense mathematics, mastering emerging applications, and learning the difficult to understand programming paradigms. While this is correct, to truly master the science of machine learning, more “mundane” issues are usually the biggest show stoppers and where many people get stuck. Two of these issues are data and testing. In this blog I will focus mostly on data, but will also touch upon why testing is also very important.

This will come as no surprise to the many experienced Data Warehousing and Business Intelligence (DW/BI) professionals out there, but I’ve encountered this so many times that this must still be a secret to some. Whether you are buying commercial applications or building your own internal machine learning departments, the problem of data still exists. For all, the data needed for these machine learning applications resides in systems of record or systems of engagement.

So you have the data. What could be the issue?

#1 Security/Risk/Compliance

In general terms, compliance, risk, and security tend to be the issues. These can all be dealt with via proper controls, however, in enterprises these controls can take a long time to receive the necessary approvals to use the data. Some data may need to pass through a second system to protect privacy and other information stored in these systems.

The results from your machine learning pipeline may need to be referenced back to those obfuscated fields from the source systems. This brings us back to the first mention of testing. These systems need to be complex enough to obfuscate or encrypt data, but also not lose the context or uniqueness of those fields to render them useless for machine learning. The results may need to be mapped back to a human readable form. We have now introduced a step in the machine learning workflow that has nothing to do with machine learning, but is critical in maintaining the security and compliance of your organization.

#2 Data Management

The second issue that we encounter is the lack of data. Wait, what? I just said that the data needed exists in systems of record or engagement. Well, the engagement data retention policy is notoriously fickle in many environments. This may be due to the volume of data generated from tracking conversations, customer interactions, and machine interactions, so many enterprises delete these records over time. In addition, the data from these systems needs to be formatted in a consistent manner for machines to read. I have encountered a number of instances where teams have attempted to read dialog information and have spent the majority of their time trying to understand the format of the data. Nothing gets data scientists more frustrated than giving them an open dump of free formatted data to parse through for insights. But all is not lost on this front, newer systems are much better at utilizing markup tags and other methods to help define data. Stream processing tools have come a long way in a short period helping to make the digestion of this data easier. However, ensure that your standard operating procedures does not delete this information. You will need enough historical data to help with learning, and if it doesn’t exist, your timelines will be pushed.

I could go on forever on this topic, however the last bit I will address on data is the state of date in enterprise. Real world data is extremely messy. Though it is coming from your systems of record, be prepared for missing data, incorrect data, sparse data extracts, formatting errors, incorrect labels, and more. Before you say “that’s not my enterprise”, know that I have worked with extracts from many of world’s top ERP, CRM, and BW systems, and they have all had these issues. Over time we have seen everything from comment fields in databases that have overridden important data fields, primary keys that are not unique, sparse data where required data is supposed to be; things that have left us shaking our collective heads in disbelief.

In short, the reason for this is usually that some corruption occurred and patches were implemented that resulted in this mess. As a first step, many folks utilize tools to dump tables into a Hadoop cluster as the first place to store data. This messy data can cause your jobs to fail, or more importantly mistake outliers, causing you to recommend your company to hire more people during the low season. Bad example, but you get my point; “Garbage in = Garbage out”.

A successful machine learning ecosystem should recognize the importance of gathering, presenting, and maintaining the necessary data to feed the machine learning pipeline. Many organizations stumble through this process, resulting in frustrated data scientist and machine learning teams. Many more make data management a job to be taken on by the machine learning specialists, which should not be the case. I have interviewed and hired many machine learning experts from other organizations because they were not getting to do their jobs, due to the lack of data. The data is there, but they cannot get to it to get their job started.

Want to learn more about Machine Learning?

Contact An Expert