Pipeline Blog

Data Ingestion and Normalization – Machine Learning Accelerates the Process

If you have ever looked through 20 years of inline inspection tally sheets, you will understand why it takes a machine learning technique (e.g. random forest, Bayesian methods) to ingest and normalize them into a database effectively. It would be a monumental task if attempted manually by a human … not to mention the risk of endless errors. However, by training a machine learning data classifier on enough log data, this task becomes the perfect scenario where data science can drastically improve integrity management practices.

Let’s start at the beginning.

The scenario we just described is quite typical for any pipeline operator. They have been running PIGs for decades. During that time, they have used a variety of tool vendors, data standards, and even PIG technologies as recent advancements have enabled pipeline operators to measure previously missed features. As a result, operators have thousands of tally sheets from many different tool vendors, each with varying formats. Layer on further complexities such as different tabs, column layouts, and more critically, inconsistent key data fields needed to identify threats to the pipeline, and it becomes exceedingly challenging to accurately use this data for the intended purpose of predicting and preventing pipeline failures. There are operators who have invested and built robust database systems to handle this task. Many operators, however, still rely on rudimentary systems which can be a very expensive proposition when factoring in resources required. Prior to identifying anomalies requiring attention, engineers are burdened with the tedious and time-consuming tasks of filtering datasets using criteria from the tool vendor or internal standards. Then they are required to analyze the selected data in simple tools like Microsoft Excel which lack formulas and methods the engineer wishes to use.

How does data science solve the issue of data ingestion and analysis, and what’s the benefit?

First, data ingestion can be handled using a standard out of the box machine learning technique. This is the easier part. The difficulty is in gathering the “truth” data needed for the classifier. We have developed tools to support our machine learning efforts where we are able to handle “truth” data similar to how a captcha works (the verification questions that request you to select pictures containing objects to ensure a web app knows you aren’t a robot). Up to this point, we have had individuals at the pipeline operator mapping the columns on the ingested spreadsheets to our Alias model, thus creating our “truth” data. For example, “depth (%)” from operator A and “depth (percent)” from operator B becomes “depth” for all operators. Ultimately, we will see every possible example and the classifier will be complete. The cloud allows us to do this without sharing operator data.

Second, to understand more complex threats identified in our Pattern Detection and Interacting Threat algorithms, it’s necessary to correctly classify features under an Alias structure. We utilize the industry Alias classification, category, and type structure. Again, data science is critical here. It can observe patterns in the data to e.g. extract the word “dent” from the comment field, tag the record, and update its alias to reflect this while maintaining the original user classification. The identified feature can now be used in our Corrosion within Dent Interacting Threat algorithm (yes, it’s a mouthful).

Finally, the accounting profession figured out long ago that while Excel is a wonderful tool, it’s unable to provide consistency when used at scale. While the challenge of normalizing data within integrity management is daunting, it’s foundational to everything downstream. Data science can help get you there accurately and extremely fast. For example, one of our customers was able to ingest and normalize 845 tally sheets spanning 20+ years into a hybrid PODS database in 1.9 hours.

We’re sure that there are similar stories from other operators. We are eager to share and learn more so please reach out.