Natural-Language-Processing Techniques for Oil and Gas Drilling Data

Topics: Data and information management

You have access to this full article to experience the outstanding content available to SPE members and JPT subscribers.

To ensure continued access to JPT's content, please Sign In, JOIN SPE, or Subscribe to JPT

Recent advances in search, machine learning, and natural-language processing have made it possible to extract structured information from free text, providing a new and largely untapped source of insight for well and reservoir planning. However, major challenges are involved in applying these techniques to data that are messy or that lack a labeled training set. This paper presents a method to compare the distribution of hypothesized and realized risks to oil wells described in two data sets that contain free-text descriptions of risks.


In the oil and gas industry, risk identification and risk assessment are critical. This holds particularly true during the drilling stages, which cannot begin before a risk assessment is conducted. While these risk assessments are typically conducted in a group setting, the project drilling engineer usually has a predetermined list of risks and likelihood scores that are the focus of the conversation.

One problem with this approach is that drilling engineers are inherently biased by personal experiences, which can affect their view on how likely an event is to happen. For example, if a project drilling engineer recently encountered well-control issues, the engineer will likely overestimate the chance of future well-control issues. On the other hand, if the engineer has never encountered a well-control issue, it may be unintentionally omitted altogether from the risk assessments.

Using historical data as a barometer could help the drilling engineer overcome these issues, though doing so requires a unified view of both prior risk assessments and prior issues encountered. Chevron maintains both data sets in disparate systems. The Risk Assessment database contains descriptions of risks from historical risk assessments, and the Well Operations database contains descriptions of unexpected events and associated unexpected-event codes, which categorize the unexpected events.

Leveraging both, a system has been created that allows a project drilling engineer to enter a risk in natural language, return drilling codes related to this risk, produce statistics showing how often these types of events have happened in the past, and predict the likelihood of the problem occurring in certain fields.


Well Operations Database. The Well Operations data set contains free-text descriptions of unexpected events that occur during the lifetime of the well. As events occur on the well, engineers create log entries describing the events and categorizing them. Each unexpected event includes a free-text description and two unexpected-event codes, Type and Type Detail.

The unexpected-event codes Type and Type Detail provide category and subcategory classifications for the event. This project focused only on five combinations of the Type and Type Detail labels, so Type and Type Detail were concatenated into a single label for each event. All other labels were grouped into a sixth category, Other, which includes the majority of the instances.

Risk Assessment Database. The Risk Assessment data set contains all of the risks anticipated for a well or set of wells in free-text form. The text usually consists of short phrases containing technical jargon.

Unlike the Well Operations data set, the Risk Assessment data set is not labeled with unexpected-event codes. The codes are automatically extracted for a set of approximately 1,400 risk-assessment instances by use of handwritten rules. This data set is referred to as the Risk Assessment Auto data set. In order to validate the results on the Risk Assessment data set, a random set of approximately 700 instances was hand labeled from the Risk Assessment data set. This data set is referred to as the Risk Assessment Gold data set.

Comparison. Although the Well Operations and Risk Assessment data sets were created for the same wells by people from the same organization on the same topics of unexpected events, significant differences exist between the two.

First, the vocabulary and style used in the Risk Assessment data set differ from the vocabulary and style used in the Well Operations data set. The Risk Assessment data set is cleaner and more formal than the Well Operations data set. Drilling engineers usually create the risk-assessment data, while rig crews create the Well Operations data. The Risk Assessment data set is standardized, while the Well Operations descriptions are inconsistent in vocabulary, structure, and spelling.

Second, the distribution of event types in the Well Operations data set does not match the distribution in the Risk Assessment data set. The Risk Assessment data set is a list of potential events, so costly events such as stuck pipe or lost circulation are more common, while the Well Operations data set is a list of events that actually occurred, so the majority are common events.


An application was created that receives a free-text risk assessment from an engineer and displays relevant events from past operations. Historical data are searched from the Well Operations data set to determine how many wells were drilled and how many were labeled with the problem described in the input risk assessment. This allows users to determine the likelihood of the risk occurring and helps the engineer make accurate predictions.

To compare the input risk assessment with the historical data, the risk assessments are enriched with the ­unexpected-event-code labels from the Well Operations data set. Once both data sets are labeled, users are able to compare a new risk assessment to the historical data and to determine the accuracy of the risk-assessment predictions by comparing the distributions of the Risk Assessment database’s risk assessments to the Well Operations database’s unexpected events for particular wells and well groups.

To obtain the unexpected-event-code labels for the risk assessments, a statistical classifier is trained on the labeled data in the Well Operations data set. The Well Operations instances are split into 60% training, 20% development, and 20% test sets, and the Risk Assessment database descriptions are treated as an unlabeled test set. A series of preprocessing functions is applied to the free-text descriptions: The text is made lowercase, numbers and punctuation are removed, n-gram features are extracted, and features that occur fewer than five times in the Well Operations training set are removed. The remaining features are used to convert each instance into a sparse-feature vector.

To address the differences in style and vocabulary between the Risk Assessment and Well Operations data sets, a set of labeled instances was automatically extracted from the Risk Assessment data set. A series of simple queries was written that captures unambiguous unexpected-event-code matches. These instances were added to the Well Operations training set, and the combined Well Operations/Risk Assessment Auto data set was used to train the classification model.


The classifier is evaluated by running the model against both the held-out Well Operations test set and the Risk Assessment Gold data set. The best results were found when the class weights were rebalanced and the training data were supplemented with the Risk Assessment Auto data set.

Rebalancing the class weights was only slightly helpful for the Well Operations test set but was very helpful for the Risk Assessment Gold data set.

Supplementing the Well Operations training data with the automatically labeled Risk Assessment instances improved the results on the Risk Assessment Gold data set even more than rebalancing the class weights.

The large improvement caused by the supplemental Risk Assessment Auto data set is particularly interesting because the distribution of the labels in the Risk Assessment Auto data set varied significantly from the Well Operations or Risk Assessment distributions. This improvement demonstrates that the investment of a small amount of additional data (hand-written rules to get Risk Assessment Auto data) can yield substantial improvements.

The results allowed for building an application through which drilling engineers can predict risks to wells better by viewing the historical risk assessments, the encountered unexpected problems, and a unified view of the two.


Natural language is the primary means of human-to-human communication, but it can pose potential problems during analysis with nonmanual means. In the world of drilling operations, enormous amounts of historical data are captured in this format, often stored in free-text descriptions of events. These historical data can be very useful if they can be mined and presented to engineers when they are planning a similar drilling operation. This paper presents some techniques to navigate between and connect independently created free-text databases and shows how to supplement unstructured data with labels so that these data can be compared with and used alongside structured data. These natural-language-processing techniques allow unstructured data to be searched, organized, and mined, allowing engineers to leverage the underlying insights without having to read through entire databases.

This article, written by Special Publications Editor Adam Wilson, contains highlights of paper SPE 181015, “Natural-Language-Processing Techniques on Oil and Gas Drilling Data,” by M. Antoniak, J. Dalgliesh, SPE, and M. Verkruyse, Maana, and J. Lo, Chevron, prepared for the 2016 SPE Intelligent Energy International Conference and Exhibition, Aberdeen, 6–8 September. The paper has not been peer reviewed.

Natural-Language-Processing Techniques for Oil and Gas Drilling Data

01 October 2017

Volume: 69 | Issue: 10