Technique Blends Dimensionless Numbers and Data Mining To Predict Recovery Factors

Topics: Data and information management
Fig. 1—A scatter plot for dimensionless numbers according to K-means clustering. ORF=oil recovery factor; Npc=capillary number; Ng=gravity number; Dn=density number; R1=aspect ratio.

You have access to this full article to experience the outstanding content available to SPE members and JPT subscribers.

To ensure continued access to JPT's content, please Sign In, JOIN SPE, or Subscribe to JPT

Using attributes from a database of 395 deepwater Gulf of Mexico oil fields, a set of dimensionless numbers is calculated that helps in scaling attributes for all the oil fields. On the basis of these dimensionless numbers, various data-mining techniques are used to classify the oil fields. Subsequently, partial-least-square (PLS) regression is used to relate the dimensionless numbers to the recovery factor. This study shows that dimensionless numbers, together with data-mining techniques, can predict field behavior in terms of recovery factor for sparse data sets.


The digitization of information and the rise of inexpensive sensor technologies have ushered in a new era of computing in which acquired data are used to show hidden patterns and trends. This method of computing is very efficient in solving inverse problems where parameters affecting system characteristics are not completely known. Hydrocarbon reservoirs provide a classic case of a natural system where engineers have limited control on the design of the system that they work with; thus, they have to rely on indirect measurements to determine properties of the reservoir and use these properties for prediction of future trends. Performance prediction is usually accomplished with either analytical material-balance equations or numerical reservoir simulation. However, both methods use a bottom-up work flow, which suffers from a drawback: the need for accurate representation of subsurface geology. Data mining, on the other hand, provides an alternative top-down intelligent-reservoir-modeling approach, which uses measured reservoir properties as the basis for modeling.

Using publicly available information—including information on geology, geophysics, reserves, production, and infrastructure—the complete paper applies various data-mining and predictive-analytics algorithms to estimate recovery factor. Classical reservoir engineering assumes that recovery factor is dependent on rock properties, fluid properties, geological structures, and mode of production. Instead of using traditional deterministic methods, such as material balance or numerical simulation, this study uses data-driven analytics to estimate the ­recovery factor.

Data-Mining Methodology

The Bureau of Ocean and Energy Management has published geological and geophysical properties of 1,300 deepwater fields in the Gulf of Mexico, out of which 633 are depleted. Classifying fields with gas/oil ratios less than 9,700 scf/STB as oil producers results in 395 oil fields and 905 gas fields, indicating that the Gulf of Mexico is mainly a gas-prone basin. Gas volumes are converted to barrels of oil equivalent and are used in the calculations. Data for deepwater oil fields are selected for application of the data-mining process. In order to be successful, any data-mining technique relies on three qualities: clean data, a well-defined target to predict, and good validation to avoid overfitting.

Step 1: Prior Knowledge and Objective Question. The Gulf of Mexico oilfield data set consists of 84 attributes. Most critical attributes are grouped in four classes—geological, reserves and production, petrophysical, and pressure/volume/temperature (PVT) and reservoir. While most of the geological attributes are the result of seismic interpretation, the engineering attributes (i.e. reserves, petrophysical, and PVT and reservoir) are reported by operators to the federal government.

The objective goal that the study attempts to reach with data mining is a field’s classification and a prediction of the recovery factor by use of historical data from deepwater Gulf of Mexico oil fields.

Step 2: Data Preparation and Exploratory Data Analysis. Data Preparation. It is important to prepare data to suit data-mining algorithms. For example, decision trees can handle missing values while principal-component analysis (PCA) will not process missing values; hence, data types (e.g., numeric, binomial, nominal) for each attribute need to be checked and low-quality entries needs to be removed. Reducing the number of attributes, without creating a significant loss in the performance of the model, is known as feature selection. This can be achieved by ranking attributes in order of their sensitivity toward the objective function. The complete paper used dimensionless numbers for scaling and dimensionality reduction of deepwater Gulf of Mexico oilfield data sets. Dimensionless numbers lead to scaled attributes that are based on which reservoir performances from a wide variety of fields can be compared.

Exploratory Data Analysis. A scatter plot of reservoir and fluid properties illustrates the complexity of the basin where various geological, reservoir, completions, and operating constraints interact, resulting in definite behavior of fields in terms of production. This behavior is quantified by physical quantities such as production rates of liquids and gases. A wide variation in reported recovery factors is seen for reservoirs having comparable porosity, permeability, sand thickness, and fluid properties. While data are highly multicollinear and scattered, data-mining algorithms are likely to show patterns and clusters for reservoirs with identical characteristics.

Step 3: Generation of Dimensionless Numbers. Dimensionless numbers provide a way to scale data from different reservoirs for comparison and, therefore, act as scaling variables to compare field performance of fields with different characteristics. These numbers deliver insight into the relative importance of driving forces such as viscous, gravity, and capillary forces for the fluid flow in porous media. Four dimensionless numbers based on forces controlling the displacement process (i.e., gravity, viscous, capillary, and dispersion) are used in this study.

Step 4: K-Means Clustering and Predictive Modeling. After  generation of dimensionless numbers, data-mining algorithms are used to mine for hidden patterns and generate correlations from the multidimensional data set. The objective goal of data mining in this step is to cluster fields on the basis of characteristic attributes to describe behavior of recovery factor among the 395 oil fields. A K-means distance-based clustering technique was used to obtain clusters on the basis of dimensionless numbers (Fig. 1 above). Fig. 1 shows that higher capillary number (Npc) leads to higher gravity number (Ng), while aspect ratio (Rl) shows a nonlinear relationship with density number (Dn). It also shows distinct clusters for the Npc and Ng relationship with recovery factor, while clusters overlap in the scatter plot for Rl and Dn.

To determine the correlation between recovery factor and the aforementioned dimensionless numbers, PLS regression is applied to the dimensionless numbers. This technique was chosen because of the multicollinearity of the dimensionless data points. PLS regression identifies the latent factors, which account for most of the variation in the response, similar to PCA. It incorporates dimension reduction by linearly extracting relatively few latent factors that are most useful in modeling the response. PLS regression finds the predictors (dimensionless numbers) that are relevant to target variables (recovery factor). This is in contrast to PCA, in which the principal component explains only the predictors. This study uses univariate PLS regression because there is only one target variable, recovery factor, with Dn being the next important variable influencing the model for prediction of recovery factor. By identifying factors that affect recovery-factor calculations, it is possible to narrow data-acquisition plans to collect critical information for expensive deepwater assets that can help in making informed decisions and mitigating risk.


This paper shows the integration of dimensionless numbers with data-mining techniques for successful classification of reservoir performance from fields exhibiting a wide assortment of reservoir and geological properties. K-means clustering on dimensionless numbers was able to classify oil fields. Dimensionless numbers helped in reducing a multidimensional data set to characteristic features that can be used to predict recovery factor. Through application of PLS regression, a correlation-coefficient value of 0.76 was obtained for 395 oil fields in the deepwater Gulf of Mexico, which suggests an acceptable accuracy for the predications. Higher coefficients of Dn and Rl suggest strong influence of reservoir-fluid characteristics and reservoir geometry, respectively, in estimation of the recovery factor.

This article, written by Special Publications Editor Adam Wilson, contains highlights of paper SPE 181024, “Recovery-Factor Prediction for Deepwater Gulf of Mexico Oil Fields by Integration of Dimensionless Numbers With Data-Mining Techniques,” by Priyank Srivastava and Xingru Wu, SPE, University of Oklahoma, and Amin Amirlatifi, SPE, Mississippi State University, prepared for the 2016 SPE Intelligent Energy International Conference and Exhibition, Aberdeen, 6–8 September. The paper has not been peer reviewed.

Technique Blends Dimensionless Numbers and Data Mining To Predict Recovery Factors

01 October 2017

Volume: 69 | Issue: 10