A Behind-the-Scenes Look: Managing the Mountain of Data for the BP Statistical Review of World Energy
The BP Statistical Review of World Energy is an institution in the energy world. The Stats Review (as we call it) has appeared annually for more than 65 years. Its publication in June, accompanied by an analysis of developments in world energy markets by BP’s chief economist, currently Spencer Dale, is watched closely by energy analysts around the world.
My team and I at Heriot-Watt University’s Centre for Energy Economics Research & Policy (CEERP) have worked with BP’s Economics team since 2007, helping to put together the review by assembling and managing the mountain of data that go into it. Here I discuss how it is done and what goes on behind the scenes.
The Stats Review: Then and Now
The format of the review is, at its core, quite similar to the very first one put together in April 1952 by William Jamieson, who worked in the central planning department of the Anglo-Iranian Oil Company, which was titled The Oil Industry in 1951: Statistical Review. That review, like today’s, reported data on reserves, production, consumption, and trade for individual countries and regions, covering the entire world. And it appeared just a few months after the calendar year 1951, the year surveyed in the report.
Today’s review has expanded greatly in coverage. All major commercial fuels and sources are covered: oil, natural gas, coal, nuclear, hydroelectricity, and electricity generation from renewable resources. [What is not covered are things like firewood and other traditional biomass and electricity generation that do not go into the grid or are not measured, such as solar-powered parking meters.] But like the original review, the statistics cover the entire world, reporting both individual countries and groups of countries.
And also, like the original publication, today’s review appears very soon after the end of the calendar year surveyed. Published in early June, it is the first major, comprehensive survey of the previous year’s world energy production and consumption.
A Mountain of Data—and a Few Months
These two features—comprehensive, global, country-level coverage and the review of the preceding calendar year—are the key parameters that define the task.
The statistics are built from the bottom up. For fuels and sources like natural gas and electricity, the database requires figures on a country-level, annual basis. But for oil and coal, composition matters. For example, the oil consumption data are assembled from the product level, and oil consumption reported for a particular country can be based on consumption of 10–20 different products ranging from LPG to residual fuel oil to bitumen.
The scale of the task is massive, and the time available to assemble the data is very short indeed. While some countries and organizations do publish data for the preceding calendar year in January or February, these are the exceptions. Most of our data-gathering and -processing has to take place in March through May. It is a very busy time in the office!
… and Thousands of Data Sources
Most of the data underlying the review are assembled from published sources or data gathered specifically by BP for the publication. This is very much by design. The idea is that the review presents an objective picture of global energy markets based on openly available data and official publications. Of course, sometimes questions can be raised about such official data—national figures on oil reserves are probably the most famous example. But if there is a single official source, or multiple sources all agree, then that will normally be the basis for what is published in the review.
Some of the data gathered by BP take the form of questionnaires or “Returns” sent to contacts in individual countries. Most commonly, the contacts are in fact official sources—a ministry of energy or a central statistical office. The returns are distributed by BP in January and come back to us in numbers for processing, which starts in March. The returns are very helpful for us, not only because they come back quickly but also because they come back in a standard format.
But the majority of data points are derived directly or indirectly from openly published sources. Official sources are an important means and can be ministries or regulators, sometimes covering multiple countries—Eurostat is an example of the latter.
Company reports are also a major source of data. Often an energy market may be dominated by a single national producer (e.g., a national oil company), an electricity generator that runs the country’s only nuclear plant, or a company that runs a country’s natural gas distribution network.
These sources generally follow different publishing formats and standards. It is also essential for our work that we track exact original sources so that we can identify not simply the original publication series (“Country X: Monthly Energy Statistics”) but the exact issue (“Country X: Monthly Energy Statistics, March 2017”). The result is a database which now identifies several thousand different sources.
Not surprisingly, often the original published data are not in the format needed. The coverage of a monthly publication with the most recent data may not be as complete as the annual publications that are published with a year or two lag because the monthly publication misses small producers or sellers. In this case, we need to make an adjustment to the monthly data so that they are comparable to the annual data that we used for earlier years.
Or, we have an official “flash estimate” for the most recent year’s growth in production, but no figure for the actual level. In this case, we will often apply the estimate’s growth rate to the most recent officially reported level.
Or, the original data are published in a country where the fiscal year is different from the calendar year—the most important examples are India, Pakistan, and Australia. This means we need to reassemble calendar year data from the monthly components.
And sometimes we have little published data to go on, especially for small developing economies. In such cases, we can use indirect methods. For example, if most coal is used for electricity generation, and we have data on electricity but not on coal consumption, we might use thermal generation to estimate coal consumption.
Or, if few data are available, we might use estimates of economic growth to infer a figure for energy use. In this last case, however, the figure would be used in the review as part of a regional aggregate that typically also includes data derived from official data—oil consumption in “Other Africa” in the report is an example.
The days of storing all data in Excel spreadsheets are long past. The data are now stored in a relational database that supports SQL (Sequential Query Language). It currently has over 1 million entries; about 300,000 of these are in active use, and the remainder are there either for reference or because they are outdated entries from previous reviews. Using a relational database has many advantages, not the least of which is imposing structure and robustness. Certain types of errors that plague spreadsheet work—mistakes in formulas, linking to the wrong cells, or typos in names—are much harder to make in a relational database, or may even be impossible.
For example, a key feature of the database is that it must accommodate any kind of physical or energy unit that appears in a source publication. All conversions are done internally by the database, so there is only one place where the formula needs to be right. If we were working solely in Excel, we would be doing conversions by hand many hundreds of times, and getting it right—and uniformly right, so that the exact same conversion factors are used every single time—would be extremely challenging. With a relational database, it is easy… once it has been programed. The startup phase in 2007, when we switched to a relational database, was challenging.
There are actually two different sorts of databases inside the review. The main database covers production, consumption, reserves, etc., and is on a country basis. But the second-ever review in 1952 introduced coverage of trade between countries and regions, and we also maintain a database of trade in oil and natural gas.
Trade is special for two reasons. Firstly, the complexity of the database is an order of magnitude greater because we have to track country-country pair trade flows. Secondly, we can, and usually do, have two official sources for a trade flow: what Country A says it exports to Country B, and what Country B says it imports from Country A. This is called “mirror trade” and is well-known in the economics literature on international trade. Often the two numbers are close, but they can be very different, sometimes wildly different. We assemble trade flows using an algorithm that gives priority to the import-reporting country, on the grounds that the importing country in principle knows for sure what arrived—if an export shipment was diverted mid-route, say, the export reporter might not be aware of it. But we can and do override this when the judgment is that the export reporter is the more reliable.
Onward and Forward
What I have not mentioned yet is the hard work of data checking and assembly carried out by my team and Spencer’s team at BP. Readers of this magazine are data-oriented and it is probably enough for me to say that you should take your most fiddly, data-oriented project… and multiply by the biggest plausible number… and then add an immovable deadline. But it is fascinating work, and, in its own way, a lot of fun, not least because it is a great pleasure to work with two talented teams of smart energy economists.
Perhaps, the work of the teams is best summed up by the comments that John Underhill, our chief scientist at Heriot-Watt University, made when this year’s review was published:
"Production of BP's world-leading annual publication benchmark on oil, gas, and energy trends is something which Heriot-Watt is justifiably proud to be part of and associated with. The university’s research team draws upon and integrates its extensive and deep-rooted knowledge of the energy industry with its economics research expertise to great effect. The combination of Heriot-Watt University’s innovative approach with the analytical output from the BP Statistical Review aligns well with our global vision to help understand the resilience of energy resources for communities around the world.”
Mark Schaffer is the director of the Centre for Energy Economics Research & Policy, professor of economics, principal investigator, and head of Heriot-Watt University Statistical Review Project. His fields of research include transition and emerging economies, labor markets, applied econometrics, economic history, quantitative criminology, and energy economics. Schaffer is also a Research Fellow of the Centre for Economic Policy Research in London, a Fellow of the Royal Society of Edinburgh and the IZA Institute for the Study of Labor Economics. He has worked as a consultant for organizations such as the World Bank, the International Monetary Fund, European Bank for Reconstruction and Development, the United Nations, and the Department for International Development of the UK Government. Schaffer graduated magna cum laude from Harvard University, and holds degrees in economics including an MA from Stanford University and an MSc and PhD from the London School of Economics.