Why is impactful data hard to find and what it means for the future?.

04.03.21 | BY Simon Mahony

Many investors have in their heads a model of the commercial world that feels natural to them. In this model the basic units are the listed companies: there is Exxon, Shell, and BP in oil, P&G, Nestle, and Colgate in consumer staples and so on. Working with these entities everyday it can be easy to forget that often they are financial abstractions sitting on top of the world made up by their operating subsidiaries. This is not only true for obvious conglomerates and holding companies like Berkshire Hathaway, but also for many large companies which have multiple product lines and offerings. While investors may spend most of their time in the financial world, customers, suppliers, and employees primarily interact with the operating world.

This abstract sounding idea explains one of the most common reasons why it is hard for investors to find really impactful data. Most data comes into existence at the behest of people who care about the operating world – it could be regulators who want to ensure transparency (resulting in e.g. a database of government tender offers), advertisers looking to reach specific targets (e.g. location data), or employees looking to evaluate their next career move (e.g. Glassdoor-style datasets). As such, datasets are usually cut up to fit the operating world, covering particular products, regions, or company types which do not necessarily map neatly onto the listed companies. The result is that you are often left with datasets which cover only a small part of what you care about as an investor.

We can illustrate this with an extreme example: imagine you had access to all of the consumer transactions in Spain, such that you could know with complete certainty and in real time what every consumer facing business in Spain was making; how useful would that data actually be? We can begin with the most obvious application and ask where does it make intuitive sense for this data to be predictive of revenues starting with the IBEX 35, a reasonable-ish proxy for the largest companies in Spain. We can first remove companies where there is no obvious relationship between our data and their revenues: we can throw out the commodities companies, most financials, and the likes of steel company Acerinox, infrastructure operator Ferrovial, pharma company Grifols, and wind energy company Siemens Gamesa. In fact after this exercise we are left with Inditex (the parent company of Zara), IAG (the parent company of British Airways and Iberia), Melia Hotels, and Telefonica.. From these the two most likely candidates are Melia and Inditex – in the former 38% of its rooms are in Spain, and in the latter 16.8% of sales are generated in Spain. Without question being able to forecast those component parts with 100% accuracy would be an edge well worth having, but we can see that even in this perfect world hypothetical case we would be some way from simply knowing the revenues in advance.

Not every dataset/investment pairing has this problem – particularly when the data covers important regions like the US and China – but now that the lowest hanging data fruit has been harvested more often than not some version of this problem explains the difficulty in finding impactful data in a ready-to-consume fashion. Nor is this a bad thing, investing is a game of relative advantage and there is plenty of advantage to be seized here by those willing to try. There are broadly three ways in which an investor can deal with the situation described above: (1) combine datasets in order to get insights which are impactful, (2) look for smaller companies with less diversified characteristics where it is easier to get a good data/investment fit, (3) attempt to neutralise the areas with no edge.

The first approach embraces the complexity and seeks to benefit from it but demands some sophistication on the part of the investor as they must have the capability and the resources to acquire multiple datasets and combine them in ways that produce useful insights. Sometimes this is as simple as combining similar datasets in different regions – for a CPG company whose biggest markets are the USA and mainland China we might need to purchase two point of sales datasets – other times we may be combining entirely different datasets. But even in the simple cases combining them can be non trivial from a data engineering point of view and therefore expensive: common problems would include incompatible entity identifiers (e.g. Coke vs. The Coca Cola Company vs. KO vs. US1912161007), non-matching time periods (e.g. calendar vs. fiscal), different structures (e.g. different product categories, age intervals). Less obvious is the increased analytical burden that comes with more data sources – to be in a position to generate useful insights the researcher needs to take the time not only to understand each dataset but also how they should be integrated with and weighed against each other. The result of all this is that complexity and resource-demands tend to increase not linearly but closer to exponentially with each new dataset added.

The second approach tries to avoid the resource intensivity of the first without having to concede a data advantage. It typically means looking for companies with narrow regional or product exposure and the best place to look for these is further down the cap scale where companies on average are less diversified. As a rule it is likely to be easier to find straightforward data/investment fits for companies with more concentrated businesses.In the example above looking at Spanish transactional data we could look at smaller cap Spanish companies, for example local retailers like DIA from the IBEX small cap which has roughly two thirds of its revenues in Spain and where prima facie we would expect our transaction data to be more impactful. The obvious problem with this approach is that many investors either cannot or will not have their investment universes restricted in this way.

The third approach is the one usually employed by systematic funds. Imagine we have excellent insight into Nestle’s bottled water business and we know it is going to beat expectations, but know nothing about the other businesses. If we buy Nestle, we have to accept the risk from the unknown of the other business performance and, if only done once, this risk can be tough to stomach. However, this approach becomes a lot more appealing when we can trade the same edge multiple times – over many bets, the risk from the unknowns cancels out, (as half the time you’re on the right side and half the time you’re on the wrong side) and all that remains is the edge from the bottled water business. This is a sound approach in theory but to work it requires a large number of iterations which is a luxury the discretionary investor may not have. In practice, for a discretionary investor, this approach needs to be evaluated on a case by case basis where we assess the magnitude and volatility of the unknowns as well as the trading frequency of the investor.

In our experience most investors eventually pursue the first option and introduce elements of the second and third when it makes sense to do so for reasons of resource efficiency. Once we understand this we can start to ask what the implications are for what funds will look like in the future and what investors will need to do if they want to keep up. Our view is that the need to combine datasets for insights is a key driver in the growth in the importance of research infrastructure. Infrastructure in this sense describes the capability to find, evaluate, ingest, engineer, combine, and research new datasets – it involves not just the technology elements but also the human capital in the form of the knowledge and understanding of the people who operate it. The value of good infrastructure is that it allows researchers to generate insights more effectively and with less time and cost, factors which of course become much more important as the volume and complexity of the data handled by the fund grows.

Finally, the broader implication of this is worth addressing. One point is that infrastructure tends to be costly and slow to build and this is likely over time to increase barriers to entry for new funds who would need to invest upfront to match infrastructure which incumbent funds have built over years. Further, we are already seeing, and we expect we will continue to see, how the growing importance of infrastructure results in more stable fund performance and a clearer winners-stay-winners dynamic, as better data infrastructure helps performance which enables further investment in better infrastructure which enables better performance and so on.