S05E06: Mining Unstructured Data for the Intangible

My guest in this episode is Kai Wu, CEO and founder of Sparkline Capital.

Kai is a pioneer in the measurement of intangible value. Using machine learning, he tackles unstructured data sources like patent filings, earnings transcripts, LinkedIn network connections, and GitHub code repositories to try to measure value across the four key pillars of Brand, Intellectual Property, Network, and Human Capital.

We discuss why intangibles are important, how they differ from the traditional factor zoo, the opportunities and risks of unstructured data, and how even big data can have small data problems within it.

Finally, we discuss Kai’s most recent applications of his research to the world of crypto.

Please enjoy my conversation with Kai Wu.

Subscribe on Apple Podcasts

Subscribe on Spotify

Transcript

Corey Hoffstein 00:00

okay Kai ready to go. Let’s do it. All right 321 Let’s go. Hello and welcome everyone. I’m Corey Hoffstein. And this is flirting with models, the podcast that pulls back the curtain to discover the human factor behind the quantitative strategy.

Narrator 00:21

Corey Hoffstein Is the co founder and chief investment officer of new found research due to industry regulations, he will not discuss any of new found researches funds on this podcast. All opinions expressed by podcast participants are solely their own opinion and do not reflect the opinion of new found research. This podcast is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of newfound research may maintain positions and securities discussed in this podcast for more information is it think newfound.com.

Corey Hoffstein 00:53

If you enjoy this podcast, we’d greatly appreciate it. If you could leave us a rating or review on your favorite podcast platform and check out our sponsor this season. It’s well it’s me. People ask me all the time Cory, what do you actually do? Well, back in 2008, I co founded newfound research. We’re a quantitative investment and research firm dedicated to helping investors proactively navigate the risks of investing through more holistic diversification. Whether through the funds we manage the Exchange Traded products we power, or the total portfolio solutions we construct like the structural Alpha model portfolio series, we offer a variety of solutions to financial advisors and institutions. Check us out at www dot Tink newfound.com. And now on with the show. My guest in this episode is Kai Woo, CEO and founder of sparkline capital. Kai is a pioneer in the measurement of intangible value. using machine learning. He tackles unstructured data sources like patent filings, earnings, transcripts, LinkedIn network connections, and GitHub code repositories to try to measure value across the four key pillars of brand, intellectual property, network effects and human capital. We discuss why intangibles are important, how they differ from the traditional factor zoo, the opportunities and risks of unstructured data, and how even big data can have small data problems within it. Finally, we discuss Cosmos re, it’s an applications of his research in the world of crypto. Please enjoy my conversation with Kai Wu. Chi Woo, welcome to the program. This has been a long time in the making. You have been writing some really fascinating research pieces over the last couple of years. And I’m ecstatic. We got to connect here and are getting the opportunity to dive into those. So thank you for joining me. Pleasure is mine. So why don’t we start off I think as always with perhaps your background for listeners who haven’t gotten a chance to dive into your research yet?

Kai Wu 03:05

Sure. So my first job out of college was as a quantitative researcher for Jeremy Grantham at GMO, my focus was on bubbles, asset allocation and portfolio construction. After a few years I left to pursue more entrepreneurial endeavors. This led me in 2014 to build algorithmic trading strategies and cryptocurrencies. And then later to join a former GMO colleague and launching a hedge fund called Kaleidoscope capital, which I helped build from scratch to a few 100 million. And then from there, I left to start sparkline capital, which we manage an ETF answer private funds. And our focus is on using machine learning and unstructured data to quantify the booming intangible economy.

Corey Hoffstein 03:49

So I know that since the 2000 10s, there’s just been a tremendous amount of hype around the use of both machine learning alternative data on structured data. But a lot of that seems to have died down as of late or maybe it reached peak fury in 2017 2018. Why do you think more firms haven’t been successful with the application of those techniques?

Kai Wu 04:14

Yeah, look, first of all, I do think that there is actually a tremendous amount of value in alternative data and machine learning. And there are examples of several groups who have managed to successfully apply one or the other or both. But your point is well taken that we haven’t yet seen broad adoption, that the hype around alternative data, like changing the investment industry has not come to fruition. Most firms use it as maybe a small part of their investment process, if at all. Of course, you know, I think the temper expectations, you know, only five to 15% of active assets are even managed by quants at all. So I guess in that view, it’s not super surprising. But what I think the problem is that to take advantage of machine learning really requires a rather large investments, alternative data and the infrastructure required to support it can be very expensive. And even worse is that you know, the prohibitive item here really is getting the right people to run it. Machine learning is complicated and has many pitfalls. And it’s also a relatively new field so that the pool of experienced folks is pretty small. I actually wrote a paper in mid 2019, called machine learning in the Investment Management age. And so in this paper, I outlined three ways to apply machine learning to the industry. The first is unstructured data. So the use of machine learning to transform unstructured data into the investment process to is data mining. And this is the idea of taking hundreds if not 1000s of features or signals, factors, alphas, whatever you want to call it, and allocating capital across them deciding which ones you want to invest in and what you want to ignore. And then third is risk models. So in the quant world, we’ve seen the most effort applied to the second use case, in other words, trying to figure out how to allocate capital across these 1000s of features. This has had actually significant success, but mostly at higher frequencies. So high frequency trading is that ARB. But for capacity reasons, obviously, most capital is managed on lower frequencies. So of course, it doesn’t matter as much for the average investor. And then the problem is at the lower frequencies, we have sort of a small data problem, which is for example, like every decade, there’s 10 annual filings. And these are often serially correlated. So the true dimensionality is actually quite a bit smaller. That being said, I guess machine learning models such as random forests and boosting they do provide incremental value. Even if some of the more complex like neural network deep learning models are not super useful at this horizon. And look, in particular, they can actually be helpful because they allow us to capture nonlinearities and interactions, and they are more robust to overfitting. But of course, this comes with trade offs such as in terms of interpretability. I think we haven’t seen as much innovation on the risk model front. This is an underappreciated dimension. quants use risk factor models such as Barra and US equities. And look, the way bar works. For those who don’t know, it has a few dozen industry factors, like you know, tech and consumer discretionary, and a few dozen style factors, value growth, etc. The bar model has been largely unchanged since becoming the industry standard several decades ago. And I think the biggest weakness of the model is actually its reliance on the Guix. Industry classifications that GSEs these are binary definitions, there’s like 11 different sectors. So firms like Tesla, they can’t be both tech and auto. They’re also very static. So if a company like Amazon starts investing in a new business line on AWS, that doesn’t kind of get incorporated into the risk model, we’ve actually shown that natural language processing models can be used to create superior text based industry definitions that can capture kind of the greater richness and nuance of the business landscape. So in this framework, for example, Tesla will be considered similar to both GM and Ford, and then also to Apple. And also since these data exists as matrices that can be kind of concatenated directly out of the box into the factor matrices used for optimization. And then the final area, which I think has the most room, which has yet been kind of fully realized, but it has the most potential is this idea of unstructured data. But the best way to define unstructured data is by opposition to structured data. Structured Data is the information you find in Excel spreadsheets and SQL databases. Its price volume, financial ratios, like P E ratios. Unstructured data, on the other hand, is everything else. It’s text, images, audio, video, anything else any other source of information. And unstructured data is 80% of outstanding data. And it’s grown exponentially. It’s doubling every one to two years. Importantly, it’s also being created faster than it can be structured, meaning that 80% of the data is under estimate, because as we move forward through time, it’s only set to increase. And of course, it’s not just quantity, right unstructured data can also contain a lot of valuable information about companies. We look at at sparkline. We look at like LinkedIn to measure human capital. We look at Glassdoor to measure culture patterns for innovation, Twitter for brand. And for the most part, investors are not using this data, at least in a systematic way. So while we’ve seen some unstructured data be adopted, such as new sentiment become popular, I think it’s really only scratching the surface for what this dataset can offer.

Corey Hoffstein 09:25

Now you started sparkline back in 2018, with the goal of applying machine learning techniques to unstructured data to help sort of quantify this concept of intangibles pretty specifically, can you walk us through your thesis as to first what you mean by intangibles? And then ultimately why you think this approach has an edge?

Kai Wu 09:45

As I mentioned, I started my career working for Jeremy Grantham. And you know, I’ve been a value investor ever since. But look, it’s no secret that the value factor has not thrived for the past decade. You know, the last time I checked value would have to outperform growth by 300%. Just to get back to trend. So you know, you have some investors claiming that value is dead, and others arguing that we just have to keep the faith and things will turn around. But trillions of dollars are run or managed in value strategies. This is a huge question in markets. I’ve thought a lot about it, obviously. And you know, my conclusion has been that look, value is not dead. It’s just nice to be reformed. The father of value investing, Ben Graham wrote security analysis in the 1930s. When the world was very different, the big companies will railroads and industrial firms, and buying stocks below book value was a reliable way to make money. You fast forward to today, we have Google Apple, which don’t use tangible capital to generate earnings. They rely on intangibles, we have these four pillars at sparkline intellectual property brand human capital, and network effects. These are the pillars on which you know, most firms today rely. And our research has shown that intangible capital has grown from basically 0% To 60 to 80% of the capital stock of the s&p 500. And meanwhile, the efficacy of traditional value metrics like you know, trailing earnings or book value have declined. So Baruch lab and phone goo in their excellent book, The End of accounting show that the R squared of using both value and earnings to explain market caps across actually used to be 90% in 1950, and it’s fallen to around 50% in 2010, when this was 10 years ago. So look, I’m not the first person to argue that value, investors need to incorporate intangible assets into their assessment of corporate value. But as far as I can tell, you know, sparkling we are the first firm to use machine learning and unstructured data to measure this value. For example, we use LinkedIn to track the flow of human capital from company to company or Twitter to measure the brand perception of firms. These datasets require using machine learning to take the unstructured data and form them into factors which we can then use to triangulate each of these four pillars. So basically, what we have two big insights at the firm first is that the economy is becoming increasingly intangible, but that investors and accountants are failing to adapt. And second, that unstructured data is exploding, and it contains valuable insight on the intangible economy that can be unlocked using machine learning. By combining these two insights, we hope to help investors access the opportunities in these undervalued intangible assets.

Corey Hoffstein 12:19

So as you alluded to, there has been a lot of ink spilled on this topic on the importance of including intangibles into value measures going forward. Could you talk a little bit maybe about how other firms have tackled this problem? And what makes your approach unique in comparison?

Kai Wu 12:38

Yeah, good question. So look, there are probably now a dozen or so researchers who have written about how to incorporate intangibles into measures of book value. While they each have slightly different approaches. The common theme is that they all rely on accounting data to measure intangible assets. So to be more specific, they focus on two particular line items in the accounting statements. So first is r&d, or research and development, and second is SGA. selling general administrative expenses. So SGA is kind of a catch all idea that captures you know many things, of which one is sales and marketing expenses. So the idea is that r&d And SGA are expensed rather than capitalized. And this creates a problem. For example, if I were to spend $10 million, building a factory to manufacture a new drug that I developed, that capex is capitalized, that goes on my balance sheet. On the other hand, $10 million of r&d to develop the drug that will then be manufactured is considered a cost it comes out of net income. This inconsistency means that investments in intangible capital are considered not an asset but an expense. So led by Baruch Lev, we mentioned just a second ago, a lot of different researchers have now decided to treat intangible investments the same way they do tangible investment, in other words, to build balance sheet assets for intellectual property and brand. Now, what they then do is to say, okay, so you know, we had purchased the book, why not take price to book plus capitalized r&d and S GNA. When you do that, you end up with this slightly more comprehensive version of a value factor. And they find that each paper is a little bit different, but add somewhere between one to four points of excess returns each year to performance. And the problem though, for us is that value is still in a deep drawdown, you know, not withstanding, instead of being down, say 300%. Now it needs to go up 200%, let’s say. So look, while these are very sensible adjustments, they are not a panacea. And I think the limitations are twofold. So first, there is a pretty weak relationship between the input cost and then the output value for any intangible investment. The goal of accounting is to capture historic cost, but the Ex Post value of intangible investment is very uncertain. Our $10 million we spent on this new cancer drug could be worth a billion dollars or could be worth zero to market. This new drug could go viral or Google flop so That’s the first problem. The second one is that accounting statements basically ignore the other two intangible pillars. All CEOs claim that their people are their greatest assets. But the only disclosure they put into their 10 KS is headcount, which of course makes no distinction between the quality of employees, or what functions they’re hired to do. And then finally, network effects. Uber’s main asset is its external network of drivers, which doesn’t show up on its balance sheet. So look, when all is said and done, this means that we are forced to go beyond accounting data. And we believe that by using unstructured data, we can actually measure the output as opposed to the input of the r&d investment and the quality of human capital, network effects and brand. And this allows us to transcend some of these limitations.

Corey Hoffstein 15:42

As factors got more and more popular in the last decade, we saw an absolute explosion in the number of characteristics that were used to describe the factors Xu, where I think there were ultimately sort of towards the end of the 2000 10s, hundreds of different characteristics being proposed as potential factors. Curious, in your research, do you find are intangibles really a unique factor when compared to the span of the entire zoo that’s already out there?

Kai Wu 16:11

The first thing to test would be to ask, What about the basic accounting factors? Right, the ones I just mentioned a second ago? Are they unique? So there was actually a paper I like it was by some folks at the oh, it was the Rotterdam School of Management, I think was called at the intangible premium. And what they did was they said, let’s create accounting based intangible factors. So capitalize r&d and sgRNA, find and then test to see if intangible intensive companies outperform adjusting for your kind of Fama, French or core heart factors like size, value, profitability, investment, etc. And they do find that, indeed, that it is profitable, net have all these things. And in fact, I think the investment factor drops out of the equation, you know, once these intangibles are added. So you know, that’s kind of a simpler example. Of course, what I’m doing is even a step beyond this, right? I’m using machine learning to form unstructured data into text based factors representing the intangible pillars. You think what the factors and the factors do. They’re primarily defined using the same few academic databases. So Chris copies that maybe world scope, right. And these are, of course, all structured data. So that’s one of the reasons, by the way, why you see so many false positives in the literature, which is because you have 1000s, of PhDs spending, you know, lots of money and decades of time mining the same data. But look, it’s all coming from the same Well, we’re still we’re all just kind of redefining the same things in slightly different ways. Not that unique. So simply by virtue of using a completely different underlying data set, in this case, unstructured data. I think that this, these intangible factors will kind of, by definition be quite unique. From, you know, while there might be hundreds of different species in the factories, do they all kind of come from the same firearms or whatever. So, you know, by just being a little bit different, I think we can dodge a lot of the problems that we’ve seen in the, you know, proliferation of these kind of traditional factors do things

Corey Hoffstein 18:02

at the risk of maybe asking a question, that’s just nuance without meaning, you know, when you talk about the concept of intangibles versus what you do, where you use machine learning, on unstructured data to extract intangibles, do you see what you’re doing really as a factor? Or do you see it as an ability to add alpha on top of the existing intangibles factor?

Kai Wu 18:30

That’s an interesting question. I think it gets down to the idea of are these intangible factors of risk premium a risk factor? Or are they a mispricing like an alpha? And I think the explanation is a bit of both. It is the case, for example, on the r&d front, that, you know, these are sunk costs, I put my 10 million bucks into cancer research and it doesn’t pan out, well, that’s most likely going to be worth zero. Same with some of the other intangibles. And you know, in many ways, it can be boom or bust. I think I talked about an example of like meadow, which is kind of one of the cheapest of the fang stocks will present Netflix, aside secrets of the tech tech giants. And part of the reason is because you know, they’re investing 10 billion plus in the metaverse every year. And then investors are like, is this just a science fiction and fantasy? Is this a real thing? And maybe it’ll turn out in 10 years that the metaverse is huge, and Facebook owns it, or maybe not. But that sort of bet is pretty challenging for investors to stomach and as a result, you know, leads to these firms being punished with regards to their PE. So that would be the risk based explanation on the Alpha explanation. I think it’s, you know, even simpler, which is to say that these intangibles are hard to measure. And most of the world all these kinds of passive investors and active investors as well are very focused on accounting data. They’re also very short term focused, trailing quarterly earnings. And to the extent that say, long term deep scientific research may have a huge impact on a 10 year horizon. What is a cost over one quarter horizon? You know, maybe it’s the case that these things are overlooked or perhaps underappreciated by the market. So I think, you know, either these two things are probably a valid explanation, and probably they’re both, to some extent applicable.

Corey Hoffstein 20:07

In a lot of your research, you talk about the four key pillars as being brand intellectual property networks, and human capital. Do you find that these pillars are truly distinct and independent from one another? Or do they sort of bleed into each other in terms of how they’re able to explain the cross section of returns?

Kai Wu 20:31

So I would start by asking, the question is, is there a fundamental distinction between these four pillars? I think the answer is yes, I guess the easiest way to demonstrate that is just by thinking about which companies might be represented by each. So obviously, you know, all companies are a mix of the four pillars, as well as a fifth pillar, by the way, which is tangible assets. Right? So Google, for example, has some human capital, some brand, network effects, etc. But look, usually we can take a company and you know, pretty quickly discern what is the primary pillar upon which it generates growth and earnings. So let’s go through each of them on the IP side. You know, I like the examples of nividia and Maderna. They rely on IP moats to stay ahead of their competitors. On the brand side, you know, Nike and Harley Davidson come to mind on human capital is Google, but also Goldman Sachs. And they know network effects, we have Twitter, but then we also have like at&t, you know, to go back to the OG. And so obviously, this points to a second layer, which is an industrial bias, you know, tangible value is the most important for your old economy sectors, your real estate, utilities, Materials, Energy financials. But like these old economy sectors, these asset heavy businesses now only comprise about 20% of the market cap of the s&p 500. The other 80% is primarily driven by the intangible assets. So for example, intellectual property is most important for technology and healthcare firms. Human capital is important for technology and healthcare, but also for communications and financials. Brand is, of course, most important for consumer facing companies like computers, consumer discretionary, and staples. And then finally, network effects matter most for communications and then followed by technology. The final thing we can do, by the way, is we can run the correlations across these pillars. So if we take the five pillars, right, the foreign tangibles plus tangible and run the average pairwise correlation, we find it’s about 10%. I think the highest correlation is between IP and human capital. And this makes sense because, of course, it’s human ingenuity that drives innovation. But aside from that, the coalition’s are really quite modest, cross these various pillars.

Corey Hoffstein 22:41

So my pushback to all of this would be that we know with the benefit of hindsight, which intangibles ultimately became important over time, we know that network effects became really important in the 2000s, we know that human capital became really important to support intellectual property growth as the US economy evolved into a more service based economy rather than a good space economy. So my big sort of skeptical question would be, How is this not all just a big sophisticated exercise in hindsight bias? How do we know which intangibles are actually going to be important going forward?

Kai Wu 23:21

Yeah, that’s a fair question. Look, I think the four pillars are pretty categorical. While we may not have known, say, a couple of decades ago, how important each might be whether it would be brand or IP that would ultimately drive the economy, I think it would have been a fair to say, if you go back even just to point in time, what it was that researchers and economists were talking about, yeah, brand is going to be something that helps create Moats. So human capital is important. So I would say that these are pretty universal and timeless concepts. But I think I would agree with you though, on the actual metrics we use to quantify each of these four pillars, that that is a bit more kind of subjective. So for example, even though we’ve always known that brands are important, we may not have known that, say Twitter data would have been a or social media in general, would have been good ways to measure it without the benefit of hindsight. But again, as you step back, look, this is the fundamental problem of all quant strategies. It’s why for us, we focus a lot less on the back test, and more on like the first principles reason for why it is that you know, brands, human capital, network effects, IP, these pillars should be drivers of value. And keep in mind that even if you think cross sectionally in the US, it has turned out to be the case that intangibles have become say 60 to 80% of the capital stock. But if you go to other countries, say like Europe or some of the emerging countries, it hasn’t really played out that way. It doesn’t mean we should ignore it, we should still look at it, but it’ll matter a bit less. And that’s I think, the best we can do, which is to say we believe that all else equal companies with strong brands and talented teams, they should outperform. It has nothing to do with like the data or what a backtest might show us. It just stands to The reason that this is the case, and so we defined metrics that we, you know, hope to, at least directionally help us quantify this. And, you know, that’s kind of the best we can do at this point.

Corey Hoffstein 25:09

In terms of the quantification, we’ve mentioned a couple of times now that your focus is predominantly on creating structure out of unstructured data, We’ve danced around maybe a definition of unstructured data, can you talk about exactly what unstructured data is, and both the opportunities and risks that come along with using it?

Kai Wu 25:28

Sure. So I think the opportunities are pretty profound look, the data is quite massive, and contains potentially a lot of valuable insight on both intangible assets, and also any other company characteristics that one might care about. It’s also harder to access for the average investor without having a machine learning toolkit, which means less competition. This is of course, in contrast to traditional factors like price to book ratios, which, you know, at this point, any kind of half decent quant with Python and you know, something going to have copies, that subscription can spin up in a couple hours. So really, the risks I think, are interesting. They stem from the fact that unstructured data really is still the Wild West, with a few exceptions, like 10, Ks and pattens. Most unstructured data has really only been created in the past decades. This makes it very hard to run super long back tests, which is why we have to rely on our intuition to guide our research. The other challenge, I think, is that the available datasets are rapidly expanding. So as a researcher, you have to stay on top of the current trends. For example, let’s say we want to use social media to measure brand perception. Great, well, we’re going to need to follow where the users go over time, as we know, it was MySpace, and that gateway to Facebook, and then Twitter and Instagram and Tiktok. So this dynamism means that you know, as a researcher, we really need to be on top of our game. And it’s the reason why, by the way, unstructured data is so interesting in the first place, opposed to kind of traditional accounting data, which has been very static and hasn’t adapted to the rise of the intangible economy, unstructured data has, but you do have to acknowledge that it is going to require a lot more work, you know, no pain, no gain,

Corey Hoffstein 27:08

Michael Mobis, and outlined sort of four traditional classifications of edges, there’s the behavioral edge where you’re taking advantage of the misbehavior of others, the analytical edge, where you have the same information as everyone else, but your analysis of that information is better than an informational edge where you truly identify unique information, important features that your competitors don’t. And then finally, a technical edge where perhaps you can execute more quickly or efficiently than your competitors. Where do you think the edge lies with unstructured data,

Kai Wu 27:44

it’s probably a combination of all four. But if we had to choose one, I’d go with analytical, all the data we have is publicly available, anyone can go to the US Patent and Trademark Office website and download the patents. Anyone can go to Glassdoor and read reviews on various employers, the challenge really is just processing this huge amount of information, the data is quite large, it requires a significant investment in technological infrastructure, to extract store and to process, the bigger challenge even from there, I think it’s just making the sense of all of it. Look, if you were to take the 1000s of words in say a 10k, and throw them into a linear regression, you will get nothing useful. Investors need to use specialized natural language processing NLP models in order to derive meaningful insights from this sort of data. But even with these tools, you can’t simply brute force it. Unstructured data is super high dimensional. So again, a single 10k can have 1000s of unique words, simply trying to data mine to find the patterns that say correlate with future returns, that doesn’t work. Instead, an analyst needs to use their fundamental intuition to guide their exploration of the data. For example, you know, we might have a hunch that the use of swear words in earnings call is a bad sign. To test this is actually very simple. The challenge, of course, was knowing where to look in the first place, using intuition to guide us allows us to significantly narrow down the search base so that it becomes tractable for our models. So again, I guess the edge here is that the analytical analysis requires the intersection of two skill sets machine learning and investment intuition. Both are at least to some degree uncommon, and the combination combination, of course being you know, even less prevalent.

Corey Hoffstein 29:27

You mentioned earlier on, at how quickly the unstructured data universe is growing. It’s exponential, to say the least. I mean, when it comes to the different data sources you could look at, you could look at 10 Ks, you could look at earning call transcripts, Twitter sentiment, LinkedIn, network information, GitHub activity, YouTube videos, podcasts, audio or podcast transcripts. And that’s just off the top of my head. So my question to you would be how do you ultimately decide what sources Have signal and what sources are ultimately going to be redundant or just noise?

Kai Wu 30:05

Well, all the alpha is obviously in flirting with models podcast transcripts, right? Absolutely. It starts with fundamental intuition. We use machine learning as a tool in order to to implement this intuition to test, implement and scale the ideas. As I mentioned, our research process is guided by the four pillars brand human capital IP, number of facts, and our only job is to find as many metrics and data sources as possible to proxy these values. This is our research is guided by fundamental rationale, we really only spend time on data sources for which there might be a logical reason why it could help us measure one of these four pillars. Once we do have a new data set, we will train our NLP and machine learning models on the data and Bill factors, but it tends to be quite different for each dataset. So for example, in order to determine which companies are attracting and retaining the best talent, we actually ended up using the Google PageRank algorithm or the algorithm they use for Google search. And the reason why is because we wanted to get around the circularity of this idea that look Palantir has high status, because it’s able to pull talent from say Facebook, but Facebook is high status, because it attracts talent for many other high status firms such as Palantir. Fortunately, the PageRank algorithm was kind of perfectly suited to do this, you know, only once we had this metric that we then bothered to go to test to see if it was indeed predictive of future returns. Another example, a very different would be on culture. We started with this organizational culture profile OCP framework, which comes actually from the psychology and management literature. The framework has seven dimensions, such as innovation, teamwork, and integrity, we then use word embeddings. So that’s a simple NLP model on Glassdoor reviews to assess culture on each of these seven dimensions, we then build sort of like Myers Briggs style profiles for each company. And then we could confirm from there that strong cultures do indeed predict future stock market returns. While there’s many different metrics and different ways of creating and testing them. They’re all guided by this common thread that we start with fundamental intuition in kind of service of these four intangible pillars for all

Corey Hoffstein 32:15

the big data out there. Evolution and innovation seems to create a small data problem. You mentioned earlier, for example, Amazon launching their AWS platform. And I would presume that terms like cloud computing, which are well defined today, wouldn’t have even really existed 15 years ago, how do you tackle this emergence of new information within unstructured data?

Kai Wu 32:44

I actually recently wrote a paper exactly on this topic, it was called investing in innovation, you can go to my website and see it. But the point is that we know the theme is that disruptive innovation funds like ours, because they are focused on these are things like aI cloud computing, as you mentioned, genetics. The problem, of course, is that these funds have only existed for less than a decade, which of course coincides with a huge bull market, especially for growth stocks. So we wanted to ask the question, instead of you know, how would innovation investing have performed back over several decades? For example, what themes would have been considered disruptive in 1999? What about 1989 So in order to do this, we actually ended up using patent data from the US Patent and Trademark Office. The first patent was issued in 1790, it was actually signed by George Washington itself. So what we can do with this dataset is go all the way back to the mists of time, allowing us to observe the arc of innovation over two centuries, we use machine learning on a patent abstracts to create clusters of similar technologies. And we then see how they evolve through time, super fun, we can see the rise and fall of innovation and railroads, electricity, automobile circuitry. Now the internet, we then apply like a Google Trends style algorithm to see which technologies are trending at each point in time, we find that technologies trend, as opposed to mean revert. So in other words, once a few initial breakthroughs happen, we creates a virtuous cycle, attracting other innovators in the space they build on these foundations, and so on and so forth. In other words, we find that weak, but simply by looking at which technologies are gaining traction, we can forecast the path of future innovation. And then what we do next is we back test the performance of a strategy that follows these technology cycles. So to your example of cloud computing, the strategy started to see cloud computing trend in the data in like the early 2000s 10s. And once it was identified, the strategy went out and said, Alright, let’s go buy companies with exposure to this theme. And of course, that would have played out very well the past decade. But know about cloud computing continues to grow. Eventually, it’ll probably flatten out and eventually give way to the next set of innovations, at which point the model will rotate out of cloud and into the next big technology revolution.

Corey Hoffstein 34:56

As more and more firms adopt NLP tools to run rapidly trade news releases and earnings transcripts. How do you outrun the adversarial issue where CEOs may now get coached against using specific words and phrases or coach to use specific words and phrases?

Kai Wu 35:15

I love this question. Look, investing is like poker, it’s a game theoretic endeavor. One of my favorite papers is actually called How to talk when a machine is listening. And it has a really interesting finding. So there’s this dictionary called the Laughlin McDonald dictionary, it consists of a bunch of lists of words of like positive and negative keywords. And the key is that it’s adapted to the finance industry, it was created by two finance professors solely for this focus of trying to classify financial jargon. And it was published in 2011. And quickly became widely used in natural language process, you know, finance applications, the paper, you know how to talk when a machine is listening, found that companies started to avoid using the negative loss limit going to words in their 10k and 10. Q’s, soon after this dictionary published, right, so you kind of see a kink in the data, which is evidence that yeah, this is a very real thing, that as investors and not just quants, by the way, you know, attempt to make sense of unstructured data and to say that, in general, the CEOs will try to manipulate the narrative to their advantage. So the way we deal with this is we define three buckets of data with varying levels of susceptibility to such a manipulation. And the first is company communications. So this is your 10 ks earnings calls, press releases, anything coming directly from the mouthpiece of the company. Second is third party information. So use media blogs, sell side research, company reviews, you know, mentioned Glassdoor, and then the third is ground truth. So I will talk about human capital and patterns in this category. A good example is to go back to our culture thing, we wrote a paper called measuring culture. And we started off by showing the famous slide about how Enron, its leaders went to jail for fraud, they proudly displayed the values of integrity on their office lobby. Look, CEOs invariably just love talking about how great their culture is. But this is no correlation with a true culture of a company. So to get around this problem, we don’t look at the CEO interviews. Instead, we look to the opinion of the rank and file employees. These are the opinions which on a day to day basis constitute the culture of a firm. And again, this is just to use Glassdoor. The website allows individual employees or former employees to review their employers, we find this data is a much more reliable source. In particular, we find that it’s not the quantity of star rating that matters. But the information contained in the freeform text associated with each of these reviews, that gives us interesting clues to the facets of each company’s culture. A similar example would be that all CEOs just love talking about how they’re embracing innovation and digital transformation. But Talk is cheap. So instead, we look at job postings and LinkedIn to see if companies are truly hiring talent in these areas. Right? It’s easy to say you’re investing in innovation. But do you actually then go out and spend the extra money to hire top graduates from like Carnegie Mellon, computer vision PhDs? Is it actually going to be the case that your employees have skill sets such as TensorFlow and pytorch? On their resume? You know, are you really investing in AI?

Corey Hoffstein 38:28

A lot of machine learning models require data to learn on obviously, but that data to actually learn on it can require a form of labeling. And with investing, that labeling process strikes me as potentially a subjective endeavor, where both context and horizon matter, belief might play a role. And you can’t necessarily just look at forward or future returns to determine if something had a positive or negative impact, because there could be just dozens of confounding variables that ended up resulting in those future returns. Curious how you tackle this labeling issue.

Kai Wu 39:09

I think the way we apply machine learning is kind of different. It goes back to the point that I made earlier that we start with fundamental intuition, and then use machine learning as a tool to interrogate our fundamental hypotheses. So due to this process, our work is actually very subjective. And as much as like we have to use our brains to define what sorts of data to look at what models are use, what metrics to look for. So in this way, our process is much more similar to that used by a fundamental investor, where we have fundamental hypotheses, and we use data to answer it, you know, in opposition with how many quants approach machine learning, right, the standard approach is to start off with raw data, and then to use machine learning to find interesting patterns. So we turn this on its head. You start with instead fundamental theories, and then test it with data’s I think this helps reduce many of the pitfalls. of overfitting on high dimensional datasets to say features of like noisy stock market returns, it also means that our results are a lot more transparent opposed to being a black box. But the trade off is that it instead relies on the researcher having the right instinct. This is why we’re so focused on establishing the first principles rationale for why intangible assets matter and communicating it in as a transparent way as possible in our writings.

Corey Hoffstein 40:28

So now for something entirely different. You recently published a paper applying many of these concepts to crypto markets, where I kind of have to laugh because at least in my experience, most crypto projects are just purely intangible anyway. So talk us through your approach here. And what you discovered in taking these traditional market applications and trying to bring them to crypto.

Kai Wu 40:53

Yeah, you’re exactly right, these crypto projects are, you know, whereas in equities, let’s say 20% of value is tangible, and crypto at 0%. So if anything, this framework is potentially more interesting to crypto, which is really the feedback we got from investors. So we’ve been writing now for a while about intangible value. And you know, many people read the papers I’ve written and said, Wow, this is really cool. This works in equities. But I’d love to see how it might work in web three, and crypto. Given the massive growth of the sector, a lot of people are interested in investing in it. But they want a value lens for the market where they don’t want to just be kind of chasing narratives and you know, trying to find the next big Dogecoin or something like that. So it turns out that, you know, porting our model into crypto was actually pretty seamless, branded human capital, they matter just as much for web three as web two organizations. So we were really able to just apply the framework wholesale with no modifications. The big difference in crypto is the data sources are quite different. But because web three is being built in the open, in many ways, crypto is actually an even more attractive area to apply this framework. So we focused on three different datasets. First, we use blockchain data. By definition, we can see the history of a blockchain all the way back through time, it is publicly available, it is immutable. This allows us to form metrics for the adoption of a protocol. For example, we can calculate the number of daily active users, or the dollar volume of transactions over any kind of arbitrary time period. And this, of course, maps back to our pillar of network effects. Second, we use GitHub. The really cool thing about crypto is that it’s all built on open source principles, which of course is key for its composability. We see the source code of 1000s of crypto projects today, as well as yesterday, and each point back in time to inception. So this allows us to form metrics for human capital intellectual property. So for example, we can see the number of repo changes as a proxy for iteration over a period of time. Or we can look at the growth of the developer community over the years. So finally, we have social media data. While social media is of course important for all firms. It is especially important for web three, which are digitally native, and involve the coordination of online communities across the globe, we can look at datasets such as Twitter, Reddit, Telegram discord, to track the growth of these online communities and brands. So now with these measures of fundamental value in place, we then compare them to the price you pay. So in other words, for each million dollars invested, how many users engineers followers do I acquire, we then back test a simple quantum value strategy in this asset class and find does really well. Now of course, our goal is to avoid getting too hung up on back tests. I think what makes us confident in the strategy and gets us so excited about it is that look, this is an inefficient frontier asset class, and very few other investors, if any, are approaching it with a systematic value lens. So it just stands to reason that there might be some alpha here,

Corey Hoffstein 44:08

my intuition, and it may be entirely wrong, would have been after reading all your papers and talking to you about the four pillars of intangibles that they would really only matter if they ultimately lead to cash flow at the end of the day. In other words, they’re sort of alternative metrics for measuring perhaps growth potential or economic moats for a business. Given that most crypto tokens confer absolutely zero rights to current or future cash flow. Why do you think the approach still applies?

Kai Wu 44:43

You’re right that in general, the token economics are a bit different from that of equities, which I guess is sort of the point because, you know, many of these projects are using tokens as a method of financing their growth, but they want to avoid technically calling them equity securities from like a regulatory standpoint, although that doesn’t diminish the actual value to these tokens. Let’s take the example of Aetherium. So eath is a utility token, it is required if you want to use the Ethereum network. Therefore, the value of eath is a function of the demand for the Ethereum network. There’s a logic applies to any other token, whether it’s video game, a decentralized Exchange, or an L one blockchain like eath, the value of tokens will be a function of demand for the underlying project. So our framework attempts to establish what is the fundamental traction of these underlying projects. So in this way, we’re actually much more similar to venture capitalists. We think about these projects as early stage startups, right, like your internet firms, or that the early 90s, they may not have monetized their projects or their users or whatever yet. But if we have a lot of users, we have a robust development community and a strong brand, it certainly does bode well for their ability to flourish, ultimately, which of course, would somehow filtered down to the token investors profiting

Corey Hoffstein 46:03

earlier in the conversation when you talked about the four different pillars of intangibles, you were able to identify businesses that really resonated with each of those pillars? Do you find the same occurs within crypto that you can point to specific projects or tokens that squarely fall within one pillar?

Kai Wu 46:23

Yeah, the same thing happens here. So for example, like I mentioned, Dogecoin, right, which is a joke, its main value is on his brand, a lot of people think it’s funny they like it’s kind of fun to play with. So its primary pillar is brand. And then on the infrastructure side, you have things like file coin, or other storage, decentralized storage, whatever that tends to rely most on the IP front. And then you know, your exchanges, decentralized exchanges, let’s say, or network effect play. So similar to how like nicey and CME that their value draw is derived from the fact that you have many buyers and sellers who want to aggregate liquidity on their platform. Same thing for uniswap, and sushi. So yeah, it’s very much the same concept here. And you know, what we’re trying to look for as with equities is firms where you have a bit of everything. What we’ve discovered is that you simply having one pillar is generally insufficient for success. I always give the example of Wozniak without jobs, great, you have technology and IP, but you really need marketing as well, you need brand. So what we’re looking for is crypto organizations, stocks, whatever it is asset class doesn’t matter, that are strong on all the intangible pillars, or as you know, as much as possible.

Corey Hoffstein 47:35

Well, chi, we found ourselves at the end of the podcast here. And the question I am asking everyone this season is to look back upon their career. And consider what was the luckiest break that you had,

Kai Wu 47:51

I would have to say, getting a job at GMO struggle college, I got a call from from GMO from like their HR person for an interview and I literally had to Google what they did. Now, keep in mind that I don’t come from business background. My mom’s an artist, my dad’s a doctor, I went to college was originally a poli sci or government major. I got into economics simply because I liked the kind of quantitative angle as a way of exploiting social sciences. And then I graduated into the teeth of the financial crisis. So of course bubbles and, and crises, you know, were interesting to me, but like I had never really heard of a lot of these investment firms. So it was just kind of through sheer luck on my resume got passed around, somehow ended up on the desk of the CEO over there at the time, and that I got my first internship, which then turned into a full time offer. And I think that set me on a very interesting path. The rest of my career, I didn’t know what value investing was, I never read security analysis, I showed up at GMO and a chancellor, one of my mentors gave me security analysis and told me to read it. And I actually sat there and spent all that time reading about railroads and public, private utilities and, and all these sorts of things. And it was kind of an eye opening experience for me since you know, I basically no experience in finance or even accounting or investing. So that set me on interesting track. And then you know, of course, the value, angle and quantitative investing. These are all things I would never have even dreamed about being my career for the next 10 plus years. But here we are.

Corey Hoffstein 49:24

Well, listeners, I cannot urge you enough to go to sparkline capital’s website and check out some of the research that Kai and his team have published I think not only is it some of the most interesting research I’ve read over the last couple of years, but I will also give you credit guy it is some of the most beautifully formatted research which I truly appreciate for its aesthetic quality. So I would urge everyone to go check that out. Kai, I can’t thank you enough. This has been absolutely fantastic. Thanks a lot.