S02E09: All About Alternative Data – Flirting with Models

In this episode I am joined by Katherine Glass-Hardenbergh, Associate Portfolio Manager at Acadian Asset Management.

In her role, Katherine focuses heavily on the application of alternative data in Acadian’s fundamentally-driven, systematic investment process.

Purported as being one of the leading frontiers of quant finance, there is plenty of hype around alternative data. Katherine brings refreshing transparency to our conversation, speaking just as candidly about the hurdles in alternative data as the opportunities.

We discuss everything from what alternative data is, where it comes from, interesting examples in the ever-expanding landscape, some of the practical challenges of working with alternative data, and the many potential applications for use within the investment industry.

Katherine provides insight into the world of alternative data that only someone deep in the weeds could. If you’ve ever been curious as to the real-world application of alternative data, this is definitely the episode for you.

I hope you enjoy our conversation.

Subscribe on Apple Podcasts

Subscribe on Spotify

Transcript

Corey Hoffstein 00:00

Are you ready? Sure. Okay. 321 And let’s have some fun. Hello and welcome everyone. I’m Corey Hoffstein. And this is flirting with models, the podcast that pulls back the curtain to discover the human factor behind the quantitative strategy.

Narrator 00:24

Corey Hoffstein Is the co founder and chief investment officer of new found research due to industry regulations, he will not discuss any of new found researches funds on this podcast all opinions expressed by podcast participants are solely their own opinion and do not reflect the opinion of newfound research. This podcast is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of newfound research may maintain positions and securities discussed in this podcast for more information is it think newfound.com.

Corey Hoffstein 00:55

In this episode, I am joined by Katherine glass Hardenberg, associate Portfolio Manager at Acadian Asset Management. In her role, Catherine focuses heavily on the application of alternative data in acadiens fundamentally driven Systematic Investment process. reported as being one of the leading frontiers of quant finance, there is plenty of hype around alternative data, Catherine brings refreshing transparency to our conversation, speaking just as candidly about the hurdles and alternative data as the opportunities. We discuss everything from what alternative data is, where it comes from interesting examples and the ever expanding landscape. Some of the practical challenges of working with alternative data, and the many potential applications for use within the Investment Industry. Catherine provides insight into the world of alternative data that only someone deep in the weeds could. If you’ve ever been curious as to the real world application of alternative data. This is definitely the episode for you. I hope you enjoy our conversation. Katherine, thank you for joining me today.

Katherine Glass-Hardenbergh 02:08

Nice to be on the show. Katherine,

Corey Hoffstein 02:10

I want to start off with a little bit of background for the listeners that maybe haven’t heard of you before heard of Akkadian? Before Can you provide some background lay some context for the rest of this conversation? Who are you? Where do you come from? How’d you get into the field?

Katherine Glass-Hardenbergh 02:23

Sure. So I am a member of the global quantitative equity research team here to Kadian previously have a background working on the sell side in both prime brokerage, electronic trading, and kind of gotten to the field originally from my undergraduate education where I studied both finance and engineering. So really looking to understand the financial markets better and looking to solve problems. So it was a really nice fit. So far, where I found myself now working in quantitative research, give you a little bit of background as well about a Katie and then his we’re a multifactor, quantitative equity firm, managing somewhere around 95 billion in AUM, completely fundamentally based but also fully systematic. So what that generally means from a research team perspective is we’re looking to build models that predict stock returns. But we’re going to do that with a strong focus on understanding the intuition and the economic rationale behind everything that we build and why we believe it should predict returns,

Corey Hoffstein 03:26

I think this is going to be a really fun conversation, because it’s going to be a little bit far afield, maybe a little bit on the edge of where most quantitative research is today. And we’re going to talk all about alternative data, which is an area that’s had a lot of hype around it in the last couple of years. A lot of vendors coming out with these new depth the point D alternative datasets, but I just want to set the table here and have you maybe set a foundation of what you think makes something alternative data, what’s it look like? Where does it come from? What sort of puts it in that category versus more traditional datasets that quants are used to working with?

Katherine Glass-Hardenbergh 04:05

So I generally like to give alternative data, a fairly liberal definition where it’s anything that quants don’t traditionally work with. So it’s not your pricing volume data coming from the markets. It’s not your traditional financial statement information. So where it can go from here is we’re starting to kind of see people say, Well, how alternative is it on the alternative data scale? So you may see a lot of the really eye catching ones in the news being the really alternative ones. Satellite imagery turned into quantitative signals, capturing web traffic of what folks are up to on the internet, even geolocation. So you may have heard from some news articles, as you walk around with your cell phone, chances are you’ve got an application that’s tracking where you’re going, well, can we use that information from millions of different individuals as they navigate around throughout their day to gain additional information, and then you’ve got On the less alternative side of things, using earnings calls information, can I apply natural language processing to get information from that getting information from news articles, or even one of the least alternative survey data, less traditional, but still, somewhat on that alternative scale? I think another question you asked is, What does that alternative data look like? Where does it come from? And I think what you’ll see with a lot of this is it often starts out somewhat unstructured. Maybe it’s text coming from a news article, maybe it’s an image from satellites. But it doesn’t always stay that way. And we can’t feed an image into a model. So it ends up becoming fairly structured data at some point in the process. And where it comes from two places. One of them is, as we’ve seen this explosion of data out in the world, we’re all online, we’re receiving things in our email, we’re collecting large scales and sums of data is publicly available information. As we’re able to collect it, process it and map it to companies. Now we’re able to start using it. So a couple examples that we’ve already kind of briefly mentioned is collecting news data, for example, and interpreting that collecting the transcripts from earnings calls, going online and scraping the web and maybe collecting reviews about products, reviews that companies have from their employees, whether they like working there or not. The other place that I think it’s particularly interesting that alternative data comes from is well term exhaust data. And so that’s data that’s been generated and collected for some other reason. But folks in the investment management industry have realized that it might be interesting for predicting stock returns as well. So for example, most of us have an application on our phone that we use to track the weather. Well, chances are that you give that weather app access to your location. So if you live in Boston, but are traveling to New York, you get the weather in New York, not in Boston, well, that weather application now has access to where your location is, and they collect that for lots of different users. That’s potential data that can be used. Another place that we’ll see it coming from very often is the advertising industry, really interesting to be able to track where people are on the web, what they’re up to, to sell them products. But maybe we can use that to look at consumer behaviors as well. Like, again, bridging that gap into investment management. And what we can do with some of this data,

Corey Hoffstein 07:35

that exhaust data is really interesting to me seems like a natural way for a business to try to generate additional streams of revenue from an asset they’ve already gathered and maybe weren’t utilizing before don’t have an ability to exploit naturally within their own business. It does strike me as interesting as well, though, with the current conversation in the world going on about privacy rights of individuals, it does strike me that some of this alternative data, may edge on a little bit of individuals and knew it was being collected about them and sold it, it might make them uncomfortable. How do you feel about some of these datasets with regards to individual privacy?

Katherine Glass-Hardenbergh 08:15

I think that’s something that myself and the industry as a whole is really conscious of and really wants to be careful of. So when we’re looking at any data set, we’re immediately thinking, Is there personally identifiable information PII in this data set? Let’s make sure that it’s not there, let’s make sure that it’s aggregated up to a level where we’re not able to disaggregate individual users, is there as well material nonpublic information we want to be really careful of, where’s this data coming from? Is this vendor allowed to collect it? Are they allowed to sell it? before we’re even going to consider looking at it as a potential source of information? So certainly something that on that exhaust data side, we want to be careful of, and are really, really asking a lot of questions when we are talking to these vendors.

Corey Hoffstein 09:05

I want to talk a little bit about the opportunity with alternative data. It’s certainly been, as I mentioned, an area that there’s been a lot of emphasis around, there’s a growing number of vendors, conferences, certainly have a growing number of speaking slots available about talking about alternative data and the potential role it’ll play in the investment industry going forward. But as someone who is actually working with alternative data, day to day, how do you think about the opportunity? How large of an opportunity is it to potentially reshape the results we can achieve when investing? So the way

Katherine Glass-Hardenbergh 09:44

we look at it is probably more of an evolution less of a revolution. I don’t personally think that alternative data is ever going to take over all of the traditional information that we get from financial statements from pricing, for example, there’s really Gonna be a very large role for that continuing going forward, where alternative data comes in, I think, is to augment and continue to add to some of that. And one of the challenges with some of this alternative data is that we don’t know necessarily what’s going to work beforehand. And what will work, a lot of it probably won’t quite work, it won’t necessarily add that additional value that you can’t get from looking at a financial statement. So kind of like an option value. And trying to maximize that option value of picking which projects are most likely to pay off some of the ways that we think alternative data is going to help add to some of that traditional data that we already have. Or I’d say fourfold, some of it is getting data earlier than you might traditionally get it. So before earnings come out, is some of this data helpful, maybe for predicting sales ahead of time, if I’m getting, for example, consumer credit card information, am I going to get maybe an additional level of detail that I’m not going to get from financial statements. So maybe I’m going to get additional information about sales within each of the different stores that accompany has, maybe I’m gonna be able to look further into the sentiment of those customers buying the product, maybe I can look into customer segmentation? And understand what types of customers are buying this product? Am I seeing any trends within that, that might be able to provide a little bit of extra clarity below that top line sales number that you get quarterly. Third is you’re going to be able to see potentially a cleaner, more accurate picture. So a lot of us have probably heard about the research where there’s more companies that beat earnings by a penny, than Miss earnings by a penny, can I use some of that alternative data to get a cleaner picture than I might otherwise get? Another example that there’s been a lot of interest in the industry for is using satellite imagery for some of the emerging economies for China, for example, in particular, to get at some of that information that might be a little bit harder to get from traditional sources. The final one, I think this one’s probably some of the most uncertain of if we may be able to find information out of it is call it completely orthogonal information, something that you’re really not going to capture in financial statements at all. An example here potentially being employment data. So you can learn something about the employees at a firm, what can I maybe tell you about that firm going forward?

Corey Hoffstein 12:35

So in practice, when you think about using this data, it sounds to me, like at least three of those four categories were along the lines, and the word to use, I think, was evolution, but along the lines of refinement of prior signals or prior data? Can you talk a little bit about what it actually looks like when you’re trying to incorporate this alternative data into an existing investment process? So by that, what I mean is, is it really just, are we trying to get a better value signal? Are we trying to sharpen it through more details? Are we trying to create a differentiated signal? Or is there really opportunities potentially, to even create an entirely new alpha signal? How do you think about that? How do you try to incorporate it into the investment process?

Katherine Glass-Hardenbergh 13:21

So as a multifactor quant, we’re never looking to make a completely new model, an alternative data driven model, for example, we’re going to be adding it on, as you said, to a lot of the existing information, then really hoping that that’ll help us augment and complement a lot of what’s out there. So what that’s usually going to look like is it’s going to be something in the form of a new factor that can complement some of the information we already have in there. And we’re really looking at what’s that idea? How is this information is going to be additive? Is it adding something along the lines of value? Is it adding around a quality theme? Is it adding around a growth theme to accompany and kind of focusing in on thinking about that piece of it?

Corey Hoffstein 14:09

How do you think about measuring the potential success of incorporating this alternative data? When you’re trying to sharpen these signals? Is it through just the traditional quant tools of measuring impact on return? Are you using analysis about whether it helps you like the example you used was better forecasting? Are you measuring just are you able to more accurately forecast earnings? How do you think about actually measuring the success of incorporating this data?

Katherine Glass-Hardenbergh 14:38

So I think there’s two pieces to it. The first one is along prioritizing what to first look at. And here we’re thinking about what is the breadth and coverage of this data going to look like? How many stocks can I potentially improve my signal for and am I going to be able to get enough history to do a back test even get some information on it? From there, then we’re measuring pretty much like we might any other factor, or bit of information as we’re improving that model. Does it stand up on its own? Does it provide some insight? And when I add it to the model, does it continue to improve and provide some insight or take, for example, I’ve used an alternative data source, I’ve created this new signal. And it looks great on its own, but I put it into the model, and it’s just replicated value signals. Well, that’s not very helpful, it was overly complicated. I can just get that information straight from financial statements, I can look at a price to book metric, for example, in which case, that idea in its current form isn’t necessarily going to go somewhere. You mentioned

Corey Hoffstein 15:40

sort of the availability of the data. quants typically rely on both the breadth and the depth so that we can derive some statistical significance. One of the things I’ve noticed with some of this alternative data sets is that they just don’t go back that far. To the point you mentioned, you have individuals walking around with cell phones in their pockets that might be tracking their GPS location. Well, the reality is that data set can only go back 10 or 15 years, and it might be have an incredible breadth across individuals. But there’s only so much history that we’re going to be able to be provided there. And so you might end up with this sort of very niche coverage. How do you think about developing statistical confidence when the data that you have can be very highly limited?

Katherine Glass-Hardenbergh 16:27

Yeah. And that, I think, is one of the key challenges with a lot of this alternative data. And it really makes it hard to adopt a lot of it, the way that we’ll look to deal with it is, first and foremost, we’re always going to start with the idea behind it. What is the economic rationale? Why do we think it will work? And from there, to kind of look to start to build some of that confidence, we’re going to look for consistency, we’re going to look for robustness in that signal. So even if I do only have a few years of data, do I see that it worked last year and the year before? And the year before that? If it’s only worked in one of those three years? It’s harder to say there’s something there? Can I look for some consistency across industries? For example? Can I look for some consistency across geographic regions? Does it work in the US as well as in Europe? And we’re going to use some of those ideas to try to back up that initial economic intuition. So we’re probably not going to get the kind of significance that we’d expect from a lot of our traditional factors. But maybe we can build up enough where we see it’s moving in the right direction to give us some confidence.

Corey Hoffstein 17:39

You started your answer by saying one of the big problems you face with using this data that leads me to ask what are some of the other big problems that you face when working with alternative data,

Katherine Glass-Hardenbergh 17:50

there’s actually a lot of challenges with it, which I think if you can get through the challenges means that there’s some pretty interesting opportunities. But it means it’s going to take a lot more time as you’re looking at it. Two of the other big challenges, I’d say that we have, besides that coverage are the quality of the data itself, and prioritizing datasets. So on the prioritization side, we’ve talked a little bit about how the payoff is uncertain. There are by the last metric, I’ve seen probably close to 1000, if not over 1000, different alternative data, vendors and data sets out there. If you’re curious and checking a couple of those out one free website, I like to monitor is alternative data.org? Well, if you’ve got 1000 different data vendors to go through, how do you know which ones you should test? How do you know which ones are going to pay out? And if some of them look like they’re doing the same thing? How do you pick from one vendor to another? That kind of leads into the second big challenge is the quality of it. It’s hard to know before you undertake a research project, exactly how good this data is, when you’re looking at it, and you actually get it in house. Now you’ve got to assess what’s the accuracy of it, and how do I even determine if this data is accurate? Is this data representative? So a couple examples, it’s let’s say I’m looking at when people post employment profiles online, chances are a lot of that data is skewed to the more highly educated workforce will for my investment thesis, is that going to work? Or am I missing out on too much of the workforce by getting the data from that source? Let’s say I’m looking at that geo location information. It might give me a view into a store’s brick and mortar sales, but it’s going to completely miss the online segment. Is that okay? Or does that really mess with my investment thesis? Similarly, what if the data set is regional maybe it’s us only maybe it’s Japan only or China only. But a lot of companies out there completely global will capturing just one country, which is pretty common and a lot of these datasets be in Now if another one is line of business, let’s say I’m looking at cars being sold well, what if I can capture a lot of the retail side? What about the commercial? Does this car company have a lot of different lines of businesses, that I’m not necessarily capturing all things that are hard to figure out? So you bring that data in house and you really start to look at it? Once you figure out if the data is representative, what about mapping the data, you ultimately need to get it back to some kind of financial instrument? So you may see the word Apple show up? Well, is it talking about the fruit? Or is it talking about the company getting companies that don’t exist anymore? And really thinking through a lot of those challenges to get into a workable state for quants. And I think one of the biggest things that people often overlook to begin with in the quality of some of these alternative datasets, is the point in time aspect. As a quant, we’re looking to start with that idea and then construct a back test and see if it worked going back in time. Well, am I able to get that data as I would have known it back in time when I was making that investment decision? That’s what point in time information is an alternative data, I have multiple things to consider with it. Are the mappings point in time is the data point in time? Was it reported point in time, a great example here is patent data. With patent data, I may be able to know point in time when that patent was issued. And when it became available to the general public to know that patent existed. But was it mapped point in time, right? I’m able to map those patents to companies that no longer exist? And do I have in particular the ownership point in time. So for example, if a vendor can tell me how many patents existed in 2010, that Nike owned, it’s very different than how many patents did Nike own in 2010. So really needing to get through a lot of those quality questions on a data set, to be able to move forward and get to the process that we use to test new factors and improve our model with some of the more traditional stuff that really makes it a challenge working with alternative data.

Corey Hoffstein 22:15

One of the things you said to me when we were preparing for this conversation a couple of weeks ago that really stuck with me was you mentioned alternative data, data org and I went on and I checked it out. And I was incredibly impressed about the breadth and the unique nature of the offerings. But one of the things you mentioned was one of the problems you face is not just getting the data historically. But it’s also making sure you get the data going forward, because so many of these data vendors I’ve never even heard of, and you have no idea whether they’re going to continue going forward. How do you think about that aspect of the problem, which is you might find a great data set, but you don’t know if the vendor is even going to stay in business?

Katherine Glass-Hardenbergh 22:54

Yeah. And that’s another piece when we’re looking to prioritize what projects we’re going to look at, is thinking about the vendor themselves. So when we’re evaluating vendors, we love it, if they were a very established vendor, we think they’re going to be around, we think they’re going to continue in this line of business selling the asset managers, but you can’t always get that, especially with some of the more interesting data sets. So is it a vendor, perhaps that has another line of business that we’ll continue to see them through, versus only showing up for the asset management industry? In which case, it’s hard to say if maybe they get interested, they leave, they go somewhere else? And that vendor doesn’t continue? And it’s also a little bit of a question of, if I do find a signal isn’t going to be good enough? If maybe it is only around for a few years, and that vendor shuts down? Am I able to maybe find that information somewhere else? Should they not continue on? And it becomes part of our prioritization process and questions we ask ourselves in a series of trade offs, maybe we’re a little nervous about the vendor. But if they’ve got great coverage in history, maybe that’s an okay trade off that will prioritize looking at this particular data set next

Corey Hoffstein 24:02

to the point you just made around. Maybe the vendor is only around for a couple of years. But that’s good enough to use that data. Prior in our conversation, it sounded like the opportunity for alternative data was really more in the potential the analysis of the data. If we think of sort of edges as being informational or analytical. This was more on the analytical side new data that could help sharpen our view and our perspective and hopefully get more correct answers. But how much do you look at alternative data as potentially being a bit of an arms race, that first person to a new data set to incorporate it could potentially find some interesting edge that might eventually get arbitrage away as more people include alternative data in their process?

Katherine Glass-Hardenbergh 24:44

So I think initially, alternative data very much started as an arms race. You had folks out there hiring, what are called Data scouts to go out and just find access to this information in order to get it into their process as quickly as possible and just Having access was the key. I think what we’ve seen as it’s become still very new, but a little bit more mature within the industry is that now it’s moved on to the analysis side of it, if alternative data.org is available to everybody in the public, all of those datasets are very widely known, anyone can call up the vendor and get a subscription. So it’s less about the access now, and more about what you do with the data. So from our perspective, what we’re generally looking for is data that’s a little bit more raw inform, we don’t want the vendor to do all of the work for us and hand us a signal that says, from one to 100, hundreds good one is bad invest on this, trust us, because they can go and they can sell that to 20 other asset managers, that signal is not going to stay around for very long, it’s going to very quickly get priced in, can we get at that underlying data, and really form our own unique thesis, do some of our own unique analysis, to find our edge in alternative data.

Corey Hoffstein 25:58

So alternative data, as we’re talking about seems to really be a byproduct of the big data world we live in today. There’s all this data that’s being constantly generated, you mentioned this idea of exhaust data. But you did mention offhand to me in a prior conversation. And I don’t know if you’ll recall saying this, but I took note of it that big data can be rather small in nature. And I was hoping you might be willing to expand on what you meant by that.

Katherine Glass-Hardenbergh 26:23

Sure. And it’s very much the idea that you might have gigabytes or terabytes of data. But once you distill it down, you end up with something that quickly becomes very niche, something that’s not necessarily broadly applicable, maybe even as small as a single monthly number out of these terabytes of information. So I think what might be helpful here is kind of talking about an example. Something that’s absolutely massive is satellite imagery. I can image the entire globe on a very regular basis, and we continue to see more and more satellites launched into space to get more imagery more frequently, huge data, well, what can I do with it? Let’s say we start out with an idea that we’re gonna measure economic activity, I’m able to see the entire globe. So maybe I can create some type of a country signal, I can get ahead on estimating GDP, for example. So then I say, Well, what am I going to look for in these images to estimate GDP? Maybe I’ll look for industrial production, heavy industry type information. How can I find heavy industry? Well, maybe I can look at the heat that is generated in manufacturing, and see some of these thermal hotspots. So what shows up, maybe I see steel production, I see natural gas flares, I end up with a lot of potential false positives, that I’ve got to figure out how to filter out of the information as well. Now I gotta say, Well, how do I measure this information? Do I count the number of hotspots I see being produced at a steel plant? Do I look at the frequency or the size of these hotspots? Is it going to be correspondent to the plant being on? Will that be more correspondent to the throughput of that plant? I’ve gone a little bit in depth here. And let’s take a step back does that even capture? Let’s say I’m looking at steel, all of the steelmaking activities out there? Well, turns out, for example, there’s multiple types of steel manufacturing processes, you’ve got old school blast furnaces that put off a lot of heat, so you can really pick it up with this idea. But you’ve got newer technology electric arc, which is a lot more efficient, and doesn’t necessarily put off as much heat. So I’m really shrinking all of a sudden in my idea. So I’ve got to zoom back out again, what else can I look at? Do I measure stockpiles of materials getting used, maybe I can measure the growth of industrial sites if they’re being built. But then I won’t really capture what’s happening at existing industrial sites, maybe I need to expand beyond that heavy industry idea and look at crops, maybe that and looking at the health of crops, the size of crop production, would give me a leg up in some industries that have a lot of agricultural exports. Maybe I look at tracking activity, shipping activity from one port to another, I’ve suddenly gone from this grand idea of looking at imagery across the entire world to realizing that I’m very niche, I have to measure everything very differently very carefully, and layer on one small little signal at a time, where maybe I’ve been able to measure the output of steel factories, I’ve been able to measure shipping activity of a particular commodity. But that really big grand idea doesn’t have nearly the coverage that I initially expected it without really doing a lot of that niche, individual work to add on to it. So that’s kind of where I’m getting at when I think of big data becoming very small is that there’s a lot of cases where you start out with a lot of information, but by the time you distill it Your coverages become quite small,

Corey Hoffstein 30:02

as you’re talking about, it strikes me that some of these datasets might be highly specific to an individual industry or even sub industry. And that you might be working on a dataset that might sharpen your signal for just one particular type of company, which leaves the remainder of your portfolio unenhanced to a certain degree that you now have more accuracy, perhaps on minors or more accuracy on some other type of company, but everything else in the universe is untouched, does that represent a risk in any way that you are somehow working with one data set on one specific type of company and you’re able to generate some signal there, but you haven’t worked on the rest of the portfolio at all? Or can you sort of piecemeal incrementally improve these different sectors and industries in the way in which you’re looking at them?

Katherine Glass-Hardenbergh 30:54

So you can certainly piecemeal and proof and for us that very niche coverage? The question becomes, is it worth it? Our stock universe is 10s of 1000s of stocks. If I’m really only able to get, let’s say, the mining companies, and I get a coverage of 20, for a multifactor Quan, it’s going to be hard to justify even doing the research into something that we know is only gonna have a coverage of 20 names, that data set might be extremely valuable and helpful for a fundamental investor that focuses on investing in mining companies. And it might be helpful for a quant, but it’s going to be a lot harder for us in the prioritization queue, to put that above something that has that broader coverage, that’s going to make a bigger incremental impact on our process, versus a more fundamental guy that maybe is only looking at a couple of names.

Corey Hoffstein 31:49

So sticking with that idea of of prioritization queue. There’s tons of new alternative data sets coming out all the time, maybe we can take this a little bit more from the theoretical to the practical and talk about how you and the team actually think about creating that prioritization queue. As these new data sets come out, how do you organize the research team to try to tackle them and really maximize the impact that you can have.

Katherine Glass-Hardenbergh 32:18

So the way that we’re thinking about it is that alternative data, first and foremost, is just one piece of our research agenda. So we’re always looking to be improving our model. And we’re looking at multiple facets to do that alternative data is going to come in a little bit more on the uncertain optionality value of it. But being a fairly large firm with a pretty well staffed research team, we’ve got the resources to go more in depth and to take on some of these projects, specific complexities get added when we do some of these alternative data projects. So as we started talking a little bit about that prioritization queue, that’s the first piece of it, where you get a little bit extra that you need to think about, of which projects to take on mentioned before that we first and foremost are always going to start out with, what’s the idea? What’s that research thesis that we’re looking to get information from in picking one of these projects? Next couple questions we’re going to ask ourselves are going to weigh the pros and cons and be more liable to pick the datasets that have more pros? is Do we trust that vendor? We’ve talked a little bit about that? Do we trust, the accuracy and the quality of the data that we can get out of it? And information before we actually take a data set in house and start to analyze it? What is that coverage look like? What is that scope? Is it something that only applies to a single industry? Does it apply a little bit more broadly? Is there enough history that we think we could do a pretty good back test? Another piece of it is what’s the effort that we’re going to have to put in? One of the things that we think about is what work are we going to let the data vendor do? And what work do we want to do so briefly mentioned, for example, that we don’t like finished signals, we want to have more of the raw data to be able to do some of that analysis. But we might be a lot more willing to let the data vendor be in charge of getting the data doing the collection, doing the curating doing the mapping, for example, as opposed to bringing that effort in house? Is that really something where we want to specialize? Or is it easier to let the data vendor do some of that and to really bring our expertise in the analysis piece to it. So we’re going to consider that, especially given we don’t know if this is going to work out yet. So we don’t know how much effort we want to put into something. The final piece of it is what is the uniqueness of this dataset? And in thinking about uniqueness we’re thinking about, does it overlap with traditional information that I have access to, but also relative to other data sources? So a great example is let’s say I’m trying to predict sales before earnings information comes out. I could use satellite data to count the cars in Walmart’s parking lot and Target parking lots etc each month and predict, you know, if I’m seeing an increase or decrease in the number of cars, maybe that’ll tell me something about the directions of their sales. But I could also get that information from geolocation, how many individuals are visiting those stores. And maybe that dataset gets a little bit more coverage as well, because it gets stores where people don’t often drive to that gets city dwellers, for example, potentially, though, a little bit more noisy. Because is it really that accurate? Do I know if you’re in the office, above the Starbucks or in the Starbucks, but maybe I can also get that same intuition about sales, from credit card receipt data, that’s gonna give me a really direct measure of how much money people are spending at stores. So thinking about the different places we could get at that research thesis, we’d like to predict sales ahead of time, and what data source is going to give us the best angle at looking at it. Once we actually pick that data set project that we’re going to look at, it very quickly starts to look like a regular factor project. And like we’d put any new data, new information, new idea through our process. First, does that hold up on its own in a univariant? Sense? Is it in some way predicting returns or sales, whatever that may be? And if it does, how does it hold up in the rest of our model? In that multivariate sense? Is it still adding value? Once I add it to everything that’s in the model? What is its correlation to other information in the model? Is it providing some of that unique information? And kind of throughout that process? One of the questions we’re constantly asking ourselves is, how much time do I put into immediately jumping and testing my signal so that if it doesn’t work, I can move on, versus first, cleaning that data understanding it better, because if it’s far too noisy, garbage in, garbage out, no matter how good my ideas, if I jump straight to testing the signal, I’m not going to see something there. So kind of considering that trade off of effort in the beginning to clean the data, verse, quickly, moving through as many ideas as possible, jumping straight to testing the signal.

Corey Hoffstein 37:08

You mentioned something really early in your answer there that just sort of hit me like lightning. I don’t know why I hadn’t thought of it before. But you mentioned this idea of actually trusting the vendor. And one of the things that came to mind to me as you’re talking about some of these unique datasets is that this idea of trust and uniqueness might actually be on a bit of a seesaw. Because the more unique the data, the harder it is for you to actually independently verify that the data is even accurately measuring what it claims to measure with so much traditional data that we work with. Typically as quants, the well structured price and fundamental data, there are highly reputable data sources, you can typically have a couple of different data sources to verify them. And even then, there’s still errors in the data. Talk to me about this trust issue when someone’s claiming to be giving you GPS information or satellite information. How do you gain the trust that it’s actually up to date? How do you gain the trust that it’s not just complete noise that they’re feeding you that it is what they say it is?

Katherine Glass-Hardenbergh 38:12

It’s certainly very difficult. With, for example, this geolocation data in particular, there’s a lot of pieces that go into it. If they’re going to provide this information to you, they’re getting this location. Well, how accurate is that location? Is it plus or minus 100 yards plus or minus a mile? If it’s plus or minus a mile? You’re not going to get much out of it stores aren’t that big. It doesn’t capture, for example, you know, city dwellers, how high off the ground? Am I if I’ve got skyscrapers of where it is, it assumes as well that the data vendor has gone in and figured out where all of these different stores are? Where are the Walmarts versus the targets versus the Home Depot’s? Did they capture them? Did they say what those GPS coordinates were around each of these stores? Did they miss some of the stores? Did they miss classify some? If a store changes? Are they up to date on the fact that this is no longer a Walmart in this location. And I don’t know a gym has opened up in the old space. It’s certainly not an easy task on some of this really unique data to sort through, we can get it a little bit of it by trying to get to that more raw level data by speaking with the vendor about their process and seeing how they go about it. How do they measure accuracy from their point of view? How are they continuously looking to improve their data collection, but at the end of the day, it’s only a snapshot into it and you’re not necessarily ever going to get that same amount of trust and accuracy as you would get, as you mentioned, from some of these fundamental data sources that still have data errors.

Corey Hoffstein 39:50

So alternative data being so large, so unique, potentially having some nonlinear relationships with whatever type of there variable you’re trying to capture typically gets mentioned in the same breath as machine learning. So typically, in the industry, you have panels on alternative data and machine learning and sort of being the edges of quant finance today. If we were to actually remove machine learning from the equation and say you can’t use machine learning anymore, how large Do you think the opportunity remains with alternative data just using traditional, more statistically driven techniques that quants have historically used.

Katherine Glass-Hardenbergh 40:29

So I think the opportunity probably shrinks quite a bit, especially when you’re looking at some of these broad datasets, were able to bring to scale a lot of manual processes. So I usually think of machine learning in two senses. One is that it’s an enabler, it helps us make data more accessible and more at scale, take natural language processing, image processing, is really taking a lot of those human capabilities where you are, I can go look at an image, we could read the transcript from an earnings call or listen to the call ourselves and make a judgment call. with machine learning, I can start to bring that into the purview of the machines. And I can really start to do that at scale you or I could by no means even if we try and listen to every single earnings call it happens. But a machine certainly can. A machine can also scrape the web and visit different websites and gather that data a lot more quickly than you or I could you or I could manually go on Amazon and collect the prices of different products, we could look for sales to understand if companies were discounting their products or not, but it’s going to be a whole lot faster. And we’re gonna get a lot larger scale of that information. By using machine learning to help us there. Machine learning also helps us with the mapping challenge, we can read the names of companies and manually mapped them to tradable securities. But again, we can only do that, at a small scale, machine learning helps enable that much larger, broader scale. So that’s one piece of it that if you take away all machine learning you in essence, slow things down a lot. The other piece of it, which you probably lose quite a bit of is the ability to extract some of those nonlinear relationships, go back to very traditional statistical techniques. And you can only get so much information out of the data, if there’s a lot of complex relationships that you can’t pick out just by looking at it, for example.

Corey Hoffstein 42:22

So still connecting the dots here with the more traditional factors, empirical and academic evidence tends to suggest that a lot of your traditional value quality momentum type factors work a lot better in the small cap space, because they’re either under covered or there’s actual liquidity implementation costs that create the sort of limits to arbitrage, do you find the same effect to be true with alternative data?

Katherine Glass-Hardenbergh 42:49

So I think there’s a really interesting caveat to alternative data that almost makes it that the opposite could be true. And the reason is, because of data availability, bigger companies tend to leave much larger data footprint, so it’s a lot easier for alternative data sources, to pick them up. And to get some pretty good coverage. Bigger companies are more likely to have a lot of different consumer transactions, to have an international trade footprint to have facilities or stores that satellite imagery could pick up or geolocation could pick up, they’re more likely to have a presence on social media, to have people talking about them on social media expressing their sentiment and opinions, they’re more likely to be mentioned in the news. smallcaps, on the other hand, just by virtue of being so much smaller, aren’t necessarily going to have that kind of coverage. My theory would be as if they had great coverage, you might see a lot of this stuff work better in small caps. But if you just don’t have the data, you’re not going to be able to get the type of signal out of some of these sources as you might in larger companies where you do have a lot better coverage. So it’s kind of an interesting dichotomy where it switches from a lot of traditional academic literature, because of that data availability piece.

Corey Hoffstein 44:06

As someone who is super plugged in to the alternative data space, and probably hears a lot about what’s going on in the industry. What are some of the other interesting ways that you’ve heard of other firms trying to use alternative data.

Katherine Glass-Hardenbergh 44:20

So by nature of being a multifactor, quantitative shot, we’re generally looking for slower signals, much higher breadth, but you really see alternative data sources running the gamut of being available to other types of investors, people that are looking for a very high breadth but fast signal Mustad our firm for example, really being able to jump in on some of that sentiment, understanding social media feeds, understanding news and trading intraday on that information. You’ve got the opposite side of it being a very low breath but fast signal, where an event driven shop can you use information. For example, one of the cool things I’ve heard is people tracking corporate jet movements to try to predict m&a activity, watching where one executive is flying to another if I know where company headquarters are located, looking at viewing habits of folks, what are people watching on Netflix? What’s the new latest blockbuster that’s come out on HBO, etc, that tends to be a little bit more niche, but very impactful and accompany perhaps some of my favorite examples being in that low breath, slow signal space as well. So you’ll see activist investors, private equity, venture capital, real estate, sometimes being really interested in some of these very narrow, maybe it only covers one or two stocks. But that’s just fine. So real estate developers are somebody that invests in REITs, looking at foot traffic, for example, at airports or a shopping malls, understanding where people are moving around spending their time, there was a lot of interest for a while before Uber and Lyft went public, of keeping tabs on which company was winning out in the rideshare world who was getting market share from whom, and using alternative data to get an insight into that. Also seeing some of this start to spread over to the fixed income world, where people are saying, Can I use some of this data to get a heads up on credit as well. So lots of different applications that are out there for alternative data, not necessarily just the stuff that we’re looking for.

Corey Hoffstein 46:30

So that’s a really interesting other angle of all this. I know, we’ve spent a lot of time in this conversation, talking a lot about the application of alternative data towards more traditional factor systematic equity. It seems like there might be an opportunity in credit, commodities currencies, macro, you’re mentioning event driven? It sounds like would it be mischaracterizing for me to say that you think alternative data really has a strong breadth of application across a large number of investment styles? Yeah, I’d

Katherine Glass-Hardenbergh 47:01

certainly say it does. And not all alternative data is for everybody. As a result, different people are going to be naturally drawn and find better use cases for different types of datasets. And they’re not necessarily going to have the same exact prioritization cues, as a firm like Akkadian might

Corey Hoffstein 47:18

invert the idea a little bit, are there any investment styles that you think are particularly poorly suited for alternative data

Katherine Glass-Hardenbergh 47:26

immediately coming up, I’m having trouble thinking of an example, that could be poorly suited, because there’s a lot of different vendors. And there’s a lot of types of data out there. And I think one of the keys that you’ll see some of these data vendors get at is that as a quantitative shop, we’re looking to do a lot of the analysis in house. But these vendors cater to everybody. And so they’ll also offer to do a lot of that analysis to really make their data available, no matter your level of desired output. So you’ll see a lot of these vendors creating dashboards, where they’ve already completely aggregated up, they’ve been analyzed the data, and they’re gonna give you the output that you can use to interpret to really try to make it available to many different use cases.

Corey Hoffstein 48:12

All right, Catherine, last question of the podcast for you. And it’s the same question. I’m asking everyone at the end of the season. And the question is this, and it’s supposed to get a little bit of your investment personality. If I said to you that today, you had to liquidate all of your personal investments, and could only invest in one thing for the rest of your life. So that one thing can either be an individual asset class, it could be a portfolio, it could be an investment style. So that could be an active strategy, but you have to stick with it. And it’s got to be the one thing you invest in for the rest of your life. What is it? And why

Katherine Glass-Hardenbergh 48:53

can I invest in a very broad diversified set of equities across the globe?

Corey Hoffstein 49:01

As long as you tell me why Sure.

Katherine Glass-Hardenbergh 49:03

So I’m gonna go with that for my personal investments, because I’m only going to be able to invest in it for the rest of my life. And in order to find alpha and to pick a very specific asset class or a specific security, I’d really want to be able to do my research being a quant and feel confident in it. And so, rest of my life, I don’t know how much time I’m going to have to put into constantly doing research constantly improving, keeping up with the changing and evolving landscape. So I’m gonna hedge my bets, take advantage of diversity, and look to get the general return throughout my life.

Corey Hoffstein 49:37

Katherine, it’s been really great having you on this has been super educational for me, and I know it will be the listener. So thank you.

Katherine Glass-Hardenbergh 49:44

Thank you as well, Cory, I appreciate it. Speaking with you today.