Options
Essays in Predictive and Causal Machine Learning
Type
doctoral thesis
Date Issued
2022-02-21
Author(s)
Abstract
This dissertation consists of three chapters devoted to topics in predictive and causal machine learning. Common to all chapters is the synthesis of classical econometric methods and novel machine learning algorithms. Hence this doctoral thesis provides new insights into applications of machine learning for predictive tasks and for causal inference. The first chapter investigates the estimation of heterogeneous causal effects using machine learning. We focus on the meta-learning framework where the estimation of the causal parameter is decomposed into separate prediction tasks. Using synthetic and empirical simulations we study the finite sample performance of meta-learners based on the Random Forest algorithm under different implementations using sample-splitting and cross-fitting procedures. The results imply that sample-splitting is beneficial in large samples for bias reduction but leads to an increase in variance, whereas cross-fitting keeps the bias low and successfully restores the full sample size efficiency. In contrast, the full-sample estimation is preferable in small samples when using machine learning. Additionally, we provide guidelines for applications of meta-learners in empirical studies depending on particular data characteristics such as treatment shares and sample size. The second chapter considers the estimation of ordered choice models using machine learning. Similarly, as in the first chapter, we focus on the Random Forest algorithm and develop a new machine learning estimator for models with ordered categorical outcome variable. The proposed Ordered Forest flexibly estimates the conditional ordered choice probabilities while taking the ordering information explicitly into account. In contrast to common machine learning estimators, it is not only suited for prediction tasks, but it also enables the estimation of marginal effects and conducting statistical inference, which provides additional interpretability as in classical econometric estimators. We conduct an extensive simulation study and find a good predictive performance, particularly in settings with nonlinearities and multicollinearity. Furthermore, we demonstrate the estimation of marginal effects and their standard errors in an empirical application. The third chapter presents an empirical application based on the estimation of causal effects using machine learning. As in the previous two chapters, we rely on the Random Forest method and consider its causal variant, the Modified Causal Forest. Following the rise of online dating, we study the effect of sport activity on partner choice by exploiting a unique dataset from an online dating platform. In particular, we estimate the causal effect of sport frequency on the contact chances, controlling for a large set of observable user characteristics. We find that for male users, doing sport on a weekly basis increases the probability to receive a first message by more than 50%, in comparison to no sport activity. In contrast, we do not find such an evidence for female users. Moreover, the results indicate heterogeneity as for male users the effect increases with higher income.
Abstract (De)
This dissertation consists of three chapters devoted to topics in predictive and causal machine learning. Common to all chapters is the synthesis of classical econometric methods and novel machine learning algorithms. Hence this doctoral thesis provides new insights into applications of machine learning for predictive tasks and for causal inference. The first chapter investigates the estimation of heterogeneous causal effects using machine learning. We focus on the meta-learning framework where the estimation of the causal parameter is decomposed into separate prediction tasks. Using synthetic and empirical simulations we study the finite sample performance of meta-learners based on the Random Forest algorithm under different implementations using sample-splitting and cross-fitting procedures. The results imply that sample-splitting is beneficial in large samples for bias reduction but leads to an increase in variance, whereas cross-fitting keeps the bias low and successfully restores the full sample size efficiency. In contrast, the full-sample estimation is preferable in small samples when using machine learning. Additionally, we provide guidelines for applications of meta-learners in empirical studies depending on particular data characteristics such as treatment shares and sample size. The second chapter considers the estimation of ordered choice models using machine learning. Similarly, as in the first chapter, we focus on the Random Forest algorithm and develop a new machine learning estimator for models with ordered categorical outcome variable. The proposed Ordered Forest flexibly estimates the conditional ordered choice probabilities while taking the ordering information explicitly into account. In contrast to common machine learning estimators, it is not only suited for prediction tasks, but it also enables the estimation of marginal effects and conducting statistical inference, which provides additional interpretability as in classical econometric estimators. We conduct an extensive simulation study and find a good predictive performance, particularly in settings with nonlinearities and multicollinearity. Furthermore, we demonstrate the estimation of marginal effects and their standard errors in an empirical application. The third chapter presents an empirical application based on the estimation of causal effects using machine learning. As in the previous two chapters, we rely on the Random Forest method and consider its causal variant, the Modified Causal Forest. Following the rise of online dating, we study the effect of sport activity on partner choice by exploiting a unique dataset from an online dating platform. In particular, we estimate the causal effect of sport frequency on the contact chances, controlling for a large set of observable user characteristics. We find that for male users, doing sport on a weekly basis increases the probability to receive a first message by more than 50%, in comparison to no sport activity. In contrast, we do not find such an evidence for female users. Moreover, the results indicate heterogeneity as for male users the effect increases with higher income.
Language
English
Keywords
Ökonometrie
Kausalanalyse
Maschinelles Lernen
EDIS-5206
Econometrics
Machine Learning
Causal Inference
HSG Classification
not classified
HSG Profile Area
None
Publisher
Universität St. Gallen
Publisher place
St.Gallen
Official URL
Subject(s)
Division(s)
Eprints ID
265914
File(s)