Essays in Predictive and Causal Machine Learning

Okasa, Gabriel

Essays in Predictive and Causal Machine Learning

Type

doctoral thesis

Date Issued

2022-02-21

Author(s)

Okasa, Gabriel

Abstract

This dissertation consists of three chapters devoted to topics in predictive and causal machine learning. Common to all chapters is the synthesis of classical econometric methods and novel machine learning algorithms. Hence this doctoral thesis provides new insights into applications of machine learning for predictive tasks and for causal inference. The first chapter investigates the estimation of heterogeneous causal effects using machine learning. We focus on the meta-learning framework where the estimation of the causal parameter is decomposed into separate prediction tasks. Using synthetic and empirical simulations we study the finite sample performance of meta-learners based on the Random Forest algorithm under different implementations using sample-splitting and cross-fitting procedures. The results imply that sample-splitting is beneficial in large samples for bias reduction but leads to an increase in variance, whereas cross-fitting keeps the bias low and successfully restores the full sample size efficiency. In contrast, the full-sample estimation is preferable in small samples when using machine learning. Additionally, we provide guidelines for applications of meta-learners in empirical studies depending on particular data characteristics such as treatment shares and sample size. The second chapter considers the estimation of ordered choice models using machine learning. Similarly, as in the first chapter, we focus on the Random Forest algorithm and develop a new machine learning estimator for models with ordered categorical outcome variable. The proposed Ordered Forest flexibly estimates the conditional ordered choice probabilities while taking the ordering information explicitly into account. In contrast to common machine learning estimators, it is not only suited for prediction tasks, but it also enables the estimation of marginal effects and conducting statistical inference, which provides additional interpretability as in classical econometric estimators. We conduct an extensive simulation study and find a good predictive performance, particularly in settings with nonlinearities and multicollinearity. Furthermore, we demonstrate the estimation of marginal effects and their standard errors in an empirical application. The third chapter presents an empirical application based on the estimation of causal effects using machine learning. As in the previous two chapters, we rely on the Random Forest method and consider its causal variant, the Modified Causal Forest. Following the rise of online dating, we study the effect of sport activity on partner choice by exploiting a unique dataset from an online dating platform. In particular, we estimate the causal effect of sport frequency on the contact chances, controlling for a large set of observable user characteristics. We find that for male users, doing sport on a weekly basis increases the probability to receive a first message by more than 50%, in comparison to no sport activity. In contrast, we do not find such an evidence for female users. Moreover, the results indicate heterogeneity as for male users the effect increases with higher income.

Abstract (De)

This dissertation consists of three chapters devoted to topics in predictive and causal machine learning. Common to all chapters is the synthesis of classical econometric methods and novel machine learning algorithms. Hence this doctoral thesis provides new insights into applications of machine learning for predictive tasks and for causal inference. The first chapter investigates the estimation of heterogeneous causal effects using machine learning. We focus on the meta-learning framework where the estimation of the causal parameter is decomposed into separate prediction tasks. Using synthetic and empirical simulations we study the finite sample performance of meta-learners based on the Random Forest algorithm under different implementations using sample-splitting and cross-fitting procedures. The results imply that sample-splitting is beneficial in large samples for bias reduction but leads to an increase in variance, whereas cross-fitting keeps the bias low and successfully restores the full sample size efficiency. In contrast, the full-sample estimation is preferable in small samples when using machine learning. Additionally, we provide guidelines for applications of meta-learners in empirical studies depending on particular data characteristics such as treatment shares and sample size. The second chapter considers the estimation of ordered choice models using machine learning. Similarly, as in the first chapter, we focus on the Random Forest algorithm and develop a new machine learning estimator for models with ordered categorical outcome variable. The proposed Ordered Forest flexibly estimates the conditional ordered choice probabilities while taking the ordering information explicitly into account. In contrast to common machine learning estimators, it is not only suited for prediction tasks, but it also enables the estimation of marginal effects and conducting statistical inference, which provides additional interpretability as in classical econometric estimators. We conduct an extensive simulation study and find a good predictive performance, particularly in settings with nonlinearities and multicollinearity. Furthermore, we demonstrate the estimation of marginal effects and their standard errors in an empirical application. The third chapter presents an empirical application based on the estimation of causal effects using machine learning. As in the previous two chapters, we rely on the Random Forest method and consider its causal variant, the Modified Causal Forest. Following the rise of online dating, we study the effect of sport activity on partner choice by exploiting a unique dataset from an online dating platform. In particular, we estimate the causal effect of sport frequency on the contact chances, controlling for a large set of observable user characteristics. We find that for male users, doing sport on a weekly basis increases the probability to receive a first message by more than 50%, in comparison to no sport activity. In contrast, we do not find such an evidence for female users. Moreover, the results indicate heterogeneity as for male users the effect increases with higher income.

Language

English

Keywords

Ökonometrie

Kausalanalyse

Maschinelles Lernen

EDIS-5206

Econometrics

Machine Learning

Causal Inference

HSG Classification

not classified

HSG Profile Area

None

Publisher

Universität St. Gallen

Publisher place

St.Gallen

Official URL

https://nbn-resolving.org/urn:nbn:ch:bel-2264283

URL

https://www.alexandria.unisg.ch/handle/20.500.14171/108976

Subject(s)

economics

Division(s)

SEW - Swiss Institute...

Eprints ID

265914

File(s)

Dis5206.pdf (7.65 MB)

Options

Essays in Predictive and Causal Machine Learning