Introduction

GlobalSearchRegression Build Status

Abstract

GlobalSearchRegression is both the world-fastest all-subset-regression command (a widespread tool for automatic model/feature selection) and a first-step to develop a coeherent framework to merge Machine Learning and Econometric algorithms.

Written in Julia, it is a High Performance Computing version of the Stata-gsreg command (get the original code here). In a multicore personal computer (we use a Threadripper 1950x build for benchmarks), it runs up-to 100 times faster than the original Stata-code and up-to 10 times faster than well-known R-alternatives (pdredge).

Notwithstanding, GlobalSearchRegression main focus is not only on execution-times but also on progressively combining Machine Learning algorithms with Econometric diagnosis tools into a friendly Graphical User Interface (GUI) to simplify embarrassingly parallel quantitative-research.

In a Machine Learning environment (e.g. problems focusing on predictive analysis / forecasting accuracy) there is an increasing universe of “training/test” algorithms (many of them showing very interesting performance in Julia) to compare alternative results and find-out a suitable model.

However, problems focusing on causal inference require five important econometric features: 1) Parsimony (to avoid very large atheoretical models); 2) Interpretability (for causal inference, rejecting “intuition-loss” transformation and/or complex combinations); 3) Across-models sensitivity analysis (uncertainty is the only certainty; parameter distributions are preferred against “best-model” unique results); 4) Robustness to time series and panel data information (preventing the use of raw bootstrapping or random subsample selection for training and test sets); and 5) advanced residual properties (e.g. going beyond the i.i.d assumption and looking for additional panel structure properties -for each model being evaluated-, which force a departure from many traditional machine learning algorithms).

For all these reasons, researchers increasingly prefer advanced all-subset-regression approaches, choosing among alternative models by means of in-sample and/or out-of-sample criteria, model averaging results, bayesian priors for theoretical bounds on covariates coefficients and different residual constraints. While still unfeasible for large problems (choosing among hundreds of covariates), hardware and software innovations allow researchers to implement this approach in many different scientific projects, choosing among one billion models in a few hours using standard personal computers.

Installation

GlobalSearchRegression requires Julia 1.0.1 (or newer releases) to be previously installed in your computer. Then, start Julia and type "]" (without double quotes) to open the package manager.

julia> ]
pkg>

After that, just install GlobalSearchRegression by typing "add GlobalSearchRegression"

pkg> add GlobalSearchRegression

Optionally, some users could also find interesting to install CSV and DataFrames packages to allow for additional I/O functionalities.

pkg> add CSV DataFrames

Basic Usage

To run the simplest analysis just type:

julia> using GlobalSearchRegression, DelimitedFiles
julia> dataname = readdlm("path_to_your_data/your_data.csv", ',', header=true)

and

julia> gsreg("your_dependent_variable your_explanatory_variable_1 your_explanatory_variable_2 your_explanatory_variable_3 your_explanatory_variable_4", dataname)

or

julia> gsreg("your_dependent_variable *", data)

It performs an Ordinary Least Squares - all subset regression (OLS-ASR) approach to choose the best model among 2<sup>n</sup>-1 alternatives (in terms of in-sample accuracy, using the adjusted R<sup>2</sup>), where:

Advanced usage

Alternative data input

Databases can also be handled with CSV/DataFrames packages. To do so, remember to install them by using the add command in the Julia's package manager. Once it is done, just type:

julia> using GlobalSearchRegression, CSV, DataFrames
julia> data = CSV.read("path_to_your_data/your_data.csv")
julia> gsreg("y *", data)

Alternative GUM syntax

The general unrestricted model (GUM; the gsreg function first argument) can be written in many different ways, looking for a smooth transition for R and Stata users.

# Stata like
julia> gsreg("y x1 x2 x3", data)

# R like
julia> gsreg("y ~ x1 + x2 + x3", data)
julia> gsreg("y ~ x1 + x2 + x3", data=data)

# Strings separated with comma
julia> gsreg("y,x1,x2,x3", data)

# Array of strings
julia> gsreg(["y", "x1", "x2", "x3"], data)

# Using wildcards
julia> gsreg("y *", data)
julia> gsreg("y x*", data)
julia> gsreg("y x1 z*", data)
julia> gsreg("y ~ x*", data)
julia> gsreg("y ~ .", data)

Additional options

GlobalSearchRegression advanced properties include almost all Stata-GSREG options but also additional features. Overall, our Julia's version has the following options:

Full-syntax example

This is a full-syntax example, assuming Julia 1.0.1 (or newer version), GlobalSearchRegression and DataFrames are already installed in a quad-core personal computer.

# The first four lines are used to simulate data with random variables
julia> using DataFrames
julia> data = DataFrame(Array{Union{Missing,Float64}}(randn(100,16)))
julia> headers = [ :y ; [ Symbol("x$i") for i = 1:size(data,2) - 1 ] ]
julia> names!(data, headers)
# The following two lines enable multicore calculations
julia> using Distributed
julia> addprocs(4)
# Next line defines the working directory (where output results will be saved), for example:
julia> cd("c:\\")  # in Windows, or
julia> cd("/home/")  # in Linux
# Final two lines are used to perform all-subset-regression
julia> using GlobalSearchRegression
julia> gsreg("y x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15", data,
    intercept=true,
    outsample=10,
    criteria=[:r2adj, :bic, :aic, :aicc, :cp, :rmse, :rmseout, :sse],
    ttest=true,
    method="precise",
    vectoroperation=true,
    modelavg=true,
    residualtest=true,
    time=:x1,
    csv="output.csv",
    parallel=4,
    orderresults=false)

Credits

The GSReg module, which perform regression analysis, was written primarily by Demian Panigo, Valentín Mari and Adán Mauri Ungaro. The GlobalSearchRegression.jl module was inpired by GSReg for Stata, written by Pablo Gluzmann and Demian Panigo.