GlobalSearchRegression.jl: Building bridges between Machine Learning and Econometrics in Fat-Data scenarios

The aim of this paper is twofold. The ﬁrst one is to describe a novel research-project designed for building bridges between machine learning and econometric worlds ( ModelSelection.jl). The second one is to introduce the main characteristics and comparative performance of the ﬁrst Julia-native all-subset regression algorithm included in GlobalSearchRegression.jl (v1.0.5). As other available alternatives, this algorithm allows researchers to obtain the best model speciﬁcation among all possible covariate combinations - in terms of user deﬁned information criteria-, but up to 3165 and 197 times faster than STATA and R alternatives, respectively.

Notwithstanding, the advantage of having thousands/millions of features to deal with complex phenomena stimulates an unprecedented number of methodological -and technological-improvements to manage the 'curse of dimensionality' [11].
In Economics, this process has a dual approach with machinelearning (ML) and econometric (EC) algorithms emerging for different purposes: the former for prediction/forecasts (focusing on y) and the latter for estimation/causal inference (interested inβ). Alternatively, the same distinction can be expressed in Diebold's terms as non-causal vs causal prediction (see [7]), where ML algorithms are designed to reduce prediction sampling-risks -i.e. learning through cross-validation techniques-and EC methods to identify unbiased multivariate relationships -i.e. avoiding consistency issues through residual and coefficient tests for model selection.
In turn, ML feature selection algorithms can be classified into three different families: Filters, Wrappers and Embedded -depending on whether they use some classifier/response variable information or not, or how variable selection is made along with the learning process, see [14]. Similarly, most EC dimensionality reduction approaches can be classified into three different groups: Exhaustive, General-to-Specific and Specific-to-General -depending on the search pattern; see [22]. Despite significant improvements in recent econometric developments -like PCGIVE/Autometrics or Retina algorithms, which combine some ML and EC characteristics-, available alternatives fails to fully exploit cross-validation and model averaging capabilities -see [19], [31], [15], [18], [29], [25], and [13].
Conversely, newer ML algorithms -like Convolutional Neural Networks or Bootstrap-Based LASSO, which improve complex nonlinear adjustment or model selection under regularization schemesachieved unprecedented forecast accuracy but disregarding model interpretability and/or parameter estimation issues -i.e omitting residual and coefficient tests; see [12]. Following Varian's [33] advices, about ML and EC complementarities -i.e. merging algorithms from different families to reduce both sampling and model uncertainty-, we are developing a novel multi-layer-multi-algorithm methodology combining two reinforcing paradigms: The LSE "Testimation" approach -to obtain information about residual properties, see [2]-and the Bayesian-like "Double-model averaging" -across different covariates and subsamples, see [26]. This methodology includes five complementary layers -handling cross-section, time series and panel data-: 1) Preprocessing: with outlier detection, missing values identification, seasonal adjustment and normalization/standardization functions; 2) Feature extraction: creation of logs, squares, inverses and interactions from selected variables; 3) Feature pre-selection: using filter and embedded ML algorithms like CFS, Variance threshold and LASSO functions; 4) Final feature selection: with a modified all-subset regression approach, including residual tests and model averaging capabilities; 5) Post-estimation fine-tuning: coefficient re-evaluation through cross-validation techniques and model averaging across different k-fold results.
The objective of this paper is to introduce the main characteristics and comparative performance of a key layer of our methodology: the modified all-subset regression algorithm included in Glob-alSearchRegression.jl (v1.0.5). As other available alternatives (like MuMin-pdredge in R, or GSREG Stata alternatives), this Julia algorithm allows researchers to obtain the best model specification among all possible covariate/feature combinations -in terms of user defined information criteria-, but up to 3165 times faster than Stata and 197 times faster than R.

Package's main features
Written in Julia, GlobalSearchRegression is a parallel (and improved) version of the Stata-GSREG all-subset regression command (get the original code here). The package structure is quite simple, as shown in figure 1: Through the gsreg function of the interface.jl internal package, users set the appropriate database to be used, the general unrestricted model -GUM, which defines the search space-and additional options for model selection. With this information and complementary supporting functions and definitions provided by strings.jl -i.e. error messages-, utils.jl -i.e. equation formatting, combinatorial analysis, database manipulation, sorting results, etc.-and gsreg_results.jl -i.e. the structure to save estimation results-, the core.jl package perform the all-subsetregression algorithm explained in the pseudocode below, to obtain the following outputs: 1) a matrix -optionally exported to a csv file through utils.jlincluding regression coefficients, selection criteria, observations and (optionally) t-test, residual tests, averaging-weights and out-of-sample metrics for every alternative model; 2) a text file -also displayed on screen-which contains the best model specification and (optionally) model averaging results in terms of he user-selected information criteria -see multiple examples in runtest.jl-. 4. Export results using gsreg\_results.jl structure 5. Export summary results to a txt file 6. Optionally send the results structure to utils.jl to export a csv file.
end module Core.jl.

Comparative performance against R and STATA
In For experimental -random variable-databases with a few covariates -up to 15 explanatory variables-, our Julia algorithm only provides significant time improvement in standard personal computers -e.g. 4 cores-, being up to twice faster than R and 4 times faster than Stata. For HED computers or HPC nodes, there is almost no difference among the best result obtained for each alternative.
However, for databases with 20 or more covariates, our Julia allsubset-regression code is always faster, irrespective of the number of observations or threads -up to 3165 times faster than STATA 1 The parallel version of Stata-GSREG is still under development. A preliminary version is available upon request from authors 2 All these test are available here and 197 times faster than R-. Execution time differences exponentially increase with the number of covariates and slightly decrease with available observations. Moreover, Stata and R alternatives become unfeasible for databases with 25 or more covariates. For a confirmatory analysis, in figure 2 we present execution-time kernel densities for the 20 covariates -100 observations -32 threads case, obtained from 300 independent (non-cached) runs. Kernel density results show that execution time differences are not an artifact obtained from noisy runs. Our Julia algorithm is consistently faster han R and Stata alternatives.
A detailed speed-up analysis is also available for the 25-covariate case. In figure 3, it's shown that our Julia all-subset-regression algorithm scales almost linearly for large databases -while the number of threads is not higher than the number of physical cores-. With small databases, Amdhal's law [6] inputs change. Parallel tasks become lighter and speed-up efficiency degrades consistently with additional threads, because the marginal overhead cost of a larger environment creation is not overcompensated by parallelism gains obtained from additional threads. Notwithstanding, using only physical cores speed-up efficiency is always above 45% -with an average of 84% for 2, 4, 8 and 16 threads-.
4. Why GlobalSearchRegression.jl is faster than existing alternatives?

JULIA platform
Despite some GlobalSearchRegression.jl specific features to be examined below, our all-subset-regression algorithm benefits from the well-known Julia language efficiency for High Performance Computing tasks.  Julia JIT-compilation allows packages to run faster than those executed using interpreted or byte-compiled languages (like R or Stata-Mata). Indeed, basic functions running on Julia can be up to 251 and 368 times faster than those running in R and Stata, respectively -with an average speed-up of 75 and 90, see Table 2-.

Parallel strategy and memory setup
It is well know that multicore architectures can be used to speedup execution times through two main paradigms: data and taskparallelism [35]. The preferred strategy critically depends on: (1) Database structure; (2) Algorithm serial portion; and (3) Interthread communication costs.
The first discussion is about tall vs fat data. While tall databases are more suitable for data-parallelism [8], fat-structures improve relative performance of task-parallelism (because tasks increase exponentially with covariates/columns in feature selection problems, see [20]).
Additionally, the choice between alternative paradigms takes into account the Amdahl's Law for the specific algorithm to be used. Task-parallelism is usually better for econometric and machine learning algorithms requiring some specific serial optimization paths (i.e. arima-arfima models), while data-parallelism performs better under linear-algebra solutions (i.e. OLS-family estimators, see [24]).
Finally, we have the problem of intercomunication costs. Dataparallelism -generally-involves intense needs of inter-thread-communication. While task-parallelism communication costs depends on Load-Balancing choices (i.e. Dynamic vs Static), they are usually lower than data-parallelism ones 3 [30].
For feature selection problems, pros and cons of alternative strategies often determine that available cores should be used for task parallelism. Databases are fat, data-mining algorithms can include large serial portions, and intercomunication costs can be huge for large-multicore architectures. These reasons explain why all available all-subset-regression packages use taks-parallism to speed-up execution times.
As for the memory setup, there are also alternative methodologies. First, it's necessary to choose among Static vs. Dynamic memory allocations [32]. To improve speed-up, Static "once-and-for-all" allocations are often preferred (even at a cost of higher average memory utilization). Second, shared-memory strategies must be determined. Depending on both object-size and CPU architecture -cache size and its distribution among cores-, it could be optimal to use large shared arrays or -alternatively-smaller core-specific objects. Splitting output matrices to work with smaller non-shared arrays could be useful for cache optimization purposes, but it could also entail additional communication costs and higher memory requirements [5]. In practice, feature selection algorithms usually prefer shared-arrays for high performance computing.
GlobalSearchRegression.jl (execution-time) advantages have been obtained combining: a) Task-parallelism b) Static Load-Balancing c) Coarse-grained granularity d) Static memory allocation e) Efficient shared-array implementation While some of these characteristics are shared with R and Stata alternatives (MuMin-pdredge and GSREG, respectively), our Static Load-Balancing algorithm outperform the round-robin R scheduling (implemented by the clusterapply function included in the parallel R-package see [16]) and the Shared-Array strategy significantly improves I/O performance against Stata. By construction, pure Static Load-Balancing in GlobalSearchRegression.jl avoids Round-Robin communication costs and execution gaps. This advantage overcompensate minimal 4 load-asymmetries, associated with any static scheduling. In turn, using efficient shared-arrays to store all-subset-regression results allows us to outperform the Stata-GSREG methodology which heavily relies on slower I/O disk operations (because multiple instances must be launched to enable task-parallelism and, therefore, shared-arrays become unfeasible).

OLS estimation
Efficient Ordinary Least Squares (OLS) algorithms rely on Linear Algebra operations (matrix decomposition, matrix inversion, etc.). 5 The traditional (X X) −1 operation could be time-expensive with unstable solutions under certain conditions [27]. A preferred method is the QR-decomposition developed by Francis [21] and Kublanovskaya [28]. The QR factorization allows us to decompose any full rank N × p matrixX as: where Q is a N × p matrix with Q Q = I ; and R a p × p upper triangular matrix.
The QR decomposition is fast and provides stable numerical solutions under rank-deficient matrices. Alternative factorizations are either slower (SVD decomposition) or potentially unstable (Cholesky decomposition) [3].
GlobalSearchRegression.jl OLS estimation through QRdecompositions outperform existing Julia alternatives (like GLM.jl) allowing for large speed-ups compared with R-lm and Stata-regress commands. Table 3 shows execution time differences for a 200 covariates -1000 observations -multivariate linear regression 6 .
It must be notice that execution times are compilation-free, because all-subset-regression algorithms must perform thousand to millions of regressions where compilation time is absent (it only affects the first regression).

Conclusion
"There are a number of areas where there would be opportunities for fruitful collaboration between econometrics and machine learning [... and] the most important area for collaboration involves causal inference". [33] As Hal Varian emphasizes, there is a need for mutual collaboration between machine learning developers/practitioners and econometricians. In this paper we describe a research-project aimed at building bridges between machine learning and econometric worlds (ModelSelection.jl)and introduce the main characteristics of its first outcome: a Julia-native all-subset regression algorithm (GlobalSearchRegression.jl) which runs up to 3165 and 197 times faster than existing Stata and R alternatives, respectively.
Throughout the paper, it has been shown that execution-time gains are explained by multiple efficient strategies combined in GlobalSearchRegression.jl (i.e. task parallelism, static loadbalancing, coarse-grained granularity, static memory allocation, efficient shared array implementation, and OLS estimation using neat QR decompositions). However, the main 'explanatory variable' is the impressive speed-up in atomic operations obtained using the Julia language.
Notwithstanding, increasing availability of Big and -more challenging-Fat-data, force us to go beyond pure all-subsetregression approaches and combine it with machine learning feature selection algorithms. Work in progress include a new hybrid package merging LASSO capabilities, all-subset-regression robustness nd K-fold cross validation strengths.