On Bayesian Feature Selection Procedure Applied to Regression Problem with HDD
Keywords:
High-dimensional data, Feature selection procedure, Regression modelAbstract
High-dimensional data (HDD) means that the number of features, p, are exceedingly high and only a few samples n, are available. Regression problem involves the understanding of how the response, y, depend simultaneously on some features x. Often, only a few x’s explain y, while the rest may only have a little or no influence at all to it. Moreover, most of the existing methodology on how the x’s are entered into a regression model is established on p <= n.
This study investigates a recently introduced methodology called the Bayesian feature ranking (BFR) on its performance with respect to how well the data fit the regression model in the presence of HDD in the x’s with y being continuous. The proposed methodology involves implementing a modified forward selection (MFS) procedure on the ranked features with different noise levels v infused on y via the BFR. MFS via BFR procedure allows the most top ranked features to be included in the model and addition of features to the model is done sequentially, with increment value Delta = 5. For baseline comparison, MFS procedure on unranked features is conducted and evaluation of the derived models will be based on the derived values of R2, a statistic for model fit. Results showed that in both simulated and real dataset, MFS via BFR consistently gave higher R2 than the baseline MFS, implying that the model derived via BFR using ranked features of x describe y much better than the model using unranked features of x.