Data Sources & Features for Stock Market Prediction

Overview

A major contribution of the Kumbure et al. (2022) review is the systematic cataloging of predictor variables used in stock market forecasting research. The review identified 2,173 unique input variables across 138 papers, demonstrating the extensive experimentation conducted in the field. This means that researchers have tested far more features than most individual studies consider. Evidence from multiple studies consistently shows that feature selection affects prediction accuracy more than algorithm choice—in other words, choosing the right inputs matters more than choosing the right model. Therefore, understanding data sources and feature engineering is essential for practical implementation.

Feature Categories at a Glance

Technical indicators appear in over 80% of reviewed studies, making them the dominant feature category. This reflects both data availability and the field's roots in technical analysis—due to their simplicity, technical indicators can be computed by any researcher with access to price data. Simple moving averages (SMA), relative strength index (RSI), and momentum measures are the most frequently used indicators because they have decades of established interpretation. Fundamental data and macroeconomic variables appear in approximately 30% of studies, typically for longer-horizon predictions. Alternative data, particularly sentiment from news and social media, represents a growing trend in post-2015 research.

The choice of input features depends on the prediction horizon and target. Short-term predictions (intraday to weekly) predominantly use technical indicators and high-frequency price data, whereas medium-term predictions (monthly) often incorporate fundamental ratios and earnings data. In contrast, long-term forecasts may include macroeconomic variables such as GDP growth and interest rates. According to evidence from the reviewed literature, practitioners should match feature selection to their specific prediction task rather than applying generic feature sets—this is why understanding the relationship between feature types and time horizons is essential. Unlike other machine learning domains where feature sets are relatively stable, financial prediction requires continuous adaptation as market regimes change.

Technical Indicators

Technical indicators transform raw price and volume data into signals intended to capture market momentum, trend, and volatility. According to the Kumbure et al. (2022) review, simple moving average (SMA) and exponential moving average (EMA) are the most commonly used, followed by RSI, MACD, and Bollinger Bands. These indicators derive from decades of trading practice and provide standardized transformations recognized across the industry. Unlike raw price data, technical indicators normalize information across different stocks and time periods, which enables models to learn more generalizable patterns. Compared to fundamental data that updates quarterly, technical indicators can be computed in real-time from streaming price feeds.

Indicator Type Calculation Basis Signal Interpretation
Simple Moving Average (SMA) Trend Average price over N periods Price above SMA: bullish; below: bearish
Exponential Moving Average (EMA) Trend Weighted average favoring recent prices More responsive to recent price changes
Relative Strength Index (RSI) Momentum Ratio of up moves to down moves Above 70: overbought; below 30: oversold
MACD Momentum/Trend Difference between fast and slow EMA Signal line crossovers indicate trend change
Bollinger Bands Volatility SMA with standard deviation bands Price at bands suggests mean reversion
Stochastic Oscillator Momentum Position within recent price range Identifies overbought/oversold conditions
On-Balance Volume (OBV) Volume Cumulative volume based on price direction Volume confirms or diverges from trend

Research findings from multiple studies indicate that combining multiple indicators typically outperforms single-indicator approaches. Studies using 5-10 carefully selected indicators achieved 3-7% higher accuracy than those using raw prices alone—this is because different indicators capture different market dynamics. However, adding too many indicators introduces multicollinearity and increases overfitting risk. Therefore, dimensionality reduction or regularization techniques are essential when using large indicator sets. For discussion of algorithm selection to handle high-dimensional features, see the ML Techniques page. For performance metrics, see Performance Evaluation.

Fundamental Data

Fundamental analysis examines company financials to assess intrinsic value. The Kumbure et al. (2022) review found that fundamental data appears in approximately 30% of studies, typically for medium to long-term predictions. The literature indicates that fundamental factors explain significant cross-sectional variation in stock returns. Common fundamental variables include valuation ratios, profitability metrics, and balance sheet items.

Category Variables Prediction Relevance
Valuation Ratios P/E, P/B, P/S, EV/EBITDA, dividend yield Long-term returns; mean reversion patterns
Profitability ROE, ROA, profit margin, EBITDA margin Quality factors; earnings persistence
Growth Metrics Revenue growth, EPS growth, book value growth Momentum in fundamentals
Leverage Debt/equity, interest coverage, current ratio Financial distress prediction
Earnings Quality Accruals, cash flow/earnings ratio Earnings sustainability

Fundamental data presents unique challenges for ML models. Financial statements are released quarterly, creating sparse time series compared to daily prices—this is why models using fundamental data typically focus on longer prediction horizons. Earnings surprises (differences between actual and expected results) often drive short-term price movements, but expectations data requires analyst forecast databases. Research from several sources indicates that studies combining fundamental and technical features often outperform those using either category alone, suggesting complementary information content. Specifically, fundamental data captures company-specific value signals while technical data captures market timing.

Macroeconomic Variables

Macroeconomic indicators capture the broader economic environment affecting equity markets. The review documents use of interest rates, GDP growth, inflation, unemployment, and exchange rates as predictors. Because these variables affect all stocks to some degree, they are particularly relevant for market-level predictions (e.g., S&P 500 index) rather than individual stocks. Due to their economy-wide impact, macro factors explain systematic risk that cannot be diversified away.

Variable Data Frequency Market Relationship
Interest Rates Daily (Fed funds, Treasury yields) Higher rates typically pressure equity valuations
GDP Growth Quarterly Economic expansion supports corporate earnings
Inflation (CPI) Monthly Moderate inflation positive; high inflation negative
Unemployment Monthly Labor market strength indicates economic health
Exchange Rates Daily Affects multinational company earnings

The low frequency of macroeconomic data (monthly or quarterly) presents challenges for daily prediction models. In practice, some studies address this by using "nowcasting" techniques that estimate current macro conditions from higher-frequency proxies—for instance, using weekly jobless claims to infer monthly employment conditions. Alternatively, others incorporate macroeconomic variables only for longer-horizon predictions where the temporal resolution matches. Evidence from the review finds that models incorporating macro factors show improved performance during economic regime changes but may underperform during stable periods. As a result, adaptive models that adjust their feature weighting based on market conditions tend to achieve more consistent performance.

Alternative Data Sources

Alternative data refers to non-traditional information sources that may provide predictive signals. The Kumbure et al. (2022) review documents growing use of textual data from news and social media, representing a major trend in post-2015 research. NLP techniques extract sentiment scores that capture market psychology not reflected in price data alone. This is important because investor sentiment can move prices before fundamental changes materialize—for instance, negative news about a company often drives price declines before any financial impact appears in earnings reports.

The Rise of Sentiment Analysis

Studies using news sentiment achieved average accuracy improvements of 3-5% over price-only models, according to meta-analyses cited in the review. Twitter and StockTwits provide higher-frequency sentiment signals, while news articles offer more substantive content for analysis. The challenge lies in processing unstructured text at scale and distinguishing signal from noise. Advanced NLP models, particularly transformer-based architectures, have improved sentiment extraction accuracy since 2018.

Data Source Signal Type Processing Method Challenges
News Articles Sentiment, event detection NLP, named entity recognition Volume, relevance filtering
Social Media (Twitter) Real-time sentiment Sentiment classification, topic modeling Noise, bot activity, sarcasm
Google Trends Search interest Query volume normalization Query selection, lag effects
Options Markets Implied volatility, put/call ratios Direct numerical features Complex market structure
Satellite Imagery Economic activity proxies Computer vision, change detection Cost, specialized expertise

The integration of alternative data creates both opportunities and challenges. The potential for alpha generation attracts significant industry investment, but data costs and processing requirements can be substantial—which means that smaller research teams may be at a disadvantage. Privacy regulations increasingly constrain certain data sources, particularly consumer behavior tracking. Evidence from various studies indicates that alternative data provides the greatest edge for shorter prediction horizons where traditional data may not yet reflect new information. Consequently, hedge funds with alternative data capabilities focus on intraday and daily horizons rather than weekly or monthly predictions. On the other hand, retail investors without access to premium data sources can still compete using publicly available sentiment from social media and news.

Feature Selection Methods

With 2,173 documented variables, feature selection is critical for practical model development. According to the review, there are several approaches: domain-knowledge-driven selection (choosing indicators with theoretical justification), filter methods (ranking features by correlation or mutual information), wrapper methods (evaluating feature subsets via model performance), and embedded methods (algorithms with built-in selection like LASSO or random forest importance). Each approach has trade-offs: domain knowledge is interpretable but may miss novel patterns, while automated methods are comprehensive but risk overfitting to historical data.

Method Approach Advantages Limitations
Domain Knowledge Expert selection based on theory Interpretable, avoids spurious correlations May miss novel patterns
Correlation/MI Filtering Rank features by target relationship Fast, scales to many features Ignores feature interactions
Recursive Feature Elimination Iteratively remove least important Accounts for model behavior Computationally expensive
LASSO/Elastic Net L1 regularization shrinks to zero Automatic, handles multicollinearity Linear assumptions
Random Forest Importance Measure prediction contribution Nonlinear, interpretable rankings Biased toward high-cardinality
Genetic Algorithms Evolutionary search for feature subsets Global optimization Expensive, risk of overfitting

The review notes that studies employing systematic feature selection consistently outperform those using all available features. According to meta-analysis across 50+ studies, models with 10-30 well-chosen predictors achieved 5-8% higher accuracy than those using all available features. This is because financial data contains more noise than signal, and irrelevant features can actually degrade model performance. In practice, reducing feature dimensionality from hundreds to 10-30 well-chosen predictors often improves both accuracy and interpretability. For example, a model using 20 carefully selected technical indicators typically outperforms one using 200 raw features. Essentially, the "curse of dimensionality" is particularly acute in financial prediction where sample sizes are limited. Unlike image classification where millions of examples are available, financial prediction may have only a few thousand training samples, making feature parsimony critical. Compared to wrapper methods that evaluate feature subsets exhaustively, embedded methods like LASSO achieve similar selection quality at 10x lower computational cost—this trade-off is particularly important for practitioners with limited resources. On the other hand, wrapper methods may find optimal feature combinations that embedded methods miss. For discussion of how different algorithms handle high-dimensional features, see the ML Techniques page.

Recent Developments (2024-2025)

Data innovation continues to accelerate in financial ML. Large language models now enable more sophisticated text analysis, including reasoning about financial news rather than simple sentiment classification. This represents a fundamental shift from bag-of-words sentiment to semantic understanding. Graph-structured data capturing supply chain and sector relationships provides new feature types—therefore, models can now account for how shocks propagate through interconnected companies. Synthetic data generation through GANs addresses limited sample sizes in financial applications, effectively augmenting training data for rare market events.

Key recent publications on data and features include:

These recent advances (2024-2025) demonstrate the rapid evolution of feature engineering for financial ML. Evidence from multiple studies indicates that multimodal approaches—combining price, text, and alternative data—consistently outperform single-source models. Due to advances in NLP, text-based features now achieve comparable predictive power to traditional technical indicators. Therefore, modern systems typically incorporate both structured numerical data and unstructured text, because each captures different market dynamics.

Leading Research Teams

Data and feature engineering research spans quantitative finance and computer science:

Institution Key Researchers Focus
LUT University Mahinda Kumbure [Scholar], Pasi Luukka [Scholar] Feature selection, fuzzy systems for finance
NYU Stern Lasse Pedersen [Scholar], Robert Engle [Scholar] Factor models, alternative data
Chicago Booth Eugene Fama [Scholar], John Cochrane [Scholar] Asset pricing, factor discovery
Imperial College Marcin Kacperczyk [Scholar] Machine learning in asset pricing
Yale School of Management Bryan Kelly [Scholar] Financial ML, deep learning for asset pricing

Key Journals

Data and feature engineering research appears across finance and computer science venues. For algorithm selection to handle different feature types, see the ML Techniques page. For backtesting and performance evaluation, see the Performance Evaluation page.

External Resources

Authoritative Data Sources