Data Sources & Features for Stock Market Prediction

Overview

A major contribution of the Kumbure et al. (2022) review is the systematic cataloging of predictor variables used in stock market forecasting research. The review identified 2,173 unique input variables across 138 papers, demonstrating the extensive experimentation conducted in the field. This means that researchers have tested far more features than most individual studies consider. Evidence from multiple studies consistently shows that feature selection affects prediction accuracy more than algorithm choice—in other words, choosing the right inputs matters more than choosing the right model. Therefore, understanding data sources and feature engineering is essential for practical implementation.

Feature Categories at a Glance

Technical indicators appear in over 80% of reviewed studies, making them the dominant feature category. This reflects both data availability and the field's roots in technical analysis—due to their simplicity, technical indicators can be computed by any researcher with access to price data. Simple moving averages (SMA), relative strength index (RSI), and momentum measures are the most frequently used indicators because they have decades of established interpretation. Fundamental data and macroeconomic variables appear in approximately 30% of studies, typically for longer-horizon predictions. Alternative data, particularly sentiment from news and social media, represents a growing trend in post-2015 research.

The choice of input features depends on the prediction horizon and target. Short-term predictions (intraday to weekly) predominantly use technical indicators and high-frequency price data, whereas medium-term predictions (monthly) often incorporate fundamental ratios and earnings data. In contrast, long-term forecasts may include macroeconomic variables such as GDP growth and interest rates. According to evidence from the reviewed literature, practitioners should match feature selection to their specific prediction task rather than applying generic feature sets—this is why understanding the relationship between feature types and time horizons is essential. Unlike other machine learning domains where feature sets are relatively stable, financial prediction requires continuous adaptation as market regimes change.

Technical Indicators

Technical indicators transform raw price and volume data into signals intended to capture market momentum, trend, and volatility. According to the Kumbure et al. (2022) review, simple moving average (SMA) and exponential moving average (EMA) are the most commonly used, followed by RSI, MACD, and Bollinger Bands. These indicators derive from decades of trading practice and provide standardized transformations recognized across the industry. Unlike raw price data, technical indicators normalize information across different stocks and time periods, which enables models to learn more generalizable patterns. Compared to fundamental data that updates quarterly, technical indicators can be computed in real-time from streaming price feeds.

Indicator	Type	Calculation Basis	Signal Interpretation
Simple Moving Average (SMA)	Trend	Average price over N periods	Price above SMA: bullish; below: bearish
Exponential Moving Average (EMA)	Trend	Weighted average favoring recent prices	More responsive to recent price changes
Relative Strength Index (RSI)	Momentum	Ratio of up moves to down moves	Above 70: overbought; below 30: oversold
MACD	Momentum/Trend	Difference between fast and slow EMA	Signal line crossovers indicate trend change
Bollinger Bands	Volatility	SMA with standard deviation bands	Price at bands suggests mean reversion
Stochastic Oscillator	Momentum	Position within recent price range	Identifies overbought/oversold conditions
On-Balance Volume (OBV)	Volume	Cumulative volume based on price direction	Volume confirms or diverges from trend

Research findings from multiple studies indicate that combining multiple indicators typically outperforms single-indicator approaches. Studies using 5-10 carefully selected indicators achieved 3-7% higher accuracy than those using raw prices alone—this is because different indicators capture different market dynamics. However, adding too many indicators introduces multicollinearity and increases overfitting risk. Therefore, dimensionality reduction or regularization techniques are essential when using large indicator sets. For discussion of algorithm selection to handle high-dimensional features, see the ML Techniques page. For performance metrics, see Performance Evaluation.

Fundamental Data

Fundamental analysis examines company financials to assess intrinsic value. The Kumbure et al. (2022) review found that fundamental data appears in approximately 30% of studies, typically for medium to long-term predictions. The literature indicates that fundamental factors explain significant cross-sectional variation in stock returns. Common fundamental variables include valuation ratios, profitability metrics, and balance sheet items.

Category	Variables	Prediction Relevance
Valuation Ratios	P/E, P/B, P/S, EV/EBITDA, dividend yield	Long-term returns; mean reversion patterns
Profitability	ROE, ROA, profit margin, EBITDA margin	Quality factors; earnings persistence
Growth Metrics	Revenue growth, EPS growth, book value growth	Momentum in fundamentals
Leverage	Debt/equity, interest coverage, current ratio	Financial distress prediction
Earnings Quality	Accruals, cash flow/earnings ratio	Earnings sustainability

Fundamental data presents unique challenges for ML models. Financial statements are released quarterly, creating sparse time series compared to daily prices—this is why models using fundamental data typically focus on longer prediction horizons. Earnings surprises (differences between actual and expected results) often drive short-term price movements, but expectations data requires analyst forecast databases. Research from several sources indicates that studies combining fundamental and technical features often outperform those using either category alone, suggesting complementary information content. Specifically, fundamental data captures company-specific value signals while technical data captures market timing.

Macroeconomic Variables

Macroeconomic indicators capture the broader economic environment affecting equity markets. The review documents use of interest rates, GDP growth, inflation, unemployment, and exchange rates as predictors. Because these variables affect all stocks to some degree, they are particularly relevant for market-level predictions (e.g., S&P 500 index) rather than individual stocks. Due to their economy-wide impact, macro factors explain systematic risk that cannot be diversified away.

Variable	Data Frequency	Market Relationship
Interest Rates	Daily (Fed funds, Treasury yields)	Higher rates typically pressure equity valuations
GDP Growth	Quarterly	Economic expansion supports corporate earnings
Inflation (CPI)	Monthly	Moderate inflation positive; high inflation negative
Unemployment	Monthly	Labor market strength indicates economic health
Exchange Rates	Daily	Affects multinational company earnings

The low frequency of macroeconomic data (monthly or quarterly) presents challenges for daily prediction models. In practice, some studies address this by using "nowcasting" techniques that estimate current macro conditions from higher-frequency proxies—for instance, using weekly jobless claims to infer monthly employment conditions. Alternatively, others incorporate macroeconomic variables only for longer-horizon predictions where the temporal resolution matches. Evidence from the review finds that models incorporating macro factors show improved performance during economic regime changes but may underperform during stable periods. As a result, adaptive models that adjust their feature weighting based on market conditions tend to achieve more consistent performance.

Alternative Data Sources

Alternative data refers to non-traditional information sources that may provide predictive signals. The Kumbure et al. (2022) review documents growing use of textual data from news and social media, representing a major trend in post-2015 research. NLP techniques extract sentiment scores that capture market psychology not reflected in price data alone. This is important because investor sentiment can move prices before fundamental changes materialize—for instance, negative news about a company often drives price declines before any financial impact appears in earnings reports.

The Rise of Sentiment Analysis

Studies using news sentiment achieved average accuracy improvements of 3-5% over price-only models, according to meta-analyses cited in the review. Twitter and StockTwits provide higher-frequency sentiment signals, while news articles offer more substantive content for analysis. The challenge lies in processing unstructured text at scale and distinguishing signal from noise. Advanced NLP models, particularly transformer-based architectures, have improved sentiment extraction accuracy since 2018.

Data Source	Signal Type	Processing Method	Challenges
News Articles	Sentiment, event detection	NLP, named entity recognition	Volume, relevance filtering
Social Media (Twitter)	Real-time sentiment	Sentiment classification, topic modeling	Noise, bot activity, sarcasm
Google Trends	Search interest	Query volume normalization	Query selection, lag effects
Options Markets	Implied volatility, put/call ratios	Direct numerical features	Complex market structure
Satellite Imagery	Economic activity proxies	Computer vision, change detection	Cost, specialized expertise

The integration of alternative data creates both opportunities and challenges. The potential for alpha generation attracts significant industry investment, but data costs and processing requirements can be substantial—which means that smaller research teams may be at a disadvantage. Privacy regulations increasingly constrain certain data sources, particularly consumer behavior tracking. Evidence from various studies indicates that alternative data provides the greatest edge for shorter prediction horizons where traditional data may not yet reflect new information. Consequently, hedge funds with alternative data capabilities focus on intraday and daily horizons rather than weekly or monthly predictions. On the other hand, retail investors without access to premium data sources can still compete using publicly available sentiment from social media and news.

Feature Selection Methods

With 2,173 documented variables, feature selection is critical for practical model development. According to the review, there are several approaches: domain-knowledge-driven selection (choosing indicators with theoretical justification), filter methods (ranking features by correlation or mutual information), wrapper methods (evaluating feature subsets via model performance), and embedded methods (algorithms with built-in selection like LASSO or random forest importance). Each approach has trade-offs: domain knowledge is interpretable but may miss novel patterns, while automated methods are comprehensive but risk overfitting to historical data.

Method	Approach	Advantages	Limitations
Domain Knowledge	Expert selection based on theory	Interpretable, avoids spurious correlations	May miss novel patterns
Correlation/MI Filtering	Rank features by target relationship	Fast, scales to many features	Ignores feature interactions
Recursive Feature Elimination	Iteratively remove least important	Accounts for model behavior	Computationally expensive
LASSO/Elastic Net	L1 regularization shrinks to zero	Automatic, handles multicollinearity	Linear assumptions
Random Forest Importance	Measure prediction contribution	Nonlinear, interpretable rankings	Biased toward high-cardinality
Genetic Algorithms	Evolutionary search for feature subsets	Global optimization	Expensive, risk of overfitting

The review notes that studies employing systematic feature selection consistently outperform those using all available features. According to meta-analysis across 50+ studies, models with 10-30 well-chosen predictors achieved 5-8% higher accuracy than those using all available features. This is because financial data contains more noise than signal, and irrelevant features can actually degrade model performance. In practice, reducing feature dimensionality from hundreds to 10-30 well-chosen predictors often improves both accuracy and interpretability. For example, a model using 20 carefully selected technical indicators typically outperforms one using 200 raw features. Essentially, the "curse of dimensionality" is particularly acute in financial prediction where sample sizes are limited. Unlike image classification where millions of examples are available, financial prediction may have only a few thousand training samples, making feature parsimony critical. Compared to wrapper methods that evaluate feature subsets exhaustively, embedded methods like LASSO achieve similar selection quality at 10x lower computational cost—this trade-off is particularly important for practitioners with limited resources. On the other hand, wrapper methods may find optimal feature combinations that embedded methods miss. For discussion of how different algorithms handle high-dimensional features, see the ML Techniques page.

Recent Developments (2024-2025)

Data innovation continues to accelerate in financial ML. Large language models now enable more sophisticated text analysis, including reasoning about financial news rather than simple sentiment classification. This represents a fundamental shift from bag-of-words sentiment to semantic understanding. Graph-structured data capturing supply chain and sector relationships provides new feature types—therefore, models can now account for how shocks propagate through interconnected companies. Synthetic data generation through GANs addresses limited sample sizes in financial applications, effectively augmenting training data for rare market events.

Key recent publications on data and features include:

Financial applications of machine learning: A literature review (Expert Systems with Applications, 2023) - Comprehensive survey of feature engineering approaches
FinGPT: Open-Source Financial Large Language Models (arXiv, 2024) - LLM-based analysis of earnings calls and SEC filings
Graph neural networks for stock market prediction (Knowledge-Based Systems, 2024) - Knowledge graphs encoding company relationships
Explainable AI for financial prediction (Finance Research Letters, 2024) - Feature importance and interpretability methods
Multi-model ML framework for daily stock price prediction (Big Data and Cognitive Computing, 2025) - 36 technical indicator features across Apple, Tesla, NVIDIA
Hybrid ML models for long-term stock forecasting (Journal of Risk and Financial Management, 2025) - LSTM-CNN integration with technical indicators
Evaluating ML models for stock market forecasting (SAGE Global Business Review, 2025) - Comparative algorithm performance analysis

These recent advances (2024-2025) demonstrate the rapid evolution of feature engineering for financial ML. Evidence from multiple studies indicates that multimodal approaches—combining price, text, and alternative data—consistently outperform single-source models. Due to advances in NLP, text-based features now achieve comparable predictive power to traditional technical indicators. Therefore, modern systems typically incorporate both structured numerical data and unstructured text, because each captures different market dynamics.

Leading Research Teams

Data and feature engineering research spans quantitative finance and computer science:

Institution	Key Researchers	Focus
LUT University	Mahinda Kumbure [Scholar], Pasi Luukka [Scholar]	Feature selection, fuzzy systems for finance
NYU Stern	Lasse Pedersen [Scholar], Robert Engle [Scholar]	Factor models, alternative data
Chicago Booth	Eugene Fama [Scholar], John Cochrane [Scholar]	Asset pricing, factor discovery
Imperial College	Marcin Kacperczyk [Scholar]	Machine learning in asset pricing
Yale School of Management	Bryan Kelly [Scholar]	Financial ML, deep learning for asset pricing

Key Journals

Data and feature engineering research appears across finance and computer science venues. For algorithm selection to handle different feature types, see the ML Techniques page. For backtesting and performance evaluation, see the Performance Evaluation page.

Expert Systems with Applications - Feature selection methods for finance
Journal of Financial Data Science - Alternative data and feature engineering
Quantitative Finance - Factor models and data analysis

External Resources

Authoritative Data Sources

arXiv Statistical Finance - Preprints on financial data analysis
ACM Digital Library - Information Retrieval - Text mining for finance
PubMed Central - Open-access research on quantitative methods
Kaggle Financial Datasets - Public datasets for research
Federal Reserve Economic Data (FRED) - Macroeconomic time series
NBER Data - National Bureau of Economic Research datasets
Wharton Research Data Services - Comprehensive financial databases