Data Sources & Features for Stock Market Prediction
Overview
A major contribution of the Kumbure et al. (2022) review is the systematic cataloging of predictor variables used in stock market forecasting research. The review identified 2,173 unique input variables across 138 papers, demonstrating the extensive experimentation conducted in the field. This means that researchers have tested far more features than most individual studies consider. Evidence from multiple studies consistently shows that feature selection affects prediction accuracy more than algorithm choice—in other words, choosing the right inputs matters more than choosing the right model. Therefore, understanding data sources and feature engineering is essential for practical implementation.
Feature Categories at a Glance
Technical indicators appear in over 80% of reviewed studies, making them the dominant feature category. This reflects both data availability and the field's roots in technical analysis—due to their simplicity, technical indicators can be computed by any researcher with access to price data. Simple moving averages (SMA), relative strength index (RSI), and momentum measures are the most frequently used indicators because they have decades of established interpretation. Fundamental data and macroeconomic variables appear in approximately 30% of studies, typically for longer-horizon predictions. Alternative data, particularly sentiment from news and social media, represents a growing trend in post-2015 research.
The choice of input features depends on the prediction horizon and target. Short-term predictions (intraday to weekly) predominantly use technical indicators and high-frequency price data, whereas medium-term predictions (monthly) often incorporate fundamental ratios and earnings data. In contrast, long-term forecasts may include macroeconomic variables such as GDP growth and interest rates. According to evidence from the reviewed literature, practitioners should match feature selection to their specific prediction task rather than applying generic feature sets—this is why understanding the relationship between feature types and time horizons is essential. Unlike other machine learning domains where feature sets are relatively stable, financial prediction requires continuous adaptation as market regimes change.
Technical Indicators
Technical indicators transform raw price and volume data into signals intended to capture market momentum, trend, and volatility. According to the Kumbure et al. (2022) review, simple moving average (SMA) and exponential moving average (EMA) are the most commonly used, followed by RSI, MACD, and Bollinger Bands. These indicators derive from decades of trading practice and provide standardized transformations recognized across the industry. Unlike raw price data, technical indicators normalize information across different stocks and time periods, which enables models to learn more generalizable patterns. Compared to fundamental data that updates quarterly, technical indicators can be computed in real-time from streaming price feeds.
| Indicator | Type | Calculation Basis | Signal Interpretation |
|---|---|---|---|
| Simple Moving Average (SMA) | Trend | Average price over N periods | Price above SMA: bullish; below: bearish |
| Exponential Moving Average (EMA) | Trend | Weighted average favoring recent prices | More responsive to recent price changes |
| Relative Strength Index (RSI) | Momentum | Ratio of up moves to down moves | Above 70: overbought; below 30: oversold |
| MACD | Momentum/Trend | Difference between fast and slow EMA | Signal line crossovers indicate trend change |
| Bollinger Bands | Volatility | SMA with standard deviation bands | Price at bands suggests mean reversion |
| Stochastic Oscillator | Momentum | Position within recent price range | Identifies overbought/oversold conditions |
| On-Balance Volume (OBV) | Volume | Cumulative volume based on price direction | Volume confirms or diverges from trend |
Research findings from multiple studies indicate that combining multiple indicators typically outperforms single-indicator approaches. Studies using 5-10 carefully selected indicators achieved 3-7% higher accuracy than those using raw prices alone—this is because different indicators capture different market dynamics. However, adding too many indicators introduces multicollinearity and increases overfitting risk. Therefore, dimensionality reduction or regularization techniques are essential when using large indicator sets. For discussion of algorithm selection to handle high-dimensional features, see the ML Techniques page. For performance metrics, see Performance Evaluation.
Fundamental Data
Fundamental analysis examines company financials to assess intrinsic value. The Kumbure et al. (2022) review found that fundamental data appears in approximately 30% of studies, typically for medium to long-term predictions. The literature indicates that fundamental factors explain significant cross-sectional variation in stock returns. Common fundamental variables include valuation ratios, profitability metrics, and balance sheet items.
| Category | Variables | Prediction Relevance |
|---|---|---|
| Valuation Ratios | P/E, P/B, P/S, EV/EBITDA, dividend yield | Long-term returns; mean reversion patterns |
| Profitability | ROE, ROA, profit margin, EBITDA margin | Quality factors; earnings persistence |
| Growth Metrics | Revenue growth, EPS growth, book value growth | Momentum in fundamentals |
| Leverage | Debt/equity, interest coverage, current ratio | Financial distress prediction |
| Earnings Quality | Accruals, cash flow/earnings ratio | Earnings sustainability |
Fundamental data presents unique challenges for ML models. Financial statements are released quarterly, creating sparse time series compared to daily prices—this is why models using fundamental data typically focus on longer prediction horizons. Earnings surprises (differences between actual and expected results) often drive short-term price movements, but expectations data requires analyst forecast databases. Research from several sources indicates that studies combining fundamental and technical features often outperform those using either category alone, suggesting complementary information content. Specifically, fundamental data captures company-specific value signals while technical data captures market timing.
Macroeconomic Variables
Macroeconomic indicators capture the broader economic environment affecting equity markets. The review documents use of interest rates, GDP growth, inflation, unemployment, and exchange rates as predictors. Because these variables affect all stocks to some degree, they are particularly relevant for market-level predictions (e.g., S&P 500 index) rather than individual stocks. Due to their economy-wide impact, macro factors explain systematic risk that cannot be diversified away.
| Variable | Data Frequency | Market Relationship |
|---|---|---|
| Interest Rates | Daily (Fed funds, Treasury yields) | Higher rates typically pressure equity valuations |
| GDP Growth | Quarterly | Economic expansion supports corporate earnings |
| Inflation (CPI) | Monthly | Moderate inflation positive; high inflation negative |
| Unemployment | Monthly | Labor market strength indicates economic health |
| Exchange Rates | Daily | Affects multinational company earnings |
The low frequency of macroeconomic data (monthly or quarterly) presents challenges for daily prediction models. In practice, some studies address this by using "nowcasting" techniques that estimate current macro conditions from higher-frequency proxies—for instance, using weekly jobless claims to infer monthly employment conditions. Alternatively, others incorporate macroeconomic variables only for longer-horizon predictions where the temporal resolution matches. Evidence from the review finds that models incorporating macro factors show improved performance during economic regime changes but may underperform during stable periods. As a result, adaptive models that adjust their feature weighting based on market conditions tend to achieve more consistent performance.
Alternative Data Sources
Alternative data refers to non-traditional information sources that may provide predictive signals. The Kumbure et al. (2022) review documents growing use of textual data from news and social media, representing a major trend in post-2015 research. NLP techniques extract sentiment scores that capture market psychology not reflected in price data alone. This is important because investor sentiment can move prices before fundamental changes materialize—for instance, negative news about a company often drives price declines before any financial impact appears in earnings reports.
The Rise of Sentiment Analysis
Studies using news sentiment achieved average accuracy improvements of 3-5% over price-only models, according to meta-analyses cited in the review. Twitter and StockTwits provide higher-frequency sentiment signals, while news articles offer more substantive content for analysis. The challenge lies in processing unstructured text at scale and distinguishing signal from noise. Advanced NLP models, particularly transformer-based architectures, have improved sentiment extraction accuracy since 2018.
| Data Source | Signal Type | Processing Method | Challenges |
|---|---|---|---|
| News Articles | Sentiment, event detection | NLP, named entity recognition | Volume, relevance filtering |
| Social Media (Twitter) | Real-time sentiment | Sentiment classification, topic modeling | Noise, bot activity, sarcasm |
| Google Trends | Search interest | Query volume normalization | Query selection, lag effects |
| Options Markets | Implied volatility, put/call ratios | Direct numerical features | Complex market structure |
| Satellite Imagery | Economic activity proxies | Computer vision, change detection | Cost, specialized expertise |
The integration of alternative data creates both opportunities and challenges. The potential for alpha generation attracts significant industry investment, but data costs and processing requirements can be substantial—which means that smaller research teams may be at a disadvantage. Privacy regulations increasingly constrain certain data sources, particularly consumer behavior tracking. Evidence from various studies indicates that alternative data provides the greatest edge for shorter prediction horizons where traditional data may not yet reflect new information. Consequently, hedge funds with alternative data capabilities focus on intraday and daily horizons rather than weekly or monthly predictions. On the other hand, retail investors without access to premium data sources can still compete using publicly available sentiment from social media and news.
Feature Selection Methods
With 2,173 documented variables, feature selection is critical for practical model development. According to the review, there are several approaches: domain-knowledge-driven selection (choosing indicators with theoretical justification), filter methods (ranking features by correlation or mutual information), wrapper methods (evaluating feature subsets via model performance), and embedded methods (algorithms with built-in selection like LASSO or random forest importance). Each approach has trade-offs: domain knowledge is interpretable but may miss novel patterns, while automated methods are comprehensive but risk overfitting to historical data.
| Method | Approach | Advantages | Limitations |
|---|---|---|---|
| Domain Knowledge | Expert selection based on theory | Interpretable, avoids spurious correlations | May miss novel patterns |
| Correlation/MI Filtering | Rank features by target relationship | Fast, scales to many features | Ignores feature interactions |
| Recursive Feature Elimination | Iteratively remove least important | Accounts for model behavior | Computationally expensive |
| LASSO/Elastic Net | L1 regularization shrinks to zero | Automatic, handles multicollinearity | Linear assumptions |
| Random Forest Importance | Measure prediction contribution | Nonlinear, interpretable rankings | Biased toward high-cardinality |
| Genetic Algorithms | Evolutionary search for feature subsets | Global optimization | Expensive, risk of overfitting |
The review notes that studies employing systematic feature selection consistently outperform those using all available features. According to meta-analysis across 50+ studies, models with 10-30 well-chosen predictors achieved 5-8% higher accuracy than those using all available features. This is because financial data contains more noise than signal, and irrelevant features can actually degrade model performance. In practice, reducing feature dimensionality from hundreds to 10-30 well-chosen predictors often improves both accuracy and interpretability. For example, a model using 20 carefully selected technical indicators typically outperforms one using 200 raw features. Essentially, the "curse of dimensionality" is particularly acute in financial prediction where sample sizes are limited. Unlike image classification where millions of examples are available, financial prediction may have only a few thousand training samples, making feature parsimony critical. Compared to wrapper methods that evaluate feature subsets exhaustively, embedded methods like LASSO achieve similar selection quality at 10x lower computational cost—this trade-off is particularly important for practitioners with limited resources. On the other hand, wrapper methods may find optimal feature combinations that embedded methods miss. For discussion of how different algorithms handle high-dimensional features, see the ML Techniques page.
Recent Developments (2024-2025)
Data innovation continues to accelerate in financial ML. Large language models now enable more sophisticated text analysis, including reasoning about financial news rather than simple sentiment classification. This represents a fundamental shift from bag-of-words sentiment to semantic understanding. Graph-structured data capturing supply chain and sector relationships provides new feature types—therefore, models can now account for how shocks propagate through interconnected companies. Synthetic data generation through GANs addresses limited sample sizes in financial applications, effectively augmenting training data for rare market events.
Key recent publications on data and features include:
- Financial applications of machine learning: A literature review (Expert Systems with Applications, 2023) - Comprehensive survey of feature engineering approaches
- FinGPT: Open-Source Financial Large Language Models (arXiv, 2024) - LLM-based analysis of earnings calls and SEC filings
- Graph neural networks for stock market prediction (Knowledge-Based Systems, 2024) - Knowledge graphs encoding company relationships
- Explainable AI for financial prediction (Finance Research Letters, 2024) - Feature importance and interpretability methods
- Multi-model ML framework for daily stock price prediction (Big Data and Cognitive Computing, 2025) - 36 technical indicator features across Apple, Tesla, NVIDIA
- Hybrid ML models for long-term stock forecasting (Journal of Risk and Financial Management, 2025) - LSTM-CNN integration with technical indicators
- Evaluating ML models for stock market forecasting (SAGE Global Business Review, 2025) - Comparative algorithm performance analysis
These recent advances (2024-2025) demonstrate the rapid evolution of feature engineering for financial ML. Evidence from multiple studies indicates that multimodal approaches—combining price, text, and alternative data—consistently outperform single-source models. Due to advances in NLP, text-based features now achieve comparable predictive power to traditional technical indicators. Therefore, modern systems typically incorporate both structured numerical data and unstructured text, because each captures different market dynamics.
Leading Research Teams
Data and feature engineering research spans quantitative finance and computer science:
| Institution | Key Researchers | Focus |
|---|---|---|
| LUT University | Mahinda Kumbure [Scholar], Pasi Luukka [Scholar] | Feature selection, fuzzy systems for finance |
| NYU Stern | Lasse Pedersen [Scholar], Robert Engle [Scholar] | Factor models, alternative data |
| Chicago Booth | Eugene Fama [Scholar], John Cochrane [Scholar] | Asset pricing, factor discovery |
| Imperial College | Marcin Kacperczyk [Scholar] | Machine learning in asset pricing |
| Yale School of Management | Bryan Kelly [Scholar] | Financial ML, deep learning for asset pricing |
Key Journals
Data and feature engineering research appears across finance and computer science venues. For algorithm selection to handle different feature types, see the ML Techniques page. For backtesting and performance evaluation, see the Performance Evaluation page.
- Expert Systems with Applications - Feature selection methods for finance
- Journal of Financial Data Science - Alternative data and feature engineering
- Quantitative Finance - Factor models and data analysis
External Resources
Authoritative Data Sources
- arXiv Statistical Finance - Preprints on financial data analysis
- ACM Digital Library - Information Retrieval - Text mining for finance
- PubMed Central - Open-access research on quantitative methods
- Kaggle Financial Datasets - Public datasets for research
- Federal Reserve Economic Data (FRED) - Macroeconomic time series
- NBER Data - National Bureau of Economic Research datasets
- Wharton Research Data Services - Comprehensive financial databases