Detailed Workflow

1

Data Acquisition & Preprocessing

Data Source: yfinance API for S&P500 constituent daily data

  • Fetch OHLCV data (2010-2025)
  • Handle stock splits, dividends, and adjustments
  • Data cleaning and outlier handling
  • Store in DuckDB and Parquet format
2

Factor Calculation

Factor Types: Alpha101 + TA-Lib + Custom Factors

  • Alpha101: 101 classic quantitative factors (WorldQuant)
  • TA-Lib: 50-80 technical indicator factors (RSI, MACD, Bollinger Bands, etc.)
  • Custom Factors: 5 factors optimized for S&P500
  • Total of 160+ factors
3

Factor Processing & Enhancement

Processing Method: Industry-standard MAD Winsorize

  • MAD (Median Absolute Deviation) outlier removal for robustness
  • Preserve original factor distribution to avoid information loss
  • Process by date groups to avoid look-ahead bias
4

Model Training

Model: LightGBM Ranker (Learning to Rank)

  • Label: Future 5-day return ranking (avoid next-day noise)
  • Loss Function: LambdaRank (NDCG optimization)
  • Training Set: 2010-2018 (9 years)
  • Validation Set: 2019-2021 (3 years, for early stopping)
  • Test Set: 2022-2025 (for final evaluation)
5

Prediction Generation

Prediction Content: Future return ranking score for each stock

  • Generate predictions for all stocks on the latest date
  • Output prediction scores (higher = better expected returns)
  • Save prediction results for portfolio optimization
6

Portfolio Optimization

Strategy: TopK Dropout Strategy

  • Select 20 stocks with highest prediction scores
  • Rebalance all 20 stocks daily (full rebalance strategy)
  • Equal weight allocation (5% per stock)
  • Generate daily weight files
7

Real-time Trade Execution

Trading Platform: Interactive Brokers (IBKR)

  • Read daily weight files
  • Get current positions
  • Calculate trade orders (buy/sell)
  • Execute trades via IBKR API
  • Monitor order status and fills