Detailed Workflow
1
Data Acquisition & Preprocessing
Data Source: yfinance API for S&P500 constituent daily data
- Fetch OHLCV data (2010-2025)
- Handle stock splits, dividends, and adjustments
- Data cleaning and outlier handling
- Store in DuckDB and Parquet format
2
Factor Calculation
Factor Types: Alpha101 + TA-Lib + Custom Factors
- Alpha101: 101 classic quantitative factors (WorldQuant)
- TA-Lib: 50-80 technical indicator factors (RSI, MACD, Bollinger Bands, etc.)
- Custom Factors: 5 factors optimized for S&P500
- Total of 160+ factors
3
Factor Processing & Enhancement
Processing Method: Industry-standard MAD Winsorize
- MAD (Median Absolute Deviation) outlier removal for robustness
- Preserve original factor distribution to avoid information loss
- Process by date groups to avoid look-ahead bias
4
Model Training
Model: LightGBM Ranker (Learning to Rank)
- Label: Future 5-day return ranking (avoid next-day noise)
- Loss Function: LambdaRank (NDCG optimization)
- Training Set: 2010-2018 (9 years)
- Validation Set: 2019-2021 (3 years, for early stopping)
- Test Set: 2022-2025 (for final evaluation)
5
Prediction Generation
Prediction Content: Future return ranking score for each stock
- Generate predictions for all stocks on the latest date
- Output prediction scores (higher = better expected returns)
- Save prediction results for portfolio optimization
6
Portfolio Optimization
Strategy: TopK Dropout Strategy
- Select 20 stocks with highest prediction scores
- Rebalance all 20 stocks daily (full rebalance strategy)
- Equal weight allocation (5% per stock)
- Generate daily weight files
7
Real-time Trade Execution
Trading Platform: Interactive Brokers (IBKR)
- Read daily weight files
- Get current positions
- Calculate trade orders (buy/sell)
- Execute trades via IBKR API
- Monitor order status and fills