IntroductionIn this new 5-part series of articles, I will be developing real HFT alphas using the research pipeline discussed in the previous article on HFT Alpha Research. In the first article (today’s article), we will present a basic set of features which explain a large portion of the overall variance. In the next article, we will expand this selection of features and use some more advanced features engineered from full depth orderbook data. In the 3rd article, we will expand our universe from BTC, ETH & SOL, to the top 30 assets by rolling 90d volume rank, and develop a cross-sectional factor model for the HFT timeframe. In the 4th article, we will use this factor model to test additional features, and we will also perform feature selection to form our final feature set. In the final article, we will test various models for forecasting forward returns. We use an incredibly large dataset of 1 year of second level tick data to perform this analysis and provide code to replicate the entire process at home. The final forecast aims to be worthy of being considered a starting point for market making and development. Moreso, we aim to show readers how to research a set of alphas, with the theory behind alphas explained in detail. If you have not already read the prior article on methodology please see the below article: By the end of this article, we will have found 7 features for the 5s timeframe and 5 features for the 15s timeframe. See the lists at the end for the final feature set. Index
The DataOur analysis focuses solely on USD-M Binance Futures for USDT pairs. This is the most liquid market for trading digital assets. We will use the range of data from 2025-01-01 to 2025-12-31. Our data is tick level and encompasses every orderbook update, and every trade that occurred over that duration. We use BTCUSDT, ETHUSDT, and SOLUSDT for this analysis. In the later articles we will expand the universe, but shrink the date range (to keep things computationally feasible - full depth data is massive) Data is where it all starts, and having a high quality data provider is one of the most important things to consider when doing HFT analysis. It is typical of most HFT firms that data is collected internally using custom data scrapers to ensure that the data has representative latency statistics of their actual setup. In lieu of this, we will be using Tardis, which is a high quality institutional dataset for tick data in digital asset markets. We use the below endpoints:
We pre-process the trade and quote data into an OHLCV dataset, made of 5s bars, where we have volume data derived from trades, and open, high, low, close derived from quote mid-price. We also use quotes to get our top of book best bid/ask which will be one of the features in our article today. Please note that the quote dataset from Tardis is generated from the book deltas feed and not from the quote feed. Then, we take our OHLCV dataset, and calculate the close to close mid-price returns. From here we shift these returns to create 5s forward returns. Additionally, we generate a disjoint 15s return which runs from t+5s to t+15s. This lets us separate effects on the 5s timeframe from the 15s timeframe without needing to use a markout analysis. The AlphasWe will be testing the below alphas:... Subscribe to The Quant Stack to unlock the rest.Become a paying subscriber of The Quant Stack to get access to this post and other subscriber-only content. |