Statistical Arbitrage Through Cointegrated Stocks (Final): Data Analysis with Specialized Database

At the beginning of 2025, we accepted a challenge to develop a statistical arbitrage framework for the average retail trader, that lonely warrior armed with only a consumer notebook, a regular internet connection, and small trading capital. Since then, we have published ten articles describing our research and experiments, the successes and failures of our attempts to beat the market by math.

All of these articles were written with the goal of being a gentle introduction to the topic of statistical arbitrage from a trader’s perspective. Together, they are meant to compose the minimum required for a regular person with a high-school math background and basic developer skills to start understanding and implementing statistical arbitrage trading strategies.

Correlation, cointegration, and stationarity tests were covered right at the start of the series, since these topics are foundational. There, we talked about Engle-Granger and Johansen cointegration tests, along with the Augmented Dickey Fuller (“ADF”) test, and the Kwiatkowski-Phillips-Schmidt-Shin (“KPSS”) stationarity test. By knowing their purpose, interpretation, and meaning we could start building our pairs and basket portfolios right from the beginning. The use of open-source and professional-grade Python statistical libraries like statsmodels prevented us from the need of going deep into the math behind these tests and allowed us to keep our focus on their high-level application.

Obviously, when we talk about statistical arbitrage, we talk about data analysis, we talk about databases. As Metatrader 5 has an integrated SQLite database, we got you covered, too. In Part 3 of this series, we discussed the database setup, proposed an initial database schema, and showed how it can be kept up-to-date by means of an MQL5 Service. The introduction of this Service was the first step we took to decouple our data analysis from our trading environment. We will be finishing the decoupling in this final article.

Once we had a database set up along with a stable cointegration and stationarity tests pipeline, the next logical step is to start building the portfolios, so in the next three articles, we proposed a minimal screening and scoring system based on a hierarchy of objective criteria like the cointegration strength, the stability of the portfolio weights, and the half-time to mean reversion. The hypothesis we used for our screening was the cointegration between high-liquidity microprocessor industry stocks from the Nasdaq, but its principles are valid for any other cointegrated portfolio.

Finally, in the three last articles we discussed some broadly used methods for live trade monitoring, like the Rolling Windows Eigenvector Comparison (RWEC) and the In-Sample/Out-of-Sample ADF (IS/OOS ADF) that we used in tandem to detect the need of portfolio weights rebalancing, and the Chow test and Cumulative Sum of Squares (CUSUM) evaluation for detecting structural breaks in cointegrated relationships.

The research and experiments we summarized in those ten articles were instrumental in revealing that the entry barrier for the individual average retail trader in the statistical arbitrage space has lowered significantly in the last ten to fifteen years. By shifting the focus from the trading speed, that is, overall the order execution speed, to the discovery of almost infinite opportunities in unexpected relationships between symbols from different markets and on different timeframe combinations, we were just confirming the initial supposition we had when we accepted this challenge: statistical arbitrage strategies are perfectly accessible to the average retail trader these days. Since they avoid the prohibitive costs of the HFT space, they can find juicy opportunities for operating stat arb strategies in 2026.

The only missing piece in the toolkit we described is a specialized database to remove the data analysis burden from SQLite, leaving our lovely Metatrader 5 built-in database with the tasks related to transaction processing, for which it was designed. The interactions with the EA, our so-called ‘single source of truth’ market entering and exiting, should remain with SQLite, but the data analysis that informs our screening/scoring system and our live trade monitoring should be handled by a specialized system. The last step of this introduction to statistical arbitrage for retail traders should make our data analysis setup ready for real-time monitoring of big datasets, and it will also make our setup a future-proof one, as it will accommodate the growth of our operations without the need for investment in expensive hardware.

Processing power relevance to statistical arbitrage

Any trading activity requires data. At a minimum, one needs the asset’s current price to enter into a buy or sell operation. Usually, we’ll want a bit more data; at least some price history is useful to get an insight into how the asset price has moved over time. Soon, we start calculating maxima and minima, ranges, averages, average true ranges, volume weighted averages, and price history views. With more data, we can better understand how the price arrived at its current value. With enough data, maybe we can also speculate on where the price may go in the following hours, days, or months. We may use the asset price data to build and read candle patterns, and we may use it to create more or less complex indicators to ease our price history visualization. It is not uncommon that for a single asset, we find ourselves sourcing historic price data from dozens of months.

When it comes to pairs trading, we need at least double of the data. We need to calculate the mean spread continuously, and we need yet more historical data if we want to check if the pair relationship is strong enough and still holds. Maybe we want to calculate a dynamic standard deviation threshold based on the volatility. Thus, besides requiring more data, we are also requiring more computational power for loading the data for both symbols, calculating the spread, the mean, the standard deviation threshold, and now, the volatility in real time.

The data and computational power required by pairs trading strategies are usually a fraction of the data required by statistical arbitrage strategies. We have baskets composed of an unlimited number of stocks for which we must calculate the cointegration vector and the respective portfolio weights; run a Rolling Windows Eigenvector Comparison to check the portfolio weights stability; run a Chow Test with a Cumulative Sum of Squares to anticipate structural breaks; and we must run all this computation in real time for live trade monitoring. Eventually, we need to run all this computation for several baskets simultaneously, that is, we need to require price data for dozens of symbols and process it as soon as possible to be able to react on time to alerts, breaks, and market disruptions.

When Jim Simons started running statistical arbitrage strategies in the late Eighties, the computational power required to process the huge amount of data they were sourcing on a daily basis was very expensive. The cost of equipment and specialized personnel to write the software and to keep the operation running was prohibitive for any retail trader, even a wealthy one. The high entry barrier for retail traders was not high-speed optical fiber connections or server collocation. High-Frequency trading (HFT) was not yet a thing then.

The estimated cost of an RDBMS in the late Eighties would be around fifteen to fourteen thousand dollars for a perpetual license of Oracle, Ingres, or IBM. When corrected for inflation, this would amount to something like a hundred thousand dollars nowadays. We are not taking into account the personal and the hardware. That is only the approximate cost of the software license. Statistical arbitrage was institutional traders’ private domain, mainly due to the required processing power for data analysis and the high costs of the tools used for it. But this changed dramatically in the following years, as we’ll see below.

In a standard MetaQuotes demo account, we have approximately 9000 unique symbols and more than 20 timeframes.

Fig. 1. Screenshot showing the number of symbols available in a standard MetaQuotes demo account

Since statistical arbitrage is about finding relationships between assets’ prices, ideally, we would check all possible combinations of all available assets in all available timeframes. Looking for relationships in all possible combinations would result in a combinatorial explosion. So we select our pairs and baskets according to a scoring system.

Until now, we’ve been using SQLite only

For convenience, until now, we’ve been using the Metatrader 5 built-in SQLite as our database. But don’t get me wrong. Along with the convenience comes the most used database in the world, a brilliant and trustworthy piece of software, battle-tested in the field, and extremely reliable. Besides being part of the Metatrader 5, thus avoiding additional installation and configuration, there is a native set of SQLite database functions to ease working with it inside the platform. Also, a native graphical user interface (GUI) is available inside the MetaEditor for querying and managing the Metatrader 5 built-in SQLite databases.

Fig. 2. Screenshot showing MetaEditor GUI interface to SQLite

These features made SQLite a perfect choice to start our statistical arbitrage framework for the average retail trader. It is reliable, scalable, built-in, has MQL5 native functions, and a Metaeditor GUI. It is user-friendly inside Metatrader 5. But we can do better, without getting rid of SQLite. We can use it for what it was designed for, and use a specialized database for our data analysis.

SQLite is an OLTP (Online Transaction Processing) system, a system designed to process small transactions very fast. SQLite does it pretty well, but when we talk about a specialized database for data analysis, we are talking about an OLAP (Online Analytical Processing) system, that is, a system designed to handle complex and math-heavy analysis over historical data, usually big datasets. Although their functions may overlap at some point – that is, we can use an OLAP system to store and query transactional daily data, and we can use an OLTP system to analyse historical data as we have been doing here – they are different systems.

For the differences between them to be visible, we would need to run our data analysis at full scale. Until now, we have been analysing a tiny fraction of the history in our examples. Our lookback periods have usually been around two to three years at the most. We didn’t analyze ten or twenty years of history data in any circumstance. We have been checking for cointegration in baskets of three or four symbols, never more than this. Also, we have limited our screening and scoring system to a single timeframe (H4) according to our initial cointegration hypothesis for the semiconductor industry.

However, if you are serious in your statistical arbitrage endeavour, even as a humble average retail trader, at some point you will face the need of analysing larger history periods in the search of patterns beyond cointegration; or you may want to test baskets of five, maybe ten assets in many combinations with several timeframes, eventually including non-financial data (like freight fees or commodity production volumes, for ex.); probably, this analysis will be the basis of your live trade monitoring. Can you see?

As you start involving more symbols, more timeframes, and more history, you suddenly have your own big data to process as fast as possible. This kind of growth is the natural way in statistical arbitrage. It is expected that your data processing needs evolve like this, and you should be ready to explore specialized tools to cope with this requirement to process more data. The good news is that in 2025, you do not need to sell the land to buy the shovel, as to say. While some time ago you would have to buy an expensive software license to have access to a high-end OLAP system, today you can have access to a high-quality system that is not only free, but also open-source: DuckDB.

DuckDB: a free and open-source OLAP system

There is a common misconception that statistical arbitrage trading strategies are a kind of high-frequency trading (HFT), or at least, dependent on high-speed connections and low-latency order routing. As a consequence, many people think statistical arbitrage has been dominated by institutional traders because of this supposed dependency on speed. This is only half-true.

Although many stat arb strategies are meant to be operated in lower timeframes (usually below 5m), this is not a requirement. It is perfectly normal to find correlations, cointegrations, and other statistical relationships in higher timeframes. Speed is important, and it is the core of many strategies, but it was not the only reason why institutional players have monopolized this trading domain until a few years ago. The main reason was the high cost of the computational power required for data analytics and its operational complexity.

To bring some perspective, I did a quick internet research to find that, just fifteen years ago, “the gap between a retail trader and a hedge fund was a ‘Moat of Capital’ built on proprietary OLAP hardware and software.” In the mid-2000s, besides the high financial cost, there was a high technical entry barrier. That is, after investing something like $100.000 to $250.000 just for the software licenses that would allow you to query data in an OLAP system, in some cases, you would also have to cope with a rack of specialized servers and with an annual maintenance fee of around 20%. The technical entry barrier also involved pre-processing the OLAP cubes – since those old systems relied on multidimensional OLAP – and using Multidimensional Expressions (MDX) to manipulate and query these multidimensional aggregations. With those corporate-level systems, you could not use your good old SQL to scan a raw CSV or Parquet file, as we’ll see below.

The gap between a retail trader and a hedge fund has narrowed a lot over the past fifteen years. While speed is yet a concern and HFT is still a forbidden territory for retail traders, now we can have a full-featured OLAP system running on our laptops. It doesn’t matter if we are running our analysis on a Mac, Windows, or Linux machine; we can use DuckDB by simply downloading a ~50MB executable and start running our big datasets analysis like a pro in no time.

Data format compatibility

Arguably, one of DuckDB’s main features is the ability to talk to (and listen to) the most used databases, data frame engines, cloud storage, and file formats in the data analysis domain. This is achieved by means of extensions. Once you have a client installed, you can check the available extensions. DuckDB has an extension system maintained by the core team and complemented by community contributions. It allows us to read from and write to:

* dataframe engines like Pandas, Apache Arrow, NumPy, and Polars;

* databases like MySQL, Postgres, and SQLite;

* cloud storage services like Amazon S3, Google Cloud Storage, and Azure Blob Storage;

* CSV, JSON, Parquet, and Iceberg file formats.

At the time of writing, I’m not aware of any other open-source system with this level of flexibility.

Operational simplicity

DuckDB brings to our pipeline the same SQLite operational simplicity. Like SQLite, DuckDB is also an in-process, zero management, embedded database that not only makes our data analysis easier with increased performance, but also helps with interoperability between different data sources and file formats, making it easy to aggregate data from several distinct sources, as well as to move data from local to cloud and vice-versa.

OLAP x OLTP difference in practice

To better understand the difference between the SQLite OLTP system and DuckDB OLAP system, let’s see them operating over typical trading data of different magnitudes. For this purpose, we have a very simple three-step Python script that will generate some synthetic data for a couple of stock tickers (symbols) and run a common aggregation over it, calculating the average price per ticker. This will help us to see the architectural difference in action.

There are DuckDB client packages for several languages. To have it available in Python, we can just call

It has no external dependency, except in Windows, where it requires the C++ redistributable package of runtime library components.

Then we can just import it into our scripts like any other Python package.

import sqlite3 import duckdb import pandas as pd import numpy as np import time

To illustrate the differences between SQLite and DuckDB for data analysis, we’ll use a simplified benchmark around synthetic data. We start by generating one million rows of synthetic price and volume for eight tickers. We are generating one million “ticks” totalling 11.5 trading days from the very first second of 2023. Note that this is nearly half a month of data. In the real world, we should expect data of orders of magnitude bigger than this.

def generate_data(filename=”finance_data.csv”): print(“Generating synthetic data…”) tickers = [‘AAPL’, ‘MSFT’, ‘GOOGL’, ‘AMZN’, ‘TSLA’, ‘NVDA’, ‘META’, ‘NFLX’] n_rows = 1_000_000 df = pd.DataFrame({ ‘timestamp’: pd.date_range(start=’2023-01-01′, periods=n_rows, freq=’s’), ‘ticker’: np.random.choice(tickers, n_rows), ‘price’: np.random.uniform(100, 500, n_rows), ‘volume’: np.random.randint(1, 1000, n_rows) }) df.to_csv(filename, index=False) print(f”Created {filename} with {n_rows} rows.”)

This will generate an one million rows dataframe like this:

Table 1. Sample of the synthetic tabular data for benchmarking SQLite x DuckDB

After saving this tabular data as CSV, we ask an in-memory SQLite to calculate the average price for each ticker.

def benchmark_sqlite(filename): conn = sqlite3.connect(“:memory:”) cursor = conn.cursor() df = pd.read_csv(filename) df.to_sql(“prices”, conn, index=False) start_time = time.perf_counter() query = “SELECT ticker, AVG(price) FROM prices GROUP BY ticker” cursor.execute(query) results = cursor.fetchall() end_time = time.perf_counter() conn.close() return end_time – start_time

We submit the same CSV file to DuckDB to run the same calculation, but this time we do not need to load the data because DuckDB can query the CSV directly.

def benchmark_duckdb(filename): conn = duckdb.connect(database=’:memory:’) start_time = time.perf_counter() query = f”SELECT ticker, AVG(price) FROM ‘{filename}’ GROUP BY ticker” results = conn.execute(query).fetchall() end_time = time.perf_counter() return end_time – start_time if __name__ == “__main__”: csv_file = “finance_data.csv” generate_data(csv_file) print(“nStarting Benchmarks…”) sqlite_time = benchmark_sqlite(csv_file) print(f”SQLite execution time: {sqlite_time:.4f} seconds”) duckdb_time = benchmark_duckdb(csv_file) print(f”DuckDB execution time: {duckdb_time:.4f} seconds”) speedup = sqlite_time / duckdb_time print(f”nDuckDB was {speedup:.1f}x faster than SQLite for this query.”)

This script is attached here as bench.py. By running it for one million rows, you should generate a ~50MB CSV file. In a low-end notebook with only two cores, I got the following relation between SQLite time and DuckDB time.

Generating synthetic data… Created finance_data.csv with 1000000 rows. Starting Benchmarks… SQLite execution time: 1.3408 seconds DuckDB execution time: 1.3442 seconds DuckDB was 1.0x faster than SQLite for this query.

Both systems were identical at this data scale. When we scaled the data ten times by running the same script for ten million rows and generating a ~500MB CSV file, we started to see the difference.

Generating synthetic data… Created finance_data.csv with 10000000 rows. Starting Benchmarks… SQLite execution time: 18.0329 seconds DuckDB execution time: 9.0836 seconds DuckDB was 2.0x faster than SQLite for this query.

By coincidence, we have rounded numbers here. It is just a coincidence. What is certain is that after running this script many times and on three different machines, your results will bring different values for absolute execution time than those shown here. Of course, these results are highly dependent on your machine specs, in particular, the number of CPU cores. So, do not focus on the absolute values. Just pay attention to the relative values, the proportion between SQLite and DuckDB execution time as you change the volume of the data. I invite you to try changing it to one hundred million rows.

What happens here is that SQLite has zero startup time. When we ask it to process one million rows, it starts processing immediately. On the other side, DuckDB has a query optimizer and a vectorized execution engine that takes some milliseconds (or seconds in these two-core low-end machines). Also, as a columnar storage system, DuckDB is reading only the column required to calculate the average price, that is, the ‘price’ column, naturally, while SQLite, as a row-oriented system, is loading each full row with ‘timestamp’, ‘ticker’, ‘price’, and ‘volume’. DuckDB is physically ignoring them on the disk. So when we go from one million to ten million rows, DuckDB is reading significantly less data than SQLite.

Finally, SQLite processes one row at a time and is usually single-threaded for single queries, while DuckDB uses the so-called vectorized execution, which sends data in batches to the CPU and, at the same time, uses all the cores available for full parallelism. Again, as our data volume increases, it also increases the performance gap between those two systems. To be fair and see SQLite winning, we could ask it to query for a specific trade ID among those ten million rows, and it will leave DuckDB in the dust because that is the SQLite domain, the feature it was designed for.

It is worth noting that here we are using a CSV file, but DuckDB’s ability to query a Parquet file or its own native .duckdb format would widen the gap even further. However, data analysis performance and overall speed are just one of the benefits we get by bringing a specialized OLAP system to our statistical arbitrage system. Another great benefit is the native temporal join.

ASOF JOIN

In statistical arbitrage, time is the primary key. That’s not a figure of speech. It’s literal. If you look at the SQLite database schema we’ve been developing in this series, you’ll note that most of the tables have a timestamp as the primary key. Since most of our data are time series, this is a natural choice, making the timestamps a kind of domain’s natural key not only in statistical arbitrage, but in trading as a whole.

Given this fact, suppose that we need to use the two tables below to answer the following question: what was the symbol A price when symbol B was traded? Since symbol A price is stored in the ‘market_data’ table and the symbol B trade timestamp is in the ‘trade’ table, we would need to perform a join between these two tables.

Fig. 3. Diagram showing the market_data and trade tables from schema-0.5

As you probably know, a join clause in SQL requires matching fields in the involved tables, otherwise we would get zero rows as a result. In this case, the matching field we are looking for is the timestamp. Our query would look something like this:

SELECT t.ticket, t.side, t.price AS trade_price, m.price_open, m.price_high, m.price_low, m.price_close, m.timeframe FROM trade t JOIN market_data m ON t.tstamp = m.tstamp AND t.symbol_id = m.symbol_id WHERE t.tstamp = 1708531200 — The timestamp of our reference trade AND m.timeframe = ‘M1’; — We need the timeframe because of the market_data composite primary key

The problem is that for this query to work, we would need a market data timestamp (m.tstamp) equal to the trade timestamp (t.tstamp) at the exact same second. In the case of the example above, we would need a quote with the February 21, 2024, 4:00:00 PM UTC datetime. Exactly 4:00:00 PM UTC. As you may already noted, it is unrealistic to expect trades and quotes to happen at the exact same second. If our system depended on queries like these, it would be a very fragile trading system, because most, if not all, queries would fail. Moreover, this is not what we want to know in practice. What we want to know is the symbol A last quote when symbol B was traded.

This can be addressed with sub-queries or by using a LIMIT clause to find the most recent quote before or at the time of the trade.

SELECT t.*, m.* FROM trade t JOIN market_data m ON m.symbol_id = t.symbol_id WHERE t.tstamp = 1708531200 AND m.timeframe = ‘M1’ AND m.tstamp <= t.tstamp ORDER BY m.tstamp DESC LIMIT 1;

This standard SQL way works in almost any database, including SQLite, but there are some drawbacks with it. The most critical one again is performance. To find the closest price by looking for the maximum timestamp that is less than or equal to the trade time, we are manually telling the database to sort all historical records for a symbol and pick the single most recent one, and this is expensive. For every single trade, the database must perform a search and a sort operation. If we are joining 1,000 trades against 1,000,000 quotes, which is a common scenario in data analysis, this can become extremely slow.

DuckDB and other specialized OLAP systems address this issue with the so-called ASOF JOIN, a temporal join designed for time series analysis in columnar databases, that allows us to join tables where timestamps don't match exactly. To show you the difference between making temporal joins with standard SQL sub-queries and using the specialized DuckDB ASOF JOIN, we have another simple benchmark script that will compare the performance of:

SQLite query using sub-query and LIMIT

sql_query = """ SELECT SUM(t.quantity * ( SELECT q.bid_price FROM q WHERE q.ticker = t.ticker AND q.timestamp = q.timestamp “””

This script is attached here as bench_asof.py. By running it in a low-end machine, which is our target hardware, you should find a result similar to this.

— Generating 1,000,000 trades and 100,000 quotes — — Running Engine-Only Benchmark (Computing SUM) — DuckDB Time: 1.1637s SQLite Time: 2.1329s [WINNER]: DuckDB is 1.8x faster at this scale.

Again, as we increase the volume of the data, scaling it ten times, the gap increases as well. Remember, I’m deliberately running this benchmark in a low-end two-core machine. This choice is purposeful to stay aligned with our main goal of developing this stat arb framework for the average retail trader, but at the same time, this choice limits the native parallelism of DuckDB, which by default will use all available cores. Therefore, if you run your analysis on a modern desktop/notebook, you’ll see these numbers for DuckDB time drop dramatically.

— Generating 10,000,000 trades and 1,000,000 quotes — — Running Engine-Only Benchmark (Computing SUM) — DuckDB Time: 8.6416s SQLite Time: 22.2207s [WINNER]: DuckDB is 2.6x faster at this scale.

The availability of ASOF JOINs is a game-changer for retail traders competing in the statistical arbitrage arena. Consider that we are looking at a very simple example above, and for the benchmark. But as we add more features to the query, like, for example, ask price, volume, volatility, and others, the query using standard SQL without the ASOF JOIN provided by DuckDB becomes hard to debug and maintain, and performance suffers even more. For every single trade, the database has to perform a separate lookup in the quotes table.

On the other hand, an OLAP system promotes temporal joins as a first-class citizen and uses a specialized algorithm designed specifically for this point-in-time logic. The query is elegant and computationally efficient. DuckDB doesn’t do a nested loop. This is where it uses the so-called vectorized execution we mentioned above. It processes “chunks” of data through the CPU cache and can perform these joins on millions of rows in milliseconds on a standard laptop. Because DuckDB is columnar, it only reads the timestamp, symbol, and price columns rather than the entire dataset.

The vectorized execution really shines in the calculation of the Rolling Windows Eigenvector Comparison (RWEC) for portfolio rebalancing, which we described in Part 8 of this series.

RWEC

DuckDB has built-in array functions that can significantly improve the performance of our RWEC calculation, especially as our database grows. This is the Python method we previously used to calculate the cosine similarity between consecutive vectors:

def vector_similarity(self, vectors_df): “””Compute cosine similarity between consecutive vectors””” similarities = [] for i in range(1, len(vectors_df)): vec1 = vectors_df.iloc[i-1].values vec2 = vectors_df.iloc[i].values cos_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) angle_deg = np.degrees(np.arccos(np.clip(cos_sim, -1, 1))) similarities.append({ ‘date’: vectors_df.index[i], ‘cosine_similarity’: cos_sim, ‘angle_degrees’: angle_deg, ‘stable’: angle_deg < 30 }) return pd.DataFrame(similarities).set_index('date')

There is a direct replacement for this Python logic among the DuckDB array functions: array_cosine_similarity(v1, v2). Note that to calculate the cosine similarity, we are using the dot product of the two vectors divided by the product of their magnitudes. This is the exact mathematical definition of cosine similarity. To replicate our vector_similarity method in DuckDB, we can use Window Functions (LAG) to compare the current vector with the previous one, so DuckDB can process these array operations using SIMD (Single Instruction, Multiple Data) instructions, making it much faster than a Python loop as our repository of historical data and trades grows towards a large dataset.

We must also take into account that DuckDB handles out-of-core execution, that is, it can process data larger than our available RAM. On the other hand, our current pandas approach is limited by memory. To see the difference between row-based Python iteration and an OLAP system vectorized engine, we can run a third benchmark. We will compare the native Python/NumPy loop (used in our attached rwec.py script) against the DuckDB array_cosine_similarity function.

import duckdb import pandas as pd import numpy as np import time def generate_eigen_dataset(n_rows=1_000_000, vec_dim=2): """Generates a large set of synthetic eigenvectors for testing.""" print(f"— Generating {n_rows:,} eigenvectors (dim={vec_dim}) —") data = np.random.randn(n_rows, vec_dim) norms = np.linalg.norm(data, axis=1, keepdims=True) normalized_vecs = data / norms df = pd.DataFrame({ 'date': pd.date_range(start='2000-01-01', periods=n_rows, freq='h'), 'vec': list(normalized_vecs) }) return df

In statistical arbitrage, the cointegration vector represents the hedge ratio between assets. An eigenvector like [1, -2] represents the exact same relationship as [10, -20]. They both describe a 1:2 ratio. If our rolling window calculation returns [1, -2] in one period and [10, -20] in the next, the relationship hasn't actually changed. The normalization ensures that our benchmark measures the direction (the hedge ratio) rather than the arbitrary scale of the output from the Johansen test.

def benchmark_rwec(df): con = duckdb.connect(":memory:") con.register("eigen_table", df) print("n— Running RWEC Benchmark (Cosine Similarity + Angle) —") duck_query = """ SELECT AVG(DEGREES(ACOS(inner_sim))) FROM ( SELECT LEAST(GREATEST(array_cosine_similarity( vec::DOUBLE[2], LAG(vec::DOUBLE[2]) OVER (ORDER BY date) ), -1), 1) as inner_sim FROM eigen_table ) WHERE inner_sim IS NOT NULL """ start = time.perf_counter() duck_res = con.execute(duck_query).fetchone()[0] duck_time = time.perf_counter() – start print(f"DuckDB Time: {duck_time:.4f}s (Result Avg Angle: {duck_res:.2f}°)") start = time.perf_counter() similarities = [] vecs = np.stack(df['vec'].values) for i in range(1, len(vecs)): vec1 = vecs[i-1] vec2 = vecs[i] cos_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) angle_deg = np.degrees(np.arccos(np.clip(cos_sim, -1, 1))) similarities.append(angle_deg) py_res = np.mean(similarities) py_time = time.perf_counter() – start print(f"Python Loop Time: {py_time:.4f}s (Result Avg Angle: {py_res:.2f}°)") print(f"n[WINNER]: DuckDB is {py_time/duck_time:.1f}x faster for RWEC logic.") if __name__ == "__main__": eigen_df = generate_eigen_dataset(n_rows=500_000) benchmark_rwec(eigen_df)

This script is attached here as bench_rwec.py. By running it in a low-end machine you should get results something like this.

— Generating 500,000 eigenvectors (dim=2) — — Running RWEC Benchmark (Cosine Similarity + Angle) — DuckDB Time: 0.9603s (Result Avg Angle: 90.01°) Python Loop Time: 21.9560s (Result Avg Angle: 90.01°) [WINNER]: DuckDB is 22.9x faster for RWEC logic.

Unlike the other two benchmarks we saw above, here the difference is massive right from the start, without even scaling the input data beyond 500,000 two-dimensional eigenvectors. The array_cosine_similarity function replaces both the manual np.dot and the normalization (norm) we used in Python, and the elimination of the Python loop (for i in range) avoids the overhead created by the interpreter jumping back and forth between Python code and the underlying C libs for each row. Also, as said above, DuckDB processes these arrays in chunks. It doesn't calculate one angle at a time. Instead, it calculates them in parallel batches, which is why we see these massive performance improvements when comparing with our previous Python/SQLite algorithm.

The above works for two-dimensional eigenvectors, that is, for two symbols. If we want to look at baskets of three or more symbols, we can simply pass the dimension as a vector variable. The ::DOUBLE[n] cast must match the actual number of assets in our cointegration vector. We can use a Python f-string to inject the dimension (vec_dim) directly into the query.

vec_dim = len(df['vec'].iloc[0]) duck_query = f""" SELECT AVG(DEGREES(ACOS(inner_sim))) FROM ( SELECT LEAST(GREATEST(array_cosine_similarity( vec::DOUBLE[{vec_dim}], LAG(vec::DOUBLE[{vec_dim}]) OVER (ORDER BY date) ), -1), 1) as inner_sim FROM eigen_table ) WHERE inner_sim IS NOT NULL """

This is important because DuckDB's array_cosine_similarity requires the input arrays to have a fixed length. If we try to compare a DOUBLE[2] to a DOUBLE[3], the engine will throw a binder error.

Conclusion

In this final article of our introductory series about statistical arbitrage through cointegrated stocks, we suggested the inclusion of a specialized database for data analysis in our pipeline. Specifically, we suggest the use of the free, open-source, and institutional-grade DuckDB OLAP system.

We pointed out that until a few years ago, the most relevant entry barrier for retail traders in the statistical arbitrage space was computational power for real-time data analysis, not execution speed. To support this point, we presented some estimates of the cost of similar systems around the year 2010, along with a brief description of the technical complexities involved in their implementation.

Finally, to show how our real-time data analysis step could benefit from the OLAP system adoption, we presented three benchmarks comparing the performance of DuckDB and SQLite for the most used calculations in our system: a simple aggregation for average price calculation, a two-table join for historical quotes comparison, and an RWEC calculation using native DuckDB functions instead of a Python nested loop.

The benchmark results on a low-end machine vary from 2x to 23x faster DuckDB calculations, always increasing as we increase the volume of the sample data, confirming the main reason why we are suggesting the inclusion of this specialized system in our pipeline, which is to make it ready for growth and future-proof for large datasets.

Like this:

Related

Share this:

Like this:

Related

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.