
In the previous article, we examined in detail the theoretical aspects of the hybrid trading system StockFormer, which combines predictive coding and reinforcement learning algorithms to forecast market trends and the dynamics of financial assets. StockFormer is a hybrid framework that brings together several key technologies and approaches to address complex challenges in financial markets. Its core feature is the use of three modified Transformer branches, each responsible for capturing different aspects of market dynamics. The first branch extracts hidden interdependencies between assets, while the second and third focus on short-term and long-term forecasting, enabling the system to account for both current and future market trends.
The integration of these branches is achieved through a cascade of attention mechanisms, which enhance the model’s ability to learn from multi-head blocks, improving its processing and detection of latent patterns in the data. As a result, the system can not only analyze and predict trends based on historical data but also take into account dynamic relationships between various assets. This is particularly important for developing trading strategies capable of adapting to rapidly changing market conditions.
The original visualization of the StockFormer framework is provided below.
In the practical section of the previous article, we implemented the algorithms of the Diversified Multi-Head Attention (DMH-Attn) module, which serves as the foundation for enhancing the standard attention mechanism in the Transformer model. DMH-Attn significantly improves the efficiency of detecting diverse patterns and interdependencies in financial time series, which is especially valuable when working with noisy and highly volatile data.
In this article, we will continue the work by focusing on the architecture of different parts of the model and the mechanisms of their interaction in creating a unified state space. Additionally, we will examine the process of training the decision-making Agent’s trading policy.
Predictive Coding Models
We begin with predictive coding models. The authors of the StockFormer framework proposed using three predictive models. One is designed to identify dependencies within the data describing the dynamics of the analyzed financial assets. The other two are trained to forecast the upcoming movements of the multimodal time series under study, each with a different planning horizon.
All three models are based on the Encoder-Decoder Transformer architecture, utilizing modified DMH-Attn modules. In our implementation, the Encoder and Decoder will be created as separate models.
Dependency Search Models
The architecture of the dependency search models for time series of financial assets is defined in the method CreateRelationDescriptions.
bool CreateRelationDescriptions(CArrayObj *&encoder, CArrayObj *&decoder) { CLayerDescription *descr; if(!encoder) { encoder = new CArrayObj(); if(!encoder) return false; } if(!decoder) { decoder = new CArrayObj(); if(!decoder) return false; }
The method’s parameters include pointers to two dynamic arrays, into which we must pass the architecture descriptions of the Encoder and Decoder. Inside the method, we check the validity of the received pointers and, if necessary, create new instances of the dynamic array objects.
For the first layer of the Encoder, we use a fully connected layer of sufficient size to accept all tensor data from the raw input.
Recall that the Encoder receives historical data across the full depth of the analyzed history.
encoder.Clear(); if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; int prev_count = descr.count = (HistoryBars * BarDescr); descr.activation = None; descr.optimization = ADAM; if(!encoder.Add(descr)) { delete descr; return false; }
The raw data originates from the trading terminal. As one might expect, the multimodal time series data, comprising indicators and possibly multiple financial instruments, belongs to different distributions. Therefore, we first preprocess the input data using a batch normalization layer.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBatchNormOCL; descr.count = prev_count; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!encoder.Add(descr)) { delete descr; return false; }
The StockFormer authors suggest randomly masking up to 50% of the input data during training of the dependency search models. The model must reconstruct the masked data based on the remaining information. In our Encoder, this masking is handled by a Dropout layer.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronDropoutOCL; descr.count = prev_count; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; descr.probability = 0.5f; if(!encoder.Add(descr)) { delete descr; return false; }
Following this, we add a learnable positional encoding layer.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronLearnabledPE; descr.count = prev_count; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!encoder.Add(descr)) { delete descr; return false; }
The Encoder concludes with a diversified multi-head attention module consisting of three nested layers.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronDMHAttention; descr.window = BarDescr; descr.window_out = 32; descr.count = HistoryBars; descr.step = 4; descr.layers = 3; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!encoder.Add(descr)) { delete descr; return false; }
The input to the Decoder in the dependency search model is the same multimodal time series, with identical masking and positional encoding applied. Thus, most of the Encoder and Decoder architectures are identical. The key difference is that we replace the diversified multi-head attention module with a cross-attention module, which aligns the data streams of the Decoder and Encoder.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronCrossDMHAttention; { int temp[] = {BarDescr, BarDescr}; if(ArrayCopy(descr.windows, temp) < (int)temp.Size()) return false; } descr.window_out = 32; { int temp[] = {prev_count/descr.windows[0], HistoryBars}; if(ArrayCopy(descr.units, temp) < (int)temp.Size()) return false; } descr.step = 4; descr.layers = 3; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!decoder.Add(descr)) { delete descr; return false; }
Since the Decoder's output will be compared against the original input data, we finalize the model with a reverse normalization layer.
prev_count = descr.units[0] * descr.windows[0]; if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronRevInDenormOCL; descr.count = prev_count; descr.layers = 1; descr.activation = None; descr.optimization = ADAM; if(!decoder.Add(descr)) { delete descr; return false; } return true; }
Prediction Models
Both prediction models, despite having different planning horizons, share the same architecture, which is defined in the method CreatePredictionDescriptions. It is worth noting that the Encoder is designed to receive the same multimodal time series previously analyzed by the dependency search model. Therefore, we fully reuse the Encoder architecture, with the exception of the Dropout layer, since input masking is not applied during the training of prediction models.
The Decoder of the prediction model receives as input only the feature vector of the last bar, whose values are passed through a fully connected layer.
decoder.Clear(); if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; prev_count = descr.count = (BarDescr); descr.activation = None; descr.optimization = ADAM; if(!decoder.Add(descr)) { delete descr; return false; }
As in the models described earlier, this is followed by a batch normalization layer, which we use for the initial preprocessing of raw input data.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBatchNormOCL; descr.count = prev_count; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!decoder.Add(descr)) { delete descr; return false; }
In this article, we focus on training the model to analyze historical data for a single financial instrument. Given this, having only a single-bar description vector in the input data minimizes the effectiveness of positional encoding. For this reason, we omit it here. However, when analyzing multiple financial instruments, it is recommended to add positional encoding to the input data.
Next comes a three-layer diversified multi-head cross-attention module, which uses the corresponding Encoder's output as its second source of information.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronCrossDMHAttention; { int temp[] = {BarDescr, BarDescr}; if(ArrayCopy(descr.windows, temp) < (int)temp.Size()) return false; } descr.window_out = 32; { int temp[] = {1, HistoryBars}; if(ArrayCopy(descr.units, temp) < (int)temp.Size()) return false; } descr.step = 4; descr.layers = 3; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!decoder.Add(descr)) { delete descr; return false; }
At the model's output, we add a fully connected projection layer without an activation function.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; descr.count = BarDescr; descr.activation = None; descr.optimization = ADAM; if(!decoder.Add(descr)) { delete descr; return false; } return true; }
Two important points should be emphasized here. First, unlike traditional models that predict expected values of the continuation of the analyzed time series, the authors of the StockFormer framework propose predicting change coefficients of the indicators. This means that the size of the output vector matches the input tensor of the Decoder, regardless of the planning horizon. Such an approach allows us to eliminate the reverse normalization layer at the Decoder's output. Moreover, in this prediction setup, reverse normalization becomes redundant. Since change coefficients and raw indicators belong to different distributions.
Second, regarding the use of a fully connected layer at the Decoder's output. As mentioned earlier, we are analyzing a multimodal time series of a single financial instrument. Therefore, we expect all unitary sequences under analysis to exhibit varying degrees of correlation. Therefore, their change coefficients must be aligned. Therefore, a fully connected layer is appropriate in this case. If, however, you plan to perform parallel analysis of multiple financial instruments, it is advisable to replace the fully connected layer with a convolutional one, enabling independent prediction of change coefficients for each asset.
This concludes our review of the predictive coding model architectures. A full description of their design can be found in the appendix.
Training Predictive Coding Models
In the StockFormer framework, the training of predictive coding models is implemented as a dedicated stage. After reviewing the architectures of the predictive models, we now turn to constructing an Expert Advisor for their training. The EA's base methods are largely borrowed from similar programs discussed in previous articles of this series. Therefore, in this article, we will focus primarily on the direct training algorithm, organized in the Train method.
First, we will do a little preparatory work. Here, we form a probability vector for selecting trajectories from the experience replay buffer, assigning higher probabilities to those with maximum profitability. In this way, we bias the training process toward profitable runs, filling it with positive examples.
void Train(void) { vector probability = GetProbTrajectories(Buffer, 0.9); vector result, target, state; matrix predict; bool Stop = false; uint ticks = GetTickCount();
At this stage, we also declare the necessary local variables used to store intermediate data during training. After completing the preparation, we initiate the training iteration loop. The total number of iterations defined in the EA’s external parameters.
for(int iter = 0; (iter < Iterations && !IsStopped() && !Stop); iter ++) { int tr = SampleTrajectory(probability); int i = (int)((MathRand() * MathRand() / MathPow(32767, 2)) * (Buffer[tr].Total – 2 – NForecast)); if(i 500) { double percent = double(iter) * 100.0 / (Iterations); string str = StringFormat(“%-14s %6.2f%% -> Error .8fn”, “Relate”, percent, RelateDecoder.getRecentAverageError()); str += StringFormat(“%-14s %6.2f%% -> Error .8fn”, “Short”, percent, ShortDecoder.getRecentAverageError()); str += StringFormat(“%-14s %6.2f%% -> Error .8fn”, “Long”, percent, LongDecoder.getRecentAverageError()); Comment(str); ticks = GetTickCount(); } }
Upon completion of all training iterations, we clear the comments field on the chart (previously used to display training updates).
Comment(“”); PrintFormat(“%s -> %d -> %-15s .7f”, __FUNCTION__, __LINE__, “Relate”, RelateDecoder.getRecentAverageError()); PrintFormat(“%s -> %d -> %-15s .7f”, __FUNCTION__, __LINE__, “Short”, ShortDecoder.getRecentAverageError()); PrintFormat(“%s -> %d -> %-15s .7f”, __FUNCTION__, __LINE__, “Long”, LongDecoder.getRecentAverageError()); ExpertRemove(); }
We print the results in the journal and initiate the termination of the EA operation.
The full source code of the predictive model training EA can be found in the attachment (file: “…MQL5ExpertsStockFormerStudy1.mq5”).
Finally, it should be noted that during model training for this article, we used the same input data structure as in previous works. Importantly, predictive model training relies solely on environment states that are independent of the Agent’s actions. Therefore, training can be launched using a pre-collected dataset. We now move on to the next stage of our work.
Policy Training
While the predictive models are being trained, we turn to the next stage – training the Agent behavior policy.
Model Architecture
We begin by preparing the architectures of the models used in this stage, as defined in the CreateDescriptions method. It is important to note that in the StockFormer framework, both the Actor and Critic take as input the outputs of the predictive models, which are combined into a unified subspace using a cascade of attention modules. In our library, we can build models with two data sources. So, we split the attention cascade into two separate models. In the first model, we align data from two planning horizons. The authors recommend using long-term planning data from the main stream, as it is less sensitive to noise.
The architecture of the two-horizon alignment model is straightforward. Here we create two layers:
long_short.Clear(); if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; int prev_count = descr.count = (BarDescr); descr.activation = None; descr.optimization = ADAM; if(!long_short.Add(descr)) { delete descr; return false; } if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronCrossDMHAttention; { int temp[] = {BarDescr, BarDescr}; if(ArrayCopy(descr.windows, temp) < (int)temp.Size()) return false; } descr.window_out = 32; { int temp[] = {1, 1}; if(ArrayCopy(descr.units, temp) < (int)temp.Size()) return false; } descr.step = 4; descr.layers = 3; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!long_short.Add(descr)) { delete descr; return false; }
No normalization layer is used here, as the model input is the output of previously trained prediction models, not raw data.
The results of the two-horizon alignment are then enriched with information about the current environment state, obtained from the Encoder of the dependency search model applied to the input data.
Recall that the dependency search model was trained to reconstruct masked portions of the input data. At this stage, we expect that each unitary time series has a predictive state representation formed based on the other univariate sequences. Therefore, the Encoder output is a denoised tensor of the environment state, as outliers that do not fit the model's expectations are compensated by statistical values derived from other sequences.
The architecture of the model that enriches predictions with environment state information closely mirrors the two-horizon alignment model. The only difference is that we change the sequence length of the second data source.
predict_relate.Clear(); if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; prev_count = descr.count = (BarDescr); descr.activation = None; descr.optimization = ADAM; if(!predict_relate.Add(descr)) { delete descr; return false; } if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronCrossDMHAttention; { int temp[] = {BarDescr, BarDescr}; if(ArrayCopy(descr.windows, temp) < (int)temp.Size()) return false; } descr.window_out = 32; { int temp[] = {1, HistoryBars}; if(ArrayCopy(descr.units, temp) < (int)temp.Size()) return false; } descr.step = 4; descr.layers = 3; descr.batch = 1e4; descr.activation = None; descr.optimization = ADAM; if(!predict_relate.Add(descr)) { delete descr; return false; }
After constructing the attention cascade that combines the outputs of the three predictive models into a unified subspace, we proceed to build the Actor. The input to the Actor model is the output of the attention cascade.
actor.Clear(); if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; prev_count = descr.count = (BarDescr); descr.activation = None; descr.optimization = ADAM; if(!actor.Add(descr)) { delete descr; return false; }
The predictive expectations are combined with account state information.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronConcatenate; descr.count = LatentCount; descr.window = prev_count; descr.step = AccountDescr; descr.activation = LReLU; descr.optimization = ADAM; if(!actor.Add(descr)) { delete descr; return false; }
This combined information is passed through a decision-making block implemented as an MLP with a stochastic output head.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; descr.count = LatentCount; descr.activation = LReLU; descr.optimization = ADAM; descr.probability = Rho; if(!actor.Add(descr)) { delete descr; return false; } if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronBaseOCL; descr.count = 2 * NActions; descr.activation = None; descr.optimization = ADAM; descr.probability = Rho; if(!actor.Add(descr)) { delete descr; return false; } if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronVAEOCL; descr.count = NActions; descr.optimization = ADAM; if(!actor.Add(descr)) { delete descr; return false; }
At the model's output, trade parameters for each direction are adjusted using a convolutional layer with a sigmoid activation function.
if(!(descr = new CLayerDescription())) return false; descr.type = defNeuronConvOCL; descr.count = NActions / 3; descr.window = 3; descr.step = 3; descr.window_out = 3; descr.activation = SIGMOID; descr.optimization = ADAM; descr.probability = Rho; if(!actor.Add(descr)) { delete descr; return false; }
The Critic has a similar architecture, but instead of account state, it analyzes the Agent actions. Its output does not use a stochastic head. The full architecture of all models is available in the appendix.
Policy Training Procedure
Once the model architectures are defined, we organize the training algorithms. The second stage involves finding the optimal Agent behavior strategy to maximize returns while minimizing risk.
As before, the training method begins with preparation. We generate a probability vector for selecting trajectories from the experience replay buffer based on their performance and declaring local variables.
void Train(void) { vector probability = GetProbTrajectories(Buffer, 0.9); vector result, target, state; bool Stop = false; uint ticks = GetTickCount();
We then enter the training loop, with the number of iterations set by the EA’s external parameters.
for(int iter = 0; (iter < Iterations && !IsStopped() && !Stop); iter ++) { int tr = SampleTrajectory(probability); int i = (int)((MathRand() * MathRand() / MathPow(32767, 2)) * (Buffer[tr].Total – 2 – NForecast)); if(i 500) { double percent = double(iter) * 100.0 / (Iterations); string str = StringFormat(“%-14s %6.2f%% -> Error .8fn”, “Actor”, percent, Actor.getRecentAverageError()); str += StringFormat(“%-14s %6.2f%% -> Error .8fn”, “Critic”, percent, Critic.getRecentAverageError()); Comment(str); ticks = GetTickCount(); } }
Upon completion of all training iterations, we clear the chart comments, log the results in the journal, and initiate program termination, just as in the first training stage.
Comment(“”); PrintFormat(“%s -> %d -> %-15s .7f”, __FUNCTION__, __LINE__, “Actor”, Actor.getRecentAverageError()); PrintFormat(“%s -> %d -> %-15s .7f”, __FUNCTION__, __LINE__, “Critic”, Critic.getRecentAverageError()); ExpertRemove(); }
It should be noted that algorithm adjustments affected not only the model training Expert Advisors but also the environment interaction EAs. However, the adjustments to environment interaction algorithms largely mirror the Actor’s feed-forward pass described above and are left for independent study. Therefore, we will not go into the detailed logic of these algorithms here. I encourage you to explore their implementations independently. The full source code for all programs used in this article is included in the attachment.
Testing
We have completed the extensive implementation of the StockFormer framework using MQL5 and have reached the final stage of our work – training the models and evaluating their performance on real historical data.
As previously mentioned, the initial stage of training the predictive models utilized a dataset collected in earlier studies. This dataset comprises EURUSD historical data for the entire year of 2023, on the H1 timeframe. All indicator parameters were set to their default values.
During predictive model training, we use only historical data describing the environment state, which is independent of the Agent’s behavior. This allows us to train the models without updating the training dataset. The training process continues until errors are stabilized within a narrow range.
The second training stage – optimizing the Actor’s behavior policy – is performed iteratively, with periodic updates to the training dataset to reflect the current policy.
We evaluate the performance of the trained model using the MetaTrader 5 Strategy Tester on historical data from January 2024. This period immediately follows the training dataset period. The results are presented below.
During the testing period, the model executed 15 trades, with 10 closing in profit – over 66% success rate. Quite a good result. Notably, the average profitable trade is four times larger than the average loss. This results in a clear upward trend in the balance chart.
Conclusion
Across these two articles, we explored the StockFormer framework, which offers an innovative approach to training trading strategies for financial markets. StockFormer combines predictive coding with reinforcement learning, enabling the development of flexible policies that capture dynamic dependencies among multiple assets and forecast their behavior both in the short and long term.
The three-branch predictive coding structure in StockFormer allows the extraction of latent representations reflecting short-term trends, long-term changes, and inter-asset relationships. Integration of these representations is achieved via a cascade of multi-head attention modules, creating a unified state space for optimizing trading decisions.
In the practical part, we implemented the key components of the framework in MQL5, trained the models, and tested them on real historical data. The experimental results confirm the effectiveness of the proposed approaches. Nevertheless, applying these models in live trading requires training on a larger historical dataset and comprehensive further testing.
References
Programs used in the article

