In this article, we design and train our neural network, and I explain the basic principles of neural networks along the way.
This is the second article from my series of articles on developing neural network-based pattern recognition indicators for MetaTrader5. Here, you can find an introduction to the series. In the first article, I describe the process of collecting data for neural network model training using the MT5 script. In the third article, I describe how you can use your model as MT5 indicator using DT-Box-Inference tool. In the following article you can find instructions on installing and configuring DT-Box-Inference and the necessary MT5 files. Finally, here you find the source code of DT-Box-Inference.
All the Python code used in this article can be found in the notebook DT-ModelTraining.ipynb, which you can run in Google Colab. I recommend that you follow along with the examples as you read to gain a better understanding of the process. To do this, open the notebook in your Colab environment and upload the test and training data files into the files section:
If you haven’t used Google Colab before, here you can find an overview with some examples.
Data Normalization
Before we dive into the training process, let’s first discuss the normalization of the data we have collected. Normalization is the process of transforming data (in our case, 42-bar datasets) into a uniform scale. This ensures that no single pattern (or dataset) dominates the learning process due to differences in scale.
Let me demonstrate this visually: let’s take five random patterns from our dataset of class 1 and display them on a chart. Here is how the first pattern looks:
But here is how all five look on the same scale:
Clearly, to make all datasets (patterns) appear similar, we need to somehow transform the data to remove differences in scale. The primary methods for normalization and transformation of data, especially in fields such as finance, include the following:
Percentage Change or Returns (Log Returns): This method calculates the change in value relative to a previous value, often expressed as a percentage. Log returns, in particular, are useful for analyzing continuous compounding and offer the advantage of additivity over time.
Min-Max Scaling: This technique scales data to a specific range (typically [0, 1] or [-1, 1]) by adjusting values relative to their minimum and maximum values in the dataset. It ensures that all features have a uniform scale, preventing any single feature from dominating due to its range.
Standardization (Z-score Normalization): This method transforms data to have a mean of 0 and a standard deviation of 1, centering and scaling the data to a standard distribution. It is useful for features with different units or magnitudes, making them comparable.
Indexing Against a Base: This technique involves expressing data relative to a baseline value (e.g., an initial value or a specific reference point). It shows how the data has changed over time or relative to the baseline, making it useful for tracking changes or comparing trends.
In order to choose the best approach, let’s apply all of them to our training data and compare the results visually:
Based on this visual representation, it becomes clear that only the standardization method transforms all our patterns into a uniform scale, while other methods still permit spikes and irregularities. Therefore, before using pattern data in the training process, we will normalize the data using this approach. Moreover, when we run our training model in prediction mode (i.e., when we feed the trained model with new data), we will also need to perform such normalization.
Let’s now apply normalization to all our data—both training and test—and visualize it:
In this picture, you can see all the patterns normalized to a uniform scale. ‘Class 1’ labels our desired patterns, while ‘Class 0’ labels the non-pattern datasets. The training data will be used for the neural network training process, and the test data will be used for accuracy validation.
I prefer to visualize the data before every training session to ensure that the model is being trained on the exact data I intended. When experimenting with different datasets, applying pad/cut techniques, augmentations, and normalization methods, it is easy to lose track, especially with large datasets. Incorporating full or partial dataset visualizations between steps can help prevent errors and save considerable time.
Designing the Model
Now that our dataset is ready, we can proceed to design our network. The image below illustrates the architecture of a simple neural network that we will use in our initial experiment:
It consists of several layers, each made up of neurons. These neurons are the basic units of a neural network and are inspired by the neurons in the human brain. Each neuron functions by receiving an input, processing it through a function determined by a set of parameters, and then producing an output.
These parameters are known as weights and biases, and they are what the neural network optimizes (or learns) during training. They are denoted as ‘w‘ and ‘b‘ in the picture above. Weights control the strength of the connections between neurons, determining how much influence a particular input or neuron has on the next layer. Biases allow each neuron to shift its function output, providing additional flexibility and enabling the network to model more complex functions.
A function which determines the output of a neuron is called activation function. Activation functions introduce non-linear properties to the network, enabling the model to learn and represent more complex patterns. Commonly used activation functions include sigmoid, rectified linear unit (ReLU), hyperbolic tangent (tanh), and softmax. Within a single layer, all neurons typically use the same activation function. However, across different layers, it is common—but not mandatory—to use the same activation function. Sometimes, varying activation functions across layers is necessary, depending on the specific task or network architecture.
Each neuron in a layer is connected to neurons in the previous and next layers (except for those in the input and output layers). The output of one layer serves as the input for the next layer. In a basic neural network, such as the feedforward fully connected network shown above, each layer consists of neurons connected to every neuron in the subsequent layer.
The common structure of a neural network includes an input layer, one or more hidden layers, and an output layer. However, there are variations that deviate from this structure, such as neural networks without a hidden layer for basic binary classification tasks or architectures without an output layer for feature extraction. If a neural network contains many hidden layers, it is referred to as a deep neural network.
The picture above demonstrates an architecture of a neural network we will build for our example. It has input layer with 42 neurons, followed by hidden layer with 50 neurons, followed by output layer with a single neuron. It has totally 2201 trainable parameters (2150 weights and 51 biases)
Now, let’s build this network using the Keras framework. In just one statement, we can define the number of layers, the number of neurons in each layer, the activation functions, and other model hyperparameters:
This model is a densely connected neural network consisting of three layers. It begins with an input layer designed to accept data of a length specified by the training_len variable (42 data points in our example). Following the input layer, there is a hidden layer with 50 neurons, each using the ReLU activation function for non-linear processing. Finally, the output layer employs a sigmoid activation function, making the model suitable for binary classification tasks—distinguishing between patterns and non-patterns.
Training the Model
The iterative process of finding the optimal weights and biases that yield the best output from a neural network is essentially what we call learning in the context of neural networks. This process is almost identical to optimizing strategy input parameters in MetaTrader5, with one crucial exception—neural network parameters are optimized in a defined direction. In other words, for each parameter, we know whether we need to increase or decrease it in order to minimize the loss function.
The loss function is a function that takes a model’s prediction and the actual label of a training dataset as input parameters and calculates the loss value, which indicates how close the prediction is to the actual result. There are several types of commonly used loss functions, such as binary cross-entropy, categorical cross-entropy, Poisson loss, etc. and they are all designed to be differentiable. For every neural network parameter (2201 in our case), a partial derivative of the loss function is calculated with respect to that particular parameter. This derivative provides the rate of change of the loss function with respect to changes in that parameter.
You may remember that not every function can be differentiated, let alone those that are complex and comprise multiple components. In nearly all modern deep learning frameworks (TensorFlow, PyTorch, JAX) technique called automatic differentiation is used – it allows for the efficient computation of derivatives, and unlike numerical or symbolic differentiation, automatic differentiation doesn’t suffer from rounding errors or the exponential increase in complexity with additional variables. If you need to remind yourself how badly you know linear algebra – here is an article describing magic of automatic differentiation.
Partial derivatives calculated during model compilation are called gradients. During the optimization of network parameters, when the algorithm feeds the neural network with the training dataset, the actual values of each gradient are calculated and used to update the corresponding parameters. This method is known as gradient descent and is employed by all modern neural network optimization algorithms.
Fortunately, deep learning frameworks such as TensorFlow, Keras, and PyTorch automate all these steps. All you need to do is choose an optimization algorithm and run it. Let’s now compile our model:
As an optimization algorithm, we use ADAM—Adaptive Moment Estimation, further extending gradient descent by computing adaptive learning rates for each parameter. It is often considered the default choice for optimization in neural network frameworks such as TensorFlow and Keras due to its robust performance across a wide variety of machine learning tasks.
As a loss function we will use binary cross-entropy. For a binary classification model, binary cross-entropy (also known as log loss) is a commonly used loss function. It calculates the loss for each individual prediction and then averages these losses over all samples (patterns) in the data.
For training we use the following parameters epochs and batch_sise. Let’s start with 40 epochs, i.e., whole training data will be used 40 times to adjust weights and biases of our neural network, and batch size 10, which in turn means that optimization algorithm will adjust weights and biases after processing every 10 patterns the training data.
As model performance metrics we will use accuracy which is the most intuitive performance measure and it is simply a ratio of correctly predicted observations to the total observations. Accuracy is particularly useful when the numbers of observations in each class are similar. However, metrics like accuracy (or precision, recall, etc.) do not influence the calculation of weights and biases in neural networks during the training process. The optimization process in training a neural network is governed by the loss function. The metrics are used to evaluate the model’s performance and simply give insights into how well the model is learning and generalizing to new data.
Let’s now start the training process:
Understanding Training Results, Overfitting, Choosing the Best Model
After the training process is completed, the results will be presented as follows:
Here, for each epoch, you will see the loss function value and the accuracy value for both the training and validation sets. In the case of binary cross-entropy, the loss value represents the average of the individual losses calculated from each pattern’s prediction compared to its actual label. Note that for other loss functions, the actual loss value may be calculated differently.
On the chart, we plot the changes in the loss function value for the training dataset against the changes in the loss function for the test dataset. Such a chart is an invaluable tool for understanding the dynamics of the model’s optimization process and for preventing model overfitting.
In this particular case, we observe that the loss function value, calculated at the end of each epoch for the training dataset, continuously decreases and tends toward zero. Conversely, for the test dataset, the loss decreases during the first 10 epochs and then plateaus—a classic indicator of overfitting. This suggests that the model has learned the details and noise in the training data to such an extent that it negatively impacts the model’s performance on new data. You can think about it as if the model has memorized the training data rather than learning to generalize from it.
The critical observation here is the point at which validation loss stops decreasing and remains constant or minimally improves, despite further decreases in training loss. This is generally the best model to save and use, as it represents the point where the model was generalizing best to unseen data. In our case it occur just before the training loss begins to diverge further downwards while the validation loss stabilizes.
It also may seem counterintuitive to use a model after only 5-10 epochs of training, especially when we tend to think that more training must be better. However, the goal of training a model isn’t merely to achieve low training loss but to ensure the model generalizes well to new, unseen data.
To automatically handle such scenarios, implementing an early stopping mechanism can be very effective. Early stopping monitors the validation loss and stops training when it no longer decreases for a set number of epochs, called a “patience” period. This approach helps in saving the model at its optimal state in terms of generalization to unseen data.
If you run the optimization process with an early stopping callback, as the example above demonstrates, you will observe results like the following:
The optimization stopped after 22 epochs, and the model parameters from the 12th epoch have been restored. Note that typically, the model parameters at the end of the optimization correspond to those from the last completed epoch. Therefore, if you need model parameters from a different epoch, you must use special callbacks, such as the early stopping. There are also callbacks that allow you to save the best-performing model as the optimization progresses. However, for most tasks related to price pattern recognition, early stopping is the best tool.
More Hidden Layers
We tried a simple model with one hidden layer containing 50 neurons, roughly the size of our pattern. Now, let’s increase the complexity to see its impact on performance. While there is no a strict rule that applies universally, there are some guidelines and best practices that can help us determine number of neurons in each layer.
If for a simple problem fewer neurons and layers can often suffice. More neurons and additional layers may be necessary to capture the complexity of the data and relationships among features for complex problems.
The number of neurons in the first hidden layer is often larger than the size of the input layer if the problem is complex but can be smaller or roughly the same for simpler problems. It’s common to see architectures where the number of neurons decreases with depth (e.g., 128 in the first hidden layer, 64 in the second). This pyramid shape can help in forming more abstract representations at each level, condensing information as data passes through the network.
Alternatively, keeping the same number of neurons in each hidden layer is also common, particularly in problems where maintaining information throughout the network is beneficial.
So let’s design the model in the following way:
Now we have two hidden layers of 100 and 50 neurons and two dropout layers. Dropout is a technique used during the training to prevent overfitting. The method involves temporarily “dropping out” a random subset of neuron outputs (i.e., setting them to zero) during the forward pass, effectively removing those neurons and their connections from the network during training. This way, the network is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. A typical dropout rate is between 0.2 and 0.5. They are often placed between the hidden layers.
You could think of the dropout technique in training neural networks as being randomly distracted by something like Instagram while studying. Just as these distractions force you to refocus and ensure that you really understand the material (because you can’t rely on a continuous, uninterrupted train of thought), dropout prevents the neural network from depending on any single set of neurons. By ‘forgetting’ parts of the data intermittently, the network must adapt to generalize well, rather than memorizing specifics that might not be relevant outside the training set.
Now let’s run the same training process and compare results:
Training loss is now look a bit choppy exactly because of dropout layers we added. As you can see, now we got really close curves of training and validation loss, with validation loss reaches its min in 23 epochs with accuracy 100%, training loss < 0.01 and validation loss about 0.01. This models looks like a very good candidate to be deployed as MetaTrader5 indicator.
You can further experiment with different architectures. Doing that thought you need to keep in mind that the more complex network will tend to simply ‘memorize’ your training set, rather than learn general features of your patterns.
Saving the Model
The early stopping callback does more than just restore parameters from the best epochs; it also saves the model directly to the files in your Colab notebook. Download this model to your computer and prepare to see its performance as a MetaTrader5 indicator.
Step Two Complete
In this article, we explored the process of designing the architecture of a neural network and training it on the pattern data collected in MetaTrader5. The Colab notebook I prepared contains all the examples discussed in this article, but you are encouraged to further experiment with different architectures, optimization algorithms, and performance metrics. Once you have selected the best model, download it to your computer to use as a MetaTrader5 indicator.
If you like this article, please let me know you thoughts – pavel@pavelchigirev.com