A hybrid self attentive linearized phrase structured transformer based RNN for financial sentence analysis with sentence level explainability - Scientific Reports - MarketAlert – Real-Time Market & Crypto News, Analysis & Alerts

After these vast advancements, there exists many limitations. Fatouros et al.14 limitations are their study include a constrained dataset duration potentially affecting model performance across different periods, an inability of the short dataset to fully capture the complexities of financial markets and challenges in generative AI such as model collapse when relying heavily on synthetic training data, leading to limited or repetitive model outputs. Lutz et al.15 limitations include a lack of comprehensive testing on diverse datasets for model validation, which is critical for understanding the complex interplay between news sentiment and market movements. The lack of explainability in many advanced models, especially Deep Learning (DL) ones, poses another limitation, as it becomes difficult to understand the rationale behind certain predictions42. Furthermore, the substantial computational resources required for training and maintaining sophisticated models can be a limiting factor43. Real-time predictions are also hampered by time lags between the release of financial news and its impact on the market44,45. Models trained on data from one financial market or region may not perform well in different contexts due to varying economic conditions and market behaviors, highlighting issues with transferability. There is also a risk of overfitting to historical data, particularly when the dataset is limited, leading to poor generalization to new data46. Also, most methods explain static predictions, inadequate for financial time series where sentiment persistence matters (e.g., sustained “inflation fears” vs. one-off mentions and existing tools don’t map explanations to FINRA’s Model Validation Guidelines (https://www.finra.org/rules-guidance/rulebooks/finra-rules) requiring “traceable logic for material predictions”. Leippold shows keyword-based models fail against financial adversarial examples like “growth” “groWth” (Unicode spoofing) — a vulnerability extending to explanation methods relying on surface tokens47. Additionally, financial news and reports often contain human biases, which can be inadvertently learned by the models, resulting in skewed predictions. These additional limitations underscore the complexity of financial markets and the challenges involved in developing robust prediction models. The proposed LPS model tried tackling these through tracking sentiment carriers across sentences via dependency parsing, Anchors rules formatted as MiFID II-compliant decision logs (https://www.esma.europa.eu/publications-and-data/interactive-single-rulebook/mifid-ii) and extending FinBERT’s vocabulary to detect glyph attacks.

This section details datasets used in this experiment, each model’s theoretical foundation, operational principles, evaluation metrics and relevance to the research objectives, focusing on their application in financial sentiment analysis. To uncover patterns in the proposed models’ predictions and understand how these patterns influenced the decisions with proper explainability, the following scenarios were examined: when the classifier accurately classified a title as a negative statement, when the classifier accurately classified a title as a positive/neutral statement, when the classifier inaccurately classified a title as a negative statement and when the classifier inaccurately classified a title as a positive/neutral statement.

The Financial Phrasebank and SEntFiN datasets used in this work are available to the public. In order to create IMBSEntFiN, we first average the real classes and then alter the class weights for each sentence in the SEntFiN dataset, which was refined into an unbalanced classification format. The dataset has the dependent variable inserted manually. Following this, the dataset was divided into training and testing sets, with 80% of the samples going toward training. The model’s predictions were added to the test data in a new column when it finished the classification assignment. Figure 1 presents the most frequent words in the datasets of financial sentences which are displayed in word cloud format.

Due to the informal and unstructured nature of financial sentence data, preprocessing was necessary to ensure the precision and truthfulness of our study. The following essential stages were part of our extensive data-cleaning procedure:

The scikit-learn module’s “TfidfVectorizer” function, which assigns weights to words based on their scarcity throughout the entire dataset and their relevance in individual sentences, was used to apply the TF-IDF embeddings. It also simplified the process of down-weighting frequently used keywords, freeing up our models to focus on more informative terms.

Word2Vec’s power to capture semantic similarities between words is one of its benefits; this enhances the sophistication of text data analysis. It expresses words as dense vectors in a continuous vector space using neural networks. The financial sentences were tokenized using the NLTK library’s function, which breaks sentences down into individual words. A Word2Vec model with the vector size, window size and skip-gram model set was created using the Gensim software. This method was able to represent words as vectors by utilizing a continuous vector space. Full sentences were encoded as vectors using two techniques: average vectorization and sum vectorization.

GloVe is a potent word embedding method that builds dense vector representations using a corpus’s co-occurrence statistics to capture the semantic associations between words. We trained a GloVe model customized for our dataset using the glove library. Using the function from the NLTK package, the sentences were tokenized into individual words as part of the preprocessing. After that, a co-occurrence matrix was produced, taking into account a context window to record word connections. The GloVe model was trained to generate embeddings that accurately capture words’ global semantic features as well as their local context. The model was then able to identify significant patterns and connections in the text by mapping each word in the dataset to its vector representation using these embeddings.

FastText is a sophisticated word embedding method that improves on conventional approaches by adding subword information. This makes it useful for handling uncommon or non-vocabulary terms. The FastText model from the Gensim package was used to train dataset-specific embeddings. Using the function from the NLTK library, the procedure started by tokenizing sentences into individual words. Then, by decomposing words into character-level n-grams and learning their representations, the FastText model was trained to produce word embeddings. This method makes the model resilient for assessing text in many languages and handling invisible words by allowing it to capture both word-level semantics and subword structures. Following training, embeddings for every word in the dataset were available, proving the model’s efficacy in capturing morphological and semantic information.

We used the Hugging Face Transformers library’s pre-trained transformer-based models to exploit contextual embeddings for text data in our study. Three different transformer models, each contributing special skills to the study, were used. Known for its effectiveness and portability, the “distilbert-base-uncased” model was chosen because it works well in situations with limited computing resources. The left and right context of each word are taken into account while creating context-aware word embeddings. To convey the subtleties unique to finance in text, we employed a cutting-edge “yiyanghkust/finbert-tone” model. Our sentiment analysis efforts were firmly based on this model, which was created for financial sentiment research. We have included “sentence-transformers/all-MiniLM-L6-v2,” a complex tool that converts text into a dense vector space. It preserves semantic information while converting whole phrases into fixed-dimensional vectors. While each transformer’s code implementation had a similar structure, the model selection added variety to our testing and allowed us to investigate how contextual embeddings affected our text classification task. We utilized the model’s related tokenizer to efficiently tokenize and process our text data, utilizing strategies like truncation and padding to guarantee constant input lengths. For best computational performance, the tokenized data was then effectively processed on the GPU. We also used the model’s ability to extract the hidden states linked to the “[CLS]” token, which frequently captures the text’s entire context.

This subsection discusses various ML classifiers. Additionally, it explores an advanced hybrid transformer and attention based RNN model, xFiTRNN for our tests as our proposed novel financial sentiment classification framework. This section details each model’s theoretical foundation, operational principles and relevance to the research objectives.

Ada Boost Classifier: Ada Boost Classifier, or Adaptive Boosting, is a powerful ensemble method to enhance the performance of weak classifiers. It operates by iteratively training a sequence of weak learners, typically decision trees, on various weighted versions of the training data. Initially, all data points are assigned equal weights. A weak classifier is trained in each iteration m and its error rate is defined by Eq. (1):

where the indicator function is , the true labels are , the training examples are and the weights are . The weight of the classifier is then calculated by Eq. (2):

which reflects its accuracy. Subsequently, the weights of misclassified instances are increased using Eq. (3):

focusing the next classifier on harder cases. The final prediction is made by a weighted majority vote of the weak classifiers calculated as Eq. (4):

This iterative boosting approach allows the model to effectively capture complex patterns in financial sentiment, leading to more accurate predictions.

Extra Trees Classifier: Extremely Randomized Trees, sometimes referred to as Extra Trees Classifier, is an ensemble learning method that generates the mode of the classes (classification) for individual trees after training a large number of decision trees. It functions by constructing several trees, each of which is trained using random splits in the nodes over the whole dataset. It randomly picks a subset of features and finds the best split among them, as opposed to searching for the best split among all available features for each node split in tree . Equation (5) is used to determine the decision function for a node n:

where x represents the input vector, d is the number of features and are the threshold values for the splits. The output of the ExtraTreesClassifier for an input x is determined by majority voting from all the individual trees calculated as Eq. (6):

where K is the total number of trees. This approach reduces variance and helps in capturing intricate patterns in financial sentiment data, leading to robust and accurate sentiment predictions.

LDA: LDA is a statistical method is used to find the linear combination of qualities that best splits a collection of classes into two or more. It is considered that the data from each class is normally distributed and that all classes have a common covariance matrix. For a given dataset that comprises the classes , …, the objective of latent difference analysis (LDA) is to optimize the ratio of within-class variance to between-class variance. Equations (7) and (8) define the within-class scatter matrix and the between-class scatter matrix :

where is the number of samples in class , is the mean vector of class and is the overall mean vector of the dataset. The optimal linear discriminants are found by solving the generalized eigenvalue problem as defined Eq. (9):

where are the eigenvalues and w are the eigenvectors. The resulting linear discriminants are then used to project the data into a lower-dimensional space for classification. In the context of financial sentiment analysis, LDA effectively reduces dimensionality while preserving class separability, allowing for the accurate classification of sentiment in financial texts.

QDA: QDA is a classification technique in which non-linear decision boundaries are possible since each class is modeled with a unique covariance matrix. QDA operates on the assumption that every class has its own covariance matrix , in contrast to Linear Discriminant Analysis (LDA), which operates under the assumption that all classes share a single covariance matrix. For a given input x, the posterior probability for class is calculated using Bayes’ theorem, Eq. (10):

where is the class-conditional density function given by the multivariate normal distribution computed as Eq. (11):

with d being the number of features, the mean vector and the covariance matrix for class . The decision rule assigns x to the class with the highest posterior probability defined as Eq. (12):

QDA captures the complex relationships and variations in sentiment by leveraging class-specific covariance structures, leading to more nuanced and accurate sentiment classification.

XGB Classifier: XGB Classifier is an implementation of the Extreme Gradient Boosting technique combining the predictions of several weak learners, specifically decision trees. It builds an ensemble of trees one after the other, with each new tree trying to fix the mistakes of the older ones. The model iteratively adds trees to a given dataset in order to minimize the objective function, which is specified as Eq. (13):

where L is a differentiable loss function (e.g., logistic loss for binary classification), is the prediction of the t-th tree and is a regularization term controlling the complexity of the model, with T being the number of leaves and the leaf weights defined as Eq. (14):

Every time a new tree is fitted, the gradient and hessian of the loss function are utilized to enhance the model’s predictions. is the ultimate forecast, as determined by (Eq. 15):

XGB Classifier efficiently captures complex patterns and interactions within the data, leading to high-performing and accurate sentiment predictions.

Gradient Boosting Classifier: Gradient Boosting Classifier is a potent ensemble learning technique which creates an additive model step-by-step ahead by successively fitting new models to fix the mistakes caused by the old ones. Reducing the loss function , where represents the model’s prediction, for a given dataset is the goal. A fresh weak learner is trained to fit the loss function’s negative gradient at each step m. This function is the residual error of the current model, which is determined by Eq. (16).

The model is then updated as Eq. (17):

where is the learning rate. Commonly, decision trees are used as weak learners and the final prediction is given by Eq. (18):

Gradient Boosting Classifier effectively captures intricate patterns and trends in sentiment data by iteratively focusing on and correcting the hardest-to-predict instances, leading to highly accurate sentiment classification.

A variety of DL models, each with unique architectural features, were used to thoroughly assess twitter sentiment categorization. For optimum performance, we tuned the hyperparameters of these models using the Keras Tuner. There were GRU layer units between 128 and 768, dense layer units between 64 and 512, dropout rates between 0.1 and 0.5 and learning rates between and in the search space. We found the optimal settings for each model using Keras Tuner’s Random Search technique, which greatly improved their performance. This methodical investigation, together with thorough text representations and adjusted hyperparameters, yielded insightful information about how well different DL architectures performed. A transformer was used to extract features in our first model, the “1-Dense Layered NN.” High-level representations were recorded by a single dense layer with 512 units and ReLU activation. We were able to set a benchmark for performance comparison because of this architecture’s simplicity. We presented the “2-Dense Layered NN” and “3-Dense Layered NN” to build on this framework. Three successive dense layers with decreasing units (256 and 128) and (512, 256 and 128) were added to gradually improve feature representations after global averaging of the transformer’s outputs. To improve model resilience, dropout regularization was specifically implemented after the first dense layer. In order to better investigate the subtleties, we developed the “BiGRU + 3 Hidden Dense Layers” model. The BiGRU layer, a unique kind of RNN, was included into this design to allow the network to recognize sequential relationships in the input data. The features were further refined using three more dense layers (512, 256 and 128 units) after the BiGRU layer. To improve the generalization of the model, dropout was used. In conclusion, we investigated hybrid architecture using the “BiGRU + CNN” model. Here, the model incorporated the advantages of convolutional layers and a BiGRU layer. Feature extraction gained a spatial view from the convolutional layers, which included 64 filters and different kernel sizes. The collected characteristics were then further processed by adding two thick layers (128 and 64 units).

The proposed xFiTRNN model offers an innovative architecture for financial sentiment analysis, as depicted in Fig. 2. This model leverages the power of contextual embeddings from pre-trained transformers, FinBERT, to capture the intricate language used in financial texts. The architecture commences with input layers, namely ‘ and , accommodating sequences up to 256 tokens to ensure comprehensive coverage of the input data For training, the input data is jumbled, batched and arranged into a TensorFlow dataset. A batch size of 16 is chosen to enable effective training. In order to get the labels ready for multiclass classification, they are one-hot encoded. To ensure thorough examination, the dataset is divided into training and validation sets. Keras Tuner plays a key part in FiTRNN model optimization by methodically examining a well defined search space to adjust different hyperparameters. The number of BiGRU layer units, which ranges from 128 to 512 and balances model complexity and performance, is part of the search space. Furthermore, three Dense layers are adjusted with units that range from 128 to 768, 64 to 512 and 32 to 256, respectively, which has an impact on the computing needs and model’s capability. While the learning rate is logarithmically explored between and to maximize convergence speed, the dropout rate fluctuates between 0.1 and 0.5 to avoid overfitting. An effective method of exploring the space without doing an exhaustive search is to use Random Search in the tuning process to sample different hyperparameter combinations. Based on validation loss, the tuner determines the optimal hyperparameter settings (Fig. 2) following several trials, guaranteeing an ideal trade-off between computing efficiency and performance.

The core architecture of the FiTRNN model integrates pre-trained contextual embeddings with BiGRU and self-attention mechanisms, featuring the following enhancements:

While our model was being trained, the learning rate was modified by the learning rate scheduler callback function. Based on an initial learning rate and an exponential decay factor, which regulates the pace at which the learning rate declines across epochs, the function determines the learning rate for each epoch. The exponential decay formula, which is applied in this particular version, involves multiplying the learning rate by the exponent of a negative constant = 0.1009 times the epoch number. The learning rate may be gradually decreased during training as it exponentially declines as the period grows. As seen in Fig. 3, this method aids in training process optimization by adjusting the learning rate to enhance model convergence and performance throughout several epochs. A thorough assessment of the xFiTRNN model’s performance is shown in Figs. 4 and 5. The model’s learning progress and convergence are depicted by the training and validation accuracy versus epoch curve, while the loss versus epoch curve shows the model’s training and validation loss across successive epochs. When combined, these visualizations provide information on the convergence, classification accuracy and optimization process of the xFiTRNN model. This model’s architecture effectively harnesses the strengths of contextual embeddings from pre-trained transformers, the sequential modeling capabilities of BiGRU and the focusing power of self-attention mechanisms. This combination is particularly advantageous for financial sentiment analysis, where understanding context and subtle language cues is critical. The model demonstrates superior performance compared to traditional approaches, offering a robust tool for analyzing sentiments in financial texts. Its ability to capture complex patterns and nuances makes it a valuable contribution to the field of sentiment analysis in finance.

The xFiTRNN model is a hybrid architecture designed for financial sentiment analysis, integrating transformer-based contextual encoding with recurrent neural processing to capture both global context and sequential dependencies in financial texts. Its novel contributions lie in the incorporation of a linearized phrase structure (LPS) to leverage syntactic information and a dual explainability framework for transparent predictions. Figure 2 provides a schematic overview, with detailed configurations available in the supplementary material.

For explaining our proposed xFiTRNN model we have performed tests on the following xAI models.

LIME is a model-agnostic explainability technique that successfully captures a model’s behavior locally around a prediction. Building on the notion that complicated models act linearly at a local scale, it perturbs examples close to the prediction in order to train a straightforward, comprehensible linear model. Authors used LIME as the solution computed as Eq. (19) to define the explanation for a sample :

where represents interpretable models, is the explanation model and measures complexity. evaluates how well approximates the model locally, with proximity measure guiding sample generation. To remain model-agnostic, the method draws samples based on proximity . The optimization process involves perturbing samples around , recovering them in their original representation and using the predictions as labels for the explanation model. The approach typically uses sparse linear models, square loss and an exponential kernel for proximity.

Anchors is an interpretable machine learning technique that provides high-precision rules, called anchors, which are conditions that sufficiently “anchor” a prediction such that changes to the rest of the feature values do not affect the outcome. For a given instance , an anchor is a rule set where is a feature and is its value. The precision of an anchor for instance is given by Eq. (20):

where is the model, is the indicator function and indicates that satisfies the anchor conditions. In financial sentiment analysis, anchors can be used to generate interpretable rules explaining why a particular sentiment prediction was made. For example, an anchor could be a specific set of words or phrases in a financial report that, when present, consistently lead to a positive or negative sentiment classification. This approach ensures that the explanations are both precise and interpretable, helping analysts trust and understand model predictions.

We have evaluated both of our experimental and proposed models based on as follows.

Accuracy is a way to quantify the ratio of properly identified occurrences. Accuracy is calculated as Eq. (21):

where refers to the true negatives’ number, refers to the false positives’ number and refers to the false negatives’ number.

Precision is an indicator of how correct positive predictions are, indicating the proportion of projected positive cases that turn out to be positive. Precision is defined by Eq. (22):

Recall provides an indication of how well it can recognize each pertinent occurrence. Recall is calculated as Eq. (23):

F1-score offers a balanced metric that takes both Recall (R) and Precision (P) into consideration. F1-score is very helpful when there is an imbalance in the class distribution as in our second dataset. F1-Score is defined by Eq. (24):

The AUC is a performance metric that evaluates the ability of a model to distinguish between classes. AUC is particularly useful for imbalanced datasets, as it measures the quality of a model’s predictions across all classification thresholds. The AUC score is derived from the Receiver Operating Characteristic (ROC) curve, which plots the TP against the FP. AUC is calculated as Eq. (25):

Moreover, for comprehensive evaluation the model’s performance across all financial sentiment classes, the following metrics are calculated:

where represents the total number of financial sentiment classes.

A hybrid self attentive linearized phrase structured transformer based RNN for financial sentence analysis with sentence level explainability – Scientific Reports

Like this:

Related

Share this:

Like this:

Related

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.