Detecting illicit transactions in bitcoin: a wavelet-temporal graph transformer approach for anti-money laundering - Scientific Reports - MarketAlert – Real-Time Market & Crypto News, Analysis & Alerts

Evaluate model on test set and report accuracy, precision, recall, and F1 score.

We conduct our experiments on the publicly available Elliptic Bitcoin dataset, a widely recognized benchmark for anti-money laundering (AML) tasks in blockchain analytics. This dataset represents a directed temporal transaction graph, where each node corresponds to a unique Bitcoin transaction and each edge indicates the flow of funds from one transaction to another. Every node is described by a 166-dimensional feature vector composed of statistical summaries and centrality measures, as well as a timestamp that captures the temporal progression over 49 discrete time steps.

Each transaction is labeled as either illicit, licit, or unknown, with the latter excluded from supervised learning. Among the 203,769 total transactions, only 45,576 are labeled, with approximately 21% of them identified as illicit. This significant class imbalance reflects the skewness of illicit behavior in real-world financial systems and presents a considerable challenge for detection models. This dataset is summarized in Table 1, which reports key statistics including the number of transactions, edges, feature dimension, temporal length, and the class distribution.

For graph construction, we strictly align the transaction identifiers across the feature, class, and edge files. Edges are retained only if both endpoint transactions appear in the filtered node set with known labels, ensuring consistency between the node and edge spaces. We treat the graph as directed, preserving the money flow direction, and remove duplicate or invalid edges that reference missing nodes. Self-loops are not added in preprocessing, allowing the GNN layers to handle neighborhood aggregation without artificial self-connections. No additional pruning such as degree-based filtering or temporal subsampling is applied, so the full transaction graph structure is preserved. Prior to graph construction, all transactions labeled as unknown are filtered out. Afterward, we construct the transaction graph with the retained edges, and finally apply a stratified split (80%/10%/10%) on the remaining labeled nodes for training, validation, and testing, respectively. This pipeline ensures reproducibility and prevents information leakage.

Figure 2 illustrates the temporal evolution of node labels in the dataset. The left panel shows the proportion of licit versus illicit transactions across time steps, highlighting the high class imbalance and irregular illicit activity bursts. The right panel displays the total number of labeled transactions per time step, indicating non-uniform transaction volumes over time.

In addition to temporal dynamics, we further analyze the data in the frequency domain. Figure 3 presents a zoomed-in comparison (frequency range 0-0.05) of the averaged spectral density of illicit versus licit transactions. The results reveal that illicit transactions consistently exhibit slightly stronger low-frequency components, suggesting more persistent temporal correlations compared to licit transactions. This observation provides direct spectral evidence that laundering-related behaviors manifest distinguishable frequency signatures, thereby motivating our wavelet-based design.

We implement all experiments using PyTorch Geometric on a single NVIDIA A100 GPU. Our ChronoWave-GNN model employs a 3-layer temporal-aware graph transformer architecture, where the input includes raw transaction features, level-2 Haar wavelet coefficients, and 8-dimensional sinusoidal time encodings. Each layer applies dropout with a rate of 0.4 to improve generalization. The model is optimized using AdamW with a learning rate of 0.005 and weight decay of . We train for a maximum of 200 epochs with cosine learning rate scheduling and employ early stopping based on validation F1-score (patience = 20). Label smoothing with a factor of 0.1 is introduced to mitigate overconfidence under class imbalance. Evaluation is conducted using Accuracy, Precision, Recall, and F1-score, averaged over five random seeds. A summary of model hyperparameters is presented in Table 2.

To validate the effectiveness of our approach, we conduct comprehensive comparisons against established graph learning baselines. These include GAT, GraphSAGE, T-GCN, TGAT, DySAT, and GraphMLP. All models are trained on the Elliptic dataset using the same splits and evaluation metrics. Table 3 presents the averaged results over five runs. ChronoWave-GNN surpasses all baselines across all metrics, confirming its superior ability to capture multiscale spatiotemporal dependencies in dynamic financial transaction graphs. This highlights the advantage of jointly modeling wavelet-based frequency signatures with time-sensitive graph attention.

To further illustrate the representational power of ChronoWave-GNN, we visualize the learned node embeddings using UMAP . Since UMAP is a non-linear dimensionality reduction method and may not faithfully preserve distances in the original embedding space, we treat it only as an intuitive illustration rather than definitive evidence of class separability. As shown in Fig. 4, illicit transactions are more distinctly clustered apart from licit ones in the latent space, highlighting the model’s ability to learn semantically meaningful, class-discriminative features.

To provide a more rigorous assessment, we additionally report quantitative separability analyses in the original embedding space. Specifically, we compute Silhouette scores under both Euclidean and cosine distance metrics, as well as intra-/inter-class cosine similarities. Figure 5 summarizes these results: embeddings learned by ChronoWave-GNN achieve consistently high intra-class similarity (around 0.78) and low inter-class similarity (around 0.07), together with positive Silhouette scores. These results confirm that the learned representations indeed capture meaningful class distinctions beyond what is suggested by UMAP projections.

To assess whether ChronoWave-GNN generalizes beyond a single benchmark, we further evaluate it on two heterogeneous datasets in addition to Elliptic: a synthetic banking-transfer corpus generated by IBM AMLSim and a large-scale Ethereum transaction graph centered on phishing accounts.

The AMLSim dataset is produced by a multi-agent simulator developed by IBM for the study of anti-money laundering (AML) scenarios. It generates synthetic yet realistic banking transactions that embed a variety of known laundering patterns such as structuring and layering. The example dataset includes three key components: accounts (metadata about each bank account, including fraud indicators), transactions (directed money transfers with sender and receiver information), and alerts (transactions flagged under AML rules). We construct a directed temporal graph where nodes represent accounts, edges correspond to transfers, and timestamps capture the chronological flow of activities. This setting reflects realistic AML pipelines in which fraudulent accounts must be distinguished from legitimate ones.

The Ethereum phishing dataset, by contrast, is derived from a real blockchain environment. Starting from phishing addresses reported on Etherscan, a second-order breadth-first search crawl was conducted to expand the network, resulting in a transaction graph containing nearly three million nodes and over thirteen million edges. Each node corresponds to an Ethereum address, with a binary label indicating whether it is a phishing account, while each edge records a transfer event annotated with both the transaction amount and timestamp. This large-scale graph is particularly challenging due to its sparsity, skewed label distribution, and heterogeneous activity patterns across addresses.

As shown in Table 4, ChronoWave-GNN achieves consistently strong performance across Elliptic, Ethereum, and AMLSim. These results demonstrate that the proposed model is not confined to a single benchmark but can effectively adapt to diverse blockchain and financial transaction environments, thereby confirming its robustness and practical applicability for real-world AML and fraud detection tasks.

To better understand the individual contributions of each architectural component in ChronoWave-GNN, we conduct an ablation study by removing or modifying specific modules and observing the resulting changes in model performance . We consider several variants: removing the temporal encoding component (denoted as w/o Time Encoding), eliminating wavelet-based frequency augmentation (w/o Wavelet Features), disabling label smoothing (w/o Label Smoothing), training the model without dropout regularization (w/o Dropout), replacing the full architecture with a standard 2-layer Graph Attention Network (Simplified GAT Baseline), as well as two complementary non-GNN approaches: a GNN+LSTM hybrid combining graph embeddings with sequential modeling, and a Gradient Boosted Decision Tree (GBDT) classifier trained on graph-derived features. Furthermore, to examine the sensitivity of the wavelet decomposition step, we also vary the depth of the discrete wavelet transform (DWT) from level-1 to level-4, the choice of wavelet basis, and alternative temporal encoding schemes.

Table 5 shows that removing any component leads to measurable degradation, confirming the necessity of each module. Dropout proves most critical for generalization, while wavelet-based augmentation is indispensable for enhancing discriminative power. The non-GNN reference baselines achieve reasonable performance but remain below the full ChronoWave-GNN, indicating that while sequential or tree-based approaches capture part of the underlying dynamics, the integration of temporal, frequency, and relational graph information provides the most consistent improvements. The wavelet-level ablation further shows that level-2 decomposition strikes the best balance: level-1 underfits by missing long-term structures, while deeper levels (3 or 4) lose fine-grained temporal cues, leading to marginal degradation. Regarding wavelet basis, compact orthogonal filters (Haar, Symlet-4) slightly outperform longer filters while being computationally efficient. For temporal encoding, sinusoidal embeddings yield the most stable and accurate results, with Time2Vec competitive but less stable, and simple linear projection lagging behind. Overall, these findings verify that the superior performance of ChronoWave-GNN arises from the complementary integration of its temporal, frequency, and graph components.

We group test transactions into behavior types using simple on-graph heuristics (in-/out-degree and temporal dispersion of in-neighbors): rapid_funneling (high in-degree with concentrated neighbor timestamps), long_layering (higher out-degree with dispersed timestamps), and a generic other bucket. Figure 6 reports the misclassification rate by behavior type (mean ± 95% Wilson CI across five runs). Residual errors concentrate on other cases (mean , , avg. err) with a relatively narrow interval, indicating stable but generic failure modes. In contrast, rapid_funneling exhibits a lower mean error (mean , , avg. err) yet a wide interval due to limited support, suggesting that conclusions for this specialized pattern are more uncertain. The long_layering category contains no test samples in this split and is therefore omitted from the figure.

Qualitative inspection of misclassified other cases reveals boundary-like profiles whose transactional amounts, counterpart diversity, and timing closely mimic licit activity while containing weak illicit cues, leading to neighborhood over-smoothing. For rapid_funneling, over-confident false negatives typically arise when rapid in-flows are followed by a single low-activity out-edge that partially masks the funnel signature within the local 2-hop neighborhood. These patterns align with the aggregated statistics in Figure 6 and indicate where future feature design and sampling could further reduce errors.

We explicitly model transaction semantics on edges and integrate them into temporal attention. For each directed edge , we construct an attribute vector comprising temporal gaps ( and ), amount-derived signals for source/target (standardized ) together with their differences and a stabilized ratio proxy, role tags indicating sender/receiver sharing (SS/RR/SR/RS, one-hot), and interaction affinity (TX-type match). Numeric edge attributes are standardized with statistics fitted on the training window only, and per-window subgraphs are concatenated block-diagonally to disallow cross-window edges. Attention logits are conditioned on edge semantics by augmenting key/value projections,

so that message passing dynamically modulates the contribution of edge information rather than treating it as static regularization.

To assess the utility of these features, we performed paired comparisons between a node-only baseline and the edge-augmented model under identical seeds and chronological splits. Results are summarized in Table 6. The edge-enabled variant yields small but consistent improvements across accuracy, precision, recall, and F1 (all 0.3-0.5% higher on average). Confidence intervals exclude zero and p-values are below 0.05, indicating that the gains are statistically significant, albeit modest in magnitude. This suggests that edge semantics are indeed leveraged by the temporal attention mechanism, providing complementary relational cues beyond node attributes. Importantly, the absence of performance degradation confirms that our integration strategy avoids overfitting to noisy edge patterns, supporting the generality and robustness of the approach.

To further investigate the internal learning behavior of ChronoWave-GNN, we analyze the distribution of bias parameters in the first TransformerConv layer across early training epochs. As illustrated in Fig. 7, the bias values for all attention subcomponents-including lin_key, lin_query, lin_value, and lin_skip-exhibit only minimal changes from Epoch 1 to Epoch 2. This stability suggests that the model’s representational power is primarily derived from the learned attention weights and the dynamic interaction between node features and time encodings, rather than from shifts in bias. In temporal graph transformers, bias terms therefore play a secondary role compared to temporal and structural attention mechanisms.

To complement this stability analysis, we also provide attention interpretability visualizations. As shown in Fig. 8, the left panel reports the distribution of attention weights versus temporal gaps on test edges, while the right panel highlights the top-k incoming neighbors of a representative illicit node. These results show that ChronoWave-GNN assigns disproportionately high weights to temporally concentrated inflows and emphasizes suspicious multi-hop substructures, consistent with known laundering dynamics.

Although a straightforward approach is to concatenate wavelet features with the original node attributes, such static augmentation fails to exploit their temporal dynamics. To assess the effectiveness of our proposed dynamic fusion, we compare three strategies: ChronoWave-GNN with time-aware wavelet fusion, a Concat baseline with static concatenation, and a NoWavelet baseline without wavelet features.

The results in Table 7 show that ChronoWave-GNN consistently outperforms both alternatives across all metrics. The improvements are not only larger in magnitude but also more stable across runs. Boxplot analysis (Fig. 9) further illustrates that ChronoWave-GNN achieves higher medians and lower variance, confirming the robustness of the dynamic integration mechanism.

To verify that wavelet features are actively utilized during inference, we examine the temporal evolution of attention weights and gate activations associated with the wavelet branch. As shown in Fig. 10a.

Robust anti-money laundering (AML) systems must generalize to future activity rather than rely on random splits. We therefore adopt a chronological, rolling-window protocol to explicitly evaluate robustness under temporal domain shift. In each fold, training and validation use only past transactions, while the test window lies strictly in the future, preventing any leakage of temporal information. Feature standardization and categorical alignment are fitted on the training window and applied to later windows; subgraphs are constructed per window and concatenated in a block-diagonal manner to disallow cross-window edges. This protocol mirrors production constraints where models must score never-seen, forward-evolving streams.

The evaluation results are summarized in Table 8. ChronoWave-GNN maintains stable and competitive performance when predicting on unseen time windows, indicating resilience to temporal drift. Accuracy, precision, recall, and F1 remain tightly concentrated across chronological splits, with limited performance variability. Figure 11 further illustrates the per-fold F1 trajectory, showing steady improvement as the model encounters later time periods.

Detecting illicit transactions in bitcoin: a wavelet-temporal graph transformer approach for anti-money laundering – Scientific Reports

Like this:

Related

Share this:

Like this:

Related

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.