
In the present study, we developed a sensitive genotyping method based on a peer-to-peer network-derived identifier for error reduction in amplicon sequencing (SPIDER-seq). Errors are reduced by generating consensus using CID based on the amplicon sequencing method. Using SPIDER-seq, we demonstrated proof-of-concept using a model oligonucleotide and a set of mock ctDNA references. We clustered daughter strands by constructing a peer-to-peer network and demonstrated that generating CID-based consensus effectively reduces errors, and the sensitivity of the approach enabled the detection of a 0.125% ctDNA allele frequency (AF) with high accuracy and reproducibility. By determining the lineage of amplification, we found that the majority of sequencing errors were corrected, but polymerase errors introduced in the early cycles of amplification were not corrected. We also demonstrated that SPIDER-seq can be applied to multiple targets via multiplex PCR30,31.
We hypothesized that amplicon sequencing would be more efficient for personalized monitoring of multiple target mutations because the assays can be completed rapidly and easily with a high on-target ratio using inexpensive PCR reagents. Amplicon sequencing is also suitable for low amounts of input material. However, molecular tagging with a UID in PCR is more complex than ligation reactions (Fig. 1a). When incorporating a UID sequence via PCR primers, the UID sequences are overwritten over repeated PCR cycles (Fig. 1b); thus, multiple UID-pairs must be generated from the starting molecule. To the best of our knowledge, only two types of methods currently enable molecular tagging of an amplicon library. One type of method involves limiting the number of amplifications (e.g., 2-3 PCR cycles) to prevent overwriting of the UID. However, restricting the number of PCR cycles increases the difficulty of preparing libraries of inhibitor-treated clinical samples. The second type of method involves the use of linear amplification followed by single-stranded ligation for attaching the UID. Although this type of method enables detection of mutant alleles with a frequency of <0.01%, the complex experimental process, which includes a 24-hour ligation and three rounds of PCR, makes the approach laborious.
To overcome the abovementioned drawbacks, we evaluated the suitability of a general PCR approach. The recent development of blockchain technology motivated us to combine multiple UID-pairs generated from each individual starting molecule strand into a single identity. Blockchain technology employs a distributed networking strategy to record information so that the resulting data can be identified by integrating all individual records searched through a peer-to-peer network. Similarly, integrated information can be obtained from peer-to-peer networks constructed using all daughter molecules derived from the first-copied strand from the original molecule. The daughter strand resulting from each amplification cycle with a UID-containing primer will contain two UIDs. The first is the overwritten UID incorporated by the primer, and the other is the replicated UID derived from the parental strand. Both the parental and daughter strands thus share one UID (Fig. 1b). We hypothesized that we could create a link between the two strands using the shared UID and that the created linkage could be extended to the granddaughter strand. The overwritten UID in the daughter strand could be used again as a shared UID with a granddaughter strand so that a connection could be achieved with all descendant strands derived from the first-copied strand (Fig. 1c). The UIDs in the network could therefore be considered a cluster (i.e., grouped UIDs) that can serve as an integrated identifier (designated a cluster identifier [CID]). We also hypothesized that the rate of sequencing and polymerase errors could be reduced if the consensus sequence was generated based on the CID.
In addition to building a consensus sequence, we expected that the linkage could be used to construct a lineage of amplification. By placing the first-attached UID from a cluster in the top position, daughter strands can be appended in order of generation similar to a rooted phylogenetic tree. This arrangement can be used to characterize the error pattern by investigating the continuity of errors along the branch. Sporadic errors, such as sequencing errors, should be focused at nodes, whereas polymerase errors should be conserved along the branch. We postulated that this pattern could be used to review the effectiveness of error correction efforts.
To assess the feasibility of constructing a cluster based on a peer-to-peer network, we conducted a model experiment using an oligonucleotide containing a barcode consisting of 12-nt degenerate bases. The oligonucleotide was then amplified along with a pair of UID-containing primers via six rounds of thermal cycling using KAPA HiFi polymerase (Materials and methods, Fig. 2a). An amplicon library was then prepared, and paired-end sequencing was performed. We hypothesized that the barcode content would be identical across the reads in a single CID obtained by linking the UID-pairs.
We initially investigated the characteristics of the UID-pairs using the sequencing data. Assuming an ideal case in which each strand is used repeatedly as a template across cycles, each strand would be expected to produce multiple daughter strands in the amplification experiment, such that the parental strand could be linked directly to the multiple daughter strands (Supplementary Fig. 1). The possible number of daughter strands obtained from each parental strand was estimated at a maximum of five, assuming that the first-copied strand was synthesized in the first amplification cycle and daughter strands would be produced in the second through sixth cycles. Any parental strand synthesized later than the first cycle would therefore produce fewer than five daughter strands. In other words, a UID could have a maximum of five paired-UIDs from different daughter strands. As expected, we found that most UIDs had five or less paired-UIDs.
Although most UIDs had no more than five paired-UIDs, 8.41% of UIDs produced more than five paired-UIDs (Fig. 2b). We concluded that those UIDs linked to more than five UIDs were associated with high melting temperatures of the high-GC sequences. Graphs of the distribution of observed GC content exhibited a distinct right tailing indicative of high GC content (Fig. 2c, d), which was not observed in an ideal distribution of a randomly generated UID set. We presume that a primer with a high-GC UID may preferentially reattach, leading to the initiation of a new lineage independent of the original one. We also found that more daughter-UIDs tended to be produced from parental UIDs with a GC content of ≥80% (Fig. 2e). As cases involving more than five paired-UIDs could result in over-collapsing, false consensus would be obtained. In particular, the sensitivity of detecting mutations in ctDNA would decrease if UID-pairs derived from normal DNA were collapsed to UID-pairs derived from ctDNA. Therefore, we filtered out UIDs for which the number of paired-UIDs was higher than the number of cycles or cases in which the GC content was ≥80%.
Parent and daughter strands were then linked via the peer-to-peer network (Supplementary Fig. 2). Extension of the linkage between strands was carried out in a manner similar to de novo assembly. In order to simplify the computational process, individual UIDs were used as vertexes (Clustering via construction of a peer-to-peer network in the Materials and methods). Starting from a seed UID randomly selected from the observed UIDs, paired-UIDs were recursively added to the opposite site of the strand until no new paired-UIDs to add remained. The list of paired-UIDs was considered a cluster, and a CID was assigned to each cluster. This process resulted in the formation of 58,114 clusters consisting of various sets of UID-pairs (Fig. 2f). For each cluster, the UIDs on each side (left and right sides of the amplicon, designated "left UID" and "right UID") were used in a balanced manner; the maximum total number of left and right UIDs per cluster (i.e., number of left UIDs + number of right UIDs, designated "cluster size") was 37.
The clusters were characterized by first determining how many reads and UID-pairs supported each CID. On average, each CID was composed of 6.283 paired-reads (Fig. 2g). In contrast, fewer reads (2.955 on average) supported each UID-pair than each cluster. This suggests that using CIDs is more advantageous than using UID pairs, by leveraging more reads to generate consensus on error reduction. In terms of UID-pairs, clusters with a size of 2 that consisted of only one UID-pair accounted for 66.05% of all clusters (Fig. 2h). The remainder of all clusters (33.95% of the total) were supported by 95,920 UID-pairs (68.94% of all UID-pairs). The majority of UID-pairs were therefore linked with the other UID-pairs.
To assess the accuracy of cluster construction, the specificity of the barcode content per cluster was determined by calculating the frequency of major barcode content per cluster. As expected, most of the clusters contained the same barcode content regardless of cluster size (Fig. 2i). Even in cases in which the specificity was not 100%, the barcode content sequences were very similar, and only 1 to 2 content mismatches were observed. After correction of these mismatches, 99.09% of the clusters exhibited identical barcodes (i.e., specificity of 100%). These mismatches were assumed to have arisen during PCR and sequencing.
We then determined how many clusters are generated from a starting molecule. In PCR assays, one starting molecule can produce a copy molecule (i.e., first-copied strand) in each cycle. Thus, multiple clusters can be initiated from a single molecule in every PCR cycle (Supplementary Fig. 1). Theoretically therefore, a maximum of five clusters could be generated from a template oligonucleotide through six amplification cycles. As expected, most of the barcode contents were redundantly observed in multiple clusters (Fig. 2j). However, some barcode contents were observed in more than five clusters. This was attributed to breakage of the clusters into multiple pieces due to missing UID-pairs during the purification or sequencing steps (Supplementary Fig. 3). This cluster breakage was also considered to be the cause of the high proportion of clusters with a size of 2 (Fig. 2h). The redundancy actually decreased if we discarded the clusters with a size less than 3. Importantly, we hypothesize that the redundant clusters would be helpful in attempts to detect extremely low amounts of ctDNA because multiple consensus can be obtained from one molecule.
The pattern of errors introduced into the barcode contents was investigated by constructing lineages for individual clusters. We hypothesized that the origin UID would have the most paired-UIDs because the first-copied strand has the highest probability of forming a daughter strand during the entire PCR cycle. We then repeatedly arranged the linked UIDs to complete the path (Supplementary Fig. 4). The first-copied strand was defined as the ancestor, with the subsequently produced molecules with UID-pairs defined as descendants. We initially examined whether errors are conserved along generations. To elucidate the error pattern, we focused on clusters containing barcode content in which one or two mismatches were introduced at a not negligible frequency (at least one cluster for the barcode content has a specificity of

