
Cloud computing enables data storage and application deployment over the internet, offering benefits such as mobility, resource pooling, and scalability. However, it also presents major challenges, particularly in managing shared resources, ensuring data security, and controlling distributed applications in the absence of centralized oversight. One key issue is data duplication, which leads to inefficient storage, increased costs, and potential privacy and security risks. To address these challenges, this study proposes a post-quantum mechanism that enhances both cloud security and deduplication efficiency. The proposed SALIGP method leverages Genetic Programming and a Geometric Approach, integrating Bloom Filters for efficient duplication detection. The Cryptographic Deduplication Authentication Scheme (CDAS) is introduced, which utilizes blockchain technology to securely store and retrieve files, while ensuring that encrypted access is limited to authorized users. This dual-layered approach effectively resolves the issue of redundant data in dynamic, distributed cloud environments. Experimental results demonstrate that the proposed method significantly reduces computation and communication times at various network nodes, particularly in key generation and group operations. Encrypting user data prior to outsourcing ensures enhanced privacy protection during the deduplication process. Overall, the proposed system leads to substantial improvements in cloud data security, reliability, and storage efficiency, offering a scalable and secure framework for modern cloud computing environments.
Cloud computing is an efficient computational paradigm that leverages centralized bandwidth and memory processing resources. A centralized remote server maintains data and applications, making them accessible via the internet. It operates effectively in conjunction with grid computing concepts and functions as a subscription-based service model. Through cloud services, users can access applications over an interconnected network from any location, enabling shared access to resource repositories, software, and services. Virtualization is a foundational element of cloud computing, manages different versions of software, network resources, and hardware platforms. It enables the creation of virtual entities, allowing a single server to function as multiple servers or a single microcomputer to run multiple software systems simultaneously. Virtualization also offers substantial storage capacity that appears readily available to users. This is achieved through both software and hardware, generating entities that may not physically exist. Common forms of virtualization include server virtualization, virtual networks, desktop virtualization, and virtual memory, which are integral to cloud computing infrastructures.
Cloud providers implement various capacity optimization techniques to conserve storage space by identifying and eliminating redundant data using similarity detection methods. This optimization conflicts with the confidentiality needs of users, as it may expose sensitive information stored in the cloud. Encryption is widely adopted as a primary method for preserving data confidentiality. Robust encryption produces ciphertext with high entropy, making similarity-based storage optimization largely ineffective. For example, when a frequently duplicated and shared file is encrypted, comparing it for similarity becomes impractical. A demonstration was conducted using a record encrypted and analyzed for similarity through the default MW 2013 online tool and binary encryption comparison. As illustrated in Fig. 1, the results underscore the diminished effectiveness of similarity-based techniques in encrypted environments. Data deduplication is a process in which a storage provider retains only a single instance of a file or its components, even if the same data is stored by multiple users. To avoid the inefficiency of comparing entire files, deduplication techniques use indexes, often referred to as tags or identifiers. Typically, a file is processed through an indexing function — most commonly a collision-resistant hash function — to generate a unique identifier.
In client-side deduplication, the file hash is calculated on the user’s end, and only the hash is transmitted to the storage service. This approach helps conserve network bandwidth by avoiding the transfer of redundant data. Conversely, in server-side deduplication, the user sends the complete file to the storage provider, which then processes it to detect and eliminate duplicates.
Data, a collection of values and variables, originates from human-readable information. With the advent of computer systems, this information has been digitized into binary code consisting of 0s and 1s. Digitized data is stored in computer systems and encompasses various formats such as audio, video, images, and text. Modern data storage primarily relies on internal and external hard disks, which have replaced older storage media like magnetic tapes and floppy disks. Figure 2 illustrates the data processing lifecycle, where data is collected from users and sent to a processing unit. The unit processes the data according to specific requirements, and the output is stored in a designated storage unit often a cloud server in today’s computing environment. A reliable input-based processing system integrates both hardware and software components to efficiently manage this lifecycle.
Data security plays a vital role in ensuring user privacy and in protecting the hardware and software involved in data processing. It ensures that only authorized individuals can access or modify the data. With the advancement of Information and Communication Technology (ICT), electronic data storage and processing have become more secure, with protocols achieving data safety levels above 99%. Hackers continue to develop new methods to compromise data integrity, pushing the need for continuous improvements in security measures, policies, and protocols. The primary goal of data security is to offer the highest level of protection for personal and organizational information, aiming to minimize risks while maximizing safety. Figure 3 presents the core components of a robust data security framework. Existing encryption techniques are typically categorized into hashing, symmetric encryption, and asymmetric encryption. Hashing transforms data into a fixed-length string or number, providing a one-way function that supports data integrity checks without needing the original data.Symmetric and asymmetric encryption differ primarily in their implementation and security levels. Asymmetric encryption, which offers enhanced security, typically requires more memory and processing power than symmetric methods. The performance and strength of any encryption algorithm are influenced by key size and the specific algorithm used. Shorter encryption keys are more vulnerable to brute-force attacks, whereas longer keys significantly reduce the risk by increasing computational effort. In many security-sensitive applications, data is converted into ciphertext, an unreadable format that ensures protection against unauthorized access.
Data security can be categorized into three levels: low, medium, and high. During data transmission between two terminals, various network-based attacks may occur, posing significant risks to the confidentiality and integrity of data. To address these threats, two primary strategies are commonly employed: software-based protection and hardware-based protection. Software-based protection involves the use of firewalls, antivirus programs, and other security tools to safeguard the operating system and system files. Many operating systems are equipped with built-in features that include specialized firewalls, which help prevent data breaches and unauthorized access. In contrast, hardware-based protection employs dedicated security devices to safeguard the entire network infrastructure. For instance, firewall appliances are widely used to provide enhanced security at the hardware level. In critical domains such as banking and finance, organizations often prioritize hardware-based security equipment due to its robustness and reliability. It is important to note that hardware-based solutions are typically more expensive than software-based systems, making them less accessible for smaller organizations or personal use. Despite the cost, their effectiveness in ensuring comprehensive data protection makes them essential in high-security environments.
Deduplication is a key method used to eliminate redundant data copies in distributed cloud environments, thereby improving storage efficiency and reducing operational costs. Various deduplication techniques — ranging from file-level to block-level approaches have been developed, each with its own set of advantages and limitations. In particular, block-level deduplication methods often rely on labeling mechanisms to detect redundancy, which can result in missed duplicates that are semantically or textually similar but not identically labeled. These limitations are especially evident in scenarios involving unstructured or semi-structured text data, where existing deduplication techniques fail to recognize contextual similarities. As a result, there is a growing need for more intelligent and context-aware deduplication strategies that go beyond syntactic comparisons. To address these challenges, advanced machine learning-based and security-oriented methods are being explored to enhance duplication detection, particularly in cloud settings where data is highly dynamic and decentralized. This study introduces a novel post-quantum secure framework that combines SALIGP and Geometric Analysis supported by Bloom Filter-based detection. To further strengthen data integrity and user authentication, the system integrates the CDAS leveraging blockchain technology for secure access and storage control.
While cloud computing offers scalability, cost efficiency, and flexible resource access, it also introduces critical challenges related to data privacy, security, and deduplication. Users increasingly store sensitive information on cloud platforms, relying on service providers for protection. Existing security models often fall short in addressing dynamic and distributed threats, particularly when data is outsourced to virtual servers. Existing deduplication techniques especially block-level and file-level approaches fail to detect semantically similar or contextually related duplicates, resulting in inefficient storage and increased security risks. Most existing methods lack the ability to handle privacy-preserving deduplication in multi-tenant environments where data ownership is distributed among different entities. To address these gaps, a hybrid solution is necessary one that combines advanced deduplication detection, strong cryptographic guarantees, and efficient identity verification. This study introduces a robust and scalable framework integrating SALIGP and Geometric Analysis enhanced with Bloom Filter-based detection. To ensure integrity and authorized access, the model is reinforced with CDAS using blockchain technology, thereby addressing both security and deduplication efficiency in modern cloud environments.

