The revolution in the Internet of Things (IoT) has led to a surge in data generated by sensors, prompting urgent challenges in efficient data collection, transmission, and storage. One significant issue is the redundant data transmission from sensors, especially those covering overlapping areas, which inflates both communication and storage costs.
In the realm of existing solutions, the Asymmetric Extremum (AE) and Rapid Asymmetric Maximum (RAM) schemes employ fixed and variable-sized windows during chunking. However, these approaches struggle with selecting index values to determine the variable window size, often resulting in insufficient deduplication. To address these limitations, the Controlled Cut-point Identification Algorithm (CCIA) has been devised. This algorithm restricts the variable-sized window to a certain threshold, ensuring the index value for the threshold is always larger than half the size of the fixed window. As a result, CCIA finds more duplicates while applying an upper limit offset to avoid excessively large windows that could cause high computation costs.
Introduction
The Internet of Things (IoT) is a network of smart devices with applications across various sectors, including healthcare, education, sports, transportation, and smart energy management. Smart devices in IoT communicate to share environmental data, receive information, and distribute relevant insights as needed. This communication extends the cyber world’s boundaries using real-world objects and digital components. For instance, sensors attached to a patient can monitor health metrics like temperature, blood pressure, and heartbeat, forwarding this data to a fog server for analysis and processing. IoT-enabled Wireless Sensor Networks (WSNs) further enhance this by sensing, processing, and communicating information through diverse media.
Despite these capabilities, IoT sensor networks face challenges such as limited resources, redundant data transmission, and storage inefficiencies. Accurate data measurement is crucial for many applications, yet sensors often share redundant data, increasing storage and communication demands. De-duplication techniques can mitigate this issue by removing redundant data and conserving resources.
Data Deduplication Challenges in IoT
In IoT environments, redundant data storage in Fog or cloud servers consumes substantial resources, making it essential to eliminate redundant values before storing or exchanging information. Various deduplication schemes focus on chunking data, but existing methods like AE and RAM face several issues. For instance, AE and RAM often fail to ensure optimal chunk sizes, resulting in either excessively large or insignificant chunks with poor deduplication rates.
Asymmetric Extremum (AE) Algorithm
The AE algorithm employs two windows: a variable-sized window (VSW) and a fixed-sized window (FSW). AE scans each byte of the FSW to identify the cut-point but can result in zero or minimal VSW sizes, increasing computation costs and reducing deduplication efficiency.
Rapid Asymmetric Maximum (RAM) Algorithm
RAM aims to improve upon AE by identifying a significant value byte in each FSW. However, it often sets the VSW to minimal sizes, which reduces duplication chances. These limitations necessitate a more efficient approach to deduplication.
System Model and Deduplication Needs
In a smart healthcare system, numerous patients have monitoring sensor devices attached to their bodies. These sensors collect health data and forward it to an Aggregator Node (AN). The AN aggregates this data and transmits it to a fog server, which then forwards it to a cloud server. Redundant data transmission increases storage and communication costs at both fog and cloud servers. The proposed data deduplication method aims to reduce these costs through effective deduplication at both fog and cloud servers.
Data Collection and Transmission Flow
- Data Collection: Sensor nodes continuously monitor patient health parameters and transmit the data to AN.
- Data Aggregation: The AN aggregates data from multiple sensors.
- Data Transmission: The aggregated data is sent to a fog server for further deduplication.
- Data Deduplication: The fog server performs deduplication using the CCIA algorithm.
- Data Storage: The deduplicated data is stored in the cloud server to reduce storage costs.
Proposed Controlled Cut-point Identification Algorithm (CCIA)
CCIA is designed to address the challenges of existing deduplication schemes by ensuring the appropriate size of the variable-sized window (VSW). The algorithm uses a 2-dimensional array to determine the controlled cut-point, which must be larger than the fixed-sized window (FSW). This approach mitigates the generation of extremely small or large chunks, improving deduplication efficiency and reducing computational costs.
CCIA Process
- Initialization: Set the size of the FSW and define the starting and ending points for scanning.
- VSW Size Determination: Calculate the minimum and upper thresholds for the VSW size.
- Cut-point Identification: Identify a controlled cut-point within the acceptable range.
- Chunk Formation: Form chunks based on the identified cut-points.
Performance Analysis
In experimental setups using Windows Communication Foundation services on the Azure cloud, CCIA outperformed AE and RAM across various metrics. These metrics include the total number of chunks, average chunk size, minimum and maximum chunk sizes, and the probability of successful cut-point identification. The CCIA algorithm demonstrated fewer and more optimally sized chunks, leading to enhanced deduplication efficiency.
Results
- Total Chunks: CCIA generated a reduced number of chunks compared to AE and RAM.
- Average Chunk Size: CCIA maintained an optimal average chunk size, avoiding excessively small or large chunks.
- Minimum/Maximum Chunk Sizes: The algorithm avoided extreme chunk sizes, focusing on mid-range sizes for better deduplication.
- Probability of Successful Cut-point Identification: CCIA showed a higher success rate in identifying cut-points, reducing computational costs.
Conclusion
The explosion of the Internet of Things (IoT) has resulted in a massive increase in data generated by sensors, presenting pressing challenges in efficient data collection, transmission, and storage. A notable problem is the redundant data transmitted by sensors, particularly those monitoring overlapping regions, leading to inflated communication and storage expenses.
Existing solutions like the Asymmetric Extremum (AE) and Rapid Asymmetric Maximum (RAM) schemes use fixed and variable-sized windows for chunking. However, they often falter in selecting index values to determine variable window sizes, causing insufficient deduplication. To overcome these issues, the Controlled Cut-point Identification Algorithm (CCIA) has been introduced. This algorithm limits the variable-sized windows to a specific threshold, ensuring that the index value for this threshold is greater than half the size of the fixed window. Consequently, CCIA detects more duplicates while applying an upper limit offset, preventing excessively large windows that could escalate computation costs. This innovative approach aims to enhance efficiency in data management within IoT systems.