As the
amount of data continues to grow exponentially, there has been increased focus
on stored data reduction methods. Data compression, single instance store and
data deduplication are among the common techniques employed for stored data
reduction.
Deduplication often refers to
elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike
compression, data is not changed and eliminates storage capacity for identical
data. Data deduplication offers significant advantage in terms of reduction in
storage, network bandwidth and promises increased scalability.
From a simplistic use case
perspective, we can see application in removing duplicates in Call Detail Record
(CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize
on network traffic carrying the same data packets.
Some of the common methods for
data deduplication in storage architecture include hashing, binary comparison
and delta differencing. In this post, we focus on how MapReduce and HDFS can be
leveraged for eliminating duplicate data. (The approach listed below includes certain experimental approach by students and so may be best termed as tactics.)
Tactic 1: Using HDFS and MapReduce only
Owen
O’Malley has suggested the following approach in one of the forums:
Keep your historic data sorted by md5.
Run a MapReduce job to sort your new data into md5 order. Note that you want a total order, but because the md5's are evenly spaced across the key space this is easy. Basically, you pick a number of reduces (eg. 256) and then use the top N bits of the MD5 to pick your reduce. Since this job is only processing your new data, it is very fast. Next you do a map-side join where each input split consists of an md5 range. The RecordReader reads from the historic and new datasets merging them as they go. (You can use the map-side join library for this.) Your map does the merge of the new and old. This is a map-only job, so it also is very fast. Of course if the new data is small enough you can read all of the new input in each of the maps and just keep (and sort in ram) the new records that are in the right range and do the merge from ram. This lets you avoid the step where you sort the new data. This kind of merge optimization is where Pig and Hive hide lots of the details from the developer. |
Tactic 2: Using HDFS and HBase
In a
paper “A novel approach to data deduplication over the engineering-oriented cloud systems” by Zhe Sun, Jun
Shen, Jianming Young suggested a method involving HDFS and HBase, which
includes:
-
Use MD5 and SHA-1 hash functions to calculate the file's
hash value and then pass the value to HBase.
-
Compare the new hash value with the existing values. If it
exists earlier in HBase deduplication table, HDFS will check the number of
links, and if the number is not zero, the counter will be incremented by one.
If the number is zero or hash value did not exist earlier, HDFS will ask the
client to upload the file and update the logical path.
-
HDFS will store source files, which are uploaded by users,
and corresponding link files, which are automatically generated. Link files
record the source file's hash value and the logical path of the source file.
Some of the key items to note in
this approach include:
-
file level deduplication
to keep the index as small as possible in order to achieve high lookup
efficiency.
-
MD5 and SHA-1 values are
merged together to avoid accidental collision
Tactic 3: Using HDFS, MapReduce and a Storage Controller
In a paper “Distributed Duplicate Detection in Post-Process Data De-duplication” by Netapp
engineers Ashish Kathpal & Gaurav Makkar and by Mathew John, the authors propose to replace the duplicate
detection stage of NetApp with duplicate detection mechanism that uses Hadoop
MapReduce. The proposed Hadoop based duplicate detection workflow included the
following stages:
- Move the fingerprints from the storage controller to
HDFS
- Generate fingerprint database and store it persistently
on the Hadoop Distributed File System (HDFS).
-Generate
the duplicate records from the fingerprints using MapReduce and send it back to
the storage controller.
Fingerprints
are computed hash indexes of data chunks in the storage system. Since the
fingerprints are typically much smaller in size as compared to the data chunk
they represent, it involves lesser movement of data across the network for
duplicate detection.
Tactic 4: Using Streaming, HDFS and MapReduce
There can
be two possible scenarios for Hadoop and Streaming application integration. As
demonstrated in IBM Infosphere Streams and BigInsights integration, these
scenarios could be:
(a) Streams to Hadoop
flow: Using a control flow that utilizes Hadoop MapReduce model as part of
stream analysis, the operators on Stream observe and deduplicate the incoming
data to update and validate the MapReduce model.
Since it is recognized that it is most efficient to remove the duplicate records in this data as it is being ingested, records are deduplicated for a particular time range, count or defined delta in InfoSphere Streams. The deduplicated data is then sink(ed) to Hadoop BigInsights for building a fresh model.
Since it is recognized that it is most efficient to remove the duplicate records in this data as it is being ingested, records are deduplicated for a particular time range, count or defined delta in InfoSphere Streams. The deduplicated data is then sink(ed) to Hadoop BigInsights for building a fresh model.
(b) Hadoop to Streams
flow: In this approach, Hadoop MapReduce is utilized for removing duplicate
data from historical data and then the model is updated. The model in
integrated as part of Streams flow and an operator can be configured to pick up
the model mid-stream and start applying it to incoming data.
Tactic 5: Using MapReduce with Blocking techniques
In a
prototype tool Dedoop (Deduplication
with Hadoop) developed by University
of Leipzig , MapReduce has
been utilized for entity resolution of large datasets. This tool by far shows
the most mature use of MapReduce for data deduplication.
Blocking
based entity matching is used to semantically partition the input data into
blocks of similar records and restricting to entities of the same block.
Entity
resolution processing is split into 2 MapReduce jobs - Analysis job used
primarily to count occurrences and Match job for load balancing &
similarity computation. Also, it utilizes greedy load balancing
where match tasks are sorted in descending order by their size and assigned to
fewest loaded reduce task.
Dedoop also
employs effective techniques for avoiding redundant pair comparisons. It
claims to enable MR program to unambiguously determine which reduce task is
responsible for any pair comparison eliminating need for same comparison
happening on multiple nodes.
Can I get coding for mapreduce in Java for data reduplication ..
ReplyDelete
ReplyDeleteSuperb. I really enjoyed very much with this article here. Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article.thank you for sharing such a great blog with us. expecting for your.
hadoop administration training
big data administrator training