Recently Apache Hama
team released official 0.7.0 version. According to the release announcement, there were
big improvements in Graph package. In this article, we provide an overview of
the newly improved Graph package of Apache Hama, and the benchmark results that
performed by cloud platform team at Samsung Electronics.
Large
scale datasets are being increasingly used in many fields. Graph algorithms are
becoming important for analyzing big data. Data scientists are able to predict
the behavior of the customer, the trends of the market, and make a decision by
analyzing the graph structure and characteristics. Currently there are a
variety of open source graph analytic frameworks, such as Google’s Pregel[1],
Apache Giraph[2], GraphLab[3] and GraphX[4]. These
frameworks are aimed at computations varying from classical graph traversal
algorithms to graph statistics calculations such as triangle counting to
complex machine learning algorithms. However these frameworks have been
developed each offering a solution with different programming models and
targeted at different users. In this article, we introduce the Apache Hama[5]
and its graph package.
What is Apache Hama?
Apache
Hama is a general-purpose Bulk Synchronous Parallel (BSP)[6]
computing engine on top of Hadoop. It provides a parallel processing framework
for massive scientific and iterative algorithms. BSP is an easy and flexible
programming model, as compared with traditional models of Message Passing, as
shown in Figure 1.
![]() |
Figure 1. Bulk Synchronos Parallel (BSP) Model. |
Hama
performs a series of supersteps based on BSP. A superstep consist of three
stages: local computation, message communication, and barrier synchronization.
Hama is suitable for iterative computation since it is possible that input data
which can be saved in memory is able to transfer between supersteps. However,
MapReduce must scan input data in each iteration, and then output data must be
saved in file system, such as HDFS. Hence, Hama can solve the problems which
MapReduce cannot handle easily.
Graph Package of Apache Hama
Apache
Hama also supports a graph package which allows users to program applications
for graph-parallel computations[7]. The vertex-centric model is
suggestive of MapReduce[8] in that users focus on a local action,
processing each item independently, and the system compose these actions to
lift computation to a large graph dataset. It is easy to implement and prove to
be useful for many graph algorithms.
![]() |
Figure 2. Finding the maximum value in a graph example.
Dotted lines are messages. Grey vertices have voted to halt. |
Figure
2 illustrates the concept of graph processing in Hama using a simple example:
given a connected graph where each vertex contains a value, it propagates the
latest value to every vertex. Any vertex that has known a larger value from its
messages sends it to all its neighbors in each superstep. When no more vertices
change in a superstep, the algorithm terminates. If you want to know graph
package in Hama, please refer to Apache
Hama Programming[7].
Hama in Action
We
performed the performance of the version of 0.7.0 of Hama, which is recently
release, on Amazon Web Service Elastic Map Reduce(EMR) instances. PageRank[9]
algorithm is used for benchmark test of the graph package. PageRank is an
algorithm that is used to rank web pages according to their popularity. This
algorithm calculates the probability that a random walk would end in a particular
vertex of a graph. This application computes the page rank of every vertex in a
directed graph iteratively. At every iteration t, each vertex computes the following:
where
r is the probability of the a random
jump, E is the set of directed edge
in the graph and PRt(j) denotes the page rank of the vertex at iteration t.
We
compared the performance of Hama with Giraph which is a graph framework that is
used in Facebook. PageRank algorithm was already implemented in both of the
frameworks. Default input types in Hama is text type and in Giraph is JSON. We
implement input format of JSON to run in Hama for a fair comparison. JSON type
format is as follows:
Above
datasets are generated using fastgen example program which generates random
graph datasets. The way fastgen generate the datasets on Hama cluster is as
follows:
%
bin/hama jar hama-examples-x.x.x.jar gen fastgen -v 1000 -e 100 -of json -o
/randomgraph -t 40
|
The following is the meaning of the options:
Experimental Results
In
this section, we describes our experimental setup with details about datasets
and experimental platform. We conducted various experiments with PageRank
algorithm on EMR cluster. We used graph datasets randomly generated using
fastgen example in Hama. Also we run benchmarks on EMR cluster which consists
of instance type of r3.xlarge[10]. This instance type is
memory-optimized instance type which has 30GB of RAM, and 4 vCPU. The cluster
run on the hadoop 1.0.3(Amazon Machine Image version 2.4.11). That is because
Giraph didn’t work well on hadoop 2.
Hama
provides pure BSP model via own cluster. Furthermore, it works on both Hadoop
YARN[11] and Apache Mesos[12]. Although Giraph has same
programming model like Hama, it works as a MapReduce job which means that it
requires MapReduce framework. So Hama performed PageRank algorithm making its own
cluster. On the other hands, Giraph perform MapReduce application on hadoop.
For
benchmarking of scalability, we increased the number of machines from 5 to 30
on a fixed dataset which has one billion edges (Figure 4). Both frameworks
present significant scalability for same datasets. However, Hama has much lower
execution time than Giraph for the same data set.
![]() |
Figure 3. The execution time on same data set, depending on the size of machines. |
From
the computing performance’s point of view, Figure 4 shows execution time on
same nodes, depending on the size of dataset. Hama also shows more powerful
performance than Giraph for the same machines.
The
main reason that the result of benchmark shows different performance in spite
of the same programming model is that we use the advanced PageRank algorithm
which uses aggregators for detecting the convergence condition and the BSP
framework’s efficient messaging system. Hama uses own outgoing/incoming message
manager instead of Java's built-in queues. It stores messages in serialized
form in a set of bundles (or a single bundle) to reduce the memory usage and
RPC overhead. Also unsafe serialization is used to serialize Vertex and its
message objects more quickly. Instead of sending each message individually,
Hama packages the messages per vertex at once and sends a packaged message to
their assigned destination nodes. With this Hama v0.7 achieved significant
improvement in the performance of graph applications.
![]() |
Figure 4. The execution time on same machine, depending on the size of dataset.
|
Conclusion and future work
In
this article, we presented a graph package of Hama. We also performed the
performance of graph package, with Hama. The performance compared with Giraph,
in respect of computing and scalability. As a result, the performance and
scalability are already satisfactory for graphs with billions of vertices.
However,
there are also a lot of improvement to be done based on the current version.
Efficient load balancing, spillable vertices storage and message serialization
are challenging issues. We look forward to add these features and see our
community growth.
-
References
[1] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. Principles Of Distributed Computing (PODC), 2009.
[2] Apache Giraph v 1.1.0 http://giraph.apache.org/
[3] GraphLab https://dato.com/products/create/open_source.html
[4] Apache Spark’s GraphX https://spark.apache.org/graphx/
[5] Apache Hama v 0.7.0. http://hama.apache.org/
[6] Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990
[7] Apache Hama BSP programming, http://people.apache.org/~tjungblut/downloads/hamadocs/
ApacheHamaBSPProgrammingmodel_06.pdf
[8] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
[9] Page, Larry, et al. PageRank: Bringing order to the web. Vol. 72. Stanford Digital Libraries Working Paper, 1997.
[10] Amazon EC2 instances, http://aws.amazon.com/ec2/instance-types/
[11] Apache Hadoop YARN arhitecture, https://hadoop.apache.org/docs/stable/hadoop-yarn/
hadoop-yarn-site/YARN.html
[12] Apache Mesos, http://mesos.apache.org/
[13] Satish, Nadathur, et al. "Navigating the maze of graph analytics frameworks using massive graph datasets." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
[14] General dynamics, High Performance Data Analytics, http://gdmissionsystems.com/wp-content/uploads/2015/05/GDMS_White_Paper_20151.pdf
[2] Apache Giraph v 1.1.0 http://giraph.apache.org/
[3] GraphLab https://dato.com/products/create/open_source.html
[4] Apache Spark’s GraphX https://spark.apache.org/graphx/
[5] Apache Hama v 0.7.0. http://hama.apache.org/
[6] Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990
[7] Apache Hama BSP programming, http://people.apache.org/~tjungblut/downloads/hamadocs/
ApacheHamaBSPProgrammingmodel_06.pdf
[8] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
[9] Page, Larry, et al. PageRank: Bringing order to the web. Vol. 72. Stanford Digital Libraries Working Paper, 1997.
[10] Amazon EC2 instances, http://aws.amazon.com/ec2/instance-types/
[11] Apache Hadoop YARN arhitecture, https://hadoop.apache.org/docs/stable/hadoop-yarn/
hadoop-yarn-site/YARN.html
[12] Apache Mesos, http://mesos.apache.org/
[13] Satish, Nadathur, et al. "Navigating the maze of graph analytics frameworks using massive graph datasets." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
[14] General dynamics, High Performance Data Analytics, http://gdmissionsystems.com/wp-content/uploads/2015/05/GDMS_White_Paper_20151.pdf
Minho Kim is a software engineer at Samsung Electronics in the Cloud Technology Lab. Minho is a committer of Apache Hama project, which is a general BSP computing engine. He has been developing and contributing to Apache Hama. Minho’s research interests include deep learning such as DNN, and CNN. |
for a chance to get a free invite to
best classes and sessions on
Spark, Tajo, Flink and more...
This sounds excellent and worth an evaluation .... I had been struggling with other graph db/frameworks.
ReplyDeleteMinho, did you benchmark Giraph with fault tolerance enabled? Because the execution overhead in time you see comes from spilling stuff to disk.
ReplyDeleteOtherwise this is a bit like apples vs. oranges.
Thomas,
ReplyDeleteI agree with you that Giraph with fault tolerance affects results of benchmark, if it enabled.
But I compared performance between Hama and Giraph without fault tolerance enabled.
Thank you for your opinion.