Apache Zeppelin provides a Web-UI where you can iteratively build spark scripts in Scala, Python, etc. (It also provides autocomplete support), run Sparkql queries against Hive or other store and visualize the results from the query or spark dataframes. This is somewhat akin to what Ipython notebooks do for python. Spark developers know that building, testing and fixing errors in spark scripts can be a lengthy process (It is also dull because it is not interactive), but if you use Apache Zeppelin, you can iteratively buld and test portions of your script and this will enhance your productivity significantly.
Installing and Configuring Apache Zeppelin
Ensure following prerequisites are installed
- Java 8:
su -c yum install java-1.8.0-openjdk-devel
- Maven 3.1.x+:
sudo yum install apache-maven
and then link itsudo ln -s /usr/share/apache-maven/bin/mvn /usr/bin/mvn
. If this does not work for you, you can install it the following way.wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/ sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/local/bin/mvn
- Git:
sudo yum install git
- NPM:
yum install nodejs npm
- Either download the source code from here or clone the git repository in a folder as
git clone https://github.com/apache/incubator-zeppelin.git
- Build from source, Go to the incubator-zeppelin directory and run the following command from it.
mvn clean package -Pspark-1.5 -Ppyspark -Dhadoop.version=2.6.0-cdh5.5.0 -Phadoop-2.6 -Dmaven.test.skip=true
This command works for version 5.5 of cloudera distribution, make sure your versions of hadoop and spark are correct. In addtion to installing support for spark, this command will configure zeppelin with support for pyspark as well. - To configure access for hive metastore copy the hive-site.xml to conf directory under zeppelin.
- In the conf folder create copies of files zeppelin-env.sh.template and zeppelin-site.xml.template as zeppelin-env.sh and zeppelin-site.xml respectively.
- If you would like to change the port for zeppelin, change the following property in zeppelin-site.xml.
<property> <name>zeppelin.server.port</name> <value>8999</value> <description>Server port.</description> </property>
- To start zeppelin use the command
./zeppelin-daemon.sh start
. Then you can access zeppelin ui at http://localhost:8999 [1] - To stop zeppelin use the command
./zeppelin-daemon.sh stop
Running SparkQL queries against Hive and Visualizing Results
In a cell in zeppelin type %hive to activate interpreter with hive ql support. After you do this, you can then run the query and the visualization support is automatically activated in the output. To execute the cell use
Shift+Enter
key.Bulding scala scripts and plotting model outputs
You can also code in scala or python by activating the interpreter. Scala and Spark interpreter is activated by default for a cell.
To visualize the spark dataframe just use
z.show(df)
command.Writing documentation
Activate the markdown support in a cell by using
%md
. You can then add documentation along with your code. Unfortunately, the support for latex is still not there, but it should be there in future releases.What's missing ?
Unlike ipython notebooks, there is no option to export to html or pdf(using latex). Also, the support for embedding latex expressions is missing, but these features should be added in future releases.
Conclusion
Although certain features are missing, Apache Zeppelin surely helps you in increasing your productivity by reducing the time required for build, test and fix cycle. Also, it provides nice visualization capabilities for your queries and dataframes.
[1] | If you changed the port. |