Apache Zeppelin Notebooks ~ Technical Essentials

Apache Zeppelin provides a Web-UI where you can iteratively build spark scripts in Scala, Python, etc. (It also provides autocomplete support), run Sparkql queries against Hive or other store and visualize the results from the query or spark dataframes. This is somewhat akin to what Ipython notebooks do for python. Spark developers know that building, testing and fixing errors in spark scripts can be a lengthy process (It is also dull because it is not interactive), but if you use Apache Zeppelin, you can iteratively buld and test portions of your script and this will enhance your productivity significantly.

Installing and Configuring Apache Zeppelin

Ensure following prerequisites are installed

Java 8: su -c yum install java-1.8.0-openjdk-devel

Maven 3.1.x+: sudo yum install apache-maven and then link it sudo ln -s /usr/share/apache-maven/bin/mvn /usr/bin/mvn. If this does not work for you, you can install it the following way.

wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/
sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/local/bin/mvn

Git: sudo yum install git
NPM: yum install nodejs npm
Either download the source code from here or clone the git repository in a folder as git clone https://github.com/apache/incubator-zeppelin.git
Build from source, Go to the incubator-zeppelin directory and run the following command from it.

mvn clean package -Pspark-1.5 -Ppyspark -Dhadoop.version=2.6.0-cdh5.5.0 -Phadoop-2.6 -Dmaven.test.skip=true

This command works for version 5.5 of cloudera distribution, make sure your versions of hadoop and spark are correct. In addtion to installing support for spark, this command will configure zeppelin with support for pyspark as well.
To configure access for hive metastore copy the hive-site.xml to conf directory under zeppelin.
In the conf folder create copies of files zeppelin-env.sh.template and zeppelin-site.xml.template as zeppelin-env.sh and zeppelin-site.xml respectively.

If you would like to change the port for zeppelin, change the following property in zeppelin-site.xml.

<property>
  <name>zeppelin.server.port</name>
  <value>8999</value>
  <description>Server port.</description>
</property>

To start zeppelin use the command ./zeppelin-daemon.sh start. Then you can access zeppelin ui at http://localhost:8999 [1]
To stop zeppelin use the command ./zeppelin-daemon.sh stop

Running SparkQL queries against Hive and Visualizing Results

In a cell in zeppelin type %hive to activate interpreter with hive ql support. After you do this, you can then run the query and the visualization support is automatically activated in the output. To execute the cell use Shift+Enter key.

Bulding scala scripts and plotting model outputs

You can also code in scala or python by activating the interpreter. Scala and Spark interpreter is activated by default for a cell.

To visualize the spark dataframe just use z.show(df) command.

Writing documentation

Activate the markdown support in a cell by using %md. You can then add documentation along with your code. Unfortunately, the support for latex is still not there, but it should be there in future releases.

What's missing ?

Unlike ipython notebooks, there is no option to export to html or pdf(using latex). Also, the support for embedding latex expressions is missing, but these features should be added in future releases.

Conclusion

Although certain features are missing, Apache Zeppelin surely helps you in increasing your productivity by reducing the time required for build, test and fix cycle. Also, it provides nice visualization capabilities for your queries and dataframes.

[1]	If you changed the port.

Technical Essentials

Java, ADF, Android, Identity Management, Data Science, Machine Learning, Fusion Middleware, Linux, Counter Strike 1.6, BSD, Windows, Programming, Search Engines

Jun 26, 2016