We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS). The project aims to monitor different snapshots for the entire HDFS system in an interactive way, showing the size of the folders and the rate at which the size changes. It can also effectively identify efficient and inefficient file storage and highlight nodes in the file system where this is happening.
HDFS-DU provides the following in a web user interface:
- A TreeMap visualization where each node is a folder in HDFS. The area of each node can be relative to the size or number of descendents
- A tree visualization showing the topology of the file system
HDFS-DU is built using the following front-end technologies:
Below is a screenshot of the HDFS-DU user interface (directory names scrubbed). The user interface is made up of two linked visualizations. The left visualization is a TreeMap and shows parent-child relationships through containment. The right visualization is a tree layout, which displays two levels of depth from the current selected node in the file system. The tree visualization displays extra information for each node on hover.
You can drill down on the TreeMap by clicking on a node, this would create the same effect as clicking on any tree node. There are two possible layouts for the TreeMap. The default one encodes file size in the area of each node. The second one encodes number of descendents in the area of each node. In the second view it's interesting to spot nodes where storage is inefficient.
We initially envisioned the TreeMap as a Voronoi TreeMap, however our current implementation of that code ran too slowly to be practical. We would love to get the Voronoi TreeMap code to work fast enough. We would also like to add the option to use different values to size and color the TreeMap areas. For example, change in size, creation time, last access time, frequency of access.
HDFS-DU was primarily authored by Travis Crawford (@tc), Nicolas Garcia Belmonte (@philogb) and Robert Harris (@trebor). Given that this is a young project, we always appreciate bug fixes, features and documentation improvements. Feel free to fork the project and send us a pull request on GitHub to say hello. Finally, if you’re interested in visualization and distributed file systems like Hadoop, we’re always looking for engineers to join the flock.
Follow @hdfsdu on Twitter to stay in touch!
- Chris Aniszczyk, Manager of Open Source (@cra)