Follow

HDFS Disk Uses Increases Due to Spark Checkpoints

Part of the classification of NetFlow data utilizes a feature in Spark called Checkpointing. By design, Spark will not clean up these Checkpoints, which may cause HDFS usage to grow unbounded.

The following modification can be made to analytics.sh to force a cleanup of these directories and prevent this issue from causing unbounded disk usage.

Change:

...
$DIR/sanity.sh --check $1 && $DIR/aggregate.sh $1 && $DIR/training.sh $1 && $DIR/scoring.sh $1 && $DIR/sync.sh $1

if [ $? -eq 0 ]; then
...

To:

...
$DIR/sanity.sh --check $1 && $DIR/aggregate.sh $1 && $DIR/training.sh $1 && $DIR/scoring.sh $1 && $DIR/sync.sh $1

# Clean up Spark Checkpoints
hadoop fs -rm -R -skipTrash /user/spark/checkpoint/*

if [ $? -eq 0 ]; then
...

 

IMPORTANT NOTE: If Analytics is being run in a shared environment (i.e. there are other applications using YARN and/or are using multiple tenants within Interset) do NOT use this method as you may inadvertently delete Checkpoints for other applications/tenants running at the same time. If you fall under this scenario, please contact Interset Support (support@interset.com) for resolution options.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk