Preparing the NCDC Weather Data for Hadoop
I’m exploring Hadoop with the book Hadoop: The Definitive Guide. Appendix A shows how to download NCDC Weather data from S3 and put it into Hadoop. I didn’t want to download from S3 or load the entire dataset so here’s what I did instead.
Here’s a little bash script I used to download the data. You might want to do this if you want more up-to-date data, or if you only want to work with a subset. If you only want data for a certain year just append that year to the url in $source_url.
#!/bin/bash
source_url="ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/";
download_to="~/ncdc_data";
if [! -d "$download_to"]; then
mkdir "$download_to";
fi
wget -r -c --progress=bar --no-parent -P "$download_to" "$source_url";
I’ve modified the script from the Hadoop book to work with local files. I’m just working with files from 2012. Modify the url in target if you want something different.
#!/usr/bin/env bash
# NCDC Weather file to load into hadoop
target="/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012";
# Un-gzip each station file and concat into one file
echo "reporter:status:Un-gzipping $target" >&2
for file in $target/* do
gunzip -c $file >> $target.all
echo "reporter:status:Processed $file" >&2
done
# Put gzipped version into HDFS
echo "reporter:status:Gzipping $target and putting in HDFS" >&2
gzip -c $target.all | $HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz
The script will unzip all the files, combine them, you should see output similar to this.
reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-94996-2012.gz
reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-96404-2012.gz
reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-99999-2012.gz
When it’s finished combining all the files it will store the data in Hadoop.
reporter:status:Gzipping /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012 and putting in HDFS 13/01/11 21:37:52
INFO util.NativeCodeLoader: Loaded the native-hadoop library
Once the process has completed you should be able to confirm the storage of your data in Hadoop with the following command;
rhys@linux-g1rx:~/hadoop_scripts> hadoop fs -ls gz/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012.gz
Found 1 items
-rwxrwxrwx 1 rhys users 4870924294 2013-01-11 23:11 /home/rhys/hadoop_scripts/gz/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012.gz
Now I have data in Hadoop it’s time to start writing MapReduce jobs!