Thursday, August 29, 2013

Hadoop for .NET Developers: Understanding HDFS

From a data storage perspective, you can think of Hadoop as simply a big file server.  Through the name node, the Hadoop cluster presents itself as a single file system accepting basic Linux file system commands such as ls, rmr, mkdir, and others.  To see the list of available commands with explainations and syntax, you can launch the Hadoop command prompt (a shortcut to which is placed on the desktop with the default HDInsight install) on your desktop development environment and issue the following command:

hadoop fs -help

Here are some of the results of this command:

Inline image 1


Data are loaded to the Hadoop file system as files and can be loaded either manually or programmatically.  The steps involved with each are addressed in the next few posts of this blog series.

When a file is loaded into a Hadoop cluster using HDFS, the file is broken up into blocks.  The default size of a block is 64MB though HDFS does not allocate a full block when a partial block is needed. To illustrate this, consider a 100MB file being loaded to HDFS.  Assuming a default configuration, this file would be broken into a 64MB block and a 36MB.

These blocks are stored on the data nodes and distributed in a manner that attempts to balance the placement of blocks across the cluster.  The name node keeps track of which blocks comprise which files and on which data nodes these blocks are stored.  To the user, none of this is apparent so that the file system presents a given file as if it were a singular entity within the file system regardless of the number of blocks involved and their placement on the data nodes.

Read more: Data Otaku
QR: Inline image 2