MapR provides compression for files stored in the cluster. Compression is applied automatically to uncompressed files unless you turn compression off. The advantages of compression are:
- Compressed data uses less bandwidth on the network than uncompressed data.
- Compressed data uses less disk space.
This page contains the following topics:
Choosing a Compression Setting
MapR supports three different compression algorithms:
- lz4 (default)
Compression algorithms can be evaluated for compression ratio (higher compression means less disk space used), compression speed and decompression speed. The following table gives a comparison for the three supported algorithms. The data is based on a single-thread, Core 2 Duo at 3 GHz.
Note that compression speed depends on various factors including:
- block size (the smaller the block size, the faster the compression speed)
- single-thread vs. multi-thread system
- single-core vs. multi-core system
- the type of codec used
Setting Compression on Files
Compression is set at the directory level. Any files written by a Hadoop application, whether via the file APIs or over NFS, are compressed according to the settings for the directory where the file is written. Sub-directories on which compression has not been explicitly set inherit the compression settings of the directory that contains them.
If you change a directory's compression settings after writing a file, the file will keep the old compression settings---that is, if you write a file in an uncompressed directory and then turn compression on, the file does not automatically end up compressed, and vice versa. Further writes to the file will use the file's existing compression setting.
File Extensions of Compressed Files
By default, MapR does not compress files whose filename extensions indicate they are already compressed. The default list of filename extensions is as follows:
The list of filename extensions not to compress is stored as comma-separated values in the
mapr.fs.nocompression configuration parameter, and can be modified with the
config save command. For example, you can add
parquet to the default list:
The list can be viewed with the
config load command. Example:
Turning Compression On or Off on Directories
You can turn compression on or off for a given directory in two ways:
- Set the value of the
Compressionattribute in the .dfs_attributes file at the top level of the directory.
Compression=lzf|lz4|zlibto turn compression on for a directory.
Compression=falseto turn compression off for a directory.
- Use the command
hadoop mfs -setcompression on|off/lzf/lz4/zlib <dir>.
If you choose
-setcompression on without specifying an algorithm, lz4 is used by default. This algorithm has improved compression speeds for MapR's block size of 64 KB.
Suppose the volume
test is NFS-mounted at
/mapr/my.cluster.com/projects/test. You can turn off compression by editing the file
/mapr/my.cluster.com/projects/test/.dfs_attributes and setting
Compression=false. To accomplish the same thing from the
hadoop shell, use the following command:
You can view the compression settings for directories using the
hadoop mfs -ls command. For example,
The symbols for the various compression settings are explained here:
Uncompressed, or previously compressed by another algorithm
Setting Compression During Shuffle
By default, MapReduce uses compression during the Shuffle phase. You can use the
-Dmapreduce.maprfs.use.compression switch to turn compression off during the Shuffle phase of a MapReduce job. For example: