This is documentation for MapR Version 5.0. You can also refer to MapR documentation for the latest release.

Skip to end of metadata
Go to start of metadata

MapR provides compression for files stored in the cluster. Compression is applied automatically to uncompressed files unless you turn compression off. The advantages of compression are:

  • Compressed data uses less bandwidth on the network than uncompressed data.
  • Compressed data uses less disk space.

This page contains the following topics:

Choosing a Compression Setting

MapR supports three different compression algorithms:

  • lz4 (default)
  • lzf
  • zlib

Compression algorithms can be evaluated for compression ratio (higher compression means less disk space used), compression speed and decompression speed. The following table gives a comparison for the three supported algorithms. The data is based on a single-thread, Core 2 Duo at 3 GHz.

Compression Type

Compression Ratio

Compression Speed

Decompression Speed

lz4

2.084

330 MB/s

915 MB/s

lzf

2.076

197 MB/s

465 MB/s

zlib

3.095

14 MB/s

210 MB/s

Note that compression speed depends on various factors including:

  • block size (the smaller the block size, the faster the compression speed)
  • single-thread vs. multi-thread system
  • single-core vs. multi-core system
  • the type of codec used

Setting Compression on Files

Compression is set at the directory level. Any files written by a Hadoop application, whether via the file APIs or over NFS, are compressed according to the settings for the directory where the file is written. Sub-directories on which compression has not been explicitly set inherit the compression settings of the directory that contains them.

If you change a directory's compression settings after writing a file, the file will keep the old compression settings---that is, if you write a file in an uncompressed directory and then turn compression on, the file does not automatically end up compressed, and vice versa. Further writes to the file will use the file's existing compression setting.

Icon

Only the owner of a directory can change its compression settings or other attributes. Write permission is not sufficient.

File Extensions of Compressed Files

By default, MapR does not compress files whose filename extensions indicate they are already compressed. The default list of filename extensions is as follows:

  • bz2
  • gz
  • lzo
  • snappy
  • tgz
  • tbz2
  • zip
  • z
  • Z
  • mp3
  • jpg
  • jpeg
  • mpg
  • mpeg
  • avi
  • gif
  • png

The list of filename extensions not to compress is stored as comma-separated values in the mapr.fs.nocompression configuration parameter, and can be modified with the config save command. For example, you can add parquet to the default list:

The list can be viewed with the config load command. Example:

Turning Compression On or Off on Directories

You can turn compression on or off for a given directory in two ways:

  • Set the value of the Compression attribute in the .dfs_attributes file at the top level of the directory.
    • Set Compression=lzf|lz4|zlib to turn compression on for a directory.
    • Set Compression=false to turn compression off for a directory.
  • Use the command hadoop mfs -setcompression on|off/lzf/lz4/zlib <dir>.

If you choose -setcompression on without specifying an algorithm, lz4 is used by default. This algorithm has improved compression speeds for MapR's block size of 64 KB.

Example

Suppose the volume test is NFS-mounted at /mapr/my.cluster.com/projects/test. You can turn off compression by editing the file /mapr/my.cluster.com/projects/test/.dfs_attributes and setting Compression=false. To accomplish the same thing from the hadoop shell, use the following command:

You can view the compression settings for directories using the hadoop mfs -ls command. For example,

The symbols for the various compression settings are explained here:

Symbol

Compression Setting

Z

lz4

z

zlib

L

lzf

U

Uncompressed, or previously compressed by another algorithm

Setting Compression During Shuffle

By default, MapReduce uses compression during the Shuffle phase. You can use the
-Dmapreduce.maprfs.use.compression switch to turn compression off during the Shuffle phase of a MapReduce job. For example:

  • No labels