This is documentation for MapR Version 5.0. You can also refer to MapR documentation for the latest release.

Skip to end of metadata
Go to start of metadata

You can replicate changes (puts and deletes) to the data in one table to another table that is in a separate cluster or within the same cluster. Replicate entire tables, specific column families, and specific columns.

Basic components

Tables that data is replicated from are called source tables, while tables that data is replicated to are called replicas.

The maximum number of replicas that a source table can replicate to is 64.

The maximum number of source tables that a replica can accept updates from is 64.

Clusters that data is replicated from are called source clusters. Clusters that data is replicated to are called destination clusters. A single cluster can be both a source cluster and a destination cluster, depending on the replication configuration in which the cluster participates.

Replication takes place between source and destination clusters. However, source clusters do not send data to nodes in the destination cluster directly. The replication stream (the data being pushed to the replicas) is consumed by one or more MapR gateways in the destination cluster. The gateways receive the updates from the source cluster, batch them, and apply them to the replica tables. Multiple gateways serve the purpose of both load balancing and failover. For more about gateways, see the topic “MapR gateways”.

Modes of replication

You can replicate table data in one of two replication modes. You specify the mode per source-replica pair.

Asynchronous replication

In this replication mode, MapR-DB confirms to client applications that operations are complete after the operations are performed on source tables. Updates are replicated in the background. Therefore, the latency of updates from client applications is not affected by the time required for the network roundtrip between the source cluster and the destination cluster.

This type of replication is well-suited for clusters that are geographically separated in wide-area networks.

Asynchronous replication is the default replication mode.

Synchronous replication

In this replication mode, MapR-DB confirms to client applications that changes have been applied to a source table only after the changes are sent to a gateway in the destination cluster.

Because of the confirmations that MapR-DB receives on source clusters, synchronous replication is especially well-suited for creating a backup of your data for disaster recovery.

When the latency of a replication stream is high, MapR-DB switches to asynchronous replication temporarily so that client applications are not blocked indefinitely. After the latency is sufficiently reduced, MapR-DB switches back to synchronous replication. The same switching occurs when a gateway fails, and MapR-DB does not resume synchronous replication until a new gateway is established or the failed gateway is restarted.

Supported replication topologies

There are two types of basic topologies that you can use for your replication scenarios: master-slave replication, with which you can construct several different types of more complicated topologies; and multi-master replication.

Master-slave replication

Several different topologies are possible for master-slave replication:

Replication from one source table to one or more replica tables

In this topology, updates on a source table are replicated to one or more replicas, but updates to the replicas are not replicated back to the source table.

For example, in this diagram updates to the customers table in the cluster sanfrancisco are being replicated to the newyork and hyderabad clusters. The circles marked G each represent a MapR gateway.

However, changes to the table in the newyork and hyderabad clusters are not replicated back to the table in the sanfrancisco cluster.

You can also replicate within a single cluster. In this example, the cluster sanfrancisco contains both the source table and the replica.

Many-to-one replication

Multiple source tables can replicate to a single replica. In this diagram, operations on customers tables in three different clusters are replicated via gateways to the customers table in the newyork cluster.

One-to-many replication

A single source table can replicate to multiple replicas. In this diagram, operations on the customers table in the sanfrancisco cluster are replicated via gateways to replicas in three other clusters.

Replication loops

When three or more tables need to be kept in sync, you can set up master-slave replication between pairs of them to form a replication loop. Operations on a table are propagated to the other clusters in the loop, but there is no attempt to reapply the operations at the originating table. This is because the operations are tagged with a universally unique identifier (UUID) that identifies the table where the operations originated.

In this diagram, for example, operations on the customers table in the hyderabad cluster are replicated first to the customers table in the tokyo cluster. The operations are then replicated from the tokyo cluster to the customers table in the sanfrancisco cluster. Finally, the operations are replicated from the sanfrancisco cluster to the customers table in the newyork cluster. The newyork cluster does not replicate the operations to the customers table in the hyderabad cluster.

Master-slave replication in two directions

You can combine master-slave replication configurations to replicate data between clusters. Two clusters engaged in replication can each act as a source cluster and a destination cluster.

In this example, the data in the customers table in the cluster sanfrancisco is replicated to the customers table in the cluster newyork. At the same time, the data in the products table in the newyork cluster is replicated to the products table in the cluster sanfrancisco.

In all master-slave configurations, changes made to replica tables are not replicated back to source tables. Therefore, if the replicated data is modified at the replica by client applications, the replica will become out of sync with the source table.

For example, you might replicate the two column families personal and purchases from the customer table in the sanfrancisco cluster to the customers table in the newyork cluster, as in this diagram. (For simplicity, the blue circle labeled G represents two or more gateways, rather than one as in the other diagrams in this topic.)

In master-slave replication, no updates to a replica are replicated back to the source. Any updates that applications might make to those two column families in the customers table in the newyork cluster will not be replicated to the customers table in the sanfrancisco cluster.

However, you don’t have to protect a replica from all updates that are not due to replication. For example, the customers table in the newyork cluster might have an additional column family that is not populated with replicated data: reviews.

Multi-master replication

In this configuration, two clusters have identical copies of a table. Applications can change any copy of the table, and MapR-DB replicates the changes to the other copy.

In this schematic diagram, there are two clusters: sanfrancisco and newyork. The table customers is updated on the sanfrancisco cluster by client applications on that cluster, and the same table is updated on the newyork cluster by client applications on that cluster. Updates made in either cluster are replicated to the other cluster.

Order of writes at replicas

It is possible for replicated operations to arrive at and be written to a replica in an order different from the order they were written to a source table.

In this diagram, the values “foo” and then “bar” are written to the source table. However, due to network issues, the values are written to the replica in the reverse order: “bar”, “foo”

Client applications on the destination cluster should not depend on updates being written to the replica in the same order in which they were written to the source table.

Conflict resolution

MapR-DB supports multiple versions of the values for each cell in a table. If a cell is updated both at the source table and at a replica, and the corresponding column family is not configured to support multiple versions of column data, then only the update with the later timestamp is retained. However, if the column family is configured to support multiple versions, then both versions will be saved.

Security

Security is configured at all locations in the replication stream.

On clusters

You can replicate between clusters that are all secure.

At source tables

The -replperm parameter lets you specify an access control expression (ACE) to declare who has permission to replicate data from a table. This parameter is available in the maprcli table create and maprcli table edit commands.

Across a network

You can send data encrypted or unencrypted when replicating between secure clusters by using the -networkencryption parameter when adding a replica to a source table.

At gateways

Gateways ensure that replicas receive updates only from source tables that are designated as upstream sources.

Moreover, gateways handle authentication with secure destination clusters.

At replicas

Because of the several upstream security checks, no parameters are needed for setting ACEs to declare who has permission to update a replica through a replication stream. However, before replication begins, replicas can be loaded with a snapshot of the data in corresponding source tables. Permission to perform such a load is controlled by the ACE that you set in the -bulkLoad parameter for a replica. You can set the ACE with either the maprcli table create or maprcli table edit command.

All other ACEs defined for a replica still apply for local updates and reads.

Licensing

Table replication requires a license for MapR Enterprise Database Edition (M7) on source and destination clusters.

  • No labels