rembrembdocs

Pipeline configuration file

Learn how to specify the main configuration details for an RDI pipeline.

The main configuration details for an RDI pipeline are in the config.yaml file. This file specifies the connection details for the source and target databases, and also the set of tables you want to capture. You can also add one or more job files if you want to apply custom transformations to the captured data.

Example

Below is an example of a config.yaml file. Note that the values of the form "${name}" refer to secrets that you should set as described in Set secrets. In particular, you should normally use secrets as shown to set the source and target username and password rather than storing them in plain text in this file.

sources:
  mysql:
    type: cdc
    logging:
      level: info
    connection:
      type: mysql
      host: <DB_HOST> # e.g. localhost
      port: 3306
      # User and password are injected from the secrets.
      user: ${SOURCE_DB_USERNAME}
      password: ${SOURCE_DB_PASSWORD}
    # Additional properties for the source collector:
    # List of databases to include (optional).
    # databases:
    #   - database1
    #   - database2

    # List of tables to be synced (optional).
    # tables:
    #   If only one database is specified in the databases property above,
    #   then tables can be defined without the database prefix.
    #   <DATABASE_NAME>.<TABLE_NAME>:
    #     List of columns to be synced (optional).
    #     columns:
    #       - <COLUMN_NAME>
    #       - <COLUMN_NAME>
    #     List of columns to be used as keys (optional).
    #     keys:
    #       - <COLUMN_NAME>

    # Example: Sync specific tables.
    # tables:
    #   Sync a specific table with all its columns:
    #   redislabscdc.account: {}
    #   Sync a specific table with selected columns:
    #   redislabscdc.emp:
    #     columns:
    #       - empno
    #       - fname
    #       - lname

    # Advanced collector properties (optional):
    # advanced:
    #   Sink collector properties - see the full list at
    #     https://debezium.io/documentation/reference/stable/operations/debezium-server.html#_redis_stream
    #   sink:
    #     Optional hard limits on memory usage of RDI streams.
    #     redis.memory.limit.mb: 300
    #     redis.memory.threshold.percentage: 85

    #     Uncomment for production so RDI Collector will wait on replica
    #     when writing entries.
    #     redis.wait.enabled: true
    #     redis.wait.timeout.ms: 1000
    #     redis.wait.retry.enabled: true
    #     redis.wait.retry.delay.ms: 1000

    #   Source specific properties - see the full list at
    #     https://debezium.io/documentation/reference/stable/connectors/
    #   source:
    #     snapshot.mode: initial
    #     Uncomment if you want a snapshot to include only a subset of the rows
    #     in a table. This property affects snapshots only.
    #     snapshot.select.statement.overrides: <DATABASE_NAME>.<TABLE_NAME>
    #     The specified SELECT statement determines the subset of table rows to
    #     include in the snapshot.
    #     snapshot.select.statement.overrides.<DATABASE_NAME>.<TABLE_NAME>: <SELECT_STATEMENT>

    #     Example: Snapshot filtering by order status.
    #     To include only orders with non-pending status from customers.orders
    #     table:
    #     snapshot.select.statement.overrides: customer.orders
    #     snapshot.select.statement.overrides.customer.orders: SELECT * FROM customers.orders WHERE status != 'pending' ORDER BY order_id DESC

    #   Quarkus framework properties - see the full list at
    #     https://quarkus.io/guides/all-config
    #   quarkus:
    #     banner.enabled: "false"

    #   `java_options` (for RDI 1.15.1 and above) controls the JAVA_OPTS environment variable. Use it to modify the default values for
    #       Java heap size and other Java options for the Debezium server.
    #   java_options: "-Xmx2g -Xms512m"

targets:
  # Redis target database connections.
  # The default connection must be named 'target' and is used when no
  # connection is specified in jobs or no jobs
  # are deployed. However multiple connections can be defined here and used
  # in the job definition output blocks:
  # (e.g. target1, my-cloud-redis-db2, etc.)
  target:
    connection:
      type: redis
      # Host of the Redis database to which RDI will
      # write the processed data.
      host: <REDIS_TARGET_DB_HOST> # e.g. localhost
      # Port for the Redis database to which RDI will
      # write the processed data.
      port: <REDIS_TARGET_DB_PORT> # e.g. 12000
      # User of the Redis database to which RDI will write the processed data.
      # Uncomment if you are not using the default user.
      # user: ${TARGET_DB_USERNAME}
      # Password for Redis target database.
      password: ${TARGET_DB_PASSWORD}
      # SSL/TLS configuration: Uncomment to enable secure connections.
      # key: ${TARGET_DB_KEY}
      # key_password: ${TARGET_DB_KEY_PASSWORD}
      # cert: ${TARGET_DB_CERT}
      # cacert: ${TARGET_DB_CACERT}
processors:
  # Interval (in seconds) on which to perform retry on failure.
  # on_failed_retry_interval: 5
  # The batch size for reading data from the source database.
  # read_batch_size: 2000
  # Time (in ms) after which data will be read from stream even if
  # read_batch_size was not reached.
  # duration: 100
  # The batch size for writing data to the target Redis database. Should be
  # less than or equal to the read_batch_size.
  # write_batch_size: 200
  # Enable deduplication mechanism (default: false).
  # dedup: <DEDUP_ENABLED>
  # Max size of the deduplication set (default: 1024).
  # dedup_max_size: <DEDUP_MAX_SIZE>
  # Error handling strategy: ignore - skip, dlq - store rejected messages
  # in a dead letter queue.
  # error_handling: dlq
  # Dead letter queue max messages per stream.
  # dlq_max_messages: 1000
  # Data type to use in Redis target database: `hash` for Redis Hash,
  # `json` for JSON (which requires the RedisJSON module).
  # target_data_type: hash
  # Number of processes to use when syncing initial data.
  # initial_sync_processes: 4
  # Checks if the batch has been written to the replica shard.
  # wait_enabled: false
  # Timeout in milliseconds when checking write to the replica shard.
  # wait_timeout: 1000
  # Ensures that a batch has been written to the replica shard and keeps
  # retrying if not.
  # retry_on_replica_failure: true
  # Enable merge as the default strategy to writing JSON documents.
  # json_update_strategy: merge
  # Use native JSON merge if the target RedisJSON module supports it.
  # use_native_json_merge: true

Sections

The main sections of the file configure sources, targets, and processors.

Sources

The sources section has a subsection for the source that you need to configure. The source section starts with a unique name to identify the source (in the example, there is a source called mysql but you can choose any name you like). The example configuration contains the following data:

Targets

Use this section to provide the connection details for the target Redis database(s). As with the sources, you should start each target section with a unique name that you are free to choose (here, the example uses the name target). In the connection section, you can specify the type of the target database, which must be redis, along with connection details such as host, port, and credentials (username and password). If you use TLS/ or mTLS to connect to the target database, you must specify the CA certificate (for TLS), and the client certificate and private key (for mTLS) in cacert, cert, and key. Note that these certificates must be references to secrets that you should set as described in Set secrets (it is not possible to include these certificates as plain text in the file).

Note:

If you specify localhost as the address of either the source or target server during installation then the connection will fail if the actual IP address changes for the local VM. For this reason, it is recommended that you don't use localhost for the address. However, if you do encounter this problem, you can fix it using the following commands on the VM that is running RDI itself:

sudo k3s kubectl delete nodes --all
sudo service k3s restart

Processors

The processors section configures the behavior of the pipeline. The example configuration above contains the following properties:

See also the RDI configuration file reference for full details of the other available properties.

On this page