Skip to main content

Replication

Replication is the process that enables developers to retrieve a partial or complete copy of a Web API resource and subsequently receive updates to that resource as they occur. This process is useful for developers who need to maintain a local copy of a resource in near real-time, for example to power a search feature on a website or to analyze the data for insights.

The replication process is typically broken down into two steps: the initial download and ongoing updates. During the initial download, developers retrieve a copy of the resource, often using pagination techniques to manage the data size. Ongoing updates are then retrieved periodically by polling the Web API for changes since a specified timestamp.

tip

Replication can be a complex process, requiring careful planning and implementation.

Initial Download

The initial download is the first step in setting up replication and involves copying part or all of a resource from the source to the target data store.

To perform an initial download of a resource, developers should use the timestamp and key fields to retrieve sequential batches of data. In the case of the Property resource, the timestamp field is called ModificationTimestamp and the key field is called ListingKey.

Start by retrieving the oldest records first, and then work your way forward to the most recent records. This ensures that all records are retrieved and that none are missed.

Continue retrieving batches of records until the number of records returned is less than the batch size. This indicates that the initial download is complete.

tip

A full example with source code can be found on the Replication Source Code page.

Using a Timestamp and Key

When using a timestamp and key for replication, records are retrieved in batches in the order they were modified, starting from the least-recently modified and ending with the most-recently modified. The key is used to ensure that records with the same timestamp are retrieved in the same order each time.

Any records that are modified during the initial download will be retrieved again in the final batch (or batches).

Because of this, you must be prepared for individual records to appear in more than one batch during a download. When this happens, you may choose to discard or overwrite the prior version of the record.

Unique Key

Because timestamps are not guaranteed to be unique, in addition to the timestamp you will need to include a unique key in the $filter and $orderby options of your query. It's recommended to use the resource's key field as it's guaranteed to be unique. In the examples below, the ListingKey field is used, as it's the Property resource's key field.

Example query template:

?$filter=ModificationTimestamp gt <last_timestamp_value>
or (ModificationTimestamp eq <last_timestamp_value>
and ListingKey gt '<last_key_value>')
&$orderby=ModificationTimestamp,ListingKey
&$limit=100

First batch query:

?$filter=ModificationTimestamp gt 1970-01-01T00:00:00Z
or (ModificationTimestamp eq 1970-01-01T00:00:00Z
and ListingKey gt '0')
&$orderby=ModificationTimestamp,ListingKey
&$limit=100

Last Timestamp and Last Key Variables

For the first batch, the variable last_timestamp_value is initialized to 1970-01-01T00:00:00Z and the variable last_key_value is initialized to 0. Substitute these variables into the initial query URL.

After each successful batch, read the timestamp and key values from the last record returned in the batch, store the the values into the last_timestamp_value and last_key_value variables, and substitute the updated variables in the next batch's query URL.

It's recommended to persist these variables in the filesystem or a database after each batch, so an initial download can be restarted from where it left off, and it will not have to start over from the beginning. This is sometimes called a checkpoint.

Filtering

Sometimes you may need to retreive only a subset of the records available, based on a filter. For example, you may only want to retrieve listings that have a ContractStatus of Available. To do this, add a filter to the query. In the example below, the filter is added to the $filter option of the batch query.

&$filter=ContractStatus eq 'Available'
and (ModificationTimestamp gt <last_timestamp_value>
or (ModificationTimestamp eq <last_timestamp_value>
and ListingKey gt '<last_key_value>'))

Field Selection

If you do not require all of the fields in the resource, be sure to list only the fields you require in the $select option of the query. This will reduce the amount of data transferred and will allow your download to complete in less time. At a minium you must include the timestamp and key fields so they may be used as the last timestamp and last key variables and as a key in the target database.

$select=ListingKey,ModificationTimestamp,StandardStatus,ListPrice

Batch Size

Use a batch size that the server is able to return in a reasonable amount of time — perhaps a few seconds. This will depend on the number of fields in the $select option and the size of the values for each field. If the batch size is too large, the request might time out while the server attempts to gather all the data. If the batch size is too small, the additional overhead and latency of each request could cause your download to take much longer than necessary.

Batch Query

Combining uniqueness, the last timestamp and last key variables, filtering, field selection, and the batch size, we arrive at this query:

https://query.ampre.ca/odata/Property
?$select=ModificationTimestamp,ListingKey,StandardStatus,ListPrice
&$filter=ContractStatus eq 'Available'
and (ModificationTimestamp gt <last_timestamp_value>
or (ModificationTimestamp eq <last_timestamp_value>
and ListingKey gt '<last_key_value>'))
&$orderby=ModificationTimestamp,ListingKey
&$top=100
note

Remember to URL Encode query string values prior to sending HTTP requests. To aid with readability, the URLs in this document are displayed without URL encoding and wrapped over multiple lines.

The same query used for the initial batch can be used with all subsequent batches. Remember to substitute the last timestamp and last key variables into each batch's query URL. Place them where you see <last_timestamp_value> and <last_key_value> in the above example query.

When the the number of records returned from the query is less than the batch size (the number requested in the $top option), then the initial download is complete and you can stop the download process.

Using $top and $skip

caution

Do not use $top and $skip for replication. $top and $skip are commonly used in Web API queries to limit the amount of data returned by the server. While they can be useful for improving the performance of individual requests, they should not be used for replication purposes. When replicating data, it's important to ensure that all records are transferred and that none are missed. Using $top and $skip can introduce the risk of missing records, especially if records are created or modified while the replication is in progress. Instead, replication should be performed using reliable methods such as timestamp and key-based techniques, which guarantee that all records are transferred in a consistent and accurate manner.

The Problem with $top and $skip

When utilizing $top and $skip, an issue arises when records are updated while multiple pages are requested sequentially. These updates can result in records being shifted into preceding pages, leading to them being overlooked by the ongoing requests. See the diagram below:

Problem with Top and Skip

If your implementation requires you to use $top and $skip

  • Use both a timestamp and a unique key in the $orderby option. For example: $orderby=ModificationTimestamp,ListingKey
  • Use a descending $orderby: $orderby=ModificationTimestamp,ListingKey desc. This will ensure that when records are modified during the download, this does not shift the $skip window in a way that causes records to be missed.
  • Be prepared for individual records to appear in more than one batch, as this will happen when a record is modified while the download is in progress, which causes the $skip window to shift back.
  • Do not filter by the timestamp field used in $orderby (e.g. ModificationTimestamp), as this will cause records to be missed when modifications happen while the download is in progress.
  • There is a limit of 100,000 records for the $skip option, therefore $top and $skip can not be used to retrieve more than 100,000 records.

Using a Single Query

If the resource or subset of a resource that you intend to replicate contains fewer than 10,000 records, the initial download can be accomplished with a single query. Do not use this strategy unless the resource will never to grow beyond 10,000 records.

Updates

Once the initial download is complete and the replication process is underway, the system must constantly monitor the source database for changes and retrieve the updates as they occur. The process of retrieving updates typically involves comparing timestamps to determine which data has been modified since the last synchronization. The system then retrieves the updated data and applies it to the target database, ensuring that both databases remain in sync.

In some cases, updates may be retrieved in bulk or incremental batches to minimize the impact on system performance. The frequency of updates and the method of retrieval will depend on the specific requirements of the application and the system resources available.

Using a Timestamp and Key

The procedure to retrieve updates using a timestamp and key is similar to the procedure for the initial download. A modification timestamp can be used to retrieve updates as they occur. The primary difference is the value of the last_timestamp_value variable used to initiate the process.

  • Use the same batch query as the initial download, with the same unique key and last timestamp and last key variables.
  • Use the same field selection.
  • Use the same batch size.
  • Initialize the last_timestamp_value and last_key_value variables to the values from the last record stored in the target database. Alternatively you may retrieve these variables from the filesystem or a database if you have persisted them there after the initial download and after each completed update batch.
  • Follow the same batch query procedure from the initial download.

Using $top and $skip

It is not recommended to retrieve updates using $top and $skip, as it would require filtering by the timestamp field (e.g. ModificationTimestamp) which will cause missing records when records are modified during a run. Instead, you should use a single query, perform a full download using $top and $skip, or switch to using a timestamp and key.

Using a Single Query

If the resource or subset of a resource that you intend to replicate contains fewer than 10,000 updated records, then all updates can be retrieved with a single query. Do not use this strategy unless the number of updated records is less than or equal to 10,000 records. If the number is greater than 10,000 records, you'll need to use a timestamp and key or perform a full download.