Clock-Sure Wait


Take into account a key-value retailer the place values are saved with a timestamp
to designate every model. Any cluster node that handles the consumer request
will be capable of learn the newest model utilizing the present timestamp
on the request processing node.

Within the following instance, the worth ‘Earlier than Daybreak’ is up to date
to worth “After Daybreak” at time 2, as per Inexperienced’s clock.
Each Alice and Bob are attempting to learn the newest worth for ‘title’.
Whereas Alice’s request is processed by cluster node Amber, Bob’s request is
processed by cluster node Blue.
Amber has its clock lagging at 1; which implies that
when Alice reads the newest worth, it delivers the worth ‘Earlier than Daybreak’.
Blue has its clock at 2; when Bob reads the newest worth,
it returns the worth as “After Daybreak”

This violates a consistency often known as exterior consistency.
If Alice and Bob now make a telephone name, Alice will probably be confused; Bob will
inform that the newest worth is “After Daybreak”, whereas her cluster node is
displaying “Earlier than Daybreak”.

The identical is true if Inexperienced’s clock is quick and the writes occur in ‘future’
in comparison with Amber’s clock.

It is a downside if system’s timestamp is used as a model for storing values,
as a result of wall clocks should not monotonic.
Clock values from two totally different servers can not and shouldn’t be in contrast.
When Hybrid Clock is used as a model in
Versioned Worth, it permits values to be ordered
on a single server in addition to on totally different servers which
are causally associated.
Nevertheless, Hybrid Clocks (or any Lamport Clock primarily based clocks)
can solely give partial order.
Which means that any values which aren’t causally associated and saved by
two totally different purchasers throughout totally different nodes can’t be ordered.
This creates an issue when utilizing a timestamp to learn the
values throughout cluster nodes.
If the learn request originates on cluster nodes with lagging clocks,
it most likely will not be capable of learn the hottest variations of
given values.


Cluster nodes wait till the clock values
on each node within the cluster are assured to be above the timestamp
assigned to the worth whereas studying or writting.

If the distinction betweeen clocks could be very small,
write requests can wait with out including an excessive amount of overhead.
For example, assume the utmost clock offset throughout cluster nodes is 10ms.
(Which means that, at any given cut-off date,
the slowest clock within the cluster is lagging behind t – 10ms.)
To ensure that each different cluster node has its clock set previous t,
the cluster node that deal with any write operation
must anticipate t + 10ms earlier than storing the worth.

Take into account a key worth retailer with Versioned Worth the place
every replace is added as a brand new worth, with a timestamp used as a model.
Within the Alice and Bob instance talked about above the write operation
storing the [email protected], will wait till all of the clocks within the cluster are at 2.
This makes positive that Alice will at all times see the newest worth of the title
even when the clock on the cluster node of Alice is lagging behind.

Take into account a barely totally different state of affairs.
Philip is updating the title to ‘After Daybreak’. Inexperienced’s clock has its
time at 2. However Inexperienced is aware of that there may be a server with a clock
lagging behind upto 1 unit. It can subsequently need to
wait within the write operation for a length of 1 unit.

Whereas Philip is updating the title, Bob’s learn request is dealt with
by server Blue. Blue’s clock is at 2, so it tries to learn the title at
timestamp 2. At this level Inexperienced has not but made the worth obtainable.
This implies Bob will get the worth on the highest timestamp decrease than 2,
which is ‘Earlier than Daybreak’

Alice’s learn request is dealt with
by server Amber. Amber’s clock is at 1 so it tries to learn the title at
timestamp 1. Alice will get the worth ‘Earlier than Daybreak’

As soon as Philip’s write request completes – after the wait of max_diff is over –
if Bob now sends a brand new learn request, server Blue will attempt to learn the newest
worth based on its clock (which has superior to three); this may return
the worth “After Daybreak”

If Alice initializes a brand new learn request, server Blue will attempt to learn the
newest worth as per its clock – which is now at 2. It can subsequently,
additionally return the worth “After Daybreak”

The principle downside when attempting to implement this resolution is that
getting the precise time distinction throughout cluster nodes
is just not potential with the date/time {hardware} and working programs APIs
which might be presently obtainable.
Such is the character of the problem that Google has its personal specialised date time API
known as True Time.
Equally Amazon has
AWS Time Sync Service and a library known as ClockBound.
Nevertheless, these APIs are very particular to Google and Amazon,
so can’t actually be scaled past the confines of these organizations

Usually key worth shops use Hybrid Clock to
implement Versioned Worth.
Whereas it isn’t potential to get the precise distinction between clocks,
a wise default worth might be chosen primarily based
on historic observations.
Noticed values for optimum clock drift on servers throughout
datacenters is usually 200 to 500ms.

The important thing-value retailer waits for configured max-offset earlier than storing the worth.

class KVStore…

  int maxOffset = 200;
  NavigableMap<HybridClockKey, String> kv = new ConcurrentSkipListMap<>();
  public void put(String key, String worth) {
      HybridTimestamp writeTimestamp =;
      kv.put(new HybridClockKey(key, writeTimestamp), worth);

  non-public void waitTillSlowestClockCatchesUp(HybridTimestamp writeTimestamp) {
      var waitUntilTimestamp = writeTimestamp.add(maxOffset, 0);

  non-public void sleepUntil(HybridTimestamp waitUntil) {
      HybridTimestamp now =;
      whereas ( than(waitUntil)) {
          var waitTime = (waitUntil.getWallClockTime() - now.getWallClockTime()) ;
          Uninterruptibles.sleepUninterruptibly(waitTime, TimeUnit.MILLISECONDS);
          now =;

  public String get(String key, HybridTimestamp readTimestamp) {
      return kv.get(new HybridClockKey(key, readTimestamp));

Learn Restart

200ms is simply too excessive an interval to attend for each write request.
That is why databases like CockroachDB or YugabyteDB
implement a examine within the learn requests as an alternative.

Whereas serving a learn request, cluster nodes examine if there’s a model
obtainable within the interval of readTimestamp and readTimestamp + most clock drift.
If the model is obtainable – assuming the reader’s clock may be lagging –
it’s then requested to restart the learn request with that model.

class KVStore…

  public void put(String key, String worth) {
      HybridTimestamp writeTimestamp =;
      kv.put(new HybridClockKey(key, writeTimestamp), worth);

  public String get(String key, HybridTimestamp readTimestamp) {
      checksIfVersionInUncertaintyInterval(key, readTimestamp);
      return kv.floorEntry(new HybridClockKey(key, readTimestamp)).getValue();

  non-public void checksIfVersionInUncertaintyInterval(String key, HybridTimestamp readTimestamp) {
      HybridTimestamp uncertaintyLimit = readTimestamp.add(maxOffset, 0);
      HybridClockKey versionedKey = kv.floorKey(new HybridClockKey(key, uncertaintyLimit));
      if (versionedKey == null) {
      HybridTimestamp maxVersionBelowUncertainty = versionedKey.getVersion();
      if (maxVersionBelowUncertainty.after(readTimestamp)) {
          throw new ReadRestartException(readTimestamp, maxOffset, maxVersionBelowUncertainty);

class Consumer…

  String learn(String key) {
      int attemptNo = 1;
      int maxAttempts = 5;
      whereas(attemptNo < maxAttempts) {
          attempt {
              HybridTimestamp now =;
              return kvStore.get(key, now);
          } catch (ReadRestartException e) {
    " Acquired learn restart error " + e + "Try No. " + attemptNo);
              Uninterruptibles.sleepUninterruptibly(e.getMaxOffset(), TimeUnit.MILLISECONDS);

      throw new ReadTimeoutException("Unable to learn after " + attemptNo + " makes an attempt.");

Within the Alice and Bob instance above, if there’s a model for “title”
obtainable at timestamp 2, and Alice sends a learn request with learn timestamp 1,
a ReadRestartException will probably be thrown asking Alice to restart the learn request
at readTimestamp 2.

Learn restarts solely occur if there’s a model written within the
uncertainty interval. Write request don’t want to attend.

It’s vital to do not forget that the configured worth for optimum clock drift
is an assumption, it isn’t assured. In some circumstances,
a foul server can have a clock drift greater than the assumed worth. In such circumstances,
the issue will persist.

Utilizing Clock Sure APIs

Cloud suppliers like Google and Amazon, implement clock equipment with
atomic clocks and GPS to be sure that the clock drift throughout cluster nodes
is stored beneath just a few milliseconds. As we’ve simply mentioned, Google has
True Time. AWS has
AWS Time Sync Service and ClockBound.

There are two key necessities for cluster nodes to ensure these waits
are applied appropriately.

  • The clock drift throughout cluster nodes is stored to a minimal.
    Google’s True-Time retains it beneath 1ms generally (7ms within the worst circumstances)
  • The potential clock drift is at all times
    obtainable within the date-time API, this ensures programmers do not want
    to guess the worth.

The clock equipment on cluster nodes computes error bounds for
date-time values. Contemplating there’s a potential error in timestamps
returned by the native system clock, the API makes the error specific.
It can give the decrease in addition to the higher sure on clock values.
The actual time worth is assured to be inside this interval.

public class ClockBound {
    public last lengthy earliest;
    public last lengthy newest;

    public ClockBound(lengthy earliest, lengthy newest) {
        this.earliest = earliest;
        this.newest = newest;

    public boolean earlier than(lengthy timestamp) {
        return timestamp < earliest;

    public boolean after(lengthy timestamp)   {
        return timestamp > newest;

As defined on this AWS weblog the error is
calculated at every cluster node as ClockErrorBound.
The actual time values will at all times be someplace between
native clock time and +- ClockErrorBound.

The error bounds are returned each time date-time
values are requested for.

public ClockBound now() {
    return now;

There are two properties assured by the clock-bound API

  • Clock bounds ought to overlap throughout cluster nodes
  • For 2 time values t1 and t2, if t1 is lower than t2,
    then clock_bound(t1).earliest is lower than clock_bound(t2).newest
    throughout all cluster nodes

Think about now we have three cluster nodes: Inexperienced, Blue and Orange.
Every node might need a special error sure.
As an example the error on Inexperienced is 1, Blue is 2 and Orange is 3. At time=4,
the clock sure throughout cluster nodes will seem like this:

On this state of affairs, two guidelines should be adopted to implement the commit-wait.

  • For any write operation, the clock sure’s newest worth
    needs to be picked because the timestamp.
    This may make sure that it’s at all times increased than any timestamp assigned
    to earlier write operations (contemplating the second rule beneath).
  • The system should wait till the write timestamp is lower than
    the clock sure’s earliest worth, earlier than storing the worth.

    That is As a result of the earliest worth is assured to be decrease than
    clock sure’s newest values throughout all cluster nodes.
    This write operation will probably be accessible
    to anybody studying with the clock-bound’s newest worth in future. Additionally,
    this worth is assured to be ordered earlier than some other write operation
    occur in future.

class KVStore…

  public void put(String key, String worth) {
      ClockBound now =;
      lengthy writeTimestamp = now.newest;
      kv.put(new VersionedKey(key, writeTimestamp), worth);

  non-public void waitUntilTimeInPast(lengthy writeTimestamp) {
      ClockBound now =;
      whereas(now.earliest < writeTimestamp) {
          Uninterruptibles.sleepUninterruptibly(now.earliest - writeTimestamp, TimeUnit.MILLISECONDS);
          now =;

  non-public void removePending(lengthy writeTimestamp) {
      pendingWriteTimestamps.take away(writeTimestamp);
      attempt {
      } lastly {

  non-public void addPending(lengthy writeTimestamp) {

If we return to the Alice and Bob instance above, when the worth for
“title”- “After Daybreak” – is written by Philip on server Inexperienced,
the put operation on Inexperienced waits till the chosen write timestamp is
beneath the earliest worth of the clock sure.
This ensures that each different cluster node
is assured to have a better timestamp for the newest worth of the
clock sure.
As an instance, contemplating this state of affairs. Inexperienced has error sure of
+-1. So, with a put operation which begins at time 4,
when it shops the worth, Inexperienced will decide up the newest worth of clock
sure which is 5. It then waits till the earliest worth of the clock
sure is greater than 5. Basically, Inexperienced waits for the uncertainty
interval earlier than truly storing the worth within the key-value retailer.

When the worth is made obtainable in the important thing worth retailer,
that the clock sure’s newest worth is assured to be increased than 5
on each cluster node.
Which means that Bob’s request dealt with by Blue in addition to Alice’s request
dealt with by Amber, are assured to get the newest worth of the title.

We are going to get the identical outcome if Inexperienced has ‘wider’ time bounds.
The better the error sure, the longer the wait. If Inexperienced’s error sure
is most, it is going to proceed to attend earlier than making the values obtainable in
the key-value retailer. Neither Amber nor Blue will be capable of get
the worth till their newest time worth is previous 7. When Alice will get the
latest worth of title at newest time 7,
each different cluster node will probably be assured to get it at it is newest time worth.


When studying the worth, the consumer will at all times decide the utmost worth
from the clock sure from its cluster node.

The cluster node that’s receiving the request must be sure that as soon as
a response is returned on the particular request timestamp, there are
no values written at that timestamp or the decrease timestamp.

If the timestamp within the request is increased than the
timestamp on the server, the cluster node will wait till
the clock catches up,
earlier than returning the response.

It can then examine if there are any pending write requests on the decrease timestamp,
which aren’t but saved. If there are, then the
learn requests will pause till the requests are full.

The server will then learn the values on the request timestamp and return the worth.
This ensures that after a response is returned at a specific timestamp,
no values will ever be written on the decrease timestamp.
This assure is named Snapshot Isolation

class KVStore…

  last Lock lock = new ReentrantLock();
  Queue<Lengthy> pendingWriteTimestamps = new ArrayDeque<>();
  last Situation cond  = lock.newCondition();

  public Optionally available<String> learn(lengthy readTimestamp) {
      Optionally available<VersionedKey> max = kv.keySet().stream().max(Comparator.naturalOrder());
      if(max.isPresent()) {
          return Optionally available.of(kv.get(max.get()));
      return Optionally available.empty();

  non-public void waitForPendingWrites(lengthy readTimestamp) {
      attempt {
          whereas ( -> ts <= readTimestamp)) {
      } lastly {

Take into account this last state of affairs: Alice’s learn request is dealt with by
server Amber with error sure of three. It picks up the newest time as 7 to
learn the title. In the meantime, Philip’s write request is dealt with by Inexperienced
(with an error sure of +-1), it picks up 5 to retailer the worth.
Alice’s learn request waits till the earliest time at Inexperienced is previous 7
and the pending write request. It then returns the newest worth with
a timestamp beneath 7.

Similar Posts

Leave a Reply

Your email address will not be published.