You are here: Home > Analyse > Internet Measurements > RIPE Atlas > Documentation > Status Checks

Status Checks

Status checks are a feature that lets you turn a measurement into a basis for an alert, allowing you to use measurements to gauge the health of your network in a variety of ways.

This document is meant to be used as both a primer and tutorial on the new status checks system, so we'll be covering both the simple method that uses the default options, and a more complex example that makes use of the various customisations.

Quick Start

If you just want to get started using Status Checks, you need only do the following:

  1. Create a RIPE Atlas ping measurement using either the website or the API.
    • You may use up to 1024 probes.
    • Note the newly-created measurement id
  2. Hit this URL: https://atlas.ripe.net/api/v1/status-checks/MEASUREMENT_ID/ where MEASUREMENT_ID is the id from your newly-created measurement. If the measurement in question is not public, you'll need to include an API ?key= argument.
  3. Hit the URL again later and see if anything has changed.
  4. Define your alerts accordingly

Example for Use in Monitoring Systems

Icinga

This example simply parses the output for global_alert":false and makes use of the built in check_http plugin.

define service {
    use generic-service
    host_name myhostname
    service_description Test_Atlas
    check_command check_http!-I atlas.ripe.net  -r 'global_alert":false' --ssl=1 -u /api/v1/status-checks/1040425/?permitted_total_alerts=1
}

A Simple Example

Your website, www.example.com is hosted on servers in Europe and North America. As you're interested in availability and response time, you create a RIPE Atlas ping measurement from 5 locations around the globe and begin seeing results coming back looking something like this:

  • Rotterdam, Netherlands: id: 123, rtt: 9ms
  • Athens, Greece: id: 234, rtt: 12ms
  • Vancouver, Canada: id: 345, rtt: 13ms
  • São Paulo, Brazil: id: 456, rtt: 55ms
  • Brisbane, Australia: id: 567, rtt: 312ms

The ID for your new measurement is 123456789, so you can get basic information about your measurement by querying this URL:

https://atlas.ripe.net/api/v1/measurement/123456789/

The new Status Checks system is can be found at a similar URL:

https://atlas.ripe.net/api/v1/status-checks/123456789/

Querying this URL alone should give you basic dashboard values for your server, enough for you to plug into a monitoring engine like Nagios. The output should look something like this:

# Request
GET https://atlas.ripe.net/api/v1/status-checks/123456789/

# Response
HTTP/1.1 200 OK
Date: Tue, 29 Oct 2013 14:37:37 GMT
X-RIPE-Atlas-Global-Alert: 0
Content-Type: text/plain
Cache-Control: no-cache

{
    "global_alert": false,
    "probes": {
      "123": {
          "alert": false,
          "last": 107.296,
          "last_packet_loss": 0.0,
          "source": "Country: NL"
      },
      "234": {
          "alert": false,
          "last": 14.152,
          "last_packet_loss": 0.0,
          "source": "Country: GR"
      },
      "345": {
          "alert": false,
          "last": 9.328,
          "last_packet_loss": 0.0,
          "source": "Country: CA"
      },
      "456": {
          "alert": false,
          "last": 21.761,
          "last_packet_loss": 0.0,
          "source": "Country: BR"
      },
      "567": {
          "alert": false,
          "last": 28.281,
          "last_packet_loss": 0.0,
          "source": "Country: AU"
      }
}

Note that in the case of every probe above, alert is set to false. This is because your network is presently healthy. Of course, if connectivity between your server and Brisbane, Australia were to degrade suddenly, the output might look something like this:

# Request
GET https://atlas.ripe.net/api/v1/status-checks/123456789/

# Response
HTTP/1.1 200 OK
Date: Tue, 29 Oct 2013 14:37:37 GMT
X-RIPE-Atlas-Global-Alert: 1
Content-Type: text/plain
Cache-Control: no-cache

{
  "global_alert": true,
  "probes": {
    "123": {
      "alert": false,
      "last": 107.296,
      "last_packet_loss": 0.0,
      "source": "Country: NL"
    },
    "234": {
      "alert": false,
      "last": 14.152,
      "last_packet_loss": 0.0,
      "source": "Country: GR"
    },
    "345": {
      "alert": false,
      "last": 9.328,
      "last_packet_loss": 0.0,
      "source": "Country: CA"
    },
    "456": {
      "alert": false,
      "last": 21.761,
      "last_packet_loss": 0.0,
      "source": "Country: BR"
    },
    "567": {
      "alert": true,
      "alert_reasons": [
        "loss"
      ],
      "all": [
        null,
        null,
        null
      ]
      "last": null,
      "last_packet_loss": 100.0,
      "source": "Country: AU"
    }
  }
}

Note that probe 567 (the id for the probe that you're using in Brisbane) has somehow lost the ability to ping your server. This has resulted in the following changes to the output of your Status Check:

  • The last property (the last attempt to ping your server) has a null value
  • The last_packet_loss value is set to 100%
  • As the last attempt could not get even one packet through, the alert property was set to true
  • As one of the probes has now triggered an alert, the global_alert property is set to true
  • The X-RIPE-Atlas-Global-Alert header is set to 1.
  • Two additional values were added to the probe definition in question: all and alert_reasons:
    • all is a list of all packet results used to calculate last. There's more on this later.
    • alert_reasons is a list of reasons why this alert was triggered. Typically this will only have one value: loss, but as we'll see later on, it may also contain latency.

The idea is to have your monitoring software parse this output and act accordingly. How you parse it though is up to you. A simple use case would be to simply grep the output for global_alert":true and trigger your alerts based on that, while a more nuanced example might parse the JSON, and look for values relevant to different users to page the appropriate contact.

If you're not keen on parsing the output, or want to save bandwidth by using a simpler test, we also allow you to abuse the HTTP response code system by setting the flag change_http_status=1. In these cases, the above response would change to the following:

# Request
HEAD https://atlas.ripe.net/api/v1/status-checks/123456789/?change_http_status=1

# Response
HTTP/1.1 418 UNKNOWN STATUS CODE
Date: Tue, 29 Oct 2013 14:37:37 GMT
X-RIPE-Atlas-Global-Alert: 1
Content-Type: text/plain
Cache-Control: no-cache

Note that the only HTTP codes currently in use are 200 and 418. There are no plans to expand the abuse of the HTTP status code system at present, as this would make it difficult to indicate whether there is a problem with the measurement in question, or the status-check system itself.

With these sorts of changes, you can write server side scripts to capture and parse the JSON output, or just note the HTTP response code and take whatever action you see fit. To use Nagios as an example, you could use the check_http script to alert if the HTTP response is anything other than 200. No need to write any custom code if you don't want to.

A Complex Example

The simple example above should be good enough for most people, but if you're dealing with a large subset of probes (we support up to 1024), or if you're interested in comparing the current RTT value to past values, then this section is for you.

You can control how the alerts are triggered based on a few arguments in the URL:

Argument Default Description
max_packet_loss 75 The acceptable percentage packet loss per probe
show_all false Show all RTT responses. The default is to only show all responses for alerting probes
permitted_total_alerts 0 The total number of probes you would permit to respond with an alert before a global alert is issued
lookback 1 The total number of measurement results to compare to generate a median RTT value.
median_rtt_threshold N/A The threshold at which an alert should be issued when you compare the latest RTT value to the median values (based on the lookback)

These arguments can be combined to effect interesting results, so we'll break them down one-by-one and then give you some examples of combinations and the resulting output.

max_packet_loss

By default, we don't set alert: true unless the packet loss percentage exceeds 75%. If you'd like to adjust this threshold, you can pass max_packet_loss to the URL. Expanding on our simple example above, this request would require that all packets be lost before an alert will be set on a probe:

https://atlas.ripe.net/api/v1/status-checks/123456789/?max_packet_loss=95

Note however that if you set max_packet_loss to 100, no alert will ever be set for lost packets.

Similarly, you can make the check more sensitive by tweaking the max_packet_loss value downward:

https://atlas.ripe.net/api/v1/status-checks/123456789/?max_packet_loss=0

This would set an alert if even one packet was lost.

show_all

In the simple example, the sample output listed only basic probe information:

# Request
GET https://atlas.ripe.net/api/v1/status-checks/123456789/

# Response
...
"234": {
  "alert": false,
  "last": 14.152,
  "last_packet_loss": 0.0,
  "source": "Country: GR"
},
...

If ever there is an alert triggered though, the all attribute is included so that you can some semblance of an investigation:

# Request
GET https://atlas.ripe.net/api/v1/status-checks/123456789/

# Response
...
"234": {
  "alert": true,
  "alert_reasons": [
    "loss"
  ],
  "all": [
    null,
    null,
    null
  ]
  "last": null,
  "last_packet_loss": 100.0,
  "source": "Country: GR"
},
...

By setting show_all, you're asking the server to always include the all attribute in the output, regardless of whether or not there's an alert issued, so you'd change the output of an error-free result to:

# Request
GET https://atlas.ripe.net/api/v1/status-checks/123456789/?show_all=1

# Response
...
"234": {
  "alert": false,
  "all": [
    12.123,
    14.152,
    17.321
  ]
  "last": 14.152,
  "last_packet_loss": 0.0,
  "source": "Country: GR"
},
...

permitted_total_alerts

By default, we assume that one probe failing to meet expected thresholds is cause for alarm. If you feel this is too sensitive, you can increase this value. This won't change the alert value for each probe, but it will determine whether or not global_alert will be set to true, and if change_http_status is set to 1, the HTTP status will be changed to 418.

The following will allow for a maximum of 3 probes to alert before the global alert is set:

https://atlas.ripe.net/api/v1/status-checks/123456789/?permitted_total_alerts=3

lookback and median_rtt_threshold

Sometimes the current median RTT isn't enough information with which to make an alert decision. Sometimes, you need a little history to determine whether an alert is warranted. This is where lookback and median_rtt_threshold come in.

Let's use our example again. Say that you've been running this measurement for a few hours now and each of our 5 probes has collected at least 10 results each:

Probe Results
Rotterdam 5 5 6 6 5 4 4 100 5 7
Athens 12 14 13 11 12 15 17 12 13 15
Vancouver 13 13 13 13 14 13 15 12 17 8
São Paulo 32 33 34 35 36 37 38 39 40 41
Brisbane 312 333 380 400 331 301 310 312 313 311

Based on these results, we can calculate a median value:

Probe Median
Rotterdam 5
Athens 15
Vancouver 14
São Paulo 37
Brisbane 310

The lookback value mentioned above determines the total number of past measurement results we take into account to generate these median values. Values can range from 1 to 10 and the default is 1.

Once we have a median value, the next part of the equation, your specified median_rtt_threshold comes into play. We compare our calculated median value to the current value, and if the difference exceeds your threshold value, we post an alert.

To continue with our example, say that you've decided that you want to be alerted if any probe exceeds its median RTT by 10. Your query would look like this:

# Request
GET https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=10&median_rtt_threshold=10

# Response
...
"234": {
  "alert": true,
  "alert_reasons": [
    "latency"
  ],
  "all": [
    43.103,
    43.363,
    43.517,
    45.254,
    45.303,
    45.714,
    45.72,
    46.045,
    46.907,
    46.92,
    47.338,
    48.843,
    49.831,
    50.598,
    50.834,
    55.644,
    65.612,
    73.656,
    78.739,
    81.618,
    101.793,
    105.107,
    111.606,
    138.973,
    144.736,
    154.633,
    159.825,
    199.248,
    206.075,
    314.524
  ],
  "last": 111.606,
  "last_packet_loss": 0.0,
  "median": 55.644,
  "source": "Country: GR"
},
...

You'll note that not only has an alert been triggered due to the disparity between median and last, but also that alert_reasons now contains latency instead of what you may have seen until now: loss. It's possible that in some cases, you could have a sufficient number of dropped packets to trigger an alert and a sufficient amount of latency, so this property will help you figure out which is which.

You can vary the lookback value if you like, and this will adjust the number of samples used to establish a median.

A note about the lookback value

Median calculations are based only on the non-null values available. This means that if lookback=10 and of those 10 results only 2 of them are non-null, only those two results will be used to calculate the median.

Supported median_rtt_thresholds include both percentages and integers, positive and negative. Some examples:

https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=10&median_rtt_threshold=10
https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=10&median_rtt_threshold=10%
https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=10&median_rtt_threshold=-10
https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=10&median_rtt_threshold=-10%

Note however that you should be careful with using integers, as there's always likely to be a strong variance for probes located a long distance for their target.

Sanity Filter

In the case of very low median values, a sanity check is applied to prevent alerts from being issued for no reason. An example of this might be a probe with a median RTT of 2.3 and a latest RTT of 4.6. That's a 200% increase, but not one worthy of note, so our sanity filter will not consider this worthy of an alert.

At present, the sanity filter ignores any delta of ±5ms.

Combinations

So now that we've covered all of the different options, you can try combining them to see what kind of results you might get.

This will will only alert on probes that exceed a packet loss of 50%, and will only post a global alert if more than 3 probes are alerting:

https://atlas.ripe.net/api/v1/status-checks/123456789/?permitted_total_alerts=3&max_packet_loss=50

Same thing, but this will always show the RTT values:

https://atlas.ripe.net/api/v1/status-checks/123456789/?show_all=1&permitted_total_alerts=3&max_packet_loss=50

Looking back over the last 7 results, show alerts for probes exceeding the median RTT by 30%

https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=7&median_rtt_threshold=30&show_all=1&permitted_total_alerts=3&max_packet_loss=50

The same thing, and again we include all RTT values:

https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=7&median_rtt_threshold=30%&show_all=1

The same thing again, but this time only sound a global alert if more than 5 probes are alerting:

https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=7&median_rtt_threshold=30%&show_all=1&permitted_total_alerts=5

And finally, a great big one that will:

  • Establish a median for each probe based on the past 10 results
  • Alert any probe whose latest RTT exceeds that of the median by 20%
  • Show all RTTs, regardless of alert status
  • Will only show a global alert if more than 7 probes are alerting
  • Will mark a probe as alerting if the packet loss on that probe exceeds 50%
    https://atlas.ripe.net/api/v1/status-checks/123456789/?lookback=10&median_rtt_threshold=20%&show_all=1&permitted_total_alerts=7&max_packet_loss=50
    

Frequently Asked Questions

What if I want to measure the activity of just one probe in a measurement?

The idea behind this method is to measure the condition of the target and its relation to the probes, rather than a means to check the status of an individual probe.

If your concern is the performance of a particular probe (or set of probes), you must measure that probe's connectivity against a known target. For these purposes, we recommend the use of a RIPE Atlas Anchor or some other target with a predictable status.

Do you implement any Caching?

In an effort to reduce the load on our servers, the results of status-check requests are cached for 1 minute. So, if you tailor your monitoring software to query this service in intervals < 1 minute, you'll be wasting bandwidth and resources on both sides.

Where can I find more examples for using Status Checks in monitoring systems?

A Github repo is being maintained by the community for examples in different monitoring systems. Feel free to add your own.