A Simple Example

Let's say that your website, www.example.com, is hosted on servers in Europe and North America. You're interested in availability and response time, so you create a RIPE Atlas ping measurement from five locations around the globe and begin seeing results coming back that look something like this:

  • Rotterdam, Netherlands: id: 123, rtt: 9ms
  • Athens, Greece: id: 234, rtt: 12ms
  • Vancouver, Canada: id: 345, rtt: 13ms
  • São Paulo, Brazil: id: 456, rtt: 55ms
  • Brisbane, Australia: id: 567, rtt: 312ms

The ID for your new measurement is 123456789, so you can get basic information about your measurement by querying this URL:

https://atlas.ripe.net/api/v2/measurements/123456789/

The new status checks system is can be found at a similar URL:

https://atlas.ripe.net/api/v2/measurements/123456789/status-check

Querying this URL alone should give you basic dashboard values for your server, which is enough for you to plug into a monitoring engine like Nagios. The output should look something like this:

# Request
GET https://atlas.ripe.net/api/v2/measurements/123456789/status-check

# Response
HTTP/1.1 200 OK
Date: Tue, 29 Oct 2013 14:37:37 GMT
X-RIPE-Atlas-Global-Alert: 0
Content-Type: text/plain
Cache-Control: no-cache

{
    "global_alert": false,
    "probes": {
      "123": {
          "alert": false,
          "last": 107.296,
          "last_packet_loss": 0.0,
          "source": "Country: NL"
      },
      "234": {
          "alert": false,
          "last": 14.152,
          "last_packet_loss": 0.0,
          "source": "Country: GR"
      },
      "345": {
          "alert": false,
          "last": 9.328,
          "last_packet_loss": 0.0,
          "source": "Country: CA"
      },
      "456": {
          "alert": false,
          "last": 21.761,
          "last_packet_loss": 0.0,
          "source": "Country: BR"
      },
      "567": {
          "alert": false,
          "last": 28.281,
          "last_packet_loss": 0.0,
          "source": "Country: AU"
      }
}

Note that in the case of every probe above, alert is set to false. This is because your network is presently healthy. If, however, connectivity between your server and Brisbane, Australia were to degrade suddenly, for example, the output might look something like this:

# Request
GET https://atlas.ripe.net/api/v2/measurements/123456789/status-check/

# Response
HTTP/1.1 200 OK
Date: Tue, 29 Oct 2013 14:37:37 GMT
X-RIPE-Atlas-Global-Alert: 1
Content-Type: text/plain
Cache-Control: no-cache

{
  "global_alert": true,
  "probes": {
    "123": {
      "alert": false,
      "last": 107.296,
      "last_packet_loss": 0.0,
      "source": "Country: NL"
    },
    "234": {
      "alert": false,
      "last": 14.152,
      "last_packet_loss": 0.0,
      "source": "Country: GR"
    },
    "345": {
      "alert": false,
      "last": 9.328,
      "last_packet_loss": 0.0,
      "source": "Country: CA"
    },
    "456": {
      "alert": false,
      "last": 21.761,
      "last_packet_loss": 0.0,
      "source": "Country: BR"
    },
    "567": {
      "alert": true,
      "alert_reasons": [
        "loss"
      ],
      "all": [
        null,
        null,
        null
      ]
      "last": null,
      "last_packet_loss": 100.0,
      "source": "Country: AU"
    }
  }
}

Note that probe 567 (the ID for the probe that you're using in Brisbane) has somehow lost the ability to ping your server. This has resulted in the following changes to the output of your status check:

  • The last property (the last attempt to ping your server) has a null value
  • The last_packet_loss value is set to 100%
  • As the last attempt could not get even one packet through, the alert property was set to true
  • As one of the probes has now triggered an alert, the global_alert property is set to true
  • The X-RIPE-Atlas-Global-Alert header is set to 1
  • Two additional values were added to the probe definition in question: all and alert_reasons:
    • all is a list of all packet results used to calculate last. There will be more explanation about this later.
    • alert_reasons is a list of reasons why this alert was triggered. Typically this will only have one value: loss, but as we'll see later on, it may also contain latency.

The idea is to have your monitoring software parse this output and act accordingly. How you parse it is up to you. A simple use case would be to simply grep the output for global_alert":true and trigger your alerts based on that, while a more nuanced example might parse the JSON and look for values relevant to different users to page the appropriate contact.

If you're not keen on parsing the output, or want to save bandwidth by using a simpler test, we also allow you to abuse the HTTP response code system by setting the flag change_http_status=1. In this case, the above response would change to the following:

# Request
HEAD https://atlas.ripe.net/api/v2/measurements/123456789/status-check/?change_http_status=1

# Response
HTTP/1.1 418 UNKNOWN STATUS CODE
Date: Tue, 29 Oct 2013 14:37:37 GMT
X-RIPE-Atlas-Global-Alert: 1
Content-Type: text/plain
Cache-Control: no-cache

Note that the only HTTP codes currently in use are 200 and 418. There are no plans to expand the abuse of the HTTP status code system at present, as this would make it difficult to indicate whether there is a problem with the measurement in question, or the status check system itself.

With these sorts of changes, you can write server-side scripts to capture and parse the JSON output, or just note the HTTP response code and take whatever action you see fit. To use Nagios as an example, you could use the check_http script to alert if the HTTP response is anything other than 200. There's no need to write any custom code if you don't want to. Please make sure that your system uses properly set HTTP Host-headers, i.e. it sends a Host: atlas.ripe.net line with the HTTP request. In Nagios this is acieved by using the option -H atlas.ripe.net.

results matching ""

    No results matching ""