How to install okagent

First you need to create a project.
If you are using firewall add this IP-addresses to your whitelist: ip.txt.

Built-in plugins

Okmeter agent will automatically monitor:
  • CPU usage
  • Load average
  • Memory
  • Swap: usage, I/O
  • Disks: usage, I/O
  • All processes: CPU, memory, swap, disk I/O, open files
  • TCP connections: states, ack backlog, RTT
  • Memcached
  • Redis
  • Nginx access logs
  • Raid
  • Zookeeper

Configuring Nginx

Okmeter agent needs additional information in Nginx access log. Here is how you can configure it:

  1. Add a new log_format (or modify your own) in /etc/nginx/nginx.conf:
  2. http {
            ...
            log_format combined_plus '$remote_addr - $remote_user [$time_local]'
                                     ' "$request" $status $body_bytes_sent "$http_referer"'
                                     ' "$http_user_agent" $request_time $upstream_cache_status'
                                     ' [$upstream_response_time]';
            ...
    }           
  3. Specify this format for each access_log directive in Nginx configuration.
    In simple cases, you only need to do this in /etc/nginx/nginx.conf:
  4. http {
            ...
            access_log /var/log/nginx/access.log combined_plus;
            ...
    }           
  5. Reload Nginx: sudo /etc/init.d/nginx reload
Please note, you should not use different log formats for same access log.
Also note, if the format is not specified then the predefined combined format is used. Which does not contain variables $request_time, $upstream_cache_status, $upstream_response_time. So you should find all your access_log directives and specify format with these variables.

PostgreSQL

If you're using PostgreSQL on Amazon AWS RDS | AWS Aurora Postgres | Google Cloud SQL | Azure Database for PostgreSQL — check out these setup instructions, and then return here.
To monitor PostgreSQL, you need to create a user for okmeter agent and a helper function in the postgres db for okmeter agent to be able to get stats:
$ sudo su postgres -c "psql -d postgres"
CREATE ROLE okagent WITH LOGIN PASSWORD 'EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)';
CREATE SCHEMA okmeter; -- So that helper won't mix with anything else.
GRANT USAGE ON SCHEMA okmeter TO okagent; -- So okmeter agent will have access to it.
CREATE OR REPLACE FUNCTION okmeter.pg_stats(text) -- For okagent to get stats.
RETURNS SETOF RECORD AS
$$
DECLARE r record;
BEGIN
    FOR r IN EXECUTE 'SELECT r FROM pg_' || $1 || ' r' LOOP RETURN NEXT r;  -- To get pg_settings, pg_stat_activity etc.
    END loop;
    RETURN;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
Then, add okagent user to pg_hba.conf (pg_hba.conf docs):
host all okagent 127.0.0.1/32 md5 
local all okagent md5
For PostgreSQL on Amazon AWS RDS | AWS Aurora Postgres | Google Cloud SQL change 127.0.0.1 in pg_hba.conf to an IP address of the server where Okagent is running.


And finally, reload pg_hba.conf:
$ sudo su postgres -c "psql -d postgres"
SELECT pg_reload_conf();
All set!
If you're using PostgreSQL on Amazon AWS RDS | AWS Aurora Postgres | Google Cloud SQL | Azure Database for PostgreSQL — check out these setup instructions.
PostgreSQL query statistics
To collect SQL statements / queries runtime and execution statistics, you need to enable pg_stat_statements extension.
It's a standard extension that is developed by Postgres comunity, it is well tested. It's also available in some Database as a Service solutions as AWS RDS, AWS Aurora Postgres.
First, if you're using Postgres version 9.6 or less, install postgres-contrib package from your Linux distribution or from postgresql.org.
Then, configure Postgres to load this extension by adding this to your postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'   # change requires DB restart.
pg_stat_statements.max = 500
pg_stat_statements.track = top
pg_stat_statements.track_utility = true
pg_stat_statements.save = false
        
# #Also consider enabling io timing traction by uncommenting this:
#track_io_timing = on
but maybe read this section on runtime statistics first.
Then restart postgresql: /etc/init.d/postgresql restart
And enable that extension via psql:
$ sudo su postgres -c "psql -d postgres"
CREATE EXTENSION pg_stat_statements;

PgBouncer

To monitor PgBouncer, add okagent user to /etc/pgbouncer/userlist.txt (or another file referred by auth_file directive in pgbouncer.ini):
"okagent" "EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)"
Then configure stats_user in pgbouncer.ini:
; comma-separated list of users who are just allowed to use SHOW command
stats_users = okagent
And reload PgBouncer:
/etc/init.d/pgbouncer reload

JVM

To monitor JVM, you need to enable JMX. You can do it by adding the following arguments to JVM command line:
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.host=127.0.0.1 -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.port=9099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false 

Q: What if I have two or more JVMs running on same server?
A: You can specify different JMX port to use. As long it has `authenticate=false` and `ssl=false` Okagent will automatically start gathering jvm data.

Php-fpm

To monitor php-fpm, you need to enable status page for each pool. Uncomment pm.status_path directive in each pool .conf file and set status url:
pm.status_path = /status ;you can use /status or any other url, okagent will work with that
Then restart php-fpm: service php-fpm restart or docker restart some-php-container and okmeter agent will start collecting pool metrics.

RabbitMQ

To monitor RabbitMQ you need to enable rabbitmq_management plugin and create Okagent user. Run the following commands on each RabbitMQ server:
rabbitmq-plugins enable rabbitmq_management
rabbitmqctl add_user okagent EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)
rabbitmqctl set_user_tags okagent monitoring
And grant permissions to okagent for vhosts:
rabbitmqctl set_permissions -p / okagent ".*" ".*" ".*"
rabbitmqctl set_permissions -p /vhost1 okagent ".*" ".*" ".*"
You can list vhosts by:
rabbitmqctl list_vhosts

Mysql

To monitor Mysql you need to create a monitoring user:
CREATE USER 'okagent'@'%' IDENTIFIED BY 'EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'okagent'@'%';
GRANT SELECT ON `performance_schema`.* TO 'okagent'@'%';
FLUSH PRIVILEGES;
To collect MySQL queries stats okmeter agent uses events_statements_summary_by_digest table from performance_schema, which is present in all modern MySQL, Percona and MariaDB engines. Please check your version:
mysql> SELECT 'OK' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='performance_schema' AND TABLE_NAME='events_statements_summary_by_digest';
+----+
| OK |
+----+
| OK |
+----+
And if it is OK, you can check if performance_schema is already enabled and initialized successfully by:
mysql> SHOW VARIABLES LIKE 'performance_schema';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| performance_schema | ON    |
+--------------------+-------+
If it is OFF, you can enable performance_schema in my.conf and restart MySQL.
[mysqld]
performance_schema=ON
If you're using MySQL on Amazon AWS RDS | AWS Aurora | Google Cloud SQL | Azure | Oracle Cloud check out these setup instructions!

Remote databases — AWS RDS | AWS Aurora | Cloud SQL | Azure Database for PostgreSQL | Elasticsearch | Redis

In these cases you can't install Okagent on database host.

If you're using remote / managed Postgresql or Mysql or Elasticsearch or Redis, you can setup Okagent to monitor database from another server / AWS EC2 instance.

You need to create a server / virtual server / EC2 instance, that will have access to that remote DB instance, and install Okagent there.

You can use an existing server / EC2 instance, with DB access, such as your web-application, that probably already has DB access trough security group or something alike.

Then you need to create a config for okmeter monitoring agent in /usr/local/okagent/etc/config.d/ directory on that instance, for exaple /usr/local/okagent/etc/config.d/remote_db.yaml, on that server | EC2 instance where Okagent is running.

With such content for MySQL or PostgreSQL

plugin: postgresql # or mysql
config:
  host: db_ip # replace with your remote DB instance or cluster endpoint 
  #port: db_port # uncomment and replace with your remote DB instance port if it's non standard
  user: db_user # replace with your remote DB instance monitoring user 
  password: db_password # replace with your remote DB instance monitoring user password.

With such content for Redis

plugin: redis
config:
  host: db_ip # replace with your remote DB instance or cluster endpoint 
  #port: db_port # uncomment and replace with your remote DB instance port if it's non standard
  #password: db_password # uncomment and replace with your remote DB instance monitoring user password.
plugin: elasticsearch
config:
  host: elasticserch_url # replace with your Elasticsearch url using format: http(s)://elasticsearch
  #port: db_port # uncomment and replace with your remote Elasticsearch port if it's non standard (9200)
  #user: db_user # uncomment and replace with your remote Elasticsearch monitoring user 
  #password: db_password # uncomment and replace with your remote Elasticsearch monitoring user password.
  #insecureTls: true # uncomment if Elasticsearch configured to use self-signed certificate

And restart Okagent with $ sudo /etc/init.d/okagent restart (or $ sudo systemctl restart okagent.service).

Make sure that monitoring user has sufficient permissions with database — checkout Mysql plugin or Postgresql plugin docs.

Zookeeper

If you are using Zookeeper 3.4.10 or higher, you need to add stat and mntr commands in whitelist in your zoo.cfg:
4lw.commands.whitelist=stat, mntr

Sending custom metrics

In addition to built-in metrics, Okmeter can process custom metrics. There are several ways to send your own metrics:

  1. Write SQL query returning some number values from database with SQL query plugin.
  2. Parse some log files with Logparser plugin.
  3. Write a script dumping metrics to stdout and periodically call it via Execute plugin.
  4. Parse response from HTTP endpoint by HTTP plugin.
  5. Gather information from Redis commands output with Redis query.
  6. Track metrics from your application with Statsd.
  7. Prometheus

Those plugins require some configuration. Okmeter reads configuration from /usr/local/okagent/etc/config.d/ directory. File name can be anything, file extension – .yaml, file format – YAML.

Check configs | dry run

After adding the configuration, you can check the syntax (replace PLUGIN_CONFIG_FILE):

$ /usr/local/okagent/okagent -dry-run=/usr/local/okagent/etc/config.d/PLUGIN_CONFIG_FILE

And if everything seems normal — don't forget to restart Okagent with $ sudo /etc/init.d/okagent restart (or $ sudo systemctl restart okagent.service).

SQL query plugin

SQL query plugin sends custom metrics based on periodical database queries.
It can work with PostgreSQL, MySQL, Microsoft SQL Server or ClickHouse.

Let's say you have the following article_updates table:

    update_type | character varying(16)
    updated     | timestamp without time zone
    ...
    

And you want to monitor how many new updates of different type are coming in time, with this query:

SELECT COUNT(*) AS value, update_type FROM labels WHERE updated BETWEEN NOW() - INTERVAL '60 seconds' AND NOW() GROUP BY update_type

Note: Check query execution plan before configuring Okagent. Query will be executed every minute and should not produce any performance issues.

Okmeter will run the query periodically, and will give you this chart:

Note: Okmeter uses value query field as a (floating point) value of the metric. Any other query field (here: update_type) will set metric label of the same name. Okmeter can then chart metrics with different label values separately (like in the chart above).

All of this can be done by creating YAML file /usr/local/okagent/etc/config.d/article_updates.yaml:

plugin: postgresql_query # or mssql_query or mysql_query or clickhouse_query
config:
  host: '127.0.0.1'
  port: 5432
  db: some_db
  user: some_user
  password: secret
  query: "SELECT COUNT(*) AS value, update_type FROM labels WHERE updated BETWEEN NOW() - INTERVAL '60 seconds' AND NOW() GROUP BY update_type"
  metric_name: demo.documents.update_rate
This config will produce demo.documents.update_rate metric.

Execute plugin

This plugin sends custom metrics produced by an external process outputting metrics to standard output.

Regexp

Regexp parser will parse command output. For example, for monitoring application log sizes, create config /usr/local/okagent/etc/config.d/app_log_disk_usage.yaml:

plugin: execute
config:
  command: 'du /var/log/app/main.log'
  regexp: '(?P<value>\d+)'
  name: demo.app.log_size  # metric name
  value: value             # metric value
  labels:                  # metric labels
    log_name: main
This config will produce demo.app.log_size metric.
JSON

JSON parser can be used for sending ready-made metrics. Or bunches of metrics.

Metric should contain name, value and optional labels:

{
    "name": "metric1",
    "labels": {"label1": "foo", "label2": "bar"},
    "value": 123.4
}

Create config /usr/local/okagent/etc/config.d/execute_json.yaml:

plugin: execute
config:
  command: /tmp/calc_metrics.sh
  parser: json

Dummy version of calc_metrics.sh:

echo '{"name": "metric1", "labels": {"label1": "foo", "label2": "bar"}, "value": 123.4}'

Also command can return list of metrics:

echo '[{"name": "metric1", "value": 123.4}, {"name": "metric2", "value": 567}]'

Don't forget to check config.

Logparser plugin

This plugins extracts metrics from custom log files.

Config file plugin key should be logparser.

Regexp

Use config to specify log file path (file), Perl-compatible regular expression (regex), name of subpattern that contains time (time_field), time format (time_field_format) and a list of metric descriptions (metrics).

plugin: logparser
config:
  file: /var/log/app/stages.log
  regexes:
    # 2015-11-21 15:42:36,972 demo [DEBUG] page=item stages: db=0.007s, render=0.002s, total=0.010s count=1
    - regex: '(?P\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}).+ page=(?P\w+) stages: db=(?P\d+\.\d+)s, render=(?P\d+\.\d+)s, total=(?P\d+\.\d+)s count=(?P\d+)'
      time_field: datetime
      time_field_format: '2006-01-02 15:04:05'
      metrics:
        ...
JSON
plugin: logparser
config:
  file: /var/log/app/stages.log
  json: true
  # {ts: "2015-11-21 15:42:36.000+0300", user: "demo", page: "item", db: "0.007", render: "0.002", total: "0.010", count: "1"}
  time_field: ts
  time_field_format: "2006-01-02 15:04:05.000-0700"
  metrics:
    ...
TOP
plugin: logparser
config:
    file: /var/log/app/stages.log
    #{"ts":"2018-09-12 13:07:11.500","logger": "requests","time":"33","method":"PUT","status":200,"uri":"/aaa/bbb","rid":noRequestId,ip":"2.2.2.2"}
    #{"ts":"2019-11-20 18:32:49.851+0300","logger":"requests","time":"157","method":"PUT","status":200,"uri":"/foo/bar?from=header_new","rid":"11","ip":"1.1.1.1"}
    json: true
    time_field: ts
    time_field_format: "2006-01-02 15:04:05.000-0700"
    top_vars:
      topurl:
        source: uri 
        weight: 1 
        threshold_percent: 1 
        window_minutes: 10 
    metrics:
    - type: rate
        name: service.requests.rate
        labels:
        method: =method
        url: =topurl
        status: =status
    - type: percentiles
        name: service.response_time.percentiles
        value: ms2sec:time
        args: [50, 75, 95, 99]
    - type: percentiles
        name: service.response_time.percentile-by-url
        value: ms2sec:time
        args: [95]
        labels:
        url: =topurl                

In top_vars:

  • topurl is a new label for metric with TOPN uris (JSON field – source: uri).

  • weight – increment size for the metric counter.
  • threshold_percent – metrics, which percent in total sum is below this threshold are joined in special cumulative metric (~other).
  • window_minutes – sliding window size in minutes.

Metric type can be one of the following:

  • rate — by default collects rate of matched log entries. If value is provided, actual rate will be the sum of values.
  • percentile — collects n-th percentile of value per minute. Key args should be an array of percentage values, e.g. [50, 95, 99].
  • max or min — collects max or min value per minute.
  • threshold — collects rate of value hits in different intervals (e.g. (-∞, 0.5], (0.5, 1], ...).

time_field - field with metric timestamp

time_field_format can be one of the following:

  • unix_timestamp – floating point number with Unix time timestamp
  • common_log_format parses times like this one – 2/Jan/2006:15:04:05 -0700
  • time_iso8601 – RFC 3339 format (also known as ISO 8601)
  • Custom time format, where you show how the reference time – exactly Mon Jan 2 15:04:05 -0700 MST 2006 – would be formatted in the time format of your log. It serves as an example for logparser plugin.

In the absence of time zone information, logparser interprets time as UTC.

Metric can have a labels object which can consist of static (stage: db) and dynamic (page: =page) labels. Equal sign = before label value (e.g. page: =page above) means that the actual value should be taken from regular expression named group (page in this example).

Let's say, your application server logs different request processing stages:

2015-11-01 22:51:44,072 demo [DEBUG] page=item stages: db=0.005s, render=0.002s
2015-11-01 22:51:44,087 demo [DEBUG] page=list stages: db=0.003s, render=0.001s
...

Following config

plugin: logparser
config:
  file: /var/log/app/stages.log
  regexes:
    # 2015-11-21 15:42:36,972 demo [DEBUG] page=item stages: db=0.007s, render=0.002s, total=0.010s count=1
    - regex: '(?P<datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}).+ page=(?P<page>\w+) stages: db=(?P<db>\d+\.\d+)s, render=(?P<render>\d+\.\d+)s, total=(?P<total>\d+\.\d+)s count=(?P<count>\d+)'
      time_field: datetime
      time_field_format: '2006-01-02 15:04:05'
      metrics:
        - type: percentiles
          args: [50, 75, 95, 98, 99]
          value: db
          name: demo.stages.percentiles
          labels:
            stage: db
        - type: percentiles
          args: [50, 75, 95, 98, 99]
          value: render
          name: demo.stages.percentiles
          labels:
            stage: render
        - type: rate
          name: demo.requests.rate
          labels:
            page: =page
        - type: rate
          name: demo.documents.rate
          value: count
          labels:
            page: =page
        - type: threshold
          value: total
          args: [0.05, 0.1]
          name: demo.render.histogram
          labels:
            page: =render

parses those lines and produces demo.requests.rate, demo.documents.rate, demo.stages.percentiles and demo.render.histogram.

HTTP plugin

With HTTP plugin you can check HTTP-service availability and gather any arbitrary metric. For example we would like to collect metrics from response of a service which is protected with HTTP Basic Auth:
curl --user name:password -H 'Foo: bar' -v 'http://127.0.0.1:8080/service/stat'
> GET /service/stat HTTP/1.1
> Host: 127.0.0.1
> Authorization: Basic bmFtZTpwYXNzd29yZA==
> Foo: bar
>
< HTTP/1.1 200 OK
<
online_users=140 active_users=10
Example config is following:
plugin: http
config:
  url: http://127.0.0.1:8080/service/stat
  username: name
  password: password
  #sslskip: on #optional, disable certificate verification, like curl --insecure
  headers:
    foo: bar
  metrics:
    - metric: users.online
      regex: 'online_users=(\d+)'
    - metric: users.active
      regex: 'active_users=(\d+)'
On each request we will get two metrics:
metric(name="users.online")
metric(name="users.active")
        
Plugin generates two additional metrics:
metric(name="status", plugin="http", instance="<config filename>")

# Request status (1 — in case of success, 0 — otherwise)

# Additionally there is metric with same name that contains
# error string in case of error and empty string otherwise
        
metric(name="http.request_time", plugin="http", instance="<config filename>")

Redis query

With RedisQuery plugin you can collect numeric metrics from Redis command result. Example config placed to /usr/local/okagent/etc/config.d/redis_example.yaml
plugin: redis_query
config:
  #host: 127.0.0.1 #optional
  #port: 6379 #optional
  #database: 0 #optional
  commands:
    - LLEN achievement_queue
    - SCARD some_queue
    - HGET stat active_connections
    - GET mail_sender_queue_len
will produce these 4 metrics:
metric(name="achievement_queue.llen")
metric(name="some_queue.scard")
metric(name="stat.hget", param="active_connections")
metric(name="mail_sender_queue_len.get")
Supported Redis commands: BITCOUNT, GET, GETBIT, GETRANGE, HGET, HLEN, HSTRLEN, LINDEX, LLEN, PTTL, SCARD, STRLEN, TTL, ZCARD, ZCOUNT, ZLEXCOUNT, ZRANK, ZREVRANK, ZSCORE.

Statsd / Application metrics

What is this for?

You server applications can be instrumented using numerous statsd client libraries to send statistics, like counters and timers, over UDP to Okagent. Okagent then will generate aggregated metrics and relay them to Okmeter for graphing and alerting.

Okagent listens on UDP 8125 port and accepts StatsD counts, timings and gauges.

Example application setup

Here's an example of pystatsd usage in a python webapp:

from statsd import StatsClient
statsd_client = StatsClient(host='127.0.0.1', port=8125, prefix='demo')

def item(*args, **kwargs):
    statsd_client.incr('view.get.demo.views.item.hit') #what a long metric name!
    return Items.get()

So you will get these metrics:

metric(source_hostname="backend1",
        name="demo.view.get.demo.views.item.hit", …) #equals to call count of "item" function

metric(source_hostname="backend1",
        name="demo.view.get.demo.views.item.hit.rate", …) #and this is calls per second rate

And now we can make some charts with this:

Also you can measure duration of some code blocks or functions. For that there's Timers:

def list(*args, **kwargs):
    with statsd_client.timer('view.get.demo.views.list.total'):
        return get_list_with_some_work()

This will provide us with such metrics:

metric(name="demo.view.get.demo.views.list.total.mean", …)
metric(name="demo.view.get.demo.views.list.total.count", …)
metric(name="demo.view.get.demo.views.list.total.lower", …)
metric(name="demo.view.get.demo.views.list.total.upper", …)
metric(name="demo.view.get.demo.views.list.total.percentile", percentile="50", …)
metric(name="demo.view.get.demo.views.list.total.percentile", percentile="75", …)
metric(name="demo.view.get.demo.views.list.total.percentile", percentile="90", …)
metric(name="demo.view.get.demo.views.list.total.percentile", percentile="95", …)
metric(name="demo.view.get.demo.views.list.total.percentile", percentile="97", …)
metric(name="demo.view.get.demo.views.list.total.percentile", percentile="99", …)

And we can chart that:

StatsD usage guidlines

In a real project, it's quite common to have tens or hundreds of web app handlers. So you might find yourself with hundreds or thousands of metrics, and it might become a burden to recite all this myapp.some.really.long.metrics.names or some.other.long.metric.name every time you want to graph them or do something else.

We recommend using Metrics 2.0 naming with orthogonal tags for every dimension.
So instead clumsy demo.view.get.demo.views.list.total.mean you'll get sort of self-describing

metric(name="demoapp.view.timing.mean",
        phase="total",
        handler="search",
        method="get")

Where name states distinct purpose of one particular measurement — a duration of view function execution in our view.timing example. While tags allow differentiating between measurements on various subsets.

Here's how that works:

stats = StatsClient(host='127.0.0.1', port=8125, prefix='demoapp')

def search(request):
    with stats.timer('view.timing.phase_is_total.handler_is_search.method_is_'+request.method):
        return get_list_with_some_work()

def get_item(*args, **kwargs):
    with stats.timer('view.timing.phase_is_total.handler_is_get_item.method_is_'+request.method):
        return get_list_with_some_work()

So in general if you want to add a tag named tag_1 with value some_value and tag_2 with other_val to a metric named my.precious.metric — you can do it like this my.precious.metric.tag_1_is_some_value.tag_2_is_other_val. Tag order doesn't matter and my.precious.metric.tag_2_is_other_val.tag_1_is_some_value will work as well.

We recommend not to use lots of tag values, like putting full HTTP URLs as value, especially with numerical ids — something like /path/url/123. That's because it will produce a great number of metrics and it'll make chart rendering heavy, and they won't be much useful.
It should work fine with up to 5 different tags for some metric with less that 10 of various values for each.

Advanced configuration

By default Okagent listens for StatsD metrics on UDP 8125 port.

If you would like to use different port just supply following config /usr/local/okagent/etc/config.d/statsd.yaml:

plugin: statsd
config:
    listen_address: "192.168.1.1:18125"

Don't forget to restart Okagent with $ sudo /etc/init.d/okagent restart (or $ sudo systemctl restart okagent.service).

Please note that after these changes Okagent won't listen for stats on default 8125 port.

Prometheus

Okagent is able to scrape metrics from Prometheus compatible exporters.

It scrapes each discovered exporter and produces metrics, for example, http_request_duration_seconds histogram:

metric(name="http_request_duration_seconds_count", handler="/", method="GET", code="200", ...)
metric(name="http_request_duration_seconds_sum", handler="/", method="GET", code="200", ...)
metric(name="http_request_duration_seconds_bucket", handler="/", method="GET", code="200", le="0.1", ...)
metric(name="http_request_duration_seconds_bucket", handler="/", method="GET", code="200", le="0.5", ...)
...

For applications running in Kubernetes endpoints discovery based on its annotations:

apiVersion: apps/v1
kind: Deployment #or StatefulSet, DaemonSet, CronJob, ...
metadata:
    name: my-app
    annotations:
        prometheus.io/scrape: "true"
        prometheus.io/scheme: "http"
        prometheus.io/port: "80"
        prometheus.io/path: "/metrics"
...

Okagent can also discover exporters running in Docker containers which annotated via container labels:

docker run --name my-app \
    --label io.prometheus.scrape=true \
    --label io.prometheus.port=80 \
    --label io.prometheus.scheme=http \
    --label io.prometheus.path="/metrics" \
    ...

API

Okmeter has limited support for Prometheus-like Query API. Only query and query_range are available at the moment.

You can configure Grafana data source like this:

Basic Auth Password is the project access token. Please remember to configure Scrape interval to 60s.

After that, you can use data source to query metrics from Okmeter.

Please notice, Okmeter API doesn't support PromQL, only Okmeter Query Language.

Query Language

Lines expression syntax:
lines:
- expression: metric(a='b', c='d*', e=['f', 'g*']) #example some load averages
select all metrics with labels matching values
a='b' matches metrics with label a equals fixed string 'b'
c='d*' matches metrics with label c starts with 'd'
e=['f', 'g*'] matches metrics with label e equals fixed string 'f' or starts with 'g'
lines:
- expression: rate(EXPR) #example python cpu_user
derivative for each metric in EXPR
lines:
- expression: counter_rate(EXPR) #example python cpu_user
derivative for counters – like rate but doesn`t spikes for counter reset
lines:
- expression: sum(EXPR [, ignore_nan=True|False]) #example all python's cpu_user and cpu_system
sum of all metrics in EXPR
If ignore_nan=False, then result is NaN if one metric in EXPR was NaN. Default is ignore_nan=True
lines:
- expression: max(EXPR)
- expression: min(EXPR)
- expression: std(EXPR) #standard deviation
- expression: average(EXPR) #same as mean
- expression: mean(EXPR) #example mean load average
at each time-point take aggregation function for all metrics in EXPR
lines:
- expression: sum_by(label_name, [other_label,] EXPR) #example processes cpu usage
- expression: max_by(label_name, [other_label,] EXPR)
- expression: min_by(label_name, [other_label,] EXPR)
- expression: std_by(label_name, [other_label,] EXPR) #standard deviation
- expression: mean_by(label_name, [other_label,] EXPR) #same as average
- expression: average_by(label_name, [other_label,] EXPR) #example mean load average
group all metrics in EXPR by value of label_name label and aggregate metrics in the same group into one metric
Accepts parametrignore_nan=False|True, just like ordinary sum
lines:
- expression: win_sum(window_size_in_seconds, EXPR)
- expression: win_mean(window_size_in_seconds, EXPR) #same as win_avg
- expression: win_min(window_size_in_seconds, EXPR)
- expression: win_max(window_size_in_seconds, EXPR)
- expression: win_std(window_size_in_seconds, EXPR)
- expression: win_avg(window_size_in_seconds, EXPR) #example mean load average on hour window
Applies specified function sum|mean|min|max|std for each metric in EXPR on moving time window window_size_in_seconds. See Moving average
lines:
- expression: cum_sum(EXPR) #example
Cumulative sum for each metric in EXPR.
lines:
- expression: top(N, EXPR[, include_other=true|false][, by="exp"|"sum"|"max"]) #example top 5 processes by CPU
- expression: bottom(N, EXPR[, by="exp"|"sum"|"max"])
show top|bottom N metrics from EXPR by ews|exp(exponentialy weighted sum) or sum or max in current timespan
lines:
- expression: filter_with(EXPR, FILTER_EXPR) #example memory usage of long running processes
filters metrics in EXPR returning only those for which FILTER_EXPR not zero (or NaN).
lines:
- expression: const(v[, label="value", ...]) #example
constant metric with value v and additonal labels for legend
lines:
- expression: time()
timestamp from x-axis as y-value
lines:
- expression: from_string("1,2,3,3,2,1,", [,repeat=false] [,sep=' '] [,label="value", ...]) #example
construct metric from string like "1,2,3,3,2,1,", where each number becomes the value of the metric for corresponding minute
lines:
- expression: defined(EXPR) #example all processes
1 if there is data from EXPR in this time-point or 0 if there is NaN
lines:
- expression: replace(old_val, new_val, EXPR) #example
- expression: n2z(EXPR) #shortcut for "replace(nan, 0, EXPR)"
- expression: zero_if_none(EXPR) #shortcut for "replace(nan, 0, EXPR)"
- expression: z2n(EXPR) #shortcut for "replace(0, nan, EXPR)"
- expression: zero_if_negative(EXPR)
- expression: none_if_zero(EXPR) #shortcut for "replace(0, nan, EXPR)"
- expression: remove_below(EXPR, value)
- expression: remove_above(EXPR, value)
- expression: clamp_min(EXPR, min)
- expression: clamp_max(EXPR, max)
sets new_val instead of old_val
lines:
- expression: sum_by(label, [other_label,] metric(..)) / max_by(label, [other_label,] metric(.))
- expression: sum_by(label, [other_label,] metric(..)) * sum_by(label, [other_label,] metric(.))
- expression: sum_by(label, [other_label,] metric(..)) - min_by(label, [other_label,] metric(.))
if labels for bothsum_by are the same than it evaluates as / * or - for each pair of metrics (one from left and one from right metric)
lines:
- expression: sum_by(label, [other_label,] metric(..)) / EXPR
- expression: min_by(label, [other_label,] metric(..)) * EXPR
- expression: max_by(label, [other_label,] metric(..)) - EXPR
Applies / EXPR * EXPR or - EXPR for each metric from left XXX_by(label, ...)
Lines legend syntax:
lines:
- expression: metric(...)
  legend: '%s'
for each line show all label_name:label_value pairs in legend
lines:
- expression: metric(...)
  legend: '%(label_name)s anything'
for each line show `label_value` anything in legend
Colors syntax:
lines:
- expression: metric(...)
  color: '#81ff22'
- expression: metric(...)
  color: 'red'
color is color
lines:
- expression: metric(...)
  colors: ['#80AB00', 'red', 'rgb(127,0,20)', 'hsla(100,10%,20%,0.8)']
will cycle through specified colors
lines:
- expression: metric(...)
  colors: 
    /regex.*/: '#fff'
    /regex2/: 'gold'

will match legend to regexes
lines:
- expression: metric(...)
options:
  colors: semaphore
  #OR
  colors: semaphore inv
will color all, previously uncolored lines, with a gradient from red to green
or from greento red if semaphore inv
Sorting syntax:
lines:
- expression: metric(...)
options:
  sort: alpha|num
sort all lines by legend in alphabetical or numeric (default) order
lines:
- expression: metric(...)
options:
  sort: ['fixed', 'order', 'for', 'legend', 'items']
fixed sort order by item's legend
lines:
- expression: metric(...)
options:
  sort: ...
  order: DESC
change sort order to descending
Captions:
lines:
- expression: metric(...)
- expression: metric(...)
title: 'some %(label_name)s'
format chart title with labels from all expressions combined
lines:
- expression: metric(...)
- expression: metric(...)
options:
  y_title: 'some text'
Y-axis vertical title as plain text

Alerting

Alerts are an essential part of a monitoring system. You want to know if something bad is going on.

Okmeter provides you with a large list of predefined triggers. And of course, you can define your own.

Alerting consists of two parts: trigger configs (aka alerting rules) and notification configs.

It is highly recommended to test notifications after changing triggers or notifications settings, to make sure that alerting works as expected. You can temporary alter a trigger to make it fire. For example, by changing threshold.

Triggers

The following is an example of a trigger. It checks the error rate in Nginx access logs:

expression: 'sum(n2z(metric(name="nginx.requests.rate", status="5*")))'
threshold: '>= 1'
severity: critical
message: '5xx nginx %(value).1f per sec'
notify_after: 120
resolve_after: 180
notification_config: ops
expression

Expression describes how to calculate a monitored value.

Expression consists of a metric selector (metric(name="..", label_foo="..")) and some math operations. Please refer to the query language to get more info.

threshold

Threshold defines a condition in which trigger should be fired. Valid operations are <, <=, >, >=.

severity

Possible values are critical, warning and info. Triggers with severity info aren't mean to be sent anywhere.

message

Message is a text describing a problem. You can include expression value and labels using simple formatting language. For example, if you want to know a particular error code, you can alter nginx trigger above:

expression: 'sum_by(status, n2z(metric(name="nginx.requests.rate", status="5*")))'
message: '%(status)s nginx %(value).1f per sec'
notify_after / resolve_after

To ignore short spikes, you can define notify_after and resolve_after settings. Expected value is the number of seconds.

notification_config

When the error rate is high enough, you will receive alert according to notification config ops. You don't have to specify notification config for every trigger. You can configure one named default. It will be used by default.

You don't have to make any notification config at all if you want to be notified by email or sms only.

To disable notifications on the particular trigger, you can configure notification_config: off.

Notifications

Most comprehensive notification config can look like this:

renotify_interval: 600
notify_resolve: on

oncalls: # email and sms
  - admin_1
  - admin_2
  - boss

slack_url: https://hooks.slack.com/services/XXXX/XXXX/XXXXXXXX
slack_channel: '#ops'

telegram_chat_id: -XXXXXXXX

opsgenie:
  api_key: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

prometheus:
  url: http://HOST:PORT/api/v1/alerts
  basic_auth:
    username: XXXX
    password: XXXX
renotify_interval

Resend notifications if alert isn't acknowledged or resolved. Expected value is the number of seconds.

notify_resolve

Send notification when an outage resolved.

oncalls

Email address and cell phone number can be defined on the contacts page. Notification config refers contacts in oncalls section by contact Name.

If none of notification configs are defined, alerts will be sent to every enabled contact.

slack

To enable Slack notifications you should create Incoming WebHook in your slack setting. And put webhook url to slack_url. Additionally you can specify slack_channel.

telegram

Notifications to Telegram can be enabled by adding our @OkmeterBot to your chat or group. To determine telegram_chat_id you can temporary add another bot @myidbot and call /getgroupid command.

opsgenie

To push alerts to Opsgenie, please create a new API (Rest API over JSON) integration in your opsgenie settings. And configure api_key from that integration.

If you are using EU opsgenie instance, please add api_url: api.eu.opsgenie.com.

prometheus

To push alerts to Prometheus Alertmanager, please configure url and basic_auth (if applicable).

Uninstall Okagent

  1. Stop agent process
    • or sudo /etc/init.d/okagent stop
    • or sudo service okagent stop
    • or sudo systemctl stop okagent.service && sudo systemctl disable okagent.service
  2. Remove agent files and init script sudo rm -rf /usr/local/okagent /etc/init.d/okagent
  3. Resolve "no heartbeat" alert
  4. After 24h you'll be able to remove this host from host list like that:

  5. Done

Support

Feel free to ask us anything.