Built-in plugins
Okmeter agent will automatically monitor:- CPU usage
- Load average
- Memory
- Swap: usage, I/O
- Disks: usage, I/O
- All processes: CPU, memory, swap, disk I/O, open files
- TCP connections: states, ack backlog, RTT
- Memcached
- Redis
- Nginx access logs
- Raid
- Zookeeper
Configuring Nginx
Okmeter agent needs additional information in Nginx access log. Here is how you can configure it:- Add a new
log_format
(or modify your own) in/etc/nginx/nginx.conf
: - Specify this format for each
access_log
directive in Nginx configuration.
In simple cases, you only need to do this in/etc/nginx/nginx.conf
: -
Reload Nginx:
sudo /etc/init.d/nginx reload
http { ... log_format combined_plus '$remote_addr - $remote_user [$time_local]' ' "$request" $status $body_bytes_sent "$http_referer"' ' "$http_user_agent" $request_time $upstream_cache_status' ' [$upstream_response_time]'; ... }
http { ... access_log /var/log/nginx/access.log combined_plus; ... }
Also note, if the format is not specified then the predefined
combined
format is used. Which does not contain variables $request_time
, $upstream_cache_status
, $upstream_response_time
. So you should find all your access_log
directives and specify format with these variables.
PostgreSQL
If you're using PostgreSQL on Amazon AWS RDS | AWS Aurora Postgres | Google Cloud SQL | Azure Database for PostgreSQL — check out these setup instructions, and then return here.To monitor PostgreSQL, you need to create a user for okmeter agent and a helper function in the
postgres
db for okmeter agent to be able to get stats:
$ sudo su postgres -c "psql -d postgres"
CREATE ROLE okagent WITH LOGIN PASSWORD 'EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)
';
CREATE SCHEMA okmeter; -- So that helper won't mix with anything else.
GRANT USAGE ON SCHEMA okmeter TO okagent; -- So okmeter agent will have access to it.
CREATE OR REPLACE FUNCTION okmeter.pg_stats(text) -- For okagent to get stats.
RETURNS SETOF RECORD AS
$$
DECLARE r record;
BEGIN
FOR r IN EXECUTE 'SELECT r FROM pg_' || $1 || ' r' LOOP RETURN NEXT r; -- To get pg_settings, pg_stat_activity etc.
END loop;
RETURN;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;
Then, add okagent
user to pg_hba.conf
(pg_hba.conf docs):
host all okagent 127.0.0.1/32 md5For PostgreSQL on Amazon AWS RDS | AWS Aurora Postgres | Google Cloud SQL change
local all okagent md5
127.0.0.1
in pg_hba.conf
to an IP address of the server where Okagent is running.
And finally, reload pg_hba.conf:
$ sudo su postgres -c "psql -d postgres"
SELECT pg_reload_conf();All set!
If you're using PostgreSQL on Amazon AWS RDS | AWS Aurora Postgres | Google Cloud SQL | Azure Database for PostgreSQL — check out these setup instructions.
PostgreSQL query statistics
To collect SQL statements / queries runtime and execution statistics, you need to enable pg_stat_statements extension.It's a standard extension that is developed by Postgres comunity, it is well tested. It's also available in some Database as a Service solutions as AWS RDS, AWS Aurora Postgres.
First, if you're using Postgres version 9.6 or less, install
postgres-contrib
package from your Linux distribution or from postgresql.org.
Then, configure Postgres to load this extension by adding this to your
postgresql.conf
:
shared_preload_libraries = 'pg_stat_statements' # change requires DB restart. pg_stat_statements.max = 500 pg_stat_statements.track = top pg_stat_statements.track_utility = true pg_stat_statements.save = false # #Also consider enabling io timing traction by uncommenting this: #track_io_timing = onbut maybe read this section on runtime statistics first.
Then restart postgresql:
/etc/init.d/postgresql restart
And enable that extension via psql:
$ sudo su postgres -c "psql -d postgres"
CREATE EXTENSION pg_stat_statements;
PgBouncer
To monitor PgBouncer, addokagent
user to /etc/pgbouncer/userlist.txt
(or another file referred by auth_file
directive in pgbouncer.ini
):
"okagent" "EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)"
Then configure stats_user
in pgbouncer.ini
:
; comma-separated list of users who are just allowed to use SHOW command
stats_users = okagent
And reload PgBouncer: /etc/init.d/pgbouncer reload
JVM
To monitor JVM, you need to enable JMX. You can do it by adding the following arguments to JVM command line:-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.host=127.0.0.1 -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.port=9099 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
Q: What if I have two or more JVMs running on same server?
A: You can specify different JMX port to use. As long it has `authenticate=false` and `ssl=false` Okagent will automatically start gathering jvm data.
Php-fpm
To monitor php-fpm, you need to enable status page for each pool. Uncommentpm.status_path
directive in each pool .conf
file and set status url:
pm.status_path = /status ;you can use /status or any other url, okagent will work with thatThen restart php-fpm:
service php-fpm restart
or docker restart some-php-container
and okmeter agent will start collecting pool metrics.
RabbitMQ
To monitor RabbitMQ you need to enable rabbitmq_management plugin and create Okagent user. Run the following commands on each RabbitMQ server:rabbitmq-plugins enable rabbitmq_management
rabbitmqctl add_user okagent EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)
rabbitmqctl set_user_tags okagent monitoring
And grant permissions to okagent for vhosts
:
rabbitmqctl set_permissions -p / okagent ".*" ".*" ".*" rabbitmqctl set_permissions -p /vhost1 okagent ".*" ".*" ".*"You can list
vhosts
by:
rabbitmqctl list_vhosts
Mysql
To monitor Mysql you need to create a monitoring user:
CREATE USER 'okagent'@'%' IDENTIFIED BY 'EXAMPLE_PASSWORD_DONT_USE_THAT_please_!)
';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'okagent'@'%';
GRANT SELECT ON `performance_schema`.* TO 'okagent'@'%';
FLUSH PRIVILEGES;
To collect MySQL queries stats okmeter agent uses events_statements_summary_by_digest
table from performance_schema, which is present in all modern MySQL, Percona and MariaDB engines. Please check your version:
mysql> SELECT 'OK' FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='performance_schema' AND TABLE_NAME='events_statements_summary_by_digest'; +----+ | OK | +----+ | OK | +----+And if it is OK, you can check if
performance_schema
is already enabled and initialized successfully by:
mysql> SHOW VARIABLES LIKE 'performance_schema'; +--------------------+-------+ | Variable_name | Value | +--------------------+-------+ | performance_schema | ON | +--------------------+-------+If it is OFF, you can enable
performance_schema
in my.conf and restart MySQL.
[mysqld] performance_schema=ONIf you're using MySQL on Amazon AWS RDS | AWS Aurora | Google Cloud SQL | Azure | Oracle Cloud check out these setup instructions!
Remote databases — AWS RDS | AWS Aurora | Cloud SQL | Azure Database for PostgreSQL | Elasticsearch | Redis
In these cases you can't install Okagent on database host.
If you're using remote / managed Postgresql or Mysql or Elasticsearch or Redis, you can setup Okagent to monitor database from another server / AWS EC2 instance.
You need to create a server / virtual server / EC2 instance, that will have access to that remote DB instance, and install Okagent there.
You can use an existing server / EC2 instance, with DB access, such as your web-application, that probably already has DB access trough security group or something alike.
Then you need to create a config for okmeter monitoring agent in /usr/local/okagent/etc/config.d/
directory on that instance, for exaple /usr/local/okagent/etc/config.d/remote_db.yaml
, on that server | EC2 instance where Okagent is running.
With such content for MySQL or PostgreSQL
plugin: postgresql # or mysql config: host: db_ip # replace with your remote DB instance or cluster endpoint #port: db_port # uncomment and replace with your remote DB instance port if it's non standard user: db_user # replace with your remote DB instance monitoring user password: db_password # replace with your remote DB instance monitoring user password. #database: mydb # replace with the database name you've connected during creation okmeter schema, user, and function in PostgreSQL if it differs from "postgres"
With such content for Redis
plugin: redis config: host: db_ip # replace with your remote DB instance or cluster endpoint #port: db_port # uncomment and replace with your remote DB instance port if it's non standard #password: db_password # uncomment and replace with your remote DB instance monitoring user password.
plugin: elasticsearch config: host: elasticserch_url # replace with your Elasticsearch url using format: http(s)://elasticsearch #port: db_port # uncomment and replace with your remote Elasticsearch port if it's non standard (9200) #user: db_user # uncomment and replace with your remote Elasticsearch monitoring user #password: db_password # uncomment and replace with your remote Elasticsearch monitoring user password. #insecureTls: true # uncomment if Elasticsearch configured to use self-signed certificate
And restart Okagent with $ sudo /etc/init.d/okagent restart
(or $ sudo systemctl restart okagent.service
).
Make sure that monitoring user has sufficient permissions with database — checkout Mysql plugin or Postgresql plugin docs.
Zookeeper
If you are using Zookeeper 3.4.10 or higher, you need to addstat
and mntr
commands in whitelist in your zoo.cfg:
4lw.commands.whitelist=stat, mntr
Sending custom metrics
In addition to built-in metrics, Okmeter can process custom metrics. There are several ways to send your own metrics:
- Write SQL query returning some number values from database with SQL query plugin.
- Parse some log files with Logparser plugin.
- Write a script dumping metrics to stdout and periodically call it via Execute plugin.
- Parse response from HTTP endpoint by HTTP plugin.
- Gather information from Redis commands output with Redis query.
- Track metrics from your application with Statsd.
- Prometheus
Those plugins require some configuration. Okmeter reads configuration from
/usr/local/okagent/etc/config.d/
directory. File name can be anything, file extension –
.yaml
, file format – YAML.
Check configs | dry run
After adding the configuration, you can check the syntax (replace PLUGIN_CONFIG_FILE):
$ /usr/local/okagent/okagent -dry-run=/usr/local/okagent/etc/config.d/PLUGIN_CONFIG_FILE
And if everything seems normal — don't forget to restart Okagent with $ sudo /etc/init.d/okagent restart
(or $ sudo systemctl restart okagent.service
).
SQL query plugin
SQL query plugin sends custom metrics based on periodical database queries.
It can work with PostgreSQL, MySQL, Microsoft SQL Server or ClickHouse.
Let's say you have the following article_updates
table:
update_type | character varying(16) updated | timestamp without time zone ...
And you want to monitor how many new updates of different type are coming in time, with this query:
SELECT COUNT(*) AS value, update_type FROM labels WHERE updated BETWEEN NOW() - INTERVAL '60 seconds' AND NOW() GROUP BY update_type
Note: Check query execution plan before configuring Okagent. Query will be executed every minute and should not produce any performance issues.
Okmeter will run the query periodically, and will give you this chart:
Note: Okmeter uses value
query field as a (floating point) value of the metric. Any other query field (here: update_type
) will set metric label of the same name. Okmeter can then chart metrics with different label values separately (like in the chart above).
All of this can be done by creating YAML file /usr/local/okagent/etc/config.d/article_updates.yaml
:
plugin: postgresql_query # or mssql_query or mysql_query or clickhouse_query config: host: '127.0.0.1' port: 5432 db: some_db user: some_user password: secret query: "SELECT COUNT(*) AS value, update_type FROM labels WHERE updated BETWEEN NOW() - INTERVAL '60 seconds' AND NOW() GROUP BY update_type" metric_name: demo_documents_update_rateThis config will produce demo_documents_update_rate metric. Metric name may contain ASCII letters and digits, as well as underscores. It must match the regex
[a-zA-Z_][a-zA-Z0-9_]*
.
Execute plugin
This plugin sends custom metrics produced by an external process outputting metrics to standard output.
Regexp
Regexp parser will parse command output. For example, for monitoring application log sizes, create config /usr/local/okagent/etc/config.d/app_log_disk_usage.yaml
:
plugin: execute config: command: 'du /var/log/app/main.log' regexp: '(?P<value>\d+)' name: demo_app_log_size # metric name value: value # metric value labels: # metric labels log_name: mainThis config will produce demo_app_log_size metric. Metric name may contain ASCII letters and digits, as well as underscores. It must match the regex
[a-zA-Z_][a-zA-Z0-9_]*
.
JSON
JSON parser can be used for sending ready-made metrics. Or bunches of metrics.
Metric should contain name, value and optional labels:
{ "name": "metric1", "labels": {"label1": "foo", "label2": "bar"}, "value": 123.4 }
Create config /usr/local/okagent/etc/config.d/execute_json.yaml
:
plugin: execute config: command: /tmp/calc_metrics.sh parser: json
Dummy version of calc_metrics.sh
:
echo '{"name": "metric1", "labels": {"label1": "foo", "label2": "bar"}, "value": 123.4}'
Also command can return list of metrics:
echo '[{"name": "metric1", "value": 123.4}, {"name": "metric2", "value": 567}]'
Don't forget to check config.
Logparser plugin
This plugins extracts metrics from custom log files.
Config file plugin
key should be logparser
.
Regexp
Use config
to specify log file path (file
), Perl-compatible regular expression (regex
), name of subpattern that contains time (time_field
), time format (time_field_format
) and a list of metric descriptions (metrics
).
plugin: logparser config: file: /var/log/app/stages.log regexes: # 2015-11-21 15:42:36,972 demo [DEBUG] page=item stages: db=0.007s, render=0.002s, total=0.010s count=1 - regex: '(?P\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}).+ page=(?P \w+) stages: db=(?P \d+\.\d+)s, render=(?P \d+\.\d+)s, total=(?P \d+\.\d+)s count=(?P \d+)' time_field: datetime time_field_format: '2006-01-02 15:04:05' metrics: ...
JSON
plugin: logparser config: file: /var/log/app/stages.log json: true # {ts: "2015-11-21 15:42:36.000+0300", user: "demo", page: "item", db: "0.007", render: "0.002", total: "0.010", count: "1"} time_field: ts time_field_format: "2006-01-02 15:04:05.000-0700" metrics: ...
TOP
plugin: logparser config: file: /var/log/app/stages.log #{"ts":"2018-09-12 13:07:11.500","logger": "requests","time":"33","method":"PUT","status":200,"uri":"/aaa/bbb","rid":noRequestId,ip":"2.2.2.2"} #{"ts":"2019-11-20 18:32:49.851+0300","logger":"requests","time":"157","method":"PUT","status":200,"uri":"/foo/bar?from=header_new","rid":"11","ip":"1.1.1.1"} json: true time_field: ts time_field_format: "2006-01-02 15:04:05.000-0700" top_vars: topurl: source: uri weight: 1 threshold_percent: 1 window_minutes: 10 metrics: - type: rate name: service_requests_rate labels: method: =method url: =topurl status: =status - type: percentiles name: service_response_time_percentiles value: ms2sec:time args: [50, 75, 95, 99] - type: percentiles name: service_response_time_percentile_by_url value: ms2sec:time args: [95] labels: url: =topurl
In top_vars
:
topurl
is a new label for metric with TOPN uris (JSON field –source: uri
).weight
– increment size for the metric counter.threshold_percent
– metrics, which percent in total sum is below this threshold are joined in special cumulative metric (~other).window_minutes
– sliding window size in minutes.
Metric type
can be one of the following:
-
rate
— by default collects rate of matched log entries. Ifvalue
is provided, actual rate will be the sum ofvalue
s. -
percentile
— collects n-th percentile of value per minute. Keyargs
should be an array of percentage values, e.g.[50, 95, 99]
. -
max
ormin
— collects max or minvalue
per minute. -
threshold
— collects rate ofvalue
hits in different intervals (e.g.(-∞, 0.5]
,(0.5, 1]
, ...).
time_field
- field with metric timestamp
time_field_format
can be one of the following:
unix_timestamp
– floating point number with Unix time timestampcommon_log_format
parses times like this one –2/Jan/2006:15:04:05 -0700
time_iso8601
– RFC 3339 format (also known as ISO 8601)- Custom time format, where you show how the reference time – exactly
Mon Jan 2 15:04:05 -0700 MST 2006
– would be formatted in the time format of your log. It serves as an example for logparser plugin.
In the absence of time zone information, logparser interprets time as UTC.
Metric can have a labels
object which can consist of static (stage: db
) and dynamic (page: =page
) labels. Equal sign =
before label value (e.g. page: =page
above) means
that the actual value should be taken from regular expression named group (page
in this example).
Let's say, your application server logs different request processing stages:
2015-11-01 22:51:44,072 demo [DEBUG] page=item stages: db=0.005s, render=0.002s 2015-11-01 22:51:44,087 demo [DEBUG] page=list stages: db=0.003s, render=0.001s ...
Following config
plugin: logparser config: file: /var/log/app/stages.log regexes: # 2015-11-21 15:42:36,972 demo [DEBUG] page=item stages: db=0.007s, render=0.002s, total=0.010s count=1 - regex: '(?P<datetime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}).+ page=(?P<page>\w+) stages: db=(?P<db>\d+\.\d+)s, render=(?P<render>\d+\.\d+)s, total=(?P<total>\d+\.\d+)s count=(?P<count>\d+)' time_field: datetime time_field_format: '2006-01-02 15:04:05' metrics: - type: percentiles args: [50, 75, 95, 98, 99] value: db name: demo_stages_percentiles labels: stage: db - type: percentiles args: [50, 75, 95, 98, 99] value: render name: demo_stages_percentiles labels: stage: render - type: rate name: demo_requests_rate labels: page: =page - type: rate name: demo_documents_rate value: count labels: page: =page - type: threshold value: total args: [0.05, 0.1] name: demo_render_histogram labels: page: =render
parses those lines and produces demo_requests_rate, demo_documents_rate, demo_stages_percentiles and demo_render_histogram.
HTTP plugin
With HTTP plugin you can check HTTP-service availability and gather any arbitrary metric. For example we would like to collect metrics from response of a service which is protected with HTTP Basic Auth:curl --user name:password -H 'Foo: bar' -v 'http://127.0.0.1:8080/service/stat' > GET /service/stat HTTP/1.1 > Host: 127.0.0.1 > Authorization: Basic bmFtZTpwYXNzd29yZA== > Foo: bar > < HTTP/1.1 200 OK < online_users=140 active_users=10Example config is following:
plugin: http config: url: http://127.0.0.1:8080/service/stat username: name password: password #sslskip: on #optional, disable certificate verification, like curl --insecure headers: foo: bar metrics: - metric: users_online regex: 'online_users=(\d+)' - metric: users_active regex: 'active_users=(\d+)'On each request we will get two metrics:
metric(name="users_online") metric(name="users_active")Plugin generates two additional metrics:
metric(name="status", plugin="http", instance="<config filename>") # Request status (1 — in case of success, 0 — otherwise) # Additionally there is metric with same name that contains # error string in case of error and empty string otherwise
metric(name="http.request_time", plugin="http", instance="<config filename>")
Redis query
With RedisQuery plugin you can collect numeric metrics from Redis command result. Example config placed to/usr/local/okagent/etc/config.d/redis_example.yaml
plugin: redis_query config: #host: 127.0.0.1 #optional #port: 6379 #optional #database: 0 #optional commands: - LLEN achievement_queue - SCARD some_queue - HGET stat active_connections - GET mail_sender_queue_lenwill produce these 4 metrics:
metric(name="achievement_queue.llen") metric(name="some_queue.scard") metric(name="stat.hget", param="active_connections") metric(name="mail_sender_queue_len.get")Supported Redis commands: BITCOUNT, GET, GETBIT, GETRANGE, HGET, HLEN, HSTRLEN, LINDEX, LLEN, PTTL, SCARD, STRLEN, TTL, ZCARD, ZCOUNT, ZLEXCOUNT, ZRANK, ZREVRANK, ZSCORE.
Statsd / Application metrics
What is this for?
You server applications can be instrumented using numerous statsd client libraries to send statistics, like counters and timers, over UDP to Okagent. Okagent then will generate aggregated metrics and relay them to Okmeter for graphing and alerting.
Okagent listens on UDP 8125 port and accepts StatsD counts, timings and gauges.
Example application setup
Here's an example of pystatsd usage in a python webapp:
from statsd import StatsClient statsd_client = StatsClient(host='127.0.0.1', port=8125, prefix='demo') def item(*args, **kwargs): statsd_client.incr('view.get.demo.views.item.hit') #what a long metric name! return Items.get()
So you will get these metrics:
metric(source_hostname="backend1", name="demo.view.get.demo.views.item.hit", …) #equals to call count of "item" function metric(source_hostname="backend1", name="demo.view.get.demo.views.item.hit.rate", …) #and this is calls per second rate
And now we can make some charts with this:
Also you can measure duration of some code blocks or functions. For that there's Timers:
def list(*args, **kwargs): with statsd_client.timer('view.get.demo.views.list.total'): return get_list_with_some_work()
This will provide us with such metrics:
metric(name="demo.view.get.demo.views.list.total.mean", …) metric(name="demo.view.get.demo.views.list.total.count", …) metric(name="demo.view.get.demo.views.list.total.lower", …) metric(name="demo.view.get.demo.views.list.total.upper", …) metric(name="demo.view.get.demo.views.list.total.percentile", percentile="50", …) metric(name="demo.view.get.demo.views.list.total.percentile", percentile="75", …) metric(name="demo.view.get.demo.views.list.total.percentile", percentile="90", …) metric(name="demo.view.get.demo.views.list.total.percentile", percentile="95", …) metric(name="demo.view.get.demo.views.list.total.percentile", percentile="97", …) metric(name="demo.view.get.demo.views.list.total.percentile", percentile="99", …)
And we can chart that:
StatsD usage guidlines
In a real project, it's quite common to have tens or hundreds of web app handlers. So you might find yourself with hundreds or thousands of metrics, and it might become a burden to recite all this myapp.some.really.long.metrics.names
or some.other.long.metric.name
every time you want to graph them or do something else.
We recommend using Metrics 2.0 naming with orthogonal tags for every dimension.
So instead clumsy demo.view.get.demo.views.list.total.mean
you'll get sort of self-describing
metric(name="demoapp.view.timing.mean", phase="total", handler="search", method="get")
Where name
states distinct purpose of one particular measurement — a duration of view function execution in our view.timing
example. While tags allow differentiating between measurements on various subsets.
Here's how that works:
stats = StatsClient(host='127.0.0.1', port=8125, prefix='demoapp') def search(request): with stats.timer('view.timing.phase_is_total.handler_is_search.method_is_'+request.method): return get_list_with_some_work() def get_item(*args, **kwargs): with stats.timer('view.timing.phase_is_total.handler_is_get_item.method_is_'+request.method): return get_list_with_some_work()
So in general if you want to add a tag named tag_1
with value some_value
and tag_2
with other_val
to a metric named
my.precious.metric
— you can do it like this my.precious.metric.tag_1_is_some_value.tag_2_is_other_val
.
Tag order doesn't matter and my.precious.metric.tag_2_is_other_val.tag_1_is_some_value
will work as well.
We recommend not to use lots of tag values, like putting full HTTP URLs as value, especially with numerical ids — something like /path/url/123
. That's because it will produce a great number of metrics and it'll make chart rendering heavy, and they won't be much useful.
It should work fine with up to 5 different tags for some metric with less that 10 of various values for each.
Advanced configuration
By default Okagent listens for StatsD metrics on UDP 8125 port.
If you would like to use different port just supply following config /usr/local/okagent/etc/config.d/statsd.yaml
:
plugin: statsd config: listen_address: "192.168.1.1:18125"
Don't forget to restart Okagent with $ sudo /etc/init.d/okagent restart
(or $ sudo systemctl restart okagent.service
).
Please note that after these changes Okagent won't listen for stats on default 8125
port.
Prometheus
Okagent is able to scrape metrics from Prometheus compatible exporters.
It scrapes each discovered exporter and produces metrics, for example, http_request_duration_seconds
histogram:
metric(name="http_request_duration_seconds_count", handler="/", method="GET", code="200", ...) metric(name="http_request_duration_seconds_sum", handler="/", method="GET", code="200", ...) metric(name="http_request_duration_seconds_bucket", handler="/", method="GET", code="200", le="0.1", ...) metric(name="http_request_duration_seconds_bucket", handler="/", method="GET", code="200", le="0.5", ...) ...
For applications running in Kubernetes endpoints discovery based on its annotations:
apiVersion: apps/v1 kind: Deployment #or StatefulSet, DaemonSet, CronJob, ... metadata: name: my-app annotations: prometheus.io/scrape: "true" prometheus.io/scheme: "http" prometheus.io/port: "80" prometheus.io/path: "/metrics" ...
Okagent can also discover exporters running in Docker containers which annotated via container labels:
docker run --name my-app \ --label io.prometheus.scrape=true \ --label io.prometheus.port=80 \ --label io.prometheus.scheme=http \ --label io.prometheus.path="/metrics" \ ...
Okagent can scrape metrics from custom targets defined in configuration file. Targets will be scraped until okagent received correct response. Example config is following:
plugin: prometheus config: targets: - http://localhost:9100/metrics # max cardinality (default 5000) limit: 1200 # additional labels for all scraped metrics labels: exporter: node
You can configure an authorization credentials, basic
plugin: prometheus config: targets: - http://localhost:9100/metrics authorization: type: basic username: <user> password: <password>
or with bearer token:
plugin: prometheus config: targets: - http://localhost:9100/metrics authorization: type: bearer token: <token>
API
Okmeter has limited support for Prometheus-like Query API. Only query
and query_range
are available at the moment.
You can configure Grafana data source like this:

Basic Auth Password is the project access token. Please remember to configure Scrape interval to 60s.
After that, you can use data source to query metrics from Okmeter.
Please notice, Okmeter API doesn't support PromQL, only Okmeter Query Language.

Query Language
Lines expression syntax:
lines: - expression:
metric(a='b', c='d*', e=['f', 'g*'])
#example some load averages- select all metrics with labels matching values
a='b'
matches metrics with labela
equals fixed string'b'
c='d*'
matches metrics with labelc
starts with'd'
e=['f', 'g*']
matches metrics with labele
equals fixed string'f'
or starts with'g'
lines: - expression:
rate(EXPR)
#example python cpu_user- derivative for each metric in
EXPR
lines: - expression:
counter_rate(EXPR)
#example python cpu_user- derivative for counters – like
rate
but doesn`t spikes for counter reset lines: - expression:
sum(EXPR [, ignore_nan=True|False])
#example all python's cpu_user and cpu_system- sum of all metrics in
EXPR
- If
ignore_nan=False
, then result is NaN if one metric in EXPR was NaN. Default isignore_nan=True
lines: - expression:
max(EXPR)
- expression:min(EXPR)
- expression:std(EXPR)
#standard deviation - expression:average(EXPR)
#same as mean - expression:mean(EXPR)
#example mean load average- at each time-point take aggregation function for all metrics in
EXPR
lines: - expression:
sum_by(label_name, [other_label,] EXPR)
#example processes cpu usage - expression:max_by(label_name, [other_label,] EXPR)
- expression:min_by(label_name, [other_label,] EXPR)
- expression:std_by(label_name, [other_label,] EXPR)
#standard deviation - expression:mean_by(label_name, [other_label,] EXPR)
#same as average - expression:average_by(label_name, [other_label,] EXPR)
#example mean load average- group all metrics in
EXPR
by value oflabel_name
label and aggregate metrics in the same group into one metric - Accepts parametr
ignore_nan=False|True
, just like ordinarysum
lines: - expression:
win_sum(window_size_in_seconds, EXPR)
- expression:win_mean(window_size_in_seconds, EXPR)
#same as win_avg - expression:win_min(window_size_in_seconds, EXPR)
- expression:win_max(window_size_in_seconds, EXPR)
- expression:win_std(window_size_in_seconds, EXPR)
- expression:win_avg(window_size_in_seconds, EXPR)
#example mean load average on hour window- Applies specified function
sum|mean|min|max|std
for each metric inEXPR
on moving time windowwindow_size_in_seconds
. See Moving average lines: - expression:
cum_sum(EXPR)
#example- Cumulative sum for each metric in
EXPR
. lines: - expression:
top(N, EXPR[, include_other=true|false][, by="exp"|"sum"|"max"])
#example top 5 processes by CPU - expression:bottom(N, EXPR[, by="exp"|"sum"|"max"])
- show top|bottom
N
metrics fromEXPR
byews|exp
(exponentialy weighted sum) orsum
ormax
in current timespan lines: - expression:
filter_with(EXPR, FILTER_EXPR)
#example memory usage of long running processes- filters metrics in
EXPR
returning only those for whichFILTER_EXPR
not zero (or NaN). lines: - expression:
const(v[, label="value", ...])
#example- constant metric with value
v
and additonal labels for legend lines: - expression:
time()
- timestamp from x-axis as y-value
lines: - expression:
from_string("1,2,3,3,2,1,", [,repeat=false] [,sep=' '] [,label="value", ...])
#example- construct metric from string like
"1,2,3,3,2,1,"
, where each number becomes the value of the metric for corresponding minute lines: - expression:
defined(EXPR)
#example all processes1
if there is data fromEXPR
in this time-point or0
if there is NaNlines: - expression:
replace(old_val, new_val, EXPR)
#example - expression:n2z(EXPR)
#shortcut for "replace(nan, 0, EXPR)" - expression:zero_if_none(EXPR)
#shortcut for "replace(nan, 0, EXPR)" - expression:z2n(EXPR)
#shortcut for "replace(0, nan, EXPR)" - expression:zero_if_negative(EXPR)
- expression:none_if_zero(EXPR)
#shortcut for "replace(0, nan, EXPR)" - expression:remove_below(EXPR, value)
- expression:remove_above(EXPR, value)
- expression:clamp_min(EXPR, min)
- expression:clamp_max(EXPR, max)
- sets
new_val
instead ofold_val
lines: - expression:
sum_by(label, [other_label,] metric(..)) / max_by(label, [other_label,] metric(.))
- expression:sum_by(label, [other_label,] metric(..)) * sum_by(label, [other_label,] metric(.))
- expression:sum_by(label, [other_label,] metric(..)) - min_by(label, [other_label,] metric(.))
- if labels for both
sum_by
are the same than it evaluates as/
*
or-
for each pair of metrics (one from left and one from rightmetric
) lines: - expression:
sum_by(label, [other_label,] metric(..)) / EXPR
- expression:min_by(label, [other_label,] metric(..)) * EXPR
- expression:max_by(label, [other_label,] metric(..)) - EXPR
- Applies
/ EXPR
* EXPR
or- EXPR
for each metric from leftXXX_by(label, ...)
Lines legend syntax:
lines: - expression: metric(...) legend:
'%s'
- for each line show all
label_name:label_value
pairs in legend lines: - expression: metric(...) legend:
'%(label_name)s anything'
- for each line show
`label_value` anything
in legend
Colors syntax:
lines: - expression: metric(...) color:
'#81ff22'
- expression: metric(...) color:'red'
- color is color
lines: - expression: metric(...) colors:
['#80AB00', 'red', 'rgb(127,0,20)', 'hsla(100,10%,20%,0.8)']
- will cycle through specified colors
lines: - expression: metric(...) colors:
/regex.*/: '#fff' /regex2/: 'gold'
- will match legend to regexes
lines: - expression: metric(...) options: colors:
semaphore
#OR colors:semaphore inv
- will color all, previously uncolored lines, with a gradient from
red
togreen
- or from
green
tored
ifsemaphore inv
Sorting syntax:
lines: - expression: metric(...) options: sort:
alpha
|num
- sort all lines by legend in alphabetical or numeric (default) order
lines: - expression: metric(...) options: sort:
['fixed', 'order', 'for', 'legend', 'items']
- fixed sort order by item's
legend
lines: - expression: metric(...) options: sort: ... order:
DESC
- change sort order to descending
Captions:
lines: - expression: metric(...) - expression: metric(...)
title: 'some %(label_name)s'
- format chart title with labels from all expressions combined
lines: - expression: metric(...) - expression: metric(...) options:
y_title: 'some text'
- Y-axis vertical title as plain text
Alerting
Alerts are an essential part of a monitoring system. You want to know if something bad is going on.
Okmeter provides you with a large list of predefined triggers. And of course, you can define your own.
Alerting consists of two parts: trigger configs (aka alerting rules) and notification configs.
It is highly recommended to test notifications after changing triggers or notifications settings, to make sure that alerting works as expected.
You can temporary alter a trigger to make it fire. For example, by changing threshold
.
Triggers
The following is an example of a trigger. It checks the error rate in Nginx access logs:
expression: 'sum(n2z(metric(name="nginx.requests.rate", status="5*")))' threshold: '>= 1' severity: critical message: '5xx nginx %(value).1f per sec' notify_after: 120 resolve_after: 180 notification_config: ops
expression
Expression describes how to calculate a monitored value.
Expression consists of a metric selector (metric(name="..", label_foo="..")
) and some math operations.
Please refer to the query language to get more info.
threshold
Threshold defines a condition in which trigger should be fired.
Valid operations are <
, <=
, >
, >=
.
severity
Possible values are critical
, warning
and info
.
Triggers with severity info
aren't mean to be sent anywhere.
message
Message is a text describing a problem. You can include expression value and labels using simple formatting language. For example, if you want to know a particular error code, you can alter nginx trigger above:
expression: 'sum_by(status, n2z(metric(name="nginx.requests.rate", status="5*")))' message: '%(status)s nginx %(value).1f per sec'
notify_after / resolve_after
To ignore short spikes, you can define notify_after
and resolve_after
settings.
Expected value is the number of seconds.
notification_config
When the error rate is high enough, you will receive alert according to notification config ops
.
You don't have to specify notification config for every trigger. You can configure one named default
.
It will be used by default.
You don't have to make any notification config at all if you want to be notified by email or sms only.
To disable notifications on the particular trigger, you can configure notification_config: off
.
Notifications
Most comprehensive notification config can look like this:
renotify_interval: 600 notify_resolve: on oncalls: # email and sms - admin_1 - admin_2 - boss slack_url: https://hooks.slack.com/services/XXXX/XXXX/XXXXXXXX slack_channel: '#ops' telegram_chat_id: -XXXXXXXX opsgenie: api_key: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX prometheus: url: http://HOST:PORT/api/v1/alerts basic_auth: username: XXXX password: XXXX
renotify_interval
Resend notifications if alert isn't acknowledged or resolved. Expected value is the number of seconds.
notify_resolve
Send notification when an outage resolved.
oncalls
Email address and cell phone number can be defined on the contacts page.
Notification config refers contacts in oncalls
section by contact Name
.
If none of notification configs are defined, alerts will be sent to every enabled contact.
slack
To enable Slack notifications you should create Incoming WebHook in your slack setting.
And put webhook url to slack_url
. Additionally you can specify slack_channel
.
telegram
Notifications to Telegram can be enabled by adding our @OkmeterBot to your chat or group.
To determine telegram_chat_id
you can call /chat_id@OkmeterBot
command (command will work only after @OkmeterBot will be added to your chat or group).
opsgenie
To push alerts to Opsgenie, please create a new API (Rest API over JSON) integration in your opsgenie settings.
And configure api_key
from that integration.
If you are using EU opsgenie instance, please add api_url: api.eu.opsgenie.com
.
prometheus
To push alerts to Prometheus Alertmanager, please configure url
and basic_auth
(if applicable).
Uninstall Okagent
- Stop agent process
- or
sudo /etc/init.d/okagent stop
- or
sudo service okagent stop
- or
sudo systemctl stop okagent.service && sudo systemctl disable okagent.service
- or
- Remove agent files and init script
sudo rm -rf /usr/local/okagent /etc/init.d/okagent
- Resolve "no heartbeat" alert
- After 24h you'll be able to remove this host from host list like that:
- Done
Support
Feel free to ask us anything.