Basic prometheus setup

    I've been playing with prometheus for monitoring. It feels quite familiar to me because its based on an internal google technology called borgmon, but I suspect that means it feels really weird to everyone else.

    The first thing to realize is that everything at google is a web server. Your short lived tool that copies some files around probably runs a web server. All of these web servers have built in URLs which report the progress and status of the task at hand. Prometheus is built to: scrape those web servers; aggregate the data; store the data into a time series database; and then perform dashboarding, trending and alerting on that data.

    The most basic example is to just export metrics for each machine on my home network. This is the easiest first step, because we don't need to build any software to do this. First off, let's install node_exporter on each machine. node_exporter is the tool which runs a web server to export metrics for each node. Everything in prometheus land is written in go, which is new to me. However, it does make running node exporter easy -- just grab the relevant binary from https://prometheus.io/download/, untar, and run. Let's do it in a command line script example thing:

    $ wget https://github.com/prometheus/node_exporter/releases/download/v0.14.0-rc.1/node_exporter-0.14.0-rc.1.linux-386.tar.gz
    $ tar xvzf node_exporter-0.14.0-rc.1.linux-386.tar.gz
    $ cd node_exporter-0.14.0-rc.1.linux-386
    $ ./node_exporter
    


    That's all it takes to run the node_exporter. This runs a web server at port 9100, which exposes the following metrics:

    $ curl -s http://localhost:9100/metrics | grep filesystem_free | grep 'mountpoint="/data"'
    node_filesystem_free{device="/dev/mapper/raidvg-srvlv",fstype="xfs",mountpoint="/data"} 6.811044864e+11
    


    Here you can see that the system I'm running on is exporting a filesystem_free value for the filesystem mounted at /data. There's a lot more than that exported, and I'd encourage you to poke around at that URL a little before continuing on.

    So that's lovely, but we really want to record that over time. So let's assume that you have one of those running on each of your machines, and that you have it setup to start on boot. I'll leave the details of that out of this post, but let's just say I used my existing puppet infrastructure.

    Now we need the central process which collects and records the values. That's the actual prometheus binary. Installation is again trivial:

    $ wget https://github.com/prometheus/prometheus/releases/download/v1.5.0/prometheus-1.5.0.linux-386.tar.gz
    $ tar xvzf prometheus-1.5.0.linux-386.tar.gz
    $ cd prometheus-1.5.0.linux-386
    


    Now we need to move some things around to install this nicely. I did the puppet equivalent of:

    • Moving the prometheus file to /usr/bin
    • Creating an /etc/prometheus directory and moving console_libraries and consoles into it
    • Creating a /etc/prometheus/prometheus.yml config file, more on the contents on this one in a second
    • And creating an empty data directory, in my case at /data/prometheus


    The config file needs to list all of your machines. I am sure this could be generated with puppet templating or something like that, but for now here's my simple hard coded one:

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
      # Attach these labels to any time series or alerts when communicating with
      # external systems (federation, remote storage, Alertmanager).
      external_labels:
          monitor: 'stillhq'
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      # - "first.rules"
      # - "second.rules"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=` to any timeseries scraped from this config.
      - job_name: 'prometheus'
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
          - targets: ['molokai:9090']
    
      - job_name: 'node'
        static_configs:
          - targets: ['molokai:9100', 'dell:9100', 'eeebox:9100']
    


    Here you can see that I want to scrape each of my web servers which exports metrics every 15 seconds, and I also want to calculate values (such as firing alerts) every 15 seconds too. This might not scale if you have bajillions of processes or machines to monitor. I also label all of my values as coming from my domain, so that if I ever aggregate these values with another prometheus from somewhere else the origin will be clear.

    The other interesting bit for now is the scrape configuration. This lists the metrics exporters to monitor. In this case its prometheus itself (molokai:9090), and then each of my machines in the home lab (molokai, dell, and eeebox -- all on port 9100). Remember, port 9090 is the prometheus binary itself and port 9100 is that node_exporter binary we now have running on all of our machines.

    Now if we start prometheus, it will do its thing. There is some configuration which needs to be passed on the command line here (instead of in the configration file), so my command line looks like this:

    /usr/bin/prometheus -config.file=/etc/prometheus/prometheus.yml \
        -web.console.libraries=/etc/prometheus/console_libraries \
        -web.console.templates=/etc/prometheus/consoles \
        -storage.local.path=/data/prometheus
    


    Prometheus also presents an interactive user interface on port 9090, which is handy. Here's an example of it graphing the load average on each of my machines (it was something which caused a nice jaggy line):



    You can see here that the user interface has a drop down for selecting values that are known, and that the key at the bottom tells you things about each time series in the graph. So for example, if we added {instance="eeebox:9100"} to the end of the value in the text box at the top, then we'd be filtering for values with that label set, and would as a result only show one value in the graph (the one for eeebox).

    If you're interested in very simple dashboarding of basic system metrics, that's actually all you need to do. In my next post about prometheus I'm going to show how to write your own binary which exports values to be graphed. In my case, the temperature outside my house.

posted at: 21:23 | path: /prometheus | permanent link to this entry

    Add a comment to this post:

    Your name:

    Your email: Email me new comments on this post
      (Your email will not be published on this site, and will only be used to contact you directly with a reply to your comment if needed. Oh, and we'll use it to send you new comments on this post it you selected that checkbox.)


    Your website:

    Comments: