Custom metrics on Datadog with Python

Datadog — Monitoring system

Datadog is leader in SRE field, since it provides a SaaS monitoring platform useful to watch over infrastructures through its Datadog agent and native system integrations.

Monitoring systems needs an agent to be installed on the cluster in order to collect and keep track of metrics, logs and execute aggregate operations before sending them.

The idea behind a monitoring system such as Datadog is to deploy a component the agent on the machines we need to control, which automatically sends data to Datadog platform and organisations can later login on the platform and build dashboards on gathered data to summarize system or checking service status.

The principal entity and concept when dealing with monitoring is called metric and it represents a measurement.

There are several kind of metrics in Datadog and sometimes the concepts behind them can be quite confusing and complicated since they are used to represents different data and furthermore, the process of identifying important metrics on our system can be really tricky and not so trivial.

However, in this article i’m not going to deep dive too much on metrics types but i would like to propose just an overview and a simple use case on how to send custom metrics to Datadog from a Python application. Note that Datadog is a paid service but you can sign up for a trial version of 14 days to get familiar with it.

Installing the agent

Of course the first step is to install the agent, you can also try it locally, by default it will listen on port 8125 on localhost to collect metrics from other applications. Detailed installation instructions can be found on the integration page once logged in with your Datadog account.

Custom metrics — Count metrics vs Gauge metrics

Once installed, the agent itself is able to interact by default with system metrics, such as:

  • system.cpu.user
  • system.disk.free
  • system.network.rcvd

so in such cases where only system metrics needs to be monitored, we are done.

But what if we want to collect both system metrics and data about our running applications (application performance metrics) ?

We need custom metrics! They are ad-hoc metrics that are sent to the Datadog agent and later they are sent out exactly like if they were pre-integrated system metrics! Of course, we need a bit of analysis and tuning to integrate our personalised metrics in our applications, such as understanding the kind of metric, the type of dashboard needed, and so on.

Common application performance metrics are:

- HTTP Error % – Number of web requests that ended in an error

- Request Rate

- Average Response Time

- Thrown Exceptions – Number of all exceptions that have been thrown

Some of the most used metrics are count and gauge, they represents quite similar kind of data, in particular they are useful to picture numeric data in a certain point in time but they differ on how they send data to Datadog.

We know the agent samples each measurement at scheduled intervals but it sends collected data to backend just after a predefined flush interval which is 10 sec by default, so the difference between gauge and count metrics is how datapoints are considered and sent out in a flush interval:

  • Count metric: datapoints are added together and the sum of values inside each flush interval is sent. This is useful in use cases where we need to count the frequency of something, for example the number of times an endpoint or a function has been called. In a flush interval time, an API could be called hundreds of times and we need the total amount of time it has been invoked:

Ex: [2,2,2,2,2,2,2,2,2,2]=20

Each second the value is caught and stored, when the flush time ends, the values are summed up and sent out.

  • Gauge metric: Each sample of a measurement is considered as a standalone value and only the last sample inside an interval is sent. No additions are involved, this is useful to describe scenarios where we need a snapshot value of a certain measure and we don’t care to sum them up. For example if we want to check the disk usage of a machine this is the right measure to use.

Ex: [2,3,7,8,5,1,7,4,12,9]=9

Metrics in Python

So far, we have installed the agent and we know a bit about metrics, but how to use them to measure our application? For example, if we want to count the number of time a function has been called? Or to count the amount of record that has been inserted?

There are several ways of submitting metrics to the agent, i’m using the one interacting with DogStatsD, which is a component located inside the Datadog agent itself and depending by the type of metrics we are sampling, it exposes several functions which are based mainly on the same parameters:

  • METRIC_NAME : represents the name of the metric we want to submit
  • METRIC_VALUE: the value associated to our metric
  • SAMPLE_RATE: (optional) used to tune the amount of samples we want to send to DogStatsD
  • TAGS: (optional) tags of the metric

This is a very basic snippet explaining how to insert your custom metrics in your python code:

For count type metrics:

In this case, the interval decided to sample our metric is given by the parameter:

time.sleep(10)

which is set to 10 by default since it coincides with the flush time of the Datadog agent.

Try to set it to different values such as 1 and you’ll notice the metric is increased 10 times in a single flush time.

For gauge type metrics we can use instead:

Which is going to shape data like this:

At the beginning the metric’s value is 1

after one flush time, the array of collected values is[2,3,4,5,6,7,8,9,10,11] and the corresponding value to describe that interval is the last one, hence 11.

Conclusion

Easy to notice that the count value it’s been reset at each flush interval so as you can see from the graph, every bar reach the same max value. The gauge value instead is plotted as a continuous function and there’s any separation between individual flush intervals.

In this short journey i’ve just arranged a simple demo of principal concepts and how things works, even if in real use cases the sleep function won’t be needed since the incrementing of the metric will be performed differently.

--

--

--

Computer engineer with a keen interest in DevOps, cloud and distributed systems

Love podcasts or audiobooks? Learn on the go with our new app.

BUILDING A REST API WITH GO AND SQLITE(PART 1)

What is Full Stack Development?

Creating Separate Views for Mobile and Desktop Devices in Pyramid Framework — Using a Custom View…

Interview Preparation week 1: [Arrays]

Refresh your Power BI even if row header has changed / Part I

The best web scraping packages for Python

MoonTools | Partnership with LGCY Network

ADX Data Explorer and Bronze-to-Silver data stage

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
casinesque

casinesque

Computer engineer with a keen interest in DevOps, cloud and distributed systems

More from Medium

Query MS Graph API in Python

Two styles in Python to execute a command.

NHL API: what data is exposed and how to analyse it with Python

Python Functions and Packages: