Monitoring
For a complex application like Dagger monitoring plays an important role. Monitoring goes hand-in-hand with observability, which is a prerequisite for troubleshooting and performance tuning. This section will give a brief overview of Dagger's monitoring stack and explain some important panels from the pre-built dashboards which might help you in running your dagger in production.
#
Metrics StackWe use Flink's inbuilt metrics reporter to publish application metrics to one of the supported sinks. Other metrics like JMX can be enabled from Flink. Find more details on Flink's metrics reporting and supported sinks here. To register new application metrics from the dagger codebase follow this.
We have also included a custom grafana dashboard for dagger related metrics. Follow this to import this dashboard.
#
Dagger DashboardThis section gives an overview of the important panels/titles in the dagger dashboard. Find more about all the panels here.
#
OverviewIt shows the basic status of a Dagger which is kind of self-explanatory like running since, number of restarts, downtime etc.
Listing some other important panels in this section below.
Max Kafka Consumer Lag
: shows the Consumer Lag of the Kafka consumer.
Late Records Dropped
: The number of records a task has dropped due to arriving late.
Records Rate
: Input - The number of records the Dagger receives per second. Output - The number of records a task sends per second.
#
Kafka ConsumerAll the metrics related to the Kafka Consumer. Helps you debug the issues from the consumer side. Some of the important panels here are
Records Consumed Rate /second
The average number of records consumed per second.
Bytes Consumed Rate /second
The average size of records consumed per second.
Commit Rate /second
The average number of records committed per second. etc.
#
Input StreamShows some basic information about input Kafka Streams like Topics, Proto and Kafka clusters etc.
#
ExceptionShows the warning/errors which would be causing issues in the Dagger jobs.
Fatal Exceptions
cause the job to restart.
Warnings
are potential issues but don’t restart the job.
#
Output StreamShows some basic information about output Kafka Stream like Topics, Proto and Kafka cluster etc. Only gets populated for Kafka sink Dagger.
#
UDFLists the Name of the UDFs in the SQL query. Also, have the Dart related metrics for a Dart Dagger.
#
Post ProcessorsLists all the Post Processors in a Dagger and some of the crucial metrics related to Post Processors.
Total External Calls Rate
total calls to the external source per minute. Which can be either one of an HTTP endpoint / ElasticSearch / PostgresDB.
Success Response Time
Time taken by the external client to send back the response to Dagger, in case of a successful event.
Success Rate
Number of successful events per minute.
Total Failure Request Rate
Number of failed events per minute.
Failure Rate on XXX calls
Number of failed events per minute due to XXX response code.
Timeouts Rate
Number of timeouts per minute from an external source, which can be either one of an HTTP endpoint / ElasticSearch / PostgresDB, based on Post Processor type.
Close Connection On Client Rate
Number of times connected to the external client is closed per minute.
#
LongbowLists some important information related to the Longbow Read and Write Dagger.
#
CheckpointingTo make the state fault-tolerant, Flink needs to checkpoint the state. Checkpoints allow Flink to recover state and positions in the streams to give the application the same semantics as a failure-free execution. A Dagger job periodically checkpoints to a GCS bucket. This section has all the checkpointing details.
Number of checkpoints
total of checkpoints triggered since the Dagger has been started.
Number of failed checkpoints
total number of failed checkpoints since the Dagger started. If the failed checkpoint number is high, the Dagger would also restart.
Number of in-progress checkpoints
number of checkpoints currently in progress.
Last checkpoint size
is the total size of the state that needs to be checkpointed. For a larger state size (> 500 MB) one needs to increase the parallelism.