These are automatic changes from importing/exporting from Grafana
12.3.1.
In order to verify that I'm not sneaking in any changes, you can follow
these steps to get the same output.
Reproduction instructions:
1. Start [Grafana](https://hub.docker.com/r/grafana/grafana)
```
docker run -d --name=grafana --add-host
host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
```
1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials:
`admin`/`admin`)
1. Import the Synapse dashboard: `contrib/grafana/synapse.json`
1. Export the Synapse dashboard. On the dashboard page -> **Export** ->
**Export as code** -> Using the **Classic** model -> Check **Export for
sharing externally** -> Copy
1. Paste into `contrib/grafana/synapse.json`
1. `git status`/`git diff` to check if there is any diff
Sanity checked the dashboard itself by importing the dashboard on
https://grafana.matrix.org/ (Grafana 10.4.1 according to
https://grafana.matrix.org/api/health). The process-level metrics won't
work because https://github.com/element-hq/synapse/pull/19337 just
merged and isn't on `matrix.org` yet. Also just generally, this
dashboard works for me locally with the
[load-tests](https://github.com/element-hq/synapse-rust-apps/pull/397)
I've been doing.
### Motivation
There are few fixes I want to make to the Grafana dashboard and it sucks
having to manually translate everything back over because we have
different formatting.
Hopefully after this bulk change, future exports will have exactly what
we want to change.
- Update `synapse_xxx` (server-level) metrics to use
`server_name="$server_name",` instead of `instance="$instance"`
- Add `synapse_server_name_info` metric to map Synapse `server_name`s to
the `instance`s they're hosted on.
- For process level metrics, update to use `xxx * on (instance, job,
index) group_left(server_name)
synapse_server_name_info{server_name="$server_name"}`
All of the changes here are backwards compatible with whatever people
were doing before with their Prometheus/Grafana dashboards.
Previously, the recommendation was to use the `instance` label to group
everything under the same server (803e4b4d88/docs/metrics-howto.md (L93-L147))
But the `instance` label actually has a special meaning and we're
actually abusing it by using it that way:
> `instance`: The `<host>:<port>` part of the target's URL that was
scraped.
>
> *--
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series*
Since https://github.com/element-hq/synapse/issues/18592 (Synapse
`v1.139.0`), we now have the `server_name` label to use instead.
---
Additionally, the assumption that a single process is serving a single
server is no longer true with [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/).
Part of https://github.com/element-hq/synapse-small-hosts/issues/106
### Motivating use case
Although this change also benefits [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/)
(https://github.com/element-hq/synapse-small-hosts/issues/106), this is
actually spawning from adding Prometheus metrics to our workerized
Docker image (https://github.com/element-hq/synapse/pull/19324,
https://github.com/element-hq/synapse/pull/19336) with a more correct
label setup (without `instance`) and wanting the dashboard to be better.
### Testing strategy
1. Make sure your firewall allows the Docker containers to communicate
to the host (`host.docker.internal`) so they can access exposed ports of
other Docker containers. We want to allow Synapse to access the
Prometheus container and Grafana to access to the Prometheus container.
- `sudo ufw allow in on docker0 comment "Allow traffic from the default
Docker network to the host machine (host.docker.internal)"`
- `sudo ufw allow in on br-+ comment "(from Matrix Complement testing)
Allow traffic from custom Docker networks to the host machine
(host.docker.internal)"`
- [Complement firewall
docs](ee6acd9154/README.md (potential-conflict-with-firewall-software))
1. Build the Docker image for Synapse: `docker build -t
matrixdotorg/synapse -f docker/Dockerfile .`
([docs](7a24fafbc3/docker/README-testing.md (building-and-running-the-images-manually)))
1. Generate config for Synapse:
```
docker run -it --rm \
--mount type=volume,src=synapse-data,dst=/data \
-e SYNAPSE_SERVER_NAME=my.docker.synapse.server \
-e SYNAPSE_REPORT_STATS=yes \
-e SYNAPSE_ENABLE_METRICS=1 \
matrixdotorg/synapse:latest generate
```
1. Start Synapse:
```
docker run -d --name synapse \
--mount type=volume,src=synapse-data,dst=/data \
-p 8008:8008 \
-p 19090:19090 \
matrixdotorg/synapse:latest
```
1. You should be able to see metrics from Synapse at
http://localhost:19090/_synapse/metrics
1. Create a Prometheus config (`prometheus.yml`)
```yaml
global:
scrape_interval: 15s
scrape_timeout: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: prometheus
scrape_interval: 15s
metrics_path: /_synapse/metrics
scheme: http
static_configs:
- targets:
# This should point to the Synapse metrics listener (we're using
`host.docker.internal` because this is from within the Prometheus
container)
- host.docker.internal:19090
```
1. Start Prometheus (update the volume bind mount to the config you just
saved somewhere):
```
docker run \
--detach \
--name=prometheus \
--add-host host.docker.internal:host-gateway \
-p 9090:9090 \
-v
~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml
\
prom/prometheus
```
1. Make sure you're seeing some data in Prometheus. On
http://localhost:9090/query, search for `synapse_build_info`
1. Start [Grafana](https://hub.docker.com/r/grafana/grafana)
```
docker run -d --name=grafana --add-host
host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
```
1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials:
`admin`/`admin`)
1. **Connections** -> **Data Sources** -> **Add data source** ->
**Prometheus**
- Prometheus server URL: `http://host.docker.internal:9090`
1. Import the Synapse dashboard: `contrib/grafana/synapse.json`
To test workers, you can use the testing strategy from
https://github.com/element-hq/synapse/pull/19336 (assumes both changes
from this PR and the other PR are combined)
Bulk refactor `Gauge` metrics to be homeserver-scoped. We also add lints
to make sure that new `Gauge` metrics don't sneak in without using the
`server_name` label (`SERVER_NAME_LABEL`).
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the TODO metrics with the `server_name`
label
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
(Applies to the Grafana graphs)
As discovered by @devonh, we use `synapse_storage_events_persisted_events_total` (which tracks *all* persisted events) for the "Events" rate in the "Event Send Time Quantiles" graph. This is pretty misleading as I would expect it to be the rate of events being sent given the graph title, "Event Send Time Quantiles".
Since the event persistence queues are shared for local and remote events from federation and will block local events being sent, I think it does still make sense to have the event persist rate. I've updated the graph to include the rate of "Local events being persisted" and the rate of "All events being persisted". I think this properly disambiguates and clarifies what the graph is trying to show.
Fix https://github.com/matrix-org/synapse/issues/15662
This manifests as purple lines that show up on all time series panels
that you can hover and see what version was deployed.
Also added a new "Deployed Synapse versions over time" panel
where the color block changes with each version. And mixed this
color block into the "Up" time series panel.
To get the Grafana dashboard JSON to copy here: use the **Share** icon at the top -> **Export** -> check the **Export for sharing externally** option -> **View JSON** or **Save to file**