Refactor `GaugeBucketCollector` metrics to be homeserver-scoped
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Adjust the number of [`msecs` in the `looping_call` so that
`_read_forward_extremities`](a82b8a966a/synapse/storage/databases/main/metrics.py (L79))
runs immediately instead of after an hour.
1. Observe response includes the `synapse_forward_extremities` and
`synapse_excess_extremity_events` metrics with the `server_name` label
Bulk refactor `Gauge` metrics to be homeserver-scoped. We also add lints
to make sure that new `Gauge` metrics don't sneak in without using the
`server_name` label (`SERVER_NAME_LABEL`).
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the TODO metrics with the `server_name`
label
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
Bulk refactor `Counter` metrics to be homeserver-scoped. We also add
lints to make sure that new `Counter` metrics don't sneak in without
using the `server_name` label (`SERVER_NAME_LABEL`).
All of the "Fill in" commits are just bulk refactor.
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the `synapse_user_registrations_total`,
`synapse_http_server_response_count_total`, etc metrics with the
`server_name` label
Default values will be 1 room per minute, with a burst count of 10.
It's hard to imagine most users will be affected by this default rate,
but it's intentionally non-invasive in case of bots or other users that
need to create rooms at a large rate.
Server admins might want to down-tune this on their deployments.
---------
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Best reviewed commit by commit.
With the new dedicated MAS API
(https://github.com/element-hq/synapse/pull/18520), it's possible that
deactivation starts off the main process, which was not possible because
of a few calls.
I basically looked at everything that the deactivation handler was
doing, reviewed whether it could run on workers or not, and find a
workaround when possible
---------
Co-authored-by: Eric Eastwood <erice@element.io>
Part of https://github.com/element-hq/synapse/issues/18592
Separated out of https://github.com/element-hq/synapse/pull/18656
because it's a bigger, unique piece of the refactor
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the background processs metrics
(`synapse_background_process_start_count`,
`synapse_background_process_db_txn_count_total`, etc) with the
`server_name` label
Spawning from getting `HMAC incorrect` errors that seem unexplainable
except for the `registration_shared_secret` being misconfigured. It's
also possible my HMAC calculation is incorrect but every time I
double-check the result with the [known-good Python
example](553e124f76/docs/admin_api/register_api.md)
(which matches [Synapse's
source](24e849e483/synapse/rest/admin/users.py (L618-L633))),
it's as expected.
With these logs, we can actually debug whether
`registration_shared_secret` is being configured correctly or not.
It also helps specifically when using `registration_shared_secret_path`
since the default Synapse behavior (of creating the file and secret if
it doesn't exist) can mask deployment race condition where we would
start up Synapse before the `registration_shared_secret_path` file was
put in place:
> **`registration_shared_secret_path`**
>
> [...]
>
> If this file does not exist, Synapse will create a new shared secret
on startup and store it in this file.
>
> *-- [Synapse config
docs](6521406a37/docs/usage/configuration/config_documentation.md (registration_shared_secret_path))*
This only applies to the [`POST
/_synapse/admin/v1/register`](553e124f76/docs/admin_api/register_api.md)
endpoint but does log very sensitive information so we've made it so you
have to explicitly enable the logs by configuring
`synapse.rest.admin.users.registration_debug` (does not inherit root log
level) (via our new `ExplicitlyConfiguredLogger`)
`homeserver.yaml`
```yaml
log_config: "/myserver.log.config.yaml"
```
`myserver.log.config.yaml`
```yaml
version: 1
formatters:
precise:
format: '%(asctime)s - %(name)s - %(lineno)d - %(levelname)s - %(request)s - %(message)s'
handlers:
# ... file/buffer handler (see `sample_log_config.yaml`)
# A handler that writes logs to stderr. Unused by default, but can be used
# instead of "buffer" and "file" in the logger handlers.
console:
class: logging.StreamHandler
formatter: precise
loggers:
synapse.storage.SQL:
# beware: increasing this to DEBUG will make synapse log sensitive
# information such as access tokens.
level: INFO
# Has to be explicitly configured as such. Will not inherit from the root level even if it's set to DEBUG
synapse.rest.admin.users.registration_debug:
level: DEBUG
root:
level: INFO
handlers: [console]
disable_existing_loggers: false
```
The case where a consumer stops downloading media that is currently
being streamed is now able to be handled explicitly.
That scenario isn't really an error, it is expected behaviour.
This PR adds a custom exception which allows us to drop the log level
for this specific case from `WARNING` to `INFO`.
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [X] Pull request is based on the develop branch
* [X] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [X] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
---------
Co-authored-by: Eric Eastwood <erice@element.io>
This introduces a dedicated API for MAS to consume. Companion PR on the
MAS side: element-hq/matrix-authentication-service#4801
This has a few advantages over the previous admin API:
- it works on workers (this will be documented once we stabilise MSC3861
as a whole)
- it is more efficient because more focused
- it propagates trace contexts from MAS
- it is only accessible to MAS (through the shared secret) and will let
us remove the weird hack that made this token 'admin' with a ghost
'@__oidc_admin:' user
The next MAS version should support it, but will be opt-in. The version
after that should use this new API by default
---------
Co-authored-by: Eric Eastwood <erice@element.io>
The main goal of this PR is to handle device list changes onto multiple
writers, off the main process, so that we can have logins happening
whilst Synapse is rolling-restarting.
This is quite an intrusive change, so I would advise to review this
commit by commit; I tried to keep the history as clean as possible.
There are a few things to consider:
- the `device_list_key` in stream tokens becomes a
`MultiWriterStreamToken`, which has a few implications in sync and on
the storage layer
- we had a split between `DeviceHandler` and `DeviceWorkerHandler` for
master vs. worker process. I've kept this split, but making it rather
writer vs. non-writer worker, using method overrides for doing
replication calls when needed
- there are a few operations that need to happen on a single worker at a
time. Instead of using cross-worker locks, for now I made them run on
the first writer on the list
---------
Co-authored-by: Eric Eastwood <erice@element.io>
Clean up `MetricsResource`, Prometheus hacks
(`_set_prometheus_client_use_created_metrics`), and better document why
we care about having a separate `metrics` listener type.
These clean-up changes have been split out from
https://github.com/element-hq/synapse/pull/18584 since that PR was
closed.
Fixes https://github.com/element-hq/synapse/issues/18659
This changes the Tokio runtime to be attached to the Twisted reactor.
This way, the Tokio runtime starts when the Twisted reactor starts, and
*not* when the module gets loaded.
This is important as starting the runtime on module load meant that it
broke when Synapse was started with `daemonize`/`synctl`, as forks only
retain the calling threads, breaking the Tokio runtime.
This also changes so that the HttpClient gets the Twisted reactor
explicitly as parameter instead of loading it from
`twisted.internet.reactor`
Refactor `Measure` block metrics to be homeserver-scoped (add
`server_name` label to block metrics).
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
#### See behavior of previous `metrics` listener
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9323/metrics`
1. Observe response includes the block metrics
(`synapse_util_metrics_block_count`,
`synapse_util_metrics_block_in_flight`, etc)
#### See behavior of the `http` `metrics` resource
1. Add the `metrics` resource to a new or existing `http` listeners in
your `homeserver.yaml`
```yaml
listeners:
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` (it's just a `GET`
request so you can even do in the browser)
1. Observe response includes the block metrics
(`synapse_util_metrics_block_count`,
`synapse_util_metrics_block_in_flight`, etc)
Fixes: #18491
Fix hotlooping due to skipped PDUs if there is still no progress to be
made.
This could bite if the event was purged since being skipped during
catch-up.
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Another config option on my quest to a `*_path` variant for every
secret. Adds the config options `recaptcha_private_key_path` and
`recaptcha_public_key_path`. Tests and docs are included.
A public key is of course no secret, but it is closely related to the
private key, so it’s still useful to have a `*_path` variant for it.
This should be reviewed commit by commit.
Nowadays it's trivial to propagate cache invalidations, which means we
can move some things off the main process, and not go through HTTP
replication.
`ReplicationGetQueryRestServlet` appeared to be unused, and was very
weird, as it was being called if the current instance is the main one…
to RPC to the main one (if no instance is set on a replication client,
it makes it to the main process)
The other two handlers could be relatively trivially moved to any
workers, moving some methods to the worker store.
**I've intentionally not removed the replication servlets yet** so that
it's safe to rollout, and will do another PR that clean those up to
remove on the N+1 version
You can now configure how much media can be uploaded by a user in a
given time period.
Note the first commit here is a refactor of create/upload content
function
This implements
https://github.com/matrix-org/matrix-spec-proposals/pull/3765 which is
already merged and, therefore, can use stable identifiers.
For `/publicRooms` and `/hierarchy`, the topic is read from the
eponymous field of the `current_state_events` table. Rather than
introduce further columns in this table, I changed the insertion /
update logic to write the plain-text topic from the rich topic into the
existing field. This will not take effect for existing rooms unless
their topic is changed. However, existing rooms shouldn't have rich
topics to begin with.
Similarly, for server-side search, I changed the insertion logic of the
`event_search` table to prefer the value from the rich topic. Again,
existing events shouldn't have rich topics and, therefore, don't need to
be migrated in the table.
Spec doc: https://spec.matrix.org/v1.15/client-server-api/#mroomtopic
Part of supporting Matrix v1.15:
https://spec.matrix.org/v1.15/client-server-api/#mroomtopic
Signed-off-by: Johannes Marbach <n0-0ne+github@mailbox.org>
Co-authored-by: Eric Eastwood <erice@element.io>
This is to handle the case of deleting lots of "bot" devices at once.
Reviewable commit-by-commit
---------
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>