Hytale Server Monitoring with Prometheus & Grafana

HytaleONE Team
· · 15 min read
Hytale Server Monitoring with Prometheus & Grafana

Running a Hytale server without monitoring means you have no idea what’s going wrong until it’s already gone wrong. You won’t know you have a memory leak until players start complaining about lag, and by then you’re already behind. A proper observability setup catches problems before your players do - and gives you the data to actually fix them instead of guessing.

This guide walks through setting up Prometheus and Grafana for a Hytale dedicated server, covering system metrics, JVM metrics, dashboards, and alerting. We assume you already have a Hytale server running on Linux - the examples use Debian, but everything here works on any systemd-based distro.


In this article: What to Monitor · The Monitoring Stack · Installing Prometheus · JVM Metrics · System Metrics · Installing Grafana · Firewall Rules · Alerting · Profiling with spark · Tying It All Together


What to Monitor on a Hytale Server

Server monitoring isn’t just about watching CPU go up and down. For a Hytale server, you want visibility into three distinct layers - each telling you something different about what’s happening.

System Metrics

The foundation. These tell you whether your hardware is keeping up:

  • CPU usage - sustained high CPU usually means either too many players for your hardware, or a plugin doing something expensive every tick
  • Memory (RAM) - how much your OS is actually using vs. what’s available, independent of the JVM
  • Disk I/O - world saves and chunk loading hammer the disk. NVMe helps, but you still want to know when I/O wait spikes
  • Network throughput - bandwidth and packet rates, especially important since Hytale uses QUIC (UDP)

JVM Metrics

Hytale runs on the JVM, so you need JVM-specific telemetry on top of system metrics. This is where most game server admins stop too early:

  • Heap usage - how much of your allocated -Xmx is actually in use. Normal heap usage follows a sawtooth pattern - climbing as objects are allocated, then dropping when GC runs. What matters is the post-GC floor: if the heap doesn’t drop below 70% after a full GC, you’re either undersized or leaking objects
  • Garbage collection - GC pause duration and frequency. Long GC pauses are the number one cause of server “hitches” that players feel as lag spikes
  • Thread count - helps spot thread leaks from plugins or runaway async tasks
  • Class loading - useful for debugging plugin load/unload cycles

Game Metrics

The layer closest to player experience:

  • TPS (ticks per second) - the server’s heartbeat. Hytale targets 20 TPS. Drops below that mean the server can’t keep up and players feel lag
  • MSPT (milliseconds per tick) - how long each tick takes to process. Needs to stay under 50 ms to maintain 20 TPS. Spikes here point to specific bottlenecks
  • Player count - trend over time, not just a snapshot. Correlate player spikes with resource usage to plan capacity
  • Server status - is the server reachable? How long has it been up? Tools like OneQuery expose this data over UDP for lightweight polling

These three layers together give you the full picture. System metrics alone won’t tell you about GC pressure, JVM metrics alone won’t show disk saturation, and neither will tell you how many players were online when things went sideways.


The Monitoring Stack

We’re using four components, all open-source:

ComponentRolePort
PrometheusScrapes and stores time-series metrics9090
Node ExporterExports system metrics (CPU, RAM, disk, network)9100
JMX ExporterExports JVM metrics from the Hytale server process9225
GrafanaDashboards and visualization3000

Prometheus pulls metrics from exporters on a schedule (every 15 seconds by default), stores them in its time-series database, and evaluates alerting rules. Grafana connects to Prometheus as a data source and renders dashboards. No agents, no cloud dependencies, no vendor lock-in.

Where to run what: Node Exporter and JMX Exporter belong on the game server - they’re lightweight and need local access to system and JVM metrics. Prometheus and Grafana should ideally run on a separate machine. Prometheus writes to disk constantly (WAL, TSDB compaction) and Grafana adds its own memory footprint. On a game server where you want every CPU cycle and every disk IOPS going to the game, that’s unwanted contention. A cheap VPS or a spare machine on your network works fine - Prometheus just needs network access to scrape the exporters. If you only have one machine, the setup still works, but be aware of the tradeoff and keep an eye on disk I/O.

If you’re already running Prometheus and Grafana for other services, you can skip the installation sections and jump straight to JVM metrics and scrape configuration.


Installing Prometheus

Create a system user and grab the latest release:

useradd --no-create-home --shell /bin/false prometheus
mkdir -p /etc/prometheus /var/lib/prometheus
chown prometheus:prometheus /var/lib/prometheus

Download and extract (check prometheus.io/download for the latest version):

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v3.2.1/prometheus-3.2.1.linux-amd64.tar.gz
tar xzf prometheus-3.2.1.linux-amd64.tar.gz
cp prometheus-3.2.1.linux-amd64/prometheus /usr/local/bin/
cp prometheus-3.2.1.linux-amd64/promtool /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

Create the config at /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

  - job_name: "hytale"
    static_configs:
      - targets: ["localhost:9225"]

Set up the systemd service:

# /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable and start:

systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus

Prometheus is now live at http://your-server:9090. It’s scraping itself, but it won’t find the other two targets until we set up the exporters.

Scrape Configuration

The config above uses static_configs - you hardcode the targets. This is fine for a single server. If you’re running multiple Hytale servers, you can use file_sd_configs to manage targets in a separate JSON file that Prometheus reloads automatically:

  - job_name: "hytale"
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/hytale.json
        refresh_interval: 30s
[
  {
    "targets": ["10.0.0.2:9225", "10.0.0.3:9225"],
    "labels": {
      "env": "production"
    }
  }
]

This way you add or remove servers without restarting Prometheus.


Exposing JVM Metrics with JMX Exporter

This is the most important piece for Hytale specifically. The JMX Exporter runs as a Java agent inside the same JVM as your server - no separate process, no JMX remote ports to open.

Download the agent jar:

mkdir -p /home/hytale/monitoring
cd /home/hytale/monitoring
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.2.0/jmx_prometheus_javaagent-1.2.0.jar
chown hytale:hytale jmx_prometheus_javaagent-1.2.0.jar

Create the exporter config at /home/hytale/monitoring/jmx-config.yml:

hostPort: 0.0.0.0:9225
rules:
  # GC metrics - the most important ones for game servers
  - pattern: "java.lang<type=GarbageCollector, name=(.+)><>(CollectionCount|CollectionTime)"
    name: "jvm_gc_$2_total"
    labels:
      collector: "$1"
    type: COUNTER

  # Memory pools
  - pattern: "java.lang<type=MemoryPool, name=(.+)><>(Usage|PeakUsage|CollectionUsage)\\.(.+)"
    name: "jvm_memory_pool_$3_bytes"
    labels:
      pool: "$1"
      metric: "$2"

  # Thread count
  - pattern: "java.lang<type=Threading><>(ThreadCount|DaemonThreadCount|PeakThreadCount)"
    name: "jvm_threads_$1"
    type: GAUGE

  # Heap summary
  - pattern: "java.lang<type=Memory><HeapMemoryUsage>(\\w+)"
    name: "jvm_heap_$1_bytes"
    type: GAUGE

  # Non-heap (metaspace, code cache)
  - pattern: "java.lang<type=Memory><NonHeapMemoryUsage>(\\w+)"
    name: "jvm_nonheap_$1_bytes"
    type: GAUGE

  # CPU load
  - pattern: "java.lang<type=OperatingSystem><>(ProcessCpuLoad|SystemCpuLoad)"
    name: "jvm_$1"
    type: GAUGE

  # Class loading
  - pattern: "java.lang<type=ClassLoading><>(LoadedClassCount|TotalLoadedClassCount|UnloadedClassCount)"
    name: "jvm_classloading_$1"
    type: GAUGE

Now modify your Hytale server launch command to include the agent. If you’re using the systemd service from our Debian guide, update the ExecStart line:

ExecStart=/usr/bin/java \
  -Xms4G -Xmx4G \
  -javaagent:/home/hytale/monitoring/jmx_prometheus_javaagent-1.2.0.jar=9225:/home/hytale/monitoring/jmx-config.yml \
  -jar HytaleServer.jar --assets ../Assets.zip --bind 5520

Restart the server:

systemctl daemon-reload
systemctl restart hytale

Verify the exporter is working:

curl -s http://localhost:9225/metrics | head -20

You should see Prometheus-formatted metrics. Prometheus will start scraping them on the next cycle (within 15 seconds).


System Metrics with Node Exporter

Node Exporter gives Prometheus access to system-level metrics - CPU, memory, disk, network, and more. It’s a single static binary with zero dependencies.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.0/node_exporter-1.9.0.linux-amd64.tar.gz
tar xzf node_exporter-1.9.0.linux-amd64.tar.gz
cp node_exporter-1.9.0.linux-amd64/node_exporter /usr/local/bin/
useradd --no-create-home --shell /bin/false node_exporter
chown node_exporter:node_exporter /usr/local/bin/node_exporter

Systemd service:

# /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

That’s it. Node Exporter runs on port 9100 by default and Prometheus is already configured to scrape it from the config we set up earlier.


Installing Grafana

Grafana turns your Prometheus data into dashboards you can actually read at a glance. Install from the official APT repo:

apt-get install -y apt-transport-https software-properties-common wget
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" \
  | tee /etc/apt/sources.list.d/grafana.list
apt-get update
apt-get install -y grafana

Start and enable:

systemctl daemon-reload
systemctl enable grafana-server
systemctl start grafana-server

Grafana is now accessible at http://your-server:3000. Default credentials are admin / admin - you’ll be prompted to change the password on first login.

Connect Prometheus as a data source:

  1. Go to Connections > Data sources > Add data source
  2. Select Prometheus
  3. Set the URL to http://localhost:9090
  4. Click Save & test

Building a Hytale Server Dashboard

Create a new dashboard and add panels for the metrics that matter most. Here are the PromQL queries for the key panels:

Heap usage (percentage):

jvm_heap_used_bytes / jvm_heap_max_bytes * 100

GC pause time (rate per second):

rate(jvm_gc_CollectionTime_total[5m]) / 1000

CPU usage (system-wide):

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage:

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Disk I/O (read/write bytes per second):

rate(node_disk_read_bytes_total{device="nvme0n1"}[5m])
rate(node_disk_written_bytes_total{device="nvme0n1"}[5m])

Network traffic (bytes per second):

rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

Thread count:

jvm_threads_ThreadCount

Adjust the device labels to match your actual hardware (nvme0n1, sda, eth0, ens18, etc). You can find your device names by browsing http://your-server:9100/metrics and searching for node_disk or node_network.


Firewall Rules for Monitoring

Exporters serve metrics over HTTP with no authentication by default. Anyone who can reach port 9100 or 9225 can read your system and JVM metrics. That’s fine on localhost - not fine on the public internet. How you lock this down depends on whether you’re running the full stack on one machine or split across two.

Single Machine Setup

If Prometheus, Grafana, and the game server are all on the same box, the exporters only need to listen on localhost. Nothing should be reachable from outside.

Node Exporter supports a --web.listen-address flag:

ExecStart=/usr/local/bin/node_exporter --web.listen-address=127.0.0.1:9100

For the JMX Exporter, bind to localhost in your jmx-config.yml:

hostPort: 127.0.0.1:9225

Prometheus and Grafana are also localhost-only by default. Access Grafana through an SSH tunnel:

ssh -L 3000:localhost:3000 your-server

Then open http://localhost:3000 in your browser. No ports exposed, no firewall rules needed beyond what you already have for SSH and the game port.

When Prometheus runs on a different machine, the exporters on the game server need to accept connections from that machine’s IP - but nobody else. Use ufw to allow only the monitoring server:

On the game server (where the exporters run):

ufw allow from <monitoring-ip> to any port 9100 proto tcp  # Node Exporter
ufw allow from <monitoring-ip> to any port 9225 proto tcp  # JMX Exporter

Replace <monitoring-ip> with the IP of your Prometheus machine. These rules allow Prometheus to scrape the exporters while blocking everyone else.

If you prefer nftables directly:

nft add rule inet filter input ip saddr <monitoring-ip> tcp dport { 9100, 9225 } accept
nft add rule inet filter input tcp dport { 9100, 9225 } drop

On the monitoring machine (where Prometheus and Grafana run):

ufw allow 2222/tcp   # SSH
ufw allow from <your-ip> to any port 3000 proto tcp  # Grafana - your IP only
ufw enable

Don’t open Grafana to 0.0.0.0 unless you’ve set up HTTPS and changed the default credentials. A reverse proxy (Caddy, nginx) with TLS in front of Grafana is the right approach if you need access from multiple locations. For personal use, the SSH tunnel approach works just as well here as on a single machine.

What Not to Expose

Quick reference for which ports should be reachable from where:

PortServiceGame serverMonitoring machinePublic internet
5520/udpHytale (QUIC)-NoYes
9100/tcpNode ExporterLocalhost or monitoring IP-Never
9225/tcpJMX ExporterLocalhost or monitoring IP-Never
9090/tcpPrometheus-LocalhostNever
3000/tcpGrafana-Your IP / SSH tunnelNever (unless behind reverse proxy with TLS)

Alerting

Dashboards are great for when you’re looking. Alerts are for when you’re not. Prometheus has a built-in alerting engine that evaluates rules on every scrape cycle and fires alerts to Alertmanager, which handles routing to Discord, email, Slack, or whatever you use.

Create an alerting rules file at /etc/prometheus/alerts.yml:

groups:
  - name: hytale
    rules:
      - alert: HighHeapUsage
        expr: (jvm_heap_used_bytes / jvm_heap_max_bytes) > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Hytale server heap usage above 85%"
          description: "Heap has been above 85% for 5 minutes. Current: {{ $value | humanizePercentage }}"

      - alert: LongGCPauses
        expr: rate(jvm_gc_CollectionTime_total[5m]) / 1000 > 0.1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Hytale server spending >10% of time in GC"
          description: "GC overhead is {{ $value | humanizePercentage }}. Players are likely experiencing lag."

      - alert: HighCPU
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 90% for 10 minutes"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Less than 10% disk space remaining"

Tell Prometheus about the rules file - add this to /etc/prometheus/prometheus.yml:

rule_files:
  - alerts.yml

Restart Prometheus:

systemctl restart prometheus

Example Alert Rules

The four rules above cover the most common failure modes:

AlertTriggerWhy it matters
HighHeapUsageHeap > 85% for 5 minPrecursor to OutOfMemoryError crashes
LongGCPausesGC > 10% of CPU time for 3 minPlayers experience lag spikes during GC pauses
HighCPUCPU > 90% for 10 minServer can’t keep up with tick rate
DiskSpaceLow< 10% disk free for 5 minWorld saves will fail, server may crash

For routing alerts to Discord, set up Alertmanager with a webhook receiver pointed at a Discord webhook URL. The Alertmanager setup is straightforward but out of scope here - the Prometheus docs cover it well.


Profiling with spark

Prometheus and Grafana give you continuous, infrastructure-level monitoring. But when you need to dig into why the server is lagging right now - which mod, which method, which tick - you need a profiler.

spark is an open-source performance profiler built by lucko (of LuckPerms fame) that runs as a mod inside your Hytale server. It’s lightweight enough to leave running in production and gives you tools that Prometheus simply can’t: CPU sampling down to individual method calls, per-tick analysis, and interactive flame graphs.

Installing spark

Download the Hytale version from CurseForge and drop the jar into your server’s mods/ directory. Restart the server. No configuration needed - spark is ready to use immediately. Note that the Hytale build is still in beta as of February 2026, so expect rough edges.

What spark Gives You

spark covers a different angle than Prometheus. It can’t do continuous recording or alerting - there’s no 30-day history and no way to fire a notification at 3 AM. But it does things Prometheus can’t touch: CPU sampling down to individual method calls with /spark profiler, per-tick analysis with /spark tickmonitor, live TPS and MSPT via /spark tps, heap breakdowns with /spark heapsummary, and GC monitoring with /spark gc. Prometheus tells you that something went wrong and when. spark tells you what and where.

Key Commands

Check server health at a glance:

/spark health

Reports TPS, CPU usage, memory, disk, and GC activity in one summary. This is the first thing to run when someone says “the server feels laggy.”

Monitor TPS and tick duration:

/spark tps

Shows current TPS rate plus MSPT statistics - min, max, median, and 95th percentile. A healthy server shows 20 TPS with MSPT well under 50 ms. If your 95th percentile MSPT is above 50, some ticks are running long enough to cause noticeable hitches.

Catch individual slow ticks:

/spark tickmonitor --threshold 100

This watches every tick and reports any that take more than double the average duration (100% over baseline). When a tick spikes, spark logs exactly when it happened and how long it took. Pair this with your Grafana dashboards to correlate tick spikes with GC pauses or CPU load.

Profile CPU usage:

/spark profiler start

Starts sampling which methods are consuming CPU time. Let it run for a minute or two during normal gameplay, then stop it:

/spark profiler stop

spark uploads the results and gives you a link to an interactive viewer. The viewer shows a flame graph - a visual breakdown of where CPU time is going, from the server tick loop down to individual method calls. If a specific mod or plugin is eating cycles, you’ll see it immediately.

Profile only during lag:

/spark profiler start --only-ticks-over 50

This only samples during ticks that exceed 50 ms - filtering out all the normal ticks and focusing exclusively on what’s causing lag. This is the most useful profiling mode for tracking down intermittent performance issues.

Inspect memory:

/spark heapsummary

Generates a summary of what’s consuming heap memory, grouped by type. If you see a specific object type dominating the heap, that’s your leak candidate. For deeper analysis, /spark heapdump creates a full .hprof file you can open in tools like Eclipse MAT or VisualVM.

When to Use Which

Use Prometheus + Grafana for:

  • Continuous 24/7 monitoring and historical trends
  • Automated alerting (high heap, long GC, disk full)
  • Capacity planning based on weeks of data
  • Monitoring multiple servers from one dashboard

Use spark for:

  • Diagnosing active performance problems in real time
  • Finding which mod or method is causing lag
  • Profiling specific scenarios (PvP events, world gen, chunk loading)
  • Quick health checks without opening a browser

A practical workflow: your Prometheus alert fires for high GC overhead. You check Grafana and confirm GC pauses are spiking during peak hours. You SSH in, run /spark profiler start --only-ticks-over 50, play for a few minutes, stop it, and the flame graph shows a specific mod allocating objects in a hot loop. Problem identified.


Tying It All Together

Here’s what the full observability setup looks like once everything is running:

  Hytale Server (JVM)
    ├── JMX Exporter (:9225) ──┐
    ├── spark (in-process) ────┼── profiling, TPS, flame graphs
    └── OneQuery (UDP :5520) ──┼── external status queries

  Node Exporter (:9100) ───────┤

                        Prometheus (:9090)
                          ├── time-series storage
                          ├── alerting rules
                          └── PromQL queries

                          Grafana (:3000)
                          └── dashboards & visualization

Each layer covers a different question. OneQuery handles “is the server up, who’s playing.” Prometheus and its exporters handle “how is the server performing over time.” spark handles “why is this specific tick slow right now.” Grafana ties the continuous metrics together in one dashboard.

A few tips once you’re running this stack:

  • Set Prometheus retention to match your needs. The --storage.tsdb.retention.time=30d flag in our config keeps 30 days. For a single server, this uses about 1-2 GB of disk. Increase it if you want longer historical data for capacity planning.
  • Watch GC metrics first. If you only look at one thing, make it garbage collection. Long GC pauses are the most common cause of player-visible lag on JVM-based game servers, and they don’t show up in basic CPU/RAM monitoring.
  • Correlate metrics with events. When players report lag at 8 PM, check your Grafana dashboards for that exact time. Was it a GC pause? CPU spike? Disk I/O? Having the data retroactively is the whole point.

If you’re running Prometheus and Grafana on the same box as the game server, keep the retention short (15d instead of 30d) and monitor Prometheus’s own resource usage - it’s a bit ironic, but your monitoring stack can become the thing that needs monitoring if it starts competing with the game for disk I/O. On a dedicated monitoring machine, none of this is a concern. The exporters themselves (Node Exporter and JMX Exporter) are negligible - a few megabytes of RAM each.