Backups and Pool Health

Storage for the whole homelab lives on a RAIDZ1 pool on the Ugreen DXP 8800 Plus. If the pool is unhealthy, everything stops — Jellyfin, photos, *arr imports, SMB shares. This page is the runbook for keeping it alive and knowing what to do when it is not.

Pool layout

Setting	Value
Platform	Ugreen DXP 8800 Plus @ `192.168.2.203`
OS	TrueNAS
Pool type	RAIDZ1 (single parity)
Drives today	7× 16 TB
Planned	8× 16 TB (8th drive ordered / pending install)
Admin UI	dsm.saxobroko.com → Storage

RAIDZ1 can survive one drive failure without data loss. A second failed drive before rebuild completes means data loss — treat degraded status as urgent.

Usable capacity is less than raw drive total (parity + metadata overhead). Check the TrueNAS dashboard for live numbers; do not rely on back-of-envelope maths.

What lives on the pool

Dataset / use	Consumers
Media (`Shows/`, `Movies/`, music)	Jellyfin, Navidrome, *arr stack — arr-stack
Photos	photos.saxobroko.com — Photos
App configs	Docker volumes for Sonarr, Radarr, qBittorrent, etc.
SMB shares	Windows `A:` drive — Windows network drive
Backups / snapshots	ZFS snapshots (see below)

Scrubs

ZFS scrubs read every block and verify checksums. They catch silent bit rot before it becomes unrecoverable.

Task	Guidance
Schedule	Monthly scrub on the pool (TrueNAS default or custom schedule)
Duration	Hours on multi-TB RAIDZ1 — run off-peak if performance matters
During scrub	Pool stays online; may be slower
Check status	TrueNAS → Storage → pool → Scrub tab

If a scrub reports checksum errors, note the drive and run SMART tests on suspect disks before replacing.

SMART monitoring

TrueNAS surfaces drive SMART data:

Storage → Disks — review temperature and SMART status
Enable S.M.A.R.T. Tests (short monthly, long quarterly) if not already scheduled
Replace a drive showing reallocated sectors, pending sectors, or critical warnings — do not wait for total failure

Email or notification hooks for SMART failures: TODO: document alert destination if configured.

Snapshots mindset

ZFS snapshots are cheap point-in-time copies on the same pool. They help with:

Accidental deletes (rollback a dataset or recover files from snapshot browser)
Before risky changes (app upgrades, bulk renames)
Quick recovery without a second copy of the data

Snapshots are not a substitute for off-site backup — fire, theft, or multiple drive loss still kills the pool.

Practice	Notes
Snapshot important datasets	Media, photos, app config volumes
Retention	e.g. daily × 7, weekly × 4 — tune to pool space
Off-site	TODO: document if cloud or external USB backup exists

8th drive expansion

Plan when the 8th 16 TB drive arrives:

Physically install the drive in the DXP 8800 Plus (power down if hot-swap is not confirmed for that bay)
TrueNAS → Storage → pool → Expand (or replace/expand wizard depending on UI version)
RAIDZ1 widen — pool capacity increases; parity layout stays one disk
Verify scrub passes after expansion
Update SaxDocs — TrueNAS, Server overview hardware section

Before expanding, ensure no drive is already degraded. Fix or replace failing disks first.

Degraded pool — what to do

If TrueNAS shows DEGRADED or a drive is FAULTED:

Immediate steps

Stop heavy writes — pause qBittorrent imports, large photo uploads if possible
Identify the failed drive — Storage → pool status; note serial and slot
Do not yank the wrong disk — confirm by slot LED or serial in UI
Replace the drive — same or larger capacity; TrueNAS will resilver

During resilver

Pool remains usable but slow; avoid unnecessary load
Resilver time scales with pool used capacity — may take a day or more on 7×16 TB
Monitor progress in the UI; do not reboot unless instructed

After resilver

Run a manual scrub
Check SMART on the replacement drive
Review any checksum errors logged during the incident

If two drives fail

RAIDZ1 cannot recover. Restore from off-site backup if it exists. Otherwise this is a data-loss event — TODO: confirm backup strategy and document restore steps here.

Windows and SMB health

Degraded pool performance shows up as slow A: drive or failed copies from the PC @ 192.168.2.203. If SMB is fine on LAN but apps fail, the problem may be Docker — see Common Issues.

Monitoring

External uptime checks: status.saxobroko.com — Monitoring. That tells me if services respond; it does not replace pool health checks inside TrueNAS.

TrueNAS — admin access and app host
arr-stack — media paths on the pool
Photos — photo storage
Common Issues — service-level problems
Monitoring — external uptime