
This page covers the most common problems people hit on Fly.io and how to fix them. If your problem isn’t here, check the community forum.
Try this first
If the error isn’t obvious, start here.
- Update flyctl:
fly version update— outdated versions cause weird failures. - Run diagnostics:
fly doctor— checks WireGuard, IPs, and Docker. - Review fly.toml: Run
fly config validateto catch syntax and configuration errors. Double-check formatting, port numbers, and recent changes against the configuration reference. - Check logs:
fly logsin one terminal while running your command in another. For more detail:LOG_LEVEL=debug fly deploy. - SSH in:
fly ssh console(use-sto pick a specific Machine).
Find your problem
- I’m getting an error code
- 502 Bad Gateway — app didn’t respond to the proxy
- 503 Service Unavailable — no healthy Machines
- 401 Unauthorized — registry auth failure during deploy
- 403 Forbidden — usually your app’s CORS config or a third-party block
- 520 with Cloudflare — Cloudflare doesn’t like the response
- My deploy failed
- My app is slow or timing out
- I can’t connect to something
- My Machine is stuck or behaving unexpectedly
- I can’t access the dashboard or my account
- My app is down in a specific region
Error codes
502 Bad Gateway
The Fly proxy reached your Machine, but your app didn’t respond correctly. Common causes:
- Your app crashed mid-request
- Your app is listening on the wrong port
- The Machine is mid-deploy and your app isn’t ready yet
Check fly logs first. If you see 502s right after deploy, your app probably needs more startup time — increase your health check grace period. If it’s intermittent, check for OOM kills in your logs or SSH in and check memory usage with free or htop.
OOM kills look like crashes to the proxy. If your Machine is running out of memory, add more RAM.
503 Service Unavailable
No healthy Machines are available. Either all your Machines are stopped, they’re failing health checks, or there’s a regional issue.
If Machines show started but you’re still getting 503s, health checks are probably failing. Run fly checks list to see health check results. Errors like “connection refused” won’t appear in fly logs. Try checking both. See Health checks failing.
If all Machines are stopped and you expect auto-start to wake them, verify your [[services]] or [http_service] config — auto-start only works when the proxy knows where to route traffic. See Autostart and autostop.
Registry 401 errors
failed to push registry: 401 Unauthorized
This shows up during fly deploy. Two possible causes:
-
Your auth token is stale. Fix it:
fly auth logout fly auth login -
A Fly registry incident. Check status.flyio.net. If there’s an active incident, wait it out or subscribe for updates.
Note the image size limits: 8GB for standard Machines, 50GB for GPU Machines. If your image exceeds these limits, the push fails.
403 Forbidden
A 403 can come from different places:
From your app’s CORS configuration: The Fly proxy does not enforce CORS or act as a WAF. If you’re seeing 403 Invalid CORS request, that’s coming from your application’s CORS middleware, not from Fly. Check your app’s CORS configuration and make sure the Origin header your client sends is in your allowed origins list.
From third-party APIs (outbound): If your app calls external APIs and gets 403s, the third party may be blocking Fly’s IP ranges. This is common with Cloudflare-protected services. Fix: allocate an app-scoped egress IP with fly ips allocate-egress so your outbound traffic comes from a consistent IP you can allowlist, or contact the third-party service. You can read more about app-scoped egress IPs, as well as some caveats.
From object storage: S3-compatible storage returns 403 on permission issues. Double-check your bucket policy, access keys, and region configuration.
520 errors with Cloudflare
520 is a Cloudflare-specific code: “web server returned an unexpected response.” When using Cloudflare in front of Fly, this usually means Fly’s proxy sent a response header that Cloudflare doesn’t understand. The TE: trailers header is a known culprit.
If you’re using Cloudflare:
- Set SSL mode to Full (strict)
- Check your Cloudflare proxy settings
- If 520s are intermittent, they may correlate with specific response headers from your app
Note: if Cloudflare itself goes down, your Fly-hosted apps behind Cloudflare go down too. Fly is still running — the CDN in front of it isn’t.
Deployment failures
Build hangs: Waiting for depot builder…
Remote builds use Depot. When Depot is having issues, fly deploy hangs.
Quick fix — switch to the legacy remote builder:
This bypasses Depot but still builds remotely. If remote builds are down entirely, build on your own machine:
This requires Docker installed locally. Slower on upload, but doesn’t depend on any remote build infrastructure.
If your build fails with exit code: 1, that’s your Dockerfile failing — not a Fly problem. Debug it locally:
Release command failures
Release commands run in an ephemeral Machine before your app starts.
error running release_command machine: machine not found
This is usually a platform timing issue. Retry the deploy. If it persists, check fly logs to see why the release command Machine is exiting early.
The image hasn’t propagated to the registry yet. This happens with two-stage deploys (build + push in one command, deploy in another). Wait about a minute between stages, or retry.
Container registry rate limits
Fly has a caching proxy for Docker Hub pulls, so Docker Hub rate limits rarely affect builds. However, images hosted on other registries (like GitHub Container Registry) don’t go through this cache and can hit rate limits.
Options:
- Build locally:
fly deploy --local-only - Use the legacy builder:
fly deploy --depot=false - Push your image to a private registry or Docker Hub (which benefits from the cache), then deploy with
fly deploy --image <your-registry/image:tag>
Missing secrets or environment variables
If your app crashes on startup complaining about missing config:
fly secrets list
fly config env
Secrets set with fly secrets set are available as environment variables at runtime. They’re not available at build time. If you need build-time values, use [build.args] in fly.toml. Find out more about build-time secrets here.
Image size limit
Standard (non-GPU) Machines have an 8GB rootfs limit. GPU Machines allow up to 50GB.
If your image is too large:
- Use multi-stage Docker builds to drop build dependencies
- Move large assets to a volume or object storage
- Check for accidentally included files — add a
.dockerignore
Buildpack deploys
Buildpacks work but Dockerfiles are more reliable and give you more control. If you’re hitting buildpack issues, consider switching. The fly launch command generates a Dockerfile for most frameworks.
Your app isn’t listening on the right address
You’ll see this during deploy:
WARNING The app is not listening on the expected address
and will not be reachable by fly-proxy.
Your app must listen on 0.0.0.0 (not localhost, not 127.0.0.1) on the port specified by internal_port in your fly.toml.
If your fly.toml says:
[http_service]
internal_port = 8080
Then your app must listen on 0.0.0.0:8080.
Common mistakes:
- Listening on
127.0.0.1orlocalhost— this only accepts connections from inside the Machine. The Fly proxy connects from outside, so it can’t reach your app. Some frameworks (Rails, Django, Next.js) default to localhost. Set the host to0.0.0.0explicitly. - Port mismatch — your app listens on 3000, but
internal_portis 8080. Pick one and make them match.
Framework examples:
Rails:
bin/rails server -b 0.0.0.0 -p 8080
Express / Fastify (Node.js):
// Express
app.listen(8080, '0.0.0.0')
// Fastify
fastify.listen({ port: 8080, host: '0.0.0.0' })
Flask / Django (via Gunicorn):
gunicorn --bind 0.0.0.0:8080 myapp:app
Don’t use Flask’s or Django’s built-in dev servers in production. Use Gunicorn or another WSGI server.
FastAPI (Uvicorn):
uvicorn main:app --host 0.0.0.0 --port 8080
Health checks failing
Health checks tell the Fly proxy whether your Machine is ready to receive traffic. If a Machine fails its health checks, the proxy stops routing requests to it. If all your Machines fail health checks, your users get 503s. For the full picture on how health checks work, see Health checks.
Out of memory or high CPU
If your app OOMs, the Machine crashes and health checks fail by definition.
fly machine status <machine-id>
Look for OOM kill events. Fix: add memory.
For CPU-intensive apps, make sure you’ve selected an appropriate Machine size. CPU and RAM scale together in preset combinations.
Grace period
Your app needs time to start before health checks begin. Failed health checks are retried, but each failure adds backoff before the next attempt. If your app takes too long to become healthy, the deploy can fail.
Set a grace period to delay the first check:
[[services.tcp_checks]]
grace_period = "10s"
For apps with slow startup (Rails, Django, large JVM apps), you may need 15-30 seconds. If you’re not sure, start with 10s and increase if deploys keep failing.
Other health check failures
- Blocked accept loop: Your app’s main thread is busy and can’t accept new connections. Offload CPU work to background threads/workers.
- Non-200 responses: HTTP health checks expect a 200. If your health check endpoint returns redirects, auth challenges, or errors, the check fails. Use a dedicated
/healthzendpoint that always returns 200. - App panics on startup: Check
fly logsfor stack traces. Fix the crash. If it only happens on Fly (not locally), check your secrets and env vars.
Define an explicit HTTP health check rather than relying on the implicit one:
[[services.http_checks]]
grace_period = "10s"
interval = "15s"
method = "GET"
path = "/healthz"
timeout = "5s"
Cold starts
After a deploy or when a stopped Machine wakes up, the first request is slow. This is expected — the Machine needs to boot and your app needs to initialize.
Reduce cold start impact:
- Set a grace period on your health check so the proxy waits for your app. See Grace period.
- Keep a Machine warm with
min_machines_running = 1in your[http_service]config. This ensures at least one machine is always running. - Use
stopinstead ofsuspendif cold start latency matters more than wake-up speed.suspendis faster to resume but has clock issues. - Lighten your startup. For heavy frameworks, defer non-essential initialization. Make your health check endpoint respond before the full app is ready.
If the first request after deploy always fails (not just slow), your grace period is probably too short. The proxy sends the request, your app isn’t ready, and the request times out.
Machine lifecycle issues
Stuck Machines
Machines occasionally get stuck in a state (replacing, starting, created) and stop responding to commands.
Try these in order:
-
Restart it:
fly machine restart <machine-id> -
Force an update (any metadata change can unstick the platform state)::
fly machine update <machine-id> --yes --metadata foo=bar -
Force destroy (nuclear option — destroys the Machine):
fly machine destroy --force <machine-id>
After force-destroying, scale back up to replace it:
Machines stop immediately after starting
If your Machine starts and immediately stops, your app’s process is exiting. The Machine has nothing to run, so it shuts down.
- Make sure your Dockerfile has an explicit
CMD. Don’t rely on the base image default. - Test locally:
docker run <your-image>. If it exits immediately in Docker, it’ll exit immediately on Fly. - Check
fly logsfor your app’s exit code and any error output.
Suspend vs stop
stop shuts down the VM. suspend snapshots memory to disk and resumes later — faster wake-up, but with a tradeoff.
The clock problem: When a Machine resumes from suspend, the system clock is wrong for a brief period. It thinks it’s still the time when the Machine was suspended. This breaks:
- JWT validation — tokens appear to be issued in the future (
nbfclaim fails) - Cron jobs — scheduled tasks fire at the wrong time
- Cache TTLs — expiration times are off
- TLS certificate validation — cert timestamps don’t match
The clock corrects itself quickly, but if your app checks timestamps during the first moments after resume, things break.
Fix: If your app uses JWTs, time-sensitive scheduling, or certificate validation on startup, use stop instead of suspend:
[http_service]
auto_stop_machines = "stop"
Or add clock-skew tolerance to your JWT validation (a few seconds of leeway).
The init process
Fly injects a lightweight init process at runtime when your Machine starts. It doesn’t modify your image — it runs in front of your app inside the VM.
This init handles:
- Reaping orphaned child processes (PID 1 responsibilities)
- Forwarding signals from the host to your app
- Setting up networking and volume mounts
- Coordinating clean shutdowns
You don’t need tini, dumb-init, or s6-overlay in your Dockerfile. Fly’s init covers these responsibilities. It’s not a problem to keep them if they’re already there — they’ll just be redundant.
You can’t disable or replace Fly’s init. If you need setup scripts before your app starts, use a Docker ENTRYPOINT script that runs your setup and then execs your app.
Networking and connectivity
Custom domains and TLS
If your custom domain shows TLS errors, do an active check:
fly certs show <hostname>
You need either A and AAAA records, or a single CNAME record pointing to Fly (don’t mix CNAME with A/AAAA)
Using Cloudflare? Most TLS issues on Fly involve domains behind Cloudflare. Read Understanding Cloudflare before debugging further. For all other setups, see Custom domains.
Flycast (internal load balancing)
Flycast routes traffic between your Fly apps over the private network. Two gotchas:
force_https must be false. Flycast is HTTP-only. Don’t use force_https:
# Wrong for Flycast
[http_service]
force_https = true
# Right for Flycast
[http_service]
force_https = false
Plain TCP services need [[services]] with protocol = "tcp", not [http_service]:
[[services]]
internal_port = 8080
protocol = "tcp"
[[services.ports]]
handlers = []
port = 4321
Outbound connections
Raw TCP over shared IPv4 doesn’t work. Fly’s shared IPv4 addresses use the proxy, which needs SNI (from TLS) or a Host header (from HTTP) to route virtual host traffic. Non-HTTP, non-TLS TCP connections, such as unencrypted Redis, SMTP on port 25, or raw socket connections, fail on shared IPs because the proxy can’t identify which app to route to.
Fixes:
- Allocate a dedicated IPv4:
fly ips allocate-v4— gives your app its own IP, no virtual host routing needed - Use
.internaladdresses for services on Fly’s private network — these bypass the proxy entirely
SMTP: If you’re having trouble with outbound email, we recommend using a transactional email service (like Postmark, Resend, or SendGrid) rather than sending directly from your Machines.
CORS issues
If POST requests to your app return 403, it’s almost certainly your app’s CORS middleware. The Fly proxy does not have a WAF and does not enforce CORS.
- Check that the
Originheader your client sends is in your app’s allowed origins list - Make sure your app returns the correct
Access-Control-Allow-Origin0 headers on preflight (OPTIONS) responses - If it works via
curlbut fails in the browser, that confirms it’s a CORS issue in your app, not a Fly issue
Database connections
Managed Postgres
MPG clusters run on Fly’s private network and aren’t accessible over the public internet. Connection strings use .flympg.net domains, which resolve to private network addresses. See Create and connect to MPG for full details.
To connect from your local machine::
- Interactive psql:
fly mpg connect - Proxy to localhost:
fly mpg proxy— forwards a local port to your database - WireGuard: connect to your org’s private network, then use the
.flympg.netconnection string directly. Read more in this reference guide.
If fly mpg proxy times out, try fly mpg connect first to verify the cluster is healthy.
Redis and Valkey
IPv6 is required on Fly’s private network. Most Redis clients default to IPv4. If your connection fails with I/O errors:
// ioredis — set family: 6
const redis = new Redis(process.env.REDIS_URL, {
family: 6,
maxRetriesPerRequest: null,
enableReadyCheck: false,
});
For Upstash Redis on Fly, use the internal endpoint over IPv6, not the public TLS endpoint.
Volumes and disk errors
If you see filesystem errors like unable to read superblock, your volume is corrupted. This is rare but can happen after a hard crash.
If you have snapshots enabled:
fly volumes list
fly volumes snapshots list <volume-id>
fly volumes create <name> --snapshot-id <snapshot-id> --region <region>
If you don’t have snapshots, the data may be unrecoverable. Always enable snapshots for volumes with data you care about. See Volume snapshots.
Dashboard and account access
Can’t log in (GitHub SSO)
If GitHub SSO stops working and you can’t access the dashboard:
- Try
fly auth logoutthenfly auth loginfrom the CLI - If you need SSO removed from your account, email billing@fly.io — they verify ownership before making SSO changes
Token issues
If you can’t create or manage tokens:
Token management bugs occasionally appear in specific flyctl versions. Update flyctl first. If fly tokens create fails, check the community forum for known issues with your version.
Regional issues
Fly runs on bare metal in 17 regions. Individual hosts or regions can have issues independent of the rest.
Check status first: status.flyio.net
If your app is down in one region but the status page is clear, the issue might be specific to your host. Run:
Check which region your Machines are in and whether they’re healthy.
Mitigation: deploy to multiple regions. If all your Machines are in iad and iad has problems, your app is down. Spread across regions:
fly scale count 2 --region iad,ord
For databases, keep read replicas in a second region. For apps where latency matters, pick regions close to your users — lhr and ams for Western Europe, nrt and sin for Asia-Pacific.
If a region is down and you need to deploy urgently, scale into a healthy region:
fly scale count 1 --region ord