Blindness
Production. A user writes: "the site is laggy". You open the site - everything's fast. You open the browser console - clean. And that's it, a dead end. You don't know what was slow for them, when, on which page, or why.
That's blindness. And the most frustrating part is that a Next.js app is a server. Page rendering, server actions, backend calls - all of it is code running on your machine, and you could know everything about it. But you don't, because you don't log it.
At some point I got tired of guessing and wired up logging: pino, Loki, Grafana. What follows is about what it gave me, why it's built this way, and whether it was worth it at all.
A note on the code. There will be a few snippets ahead, but they're pseudo-code - simplified down to the idea. Real versions are longer: types, error handling, edge cases. This is just the concept, not something to paste into a project.
What it gave me
Let me start from the end - with what I can see now and couldn't before.
- How long each backend endpoint takes. Not "the backend is slow", but "the prices endpoint responds in 1.2s at the 95th percentile, and it got worse after yesterday's deploy".
- Where exactly things break. A backend timeout is no longer "oops, the user couldn't see the prices", but a concrete line: which service, which request, how long we waited, why it failed.
- How fast the site opens for real people. Not on my laptop with fiber, but on a user's cheap Android phone - their LCP, their percentiles.
- The whole path of a single request. One filter and you see the entire chain: the page rendered, three backend endpoints were called, one of them lagged, the user got such-and-such LCP.
Here's a real (anonymized) line from Loki - the backend didn't respond within 5 seconds:
{
"level": "error",
"event": "fetch",
"service": "catalog_prices",
"url": "https://catalog-api.internal/v1/prices?city=almaty",
"duration_ms": 5004,
"aborted": true,
"err_name": "TimeoutError",
"requestId": "5f2869b1...",
"release": "a1b2c3d4",
"msg": "fetch_failed"
}One line and the whole incident is clear: which endpoint, which city, how long we waited, what failed, on which build. Before, an investigation like this took half a day of back-and-forth: "what happened, when, for whom".
Why pino, Loki, Grafana
console.log writes strings. A human can read them, a machine can't. And when there are thousands of log lines per minute, reading them with your eyes is pointless: you need graphs, filters, percentiles. For that, a log line has to be structured - not text, but an object with fields.
That's exactly what pino does: every entry is a JSON object. It's fast, adds almost no overhead, and writes to stdout - the rest is up to the infrastructure.
Loki is the log storage, Grafana is where you look at the logs: graphs, dashboards, alerts. The combo is popular and cheap: Loki doesn't index the content of lines (unlike Elasticsearch), so storing a lot in it is affordable.
The key idea: you're not "writing logs". You're collecting structured events you can later ask questions of. To make those questions easy, every event has a couple of service fields:
event- what happened:fetch,page_render,client_web_vital. The main filtering axis.release- the build's git SHA. After a deploy you immediately see whether the new version started slowing down. Invaluable.requestId- it gets its own section, it's the real killer feature.
The end-to-end requestId - what made it all click
Individual logs are just a pile of lines. The magic starts when you can gather the lines of one request together.
For that you need an end-to-end identifier - a requestId. It's born at the start of the request (set by nginx or the ingress, or generated by us) and then ends up in every line: the render log, the logs of every backend call, even the client metrics - the server hands the requestId to the HTML, and the client attaches it to its own logs.
Threading it as an argument through every function is hell. So it lives in AsyncLocalStorage - a "context" tied to the request's lifecycle:
// at the start of the request - once
runWithContext({ requestId, route }, () => renderPage())
// anywhere deeper, with no argument threading
const log = getLogger() // mixes in requestId from the context itself
log.info({ event: "fetch", ... })
What this gives you in practice: in Grafana you write one filter by requestId and get the whole history of the request in chronological order. It turns "logs" into "tracing". The difference is roughly like "I have photos" versus "I have video".
Sampling - or Loki will ruin you
An unpleasant truth: in production there are LOTS of logs. Backend calls are about 90% of all lines, and almost all of them are boring: 200 OK, fast. Storing all of it is money spent for nothing.
But you can't just "log less" either - you'll lose exactly the one request that broke.
The solution is head-based sampling. At the start of the request you flip a coin once: log this request in full or not. The decision goes into the same context. After that: lucky (say, 10% of requests) - keep all of its events, the full trace; unlucky - drop the boring fetch logs of that request.
An important detail: errors, non-2xx responses and slow requests are always kept, bypassing all sampling. We only sample the green noise:
shouldDrop(line):
if line.event is not fetch/axios: keep // don't touch anything else
if line.status >= 400: keep // errors - always
if line.duration_ms >= 1s: keep // slowdowns - always
else: drop, if the request didn't win the coin flip
Technically this is one hook on the logger itself - it decides whether to write a line or silently swallow it. Nothing changes at the call sites.
The result: the volume of fetch logs drops 10x, while everything important - errors, slowdowns, and the full traces of 10% of requests - stays. The dashboards don't lie, incidents get analyzed.
What exactly we log
Three types of events, in descending order of usefulness.
Backend calls. The main one. We wrap fetch in a wrapper that measures time and writes a log. It also sets a timeout - so a hung backend doesn't hold our request forever:
loggedFetch(service, url, opts):
start = now()
try:
res = await fetch(url, { ...opts, timeout: 5s })
log.info({ event: "fetch", service, status: res.status, duration_ms: now()-start })
return res
catch err:
log.error({ event: "fetch", service, aborted: isTimeout(err), duration_ms: now()-start })
throw err
What it gives: the timings of every endpoint, p95, and - most importantly - timeouts. That line from the start of the article was born right here.
Page rendering. A wrapper around the render (in the App Router - around the server component, in the Pages Router - around getServerSideProps). It's also a convenient point where the requestId and the request context are born. What it gives: you see which page renders slowly and whether the render degraded after a deploy.
Web Vitals from the client. The only thing measured in the browser: LCP, CLS, INP - how the site feels to a live user. The Next.js useReportWebVitals hook hands over the metrics, a client-side logger buffers them and sends them to the server in a batch - on a timer or when the user leaves the page (via sendBeacon, so the send survives the tab closing). What it gives: real performance for real people, not for you on your laptop.
Grafana: questions that now have answers
Logs are in Loki - now it's down to LogQL. A few queries that genuinely paid off.
Slow backend endpoints:
{app="storefront"} | json | event="fetch" | duration_ms > 1000All timeouts:
{app="storefront"} | json | event="fetch" | aborted="true"The whole path of a single request - that same tracing:
{app="storefront"} | json | requestId="5f2869b1..."These queries go onto a dashboard: p95 per endpoint, a timeout counter, the LCP distribution. Set it up once - and you stop guessing "is it slow or does it just feel slow", you look at the numbers.
Practical pitfalls
Briefly - the things I got stuck on.
- Middleware is the Edge Runtime. pino doesn't work there: the Node.js APIs it needs aren't available. If you log in middleware - it's
console.logwith JSON by hand only. - Delivery to Loki is a fork in the road. The
pino-lokitransport ships logs straight from the app (quick to set up, but pino transports run worker threads and the Next.js bundler breaks their paths). Or the app writes to stdout and an agent next to it (Grafana Alloy, Promtail, Vector) ships them. In a cluster I went with the second: the app shouldn't be responsible for delivery - if Loki goes down, the agent handles buffering and retries. - Don't make unique fields Loki labels.
requestId,duration_ms,urlas a label is an index explosion. Labels are only for low-cardinality values:app,env,release. The rest is extracted from the JSON right in the query. - redact is mandatory. Tokens, cookies, phone numbers must not leak into the logs. Configured once in the pino config.
Was it worth it
Honestly - yes, a lot.
Not because of pretty dashboards. But because the whole mode of work changed. Debugging production used to be archaeology: you gather fragments, interview people, guess. Now it's a question with an answer - open Grafana, filter, see.
Time-wise, the skeleton - the logger, the context, the fetch wrapper - is a couple of days. Sampling and client metrics were tuned in later, as Loki started bloating and the questions piled up.
Who shouldn't bother: if you have a landing page or a small site with no server logic - this is overkill, your host's built-in analytics is enough. Logging at this level pays off where Next.js really works as a backend: calls a pile of services, renders dynamic content, and the cost of a minute of downtime isn't zero.
For a project like that, this isn't "one more trendy thing in the stack". It's the difference between "fixing things blind" and "seeing".