2026-04-08

What you’ll learn

  • The mental model behind Loki (labels, streams, chunks) and why it’s cost-effective
  • How to structure logs at the source (JSON, stable fields)
  • How to ingest logs with Promtail and parse them
  • How to query with LogQL to find, aggregate, and turn logs into metrics
  • How to build Grafana dashboards, link traces, and alert on log-derived metrics

Why Loki (and how it works)

Traditional log stacks index full text. Loki only indexes labels you choose, and stores raw log lines compressed in chunks. This makes it cheaper to run and scale.

Key concepts:

  • Log stream: a set of logs that share the exact same label set (e.g., {app="payments", env="prod", pod="p-123"})
  • Labels: key/value pairs used for indexing and filtering; keep these low-cardinality
  • Chunks: compressed blocks of log lines stored in object storage; queries scan only the relevant chunks
  • Queries: first filter by labels, then parse/filter lines, then optionally aggregate

Quickstart: Loki, Promtail, Grafana (Docker Compose)

Create three files side-by-side and run docker compose up -d.

  1. docker-compose.yml:
version: "3.8"
services:
  loki:
    image: grafana/loki:2.9.4
    command: -config.file=/etc/loki/config.yml
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/config.yml
      - ./loki-data:/loki

  promtail:
    image: grafana/promtail:2.9.4
    command: -config.file=/etc/promtail/config.yml
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    depends_on:
      - loki

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - loki
  1. loki-config.yml (single-node, local storage):
auth_enabled: false
server:
  http_listen_port: 3100
common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory
schema_config:
  configs:
    - from: 2023-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h
ruler:
  alertmanager_url: http://localhost:9093
  1. promtail-config.yml (scrape local files and Docker logs):
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # Example: your app JSON logs
  - job_name: app-logs
    static_configs:
      - targets: [localhost]
        labels:
          job: app
          app: payments
          env: dev
          __path__: /var/log/app/*.log
    pipeline_stages:
      - json:
          expressions:
            ts:
            level:
            msg:
            req_id:
            user_id:
            duration_ms:
      # Label only stable, low-cardinality fields
      - labels:
          level:
      # Drop noisy lines (example: health checks)
      - match:
          selector: '{app="payments"}'
          stages:
            - drop:
                source: msg
                expression: 'healthcheck'

  # Example: nginx access log
  - job_name: nginx
    static_configs:
      - targets: [localhost]
        labels:
          job: nginx
          env: dev
          __path__: /var/log/nginx/access.log
    pipeline_stages:
      - regex:
          expression: '^(?P<ip>\\S+) \\S+ \\S+ \\[ (?P<time>[^\\]]+) \\] "(?P<method>\\S+) (?P<path>\\S+) (?P<proto>[^\"]+)" (?P<status>\\d{3}) (?P<size>\\d+)'
      # Label only small-cardinality fields
      - labels:
          status:
          method:

Start the stack:

docker compose up -d

Then open Grafana at http://localhost:3000 (user: admin, pass: admin) and add a Loki data source pointing to http://loki:3100.


Structure your logs at the source (so queries are easy)

  • Prefer JSON logs. Consistent keys beat regex.
  • Include stable fields as labels via Promtail (env, app, service, region). Avoid labeling high-cardinality values such as user_id, req_id, path with IDs.
  • Record durations, status, and identifiers in JSON fields (not labels). Parse and filter them at query time.

Example JSON log line (file under /var/log/app/app.log):

{"ts":"2026-04-08T12:34:56Z","level":"error","app":"payments","msg":"charge failed","req_id":"abc-123","user_id":"u-42","order_id":"o-77","duration_ms":352,"err":"insufficient funds"}

LogQL by example

Open Explore in Grafana, pick your Loki data source, and try these.

  1. Find recent errors for the payments app
{app="payments", env="dev"} | json | level="error"
  1. Count errors over 5 minutes (per app)
sum by (app) (
  count_over_time({app="payments", env="dev"} | json | level="error" [5m])
)
  1. Error rate (lines/sec) over 5 minutes
sum by (app) (
  rate({app="payments", env="dev"} | json | level="error" [5m])
)
  1. Show only the fields you care about
{app="payments"} | json | level="error" |
  line_format "{{.ts}} {{.req_id}} {{.user_id}} {{.msg}}"
  1. Top 5 error messages in the last 10 minutes
{app="payments"} | json | level="error" |
  label_format msg="{{.msg}}" |
  topk(5, sum by (msg) (count_over_time({app="payments"} | json | level="error" | label_format msg="{{.msg}}" [10m])))
  1. Numeric aggregations from logs (unwrap a number field)
  • Average latency over 5 minutes:
avg_over_time({app="payments"} | json | unwrap duration_ms [5m])
  • P99 latency over 10 minutes:
quantile_over_time(0.99, {app="payments"} | json | unwrap duration_ms [10m])
  1. Nginx: count 5xx by path If you didn’t parse in Promtail, parse at query time with a pattern:
sum by (path) (
  count_over_time(
    {job="nginx"} |
    pattern "<ip> - - [<ts>] \"<method> <path> <proto>\" <status> *" |
    status =~ "5.." [10m]
  )
)
  1. Nginx: 5xx error rate percentage
(
  sum by (job)(rate({job="nginx"} | pattern "<ip> - - [<ts>] \"<method> <path> <proto>\" <status> *" | status =~ "5.." [5m]))
)
/
(
  sum by (job)(rate({job="nginx"} [5m]))
)

Tips:

  • Start with the tightest label filter you can, then parse. Labels shrink the search space.
  • Prefer | pattern or | json over heavy regex.
  • unwrap converts a parsed field into a sample for numeric functions.

From logs to dashboards, links, and alerts in Grafana

  1. Dashboards
  • Use a Logs panel to show parsed fields. Add Stat/Time series panels with queries that aggregate logs (e.g., error rate, P99 from unwrap).
  1. Derived fields (click-to-trace)
  • Settings > Data sources > Loki > Derived fields. Example:
    • Name: trace_id
    • Regex: "trace_id":"([a-f0-9-]+)"
    • URL: link to your tracing system (e.g., Tempo). Now clicking a log with a trace_id opens the trace.
  1. Alerts on logs
  • Create a rule (Grafana Alerting or Loki ruler) based on a metrics-style LogQL expression. Example: high error ratio for payments:
groups:
- name: payments
  rules:
  - alert: HighErrorRate
    expr: |
      (
        sum(rate({app="payments"} | json | level="error" [5m]))
      )
      /
      (
        sum(rate({app="payments"} [5m]))
      ) > 0.05
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "Payments error rate >5% for 10m"

Best practices (save money, gain signal)

  • Labels: only stable, bounded-cardinality. Good: env, app, job, namespace, pod, instance. Avoid: user_id, req_id, raw path with IDs.
  • Prefer structured logs (JSON) from the app. If not possible, use Promtail pipelines to parse.
  • Drop noise at the edge. Promtail drop and match stages can remove health checks and debug spam.
  • Storage: for production, use object storage (S3/GCS) with boltdb-shipper. Set retention per tenant.
  • Query efficiency: narrow time ranges, filter by labels first, then parse, then aggregate. Avoid unbounded regex.
  • Governance: scrub PII, set sensible log levels, rotate and retain based on compliance needs.

Troubleshooting

  • No logs in Grafana Explore:
    • Check Promtail targets at http://localhost:9080/targets and positions file.
    • Verify clients.url points to Loki and Loki is reachable.
    • Ensure your time range covers when logs were written.
  • Slow queries or 429s:
    • Tighten label selectors and time ranges.
    • Reduce cardinality (check the label browser in Explore).
    • Prefer json/pattern over complex regex.
  • Unexpectedly high cardinality:
    • Audit Promtail labels stage. Remove dynamic fields.

Cheat sheet

  • Filter by labels and substring:
{app="api", env="prod"} |= "timeout"
  • Parse JSON and filter on a field:
{job="app"} | json | level="warn"
  • Count lines over a window:
count_over_time({app="api"} |= "ERROR" [10m])
  • Rate (lines/sec):
rate({app="api"} [5m])
  • Numeric from field:
max_over_time({app="api"} | json | unwrap duration_ms [15m])

With the right labels, structured logs, and a handful of LogQL patterns, Loki turns raw text into actionable dashboards and alerts without breaking the bank.