2026-04-14

SQL remains the fastest way to turn raw data into insight. This guide gives developers practical SQL patterns you can paste into your warehouse to answer real analytics questions.

Assumed schema (adapt as needed):

  • users(user_id, signup_at, marketing_channel)
  • events(user_id, event_name, event_at, metadata)
  • orders(order_id, user_id, amount, status, created_at)
  • experiment_assignments(user_id, experiment, variant, assigned_at)

Notes:

  • Examples aim for PostgreSQL/ANSI style. For BigQuery: use TIMESTAMP/DATE functions and APPROX_* functions. For Snowflake: DATE_TRUNC, APPROX_PERCENTILE, QUALIFY, etc.

1) Quick EDA and Data Quality Checks

Sanity-check the shape of your data before analysis.

Count rows by table:

SELECT 'users'  AS table_name, COUNT(*) AS rows FROM users
UNION ALL
SELECT 'events', COUNT(*) FROM events
UNION ALL
SELECT 'orders', COUNT(*) FROM orders;

Date ranges and nulls:

-- Date coverage
SELECT MIN(signup_at) AS min_signup, MAX(signup_at) AS max_signup FROM users;

-- Null checks (ANSI)
SELECT
  SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS null_user_id,
  SUM(CASE WHEN event_name IS NULL THEN 1 ELSE 0 END) AS null_event_name
FROM events;

Detect duplicates (exact same event tuple):

SELECT user_id, event_name, event_at, COUNT(*) AS dupes
FROM events
GROUP BY 1,2,3
HAVING COUNT(*) > 1
ORDER BY dupes DESC
LIMIT 50;

Top categories and distributions:

-- Top traffic sources
SELECT marketing_channel, COUNT(*) AS users
FROM users
GROUP BY 1
ORDER BY users DESC
LIMIT 10;

-- Order amount histogram (Postgres width_bucket)
SELECT width_bucket(amount, 0, 500, 10) AS bucket, COUNT(*) AS n
FROM orders
WHERE amount BETWEEN 0 AND 500
GROUP BY 1
ORDER BY 1;

2) Core Time-Series Metrics

Daily Active Users (DAU):

SELECT DATE_TRUNC('day', event_at) AS day,
       COUNT(DISTINCT user_id)     AS dau
FROM events
WHERE event_name = 'session_start'
GROUP BY 1
ORDER BY 1;

7-day moving average of DAU (smooths volatility):

WITH daily AS (
  SELECT DATE_TRUNC('day', event_at) AS day,
         COUNT(DISTINCT user_id)     AS dau
  FROM events
  WHERE event_name = 'session_start'
  GROUP BY 1
)
SELECT day,
       dau,
       AVG(dau) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS dau_ma7
FROM daily
ORDER BY day;

New vs. returning users by day:

WITH first_seen AS (
  SELECT user_id, MIN(event_at) AS first_event_at
  FROM events
  GROUP BY 1
)
SELECT DATE_TRUNC('day', e.event_at) AS day,
       COUNT(DISTINCT CASE WHEN e.event_at::date = f.first_event_at::date THEN e.user_id END) AS new_users,
       COUNT(DISTINCT CASE WHEN e.event_at::date >  f.first_event_at::date THEN e.user_id END) AS returning_users
FROM events e
JOIN first_seen f USING (user_id)
GROUP BY 1
ORDER BY 1;

3) Funnel Analysis (View → Add to Cart → Purchase)

Find first timestamp per user for each step and measure drop-offs:

WITH steps AS (
  SELECT
    user_id,
    MIN(CASE WHEN event_name = 'view_product' THEN event_at END) AS step1_view,
    MIN(CASE WHEN event_name = 'add_to_cart' THEN event_at END)  AS step2_cart,
    MIN(CASE WHEN event_name = 'purchase' THEN event_at END)     AS step3_buy
  FROM events
  WHERE event_at >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY 1
), eligible AS (
  SELECT * FROM steps WHERE step1_view IS NOT NULL
)
SELECT
  COUNT(*)                                                        AS users_at_step1,
  COUNT(*) FILTER (WHERE step2_cart IS NOT NULL AND step2_cart >= step1_view) AS users_at_step2,
  COUNT(*) FILTER (WHERE step3_buy  IS NOT NULL AND step3_buy  >= COALESCE(step2_cart, step1_view)) AS users_at_step3
FROM eligible;

Add conversion rates:

WITH base AS (
  SELECT
    COUNT(*)::float AS s1,
    COUNT(*) FILTER (WHERE step2_cart IS NOT NULL AND step2_cart >= step1_view)::float AS s2,
    COUNT(*) FILTER (WHERE step3_buy  IS NOT NULL AND step3_buy  >= COALESCE(step2_cart, step1_view))::float AS s3
  FROM (
    SELECT
      user_id,
      MIN(CASE WHEN event_name = 'view_product' THEN event_at END) AS step1_view,
      MIN(CASE WHEN event_name = 'add_to_cart' THEN event_at END)  AS step2_cart,
      MIN(CASE WHEN event_name = 'purchase' THEN event_at END)     AS step3_buy
    FROM events
    WHERE event_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1
  ) s
  WHERE step1_view IS NOT NULL
)
SELECT s1, s2, s3,
       s2 / NULLIF(s1, 0) AS view_to_cart,
       s3 / NULLIF(s2, 0) AS cart_to_purchase,
       s3 / NULLIF(s1, 0) AS view_to_purchase
FROM base;

Tip: If your funnel must be strictly ordered in time (e.g., add_to_cart after view), the filters above enforce ordering.

4) Cohorts and Retention

Cohort by signup week; compute retention by weeks since signup.

WITH user_cohort AS (
  SELECT user_id,
         DATE_TRUNC('week', signup_at)::date AS cohort_week
  FROM users
), activity AS (
  SELECT e.user_id,
         DATE_TRUNC('week', e.event_at)::date AS active_week
  FROM events e
  GROUP BY 1,2
), joined AS (
  SELECT c.cohort_week,
         a.active_week,
         (DATE_PART('day', a.active_week - c.cohort_week) / 7)::int AS weeks_since,
         a.user_id
  FROM user_cohort c
  JOIN activity a USING (user_id)
  WHERE a.active_week >= c.cohort_week
), cohort_size AS (
  SELECT cohort_week, COUNT(DISTINCT user_id) AS users_in_cohort
  FROM user_cohort
  GROUP BY 1
)
SELECT j.cohort_week,
       j.weeks_since,
       COUNT(DISTINCT j.user_id)                             AS active_users,
       MAX(cs.users_in_cohort)                               AS cohort_users,
       COUNT(DISTINCT j.user_id)::decimal / NULLIF(MAX(cs.users_in_cohort),0) AS retention_rate
FROM joined j
JOIN cohort_size cs USING (cohort_week)
GROUP BY 1,2
ORDER BY 1,2;

This returns a tidy table you can pivot to a cohort heatmap.

5) Customer LTV (90-day) by Cohort

WITH orders_90 AS (
  SELECT o.user_id,
         SUM(CASE WHEN o.created_at <= u.signup_at + INTERVAL '90 days' THEN o.amount ELSE 0 END) AS rev_90
  FROM orders o
  JOIN users u USING (user_id)
  GROUP BY 1
), cohorts AS (
  SELECT user_id, DATE_TRUNC('month', signup_at)::date AS cohort_month FROM users
)
SELECT c.cohort_month,
       COUNT(*)                                AS users,
       SUM(COALESCE(o.rev_90,0))               AS total_rev_90,
       AVG(COALESCE(o.rev_90,0))               AS avg_ltv_90
FROM cohorts c
LEFT JOIN orders_90 o USING (user_id)
GROUP BY 1
ORDER BY 1;

6) A/B Testing Basics in SQL

Compute conversion and its standard error per variant:

WITH conversions AS (
  SELECT a.variant,
         a.user_id,
         MAX(CASE WHEN o.order_id IS NOT NULL THEN 1 ELSE 0 END) AS converted
  FROM experiment_assignments a
  LEFT JOIN orders o ON o.user_id = a.user_id
                    AND o.created_at >= a.assigned_at
                    AND o.created_at <  a.assigned_at + INTERVAL '7 days'
  WHERE a.experiment = 'checkout_redesign'
  GROUP BY 1,2
), agg AS (
  SELECT variant,
         COUNT(*)::float AS n,
         AVG(converted::float) AS cr
  FROM conversions
  GROUP BY 1
)
SELECT variant,
       cr AS conversion_rate,
       SQRT(cr * (1 - cr) / n) AS se_conversion_rate
FROM agg;

To compare two variants, compute the difference in conversion and its standard error using pooled variance or a simple delta method. For rigorous inference, consider a stats library outside SQL (e.g., Welch’s t-test, CUPED, or Bayesian methods).

7) Sessionization (30-minute gap)

WITH ordered AS (
  SELECT
    user_id,
    event_at,
    CASE WHEN event_at - LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at) > INTERVAL '30 minutes'
              OR LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at) IS NULL
         THEN 1 ELSE 0 END AS new_session
  FROM events
), numbered AS (
  SELECT user_id,
         event_at,
         SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_at ROWS UNBOUNDED PRECEDING) AS session_num
  FROM ordered
)
SELECT user_id, session_num,
       MIN(event_at) AS session_start,
       MAX(event_at) AS session_end,
       EXTRACT(EPOCH FROM (MAX(event_at) - MIN(event_at))) / 60.0 AS session_minutes,
       COUNT(*) AS events
FROM numbered
GROUP BY 1,2
ORDER BY user_id, session_num;

8) Reusable Building Blocks

Date spine (ensures dense time series even on missing-data days):

-- Postgres
WITH dates AS (
  SELECT generate_series(
           (SELECT MIN(DATE_TRUNC('day', signup_at))::date FROM users),
           (SELECT MAX(DATE_TRUNC('day', event_at))::date   FROM events),
           INTERVAL '1 day'
         )::date AS d
)
SELECT d FROM dates;

Percentiles:

-- Postgres exact percentile
SELECT percentile_cont(0.95) WITHIN GROUP (ORDER BY amount) AS p95
FROM orders;

-- BigQuery approx
-- SELECT APPROX_QUANTILES(amount, 100)[OFFSET(95)] AS p95 FROM orders;

Deduplicating events:

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, event_name, event_at ORDER BY event_at
         ) AS rn
  FROM events
)
SELECT * FROM ranked WHERE rn = 1;

Anti-join to find orphan rows (e.g., orders without users):

SELECT o.*
FROM orders o
LEFT JOIN users u USING (user_id)
WHERE u.user_id IS NULL;

9) Performance and Correctness Tips

  • Filter early by date partitions (e.g., WHERE event_at >= CURRENT_DATE - INTERVAL '90 days').
  • Index/join keys: user_id, event_at (or event_date), order_id.
  • Pre-aggregate large event tables into daily summaries for most dashboards.
  • Avoid COUNT(DISTINCT ...) over huge windows; consider approximate functions (e.g., HyperLogLog, APPROX_COUNT_DISTINCT).
  • Be explicit about time zones; compute event_date using the same canonical TZ.
  • Watch CTE behavior: in some engines they’re optimization fences; in others they inline. Use them for readability and test performance.
  • Late-arriving data: design reprocessing by date partition; avoid mutable backfills that break dashboards.
  • Validate joins: compare row counts before/after; sample anomalies.

10) A Practical Workflow Checklist

  1. Define the business question and metric definition in plain language.
  2. Run EDA and data quality checks (ranges, nulls, duplicates).
  3. Build a date spine and choose consistent time zone/granularity.
  4. Write minimal, testable CTEs; validate each step’s row counts.
  5. Add metrics (cohorts, funnels, retention) and cross-check with a second method.
  6. Optimize (filters, partitions, approximations) only after correctness.
  7. Package as a view or incremental model for reuse.

Wrap-up

Mastering a handful of SQL patterns unlocks most product analytics: time series, funnels, cohorts, retention, LTV, and experiments. Start with correctness, then scale with partitions, summaries, and approximations.