LiveTemplate Scaling Guide

Target Audience: DevOps engineers, SREs, and developers deploying LiveTemplate applications at scale.

Last Updated: 2025-11-01


Overview

This guide covers scaling LiveTemplate applications from single-host prototypes to production systems handling millions of concurrent WebSocket connections.


Scaling Tiers

Tier 1: Single Host (Hobby)

Capacity: <1,000 concurrent connections Infrastructure: 1 instance, no Redis Cost: $5-20/month (VPS)

Recommended For:

Configuration:

sessionStore := livetemplate.NewMemorySessionStore()
handler := livetemplate.Mount(rootStore,
    livetemplate.WithSessionStore(sessionStore),
    livetemplate.WithMaxConnections(1000),
)

Infrastructure:

Limitations:


Tier 2: Small Production (Startup)

Capacity: 1K-10K concurrent connections Infrastructure: 1-2 instances, Redis optional Cost: $50-200/month

Recommended For:

Configuration:

// Option A: Still single-host with Redis for persistence
redisClient := redis.NewClient(&redis.Options{
    Addr: os.Getenv("REDIS_URL"),
})
sessionStore := livetemplate.NewRedisSessionStore(redisClient,
    livetemplate.WithFallbackToMemory(true),
)

handler := livetemplate.Mount(rootStore,
    livetemplate.WithSessionStore(sessionStore),
    livetemplate.WithMaxConnections(5000),
)

Infrastructure:

Benefits Over Tier 1:


Tier 3: Production Scale (SaaS)

Capacity: 10K-100K concurrent connections Infrastructure: 2-10 instances, Redis Sentinel Cost: $500-2000/month

Recommended For:

Configuration:

// Multi-instance with Redis Sentinel for HA
redisClient := redis.NewFailoverClient(&redis.FailoverOptions{
    MasterName:    "mymaster",
    SentinelAddrs: []string{"sentinel1:26379", "sentinel2:26379"},
})

sessionStore := livetemplate.NewRedisSessionStore(redisClient)
pubsubBroadcaster := livetemplate.NewRedisBroadcaster(redisClient)

handler := livetemplate.Mount(rootStore,
    livetemplate.WithSessionStore(sessionStore),
    livetemplate.WithPubSubBroadcaster(pubsubBroadcaster),
    livetemplate.WithMaxConnections(10000),
    livetemplate.WithMaxConnectionsPerGroup(500),
)

Infrastructure:

Deployment Pattern:

Key Metrics to Monitor:


Tier 4: Enterprise Scale

Capacity: 100K-1M+ concurrent connections Infrastructure: 10-100+ instances, Redis Cluster Cost: $5K-50K+/month

Recommended For:

Configuration:

// Redis Cluster for horizontal sharding
redisClient := redis.NewClusterClient(&redis.ClusterOptions{
    Addrs: []string{
        "redis-node1:6379",
        "redis-node2:6379",
        "redis-node3:6379",
    },
})

sessionStore := livetemplate.NewRedisSessionStore(redisClient)
pubsubBroadcaster := livetemplate.NewRedisBroadcaster(redisClient)

handler := livetemplate.Mount(rootStore,
    livetemplate.WithSessionStore(sessionStore),
    livetemplate.WithPubSubBroadcaster(pubsubBroadcaster),
    livetemplate.WithMaxConnections(50000),
    livetemplate.WithMaxConnectionsPerGroup(1000),
    livetemplate.WithWebSocketCompression(true), // M3 feature
)

Infrastructure:

Architecture Patterns:


Migration Guide: Memory to Redis Session Store

This guide walks through migrating from in-memory session storage to Redis-backed storage for horizontal scaling.

When to Migrate

Stay with MemorySessionStore if:

Migrate to RedisSessionStore when:

Prerequisites

  1. Redis Server: Deploy Redis (Standalone, Sentinel, or Cluster)
  2. Go Redis Client: Install github.com/redis/go-redis/v9
  3. State Serialization: Ensure all State types are gob-serializable

Step-by-Step Migration

Step 1: Set Up Redis

Development (Docker):

docker run -d \
  --name livetemplate-redis \
  -p 6379:6379 \
  redis:7-alpine \
  redis-server --appendonly yes

Production (Managed Service):

Step 2: Register State Types for Serialization

LiveTemplate uses Go's encoding/gob for serialization, which requires registering custom types.

Before (works with MemorySessionStore):

// State holds data (cloned per session)
type TodoState struct {
    Items []Todo
}

// Controller holds dependencies (singleton)
type TodoController struct {
    DB *sql.DB
}

// Action method
func (c *TodoController) Add(state TodoState, ctx *livetemplate.Context) (TodoState, error) {
    todo := Todo{Title: ctx.GetString("title")}
    state.Items = append(state.Items, todo)
    return state, nil
}

After (required for RedisSessionStore):

type TodoState struct {
    Items []Todo
}

type TodoController struct {
    DB *sql.DB
}

func (c *TodoController) Add(state TodoState, ctx *livetemplate.Context) (TodoState, error) {
    todo := Todo{Title: ctx.GetString("title")}
    state.Items = append(state.Items, todo)
    return state, nil
}

// Register all State types in init()
func init() {
    gob.Register(&TodoState{})
    gob.Register(&Todo{})  // Register nested types too
}

Why? Gob encoding preserves type information only for registered types. Without registration, deserialization fails.

Step 3: Update Application Code

Before (MemorySessionStore):

package main

import "github.com/livetemplate/livetemplate"

func main() {
    // In-memory session store (default)
    sessionStore := livetemplate.NewMemorySessionStore()

    controller := &AppController{}
    state := &AppState{}
    handler := livetemplate.Mount(controller, livetemplate.AsState(state),
        livetemplate.WithSessionStore(sessionStore),
        livetemplate.WithMaxConnections(1000),
    )

    http.Handle("/", handler)
    http.ListenAndServe(":8080", nil)
}

After (RedisSessionStore):

package main

import (
    "github.com/livetemplate/livetemplate"
    "github.com/redis/go-redis/v9"
    "log"
    "os"
)

func main() {
    // Connect to Redis
    redisClient := redis.NewClient(&redis.Options{
        Addr:     os.Getenv("REDIS_URL"), // e.g., "localhost:6379"
        Password: os.Getenv("REDIS_PASSWORD"),
        DB:       0,
    })

    // Verify Redis connection
    if err := redisClient.Ping(context.Background()).Err(); err != nil {
        log.Fatalf("Failed to connect to Redis: %v", err)
    }

    // Create Redis session store with fallback
    sessionStore := livetemplate.NewRedisSessionStore(redisClient,
        livetemplate.WithSessionTTL(24*time.Hour),
        livetemplate.WithFallbackToMemory(true), // Graceful degradation
    )

    controller := &AppController{}
    state := &AppState{}
    handler := livetemplate.Mount(controller, livetemplate.AsState(state),
        livetemplate.WithSessionStore(sessionStore),
        livetemplate.WithMaxConnections(10000), // Can handle more now
    )

    http.Handle("/", handler)
    http.ListenAndServe(":8080", nil)
}

Step 4: Configure Environment Variables

Development (.env):

REDIS_URL=localhost:6379
REDIS_PASSWORD=

Production (Kubernetes Secret):

apiVersion: v1
kind: Secret
metadata:
  name: redis-credentials
type: Opaque
stringData:
  redis-url: "redis.production.svc.cluster.local:6379"
  redis-password: "your-secure-password"

Deployment:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        env:
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: redis-credentials
              key: redis-url
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              name: redis-credentials
              key: redis-password

Step 5: Update Health Checks

Add Redis health check to ensure instance is ready before accepting traffic.

import "github.com/livetemplate/livetemplate"

func main() {
    // ... Redis setup ...

    sessionStore := livetemplate.NewRedisSessionStore(redisClient)

    // Health check endpoints
    http.HandleFunc("/health/live", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("OK"))
    })

    http.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
        // Check Redis connectivity
        if err := sessionStore.Ping(); err != nil {
            http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)
            return
        }
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("READY"))
    })

    http.Handle("/", handler)
    http.ListenAndServe(":8080", nil)
}

Kubernetes Probe Configuration:

Configure liveness and readiness probes to ensure Kubernetes can properly manage your application lifecycle.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: livetemplate-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: livetemplate
  template:
    metadata:
      labels:
        app: livetemplate
    spec:
      containers:
      - name: app
        image: your-registry/livetemplate-app:latest
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: redis-credentials
              key: redis-url

        # Liveness Probe: Is the application running?
        # Failure = Restart container
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10   # Wait 10s after container starts
          periodSeconds: 30          # Check every 30s
          timeoutSeconds: 5          # Request timeout
          successThreshold: 1        # 1 success = healthy
          failureThreshold: 3        # 3 failures = restart (90s total)

        # Readiness Probe: Can the application accept traffic?
        # Failure = Remove from service endpoints
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5     # Start checking after 5s
          periodSeconds: 10          # Check every 10s
          timeoutSeconds: 5          # Request timeout
          successThreshold: 1        # 1 success = ready
          failureThreshold: 2        # 2 failures = not ready (20s total)

        # Startup Probe: Has the application finished starting?
        # Use for slow-starting applications
        startupProbe:
          httpGet:
            path: /health/live
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 0     # Start immediately
          periodSeconds: 5           # Check every 5s
          timeoutSeconds: 3          # Request timeout
          successThreshold: 1        # 1 success = started
          failureThreshold: 30       # 30 failures = give up (150s total)

        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"

        # Graceful shutdown: allow connections to drain
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

Probe Configuration Guidelines:

Probe Type Purpose Failure Action Recommended Settings
Liveness Detect deadlocks, hung processes Restart container periodSeconds: 30, failureThreshold: 3
Readiness Detect temporary unavailability (Redis down, DB issues) Remove from load balancer periodSeconds: 10, failureThreshold: 2
Startup Handle slow application startup Delay liveness checks periodSeconds: 5, failureThreshold: 30

When to Use Each Probe:

  1. Liveness Probe (/health/live):

    • Always use for all deployments
    • Should check only if application process is responsive
    • Do NOT check external dependencies (Redis, DB)
    • Fast check (<100ms response time)
  2. Readiness Probe (/health/ready):

    • Always use for all deployments
    • Should check external dependencies (Redis, DB)
    • Allows application to temporarily become "not ready" without restart
    • Example: Redis connection lost → readiness fails → no new connections → Redis recovers → readiness passes → traffic resumes
  3. Startup Probe (/health/live):

    • Use if application takes >30s to start (database migrations, cache warming)
    • Skip if application starts quickly (<10s)
    • Prevents liveness probe from restarting slow-starting apps

Health Check Implementation Best Practices:

func setupHealthChecks(sessionStore *livetemplate.RedisSessionStore, db *sql.DB) {
    // Liveness: Just check if HTTP server is responding
    // Do NOT check external dependencies
    http.HandleFunc("/health/live", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("OK"))
    })

    // Readiness: Check all critical dependencies
    http.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
        ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
        defer cancel()

        // Check Redis
        if err := sessionStore.Ping(); err != nil {
            log.Printf("Readiness: Redis unhealthy: %v", err)
            http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)
            return
        }

        // Check database (optional, if using database)
        if db != nil {
            if err := db.PingContext(ctx); err != nil {
                log.Printf("Readiness: Database unhealthy: %v", err)
                http.Error(w, "Database unavailable", http.StatusServiceUnavailable)
                return
            }
        }

        // All checks passed
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("READY"))
    })

    // Optional: Detailed health check for monitoring (not for k8s probes)
    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        health := struct {
            Status      string            `json:"status"`
            Checks      map[string]string `json:"checks"`
            Timestamp   time.Time         `json:"timestamp"`
            Connections int               `json:"active_connections"`
        }{
            Status:    "healthy",
            Checks:    make(map[string]string),
            Timestamp: time.Now(),
        }

        // Check Redis
        if err := sessionStore.Ping(); err != nil {
            health.Status = "unhealthy"
            health.Checks["redis"] = fmt.Sprintf("error: %v", err)
        } else {
            health.Checks["redis"] = "ok"
        }

        // Check database
        if db != nil {
            if err := db.Ping(); err != nil {
                health.Status = "unhealthy"
                health.Checks["database"] = fmt.Sprintf("error: %v", err)
            } else {
                health.Checks["database"] = "ok"
            }
        }

        // Return JSON response
        w.Header().Set("Content-Type", "application/json")
        if health.Status != "healthy" {
            w.WriteHeader(http.StatusServiceUnavailable)
        }
        json.NewEncoder(w).Encode(health)
    })
}

Advanced Database Health Checks:

For production deployments with databases, implement comprehensive health checks that verify not just connectivity, but also connection pool health and query performance.

package main

import (
    "context"
    "database/sql"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "time"

    _ "github.com/lib/pq" // PostgreSQL driver
)

type DatabaseHealthChecker struct {
    db      *sql.DB
    timeout time.Duration
}

func NewDatabaseHealthChecker(db *sql.DB) *DatabaseHealthChecker {
    return &DatabaseHealthChecker{
        db:      db,
        timeout: 3 * time.Second,
    }
}

// Check performs comprehensive database health check
func (d *DatabaseHealthChecker) Check(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, d.timeout)
    defer cancel()

    // 1. Ping: Verify basic connectivity
    if err := d.db.PingContext(ctx); err != nil {
        return fmt.Errorf("ping failed: %w", err)
    }

    // 2. Simple query: Verify database is responsive
    var result int
    if err := d.db.QueryRowContext(ctx, "SELECT 1").Scan(&result); err != nil {
        return fmt.Errorf("query failed: %w", err)
    }

    return nil
}

// Stats returns database connection pool statistics
func (d *DatabaseHealthChecker) Stats() sql.DBStats {
    return d.db.Stats()
}

// Detailed health check endpoint with database metrics
func setupDatabaseHealthCheck(db *sql.DB, sessionStore *livetemplate.RedisSessionStore) {
    dbChecker := NewDatabaseHealthChecker(db)

    // Simple readiness check for Kubernetes
    http.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
        ctx := r.Context()

        // Check Redis
        if err := sessionStore.Ping(); err != nil {
            log.Printf("Readiness: Redis unhealthy: %v", err)
            http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)
            return
        }

        // Check database
        if err := dbChecker.Check(ctx); err != nil {
            log.Printf("Readiness: Database unhealthy: %v", err)
            http.Error(w, "Database unavailable", http.StatusServiceUnavailable)
            return
        }

        w.WriteHeader(http.StatusOK)
        w.Write([]byte("READY"))
    })

    // Detailed health check with metrics
    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        ctx := r.Context()

        health := struct {
            Status    string                 `json:"status"`
            Checks    map[string]interface{} `json:"checks"`
            Timestamp time.Time              `json:"timestamp"`
        }{
            Status:    "healthy",
            Checks:    make(map[string]interface{}),
            Timestamp: time.Now(),
        }

        // Redis health
        if err := sessionStore.Ping(); err != nil {
            health.Status = "unhealthy"
            health.Checks["redis"] = map[string]interface{}{
                "status": "error",
                "error":  err.Error(),
            }
        } else {
            health.Checks["redis"] = map[string]interface{}{
                "status": "ok",
            }
        }

        // Database health with detailed metrics
        dbHealth := map[string]interface{}{
            "status": "ok",
        }

        if err := dbChecker.Check(ctx); err != nil {
            health.Status = "unhealthy"
            dbHealth["status"] = "error"
            dbHealth["error"] = err.Error()
        } else {
            // Add connection pool statistics
            stats := dbChecker.Stats()
            dbHealth["connection_pool"] = map[string]interface{}{
                "open_connections":  stats.OpenConnections,
                "in_use":            stats.InUse,
                "idle":              stats.Idle,
                "max_open":          stats.MaxOpenConnections,
                "wait_count":        stats.WaitCount,
                "wait_duration_ms":  stats.WaitDuration.Milliseconds(),
                "max_idle_closed":   stats.MaxIdleClosed,
                "max_idle_time_closed": stats.MaxIdleTimeClosed,
                "max_lifetime_closed":  stats.MaxLifetimeClosed,
            }

            // Calculate pool utilization
            utilization := float64(0)
            if stats.MaxOpenConnections > 0 {
                utilization = float64(stats.OpenConnections) / float64(stats.MaxOpenConnections) * 100
            }
            dbHealth["pool_utilization_percent"] = utilization

            // Warn if pool is >80% utilized
            if utilization > 80 {
                dbHealth["warning"] = "connection pool utilization high"
            }
        }

        health.Checks["database"] = dbHealth

        // Return response
        w.Header().Set("Content-Type", "application/json")
        if health.Status != "healthy" {
            w.WriteHeader(http.StatusServiceUnavailable)
        }
        json.NewEncoder(w).Encode(health)
    })
}

// Configure database connection pool for production
func configureDatabasePool(db *sql.DB) {
    // Maximum number of open connections
    // Rule of thumb: (CPU cores × 2) + disk spindles
    // Example: 8 cores + 2 disks = 18 connections
    db.SetMaxOpenConns(25)

    // Maximum number of idle connections in pool
    // Should be same as MaxOpenConns for consistent performance
    db.SetMaxIdleConns(25)

    // Maximum lifetime of a connection
    // Helps with connection refresh and load balancer rotation
    db.SetConnMaxLifetime(5 * time.Minute)

    // Maximum idle time for a connection
    // Connections idle longer than this are closed
    db.SetConnMaxIdleTime(1 * time.Minute)
}

// Example main function with database health checks
func main() {
    // Setup database
    db, err := sql.Open("postgres", os.Getenv("DATABASE_URL"))
    if err != nil {
        log.Fatalf("Failed to open database: %v", err)
    }
    defer db.Close()

    // Configure connection pool
    configureDatabasePool(db)

    // Verify database is reachable on startup
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    if err := db.PingContext(ctx); err != nil {
        log.Fatalf("Failed to ping database: %v", err)
    }
    log.Println("Database connection established")

    // Setup Redis session store
    redisClient := redis.NewClient(&redis.Options{
        Addr: os.Getenv("REDIS_URL"),
    })
    sessionStore := livetemplate.NewRedisSessionStore(redisClient)

    // Setup health checks
    setupDatabaseHealthCheck(db, sessionStore)

    // ... rest of application setup
}

Database Connection Pool Best Practices:

  1. Set Appropriate Connection Limits:

    // Too low: Queries queue, high latency
    db.SetMaxOpenConns(5)  // DON'T: Too few for production
    
    // Too high: Resource exhaustion, database overload
    db.SetMaxOpenConns(1000)  // DON'T: Way too many
    
    // Just right: Based on workload and database capacity
    db.SetMaxOpenConns(25)  // DO: Reasonable for most apps
    

    Formula:

    MaxOpenConns = (CPU cores × 2) + disk spindles
    
    Examples:
    - 8 core server + SSD: 8×2 + 1 = 17 ≈ 20 connections
    - 16 core server + RAID: 16×2 + 4 = 36 ≈ 40 connections
    
  2. Match Idle and Max Connections:

    // Inconsistent: Idle connections close/reopen frequently
    db.SetMaxOpenConns(25)
    db.SetMaxIdleConns(5)   // DON'T: Creates connection churn
    
    // Consistent: Connections stay open and ready
    db.SetMaxOpenConns(25)
    db.SetMaxIdleConns(25)  // DO: No connection churn
    
  3. Set Connection Lifetimes:

    // Infinite lifetime: Stale connections, load balancer issues
    // (default: no limit)
    
    // Reasonable lifetime: Fresh connections, LB-friendly
    db.SetConnMaxLifetime(5 * time.Minute)     // DO: Rotate connections
    db.SetConnMaxIdleTime(1 * time.Minute)     // DO: Close idle connections
    
  4. Monitor Connection Pool Metrics:

    // Log pool stats periodically
    go func() {
        ticker := time.NewTicker(30 * time.Second)
        for range ticker.C {
            stats := db.Stats()
            log.Printf("DB Pool: open=%d in_use=%d idle=%d wait_count=%d",
                stats.OpenConnections, stats.InUse, stats.Idle, stats.WaitCount)
    
            // Alert if pool is starved
            if stats.WaitCount > 100 {
                log.Printf("WARNING: High connection wait count: %d", stats.WaitCount)
            }
        }
    }()
    

Prometheus Metrics for Database Health:

import (
    "database/sql"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    dbConnectionsOpen = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_open",
        Help: "Number of open database connections",
    })

    dbConnectionsInUse = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_in_use",
        Help: "Number of database connections currently in use",
    })

    dbConnectionsIdle = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connections_idle",
        Help: "Number of idle database connections",
    })

    dbConnectionWaitCount = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connection_wait_count_total",
        Help: "Total number of times a connection was waited for",
    })

    dbConnectionWaitDuration = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "db_connection_wait_duration_seconds",
        Help: "Total time blocked waiting for connections",
    })
)

// Export database pool metrics to Prometheus
func exportDatabaseMetrics(db *sql.DB) {
    go func() {
        ticker := time.NewTicker(10 * time.Second)
        for range ticker.C {
            stats := db.Stats()
            dbConnectionsOpen.Set(float64(stats.OpenConnections))
            dbConnectionsInUse.Set(float64(stats.InUse))
            dbConnectionsIdle.Set(float64(stats.Idle))
            dbConnectionWaitCount.Set(float64(stats.WaitCount))
            dbConnectionWaitDuration.Set(stats.WaitDuration.Seconds())
        }
    }()
}

Alerting Rules:

# Prometheus alerting rules for database health
groups:
- name: database_health
  rules:
  # Database connectivity
  - alert: DatabaseDown
    expr: up{job="database"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Database is down"

  # Connection pool exhaustion
  - alert: DatabasePoolExhausted
    expr: (db_connections_in_use / db_connections_open) > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Database connection pool >90% utilized"

  # High wait count (connection starvation)
  - alert: DatabaseConnectionStarvation
    expr: rate(db_connection_wait_count_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High database connection wait rate"

  # Slow queries
  - alert: DatabaseSlowQueries
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{handler="/health"}[5m])) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Health check queries are slow (p95 > 1s)"

Testing Database Health Checks:

# Test health endpoint locally
curl -v http://localhost:8080/health | jq .

# Expected response:
{
  "status": "healthy",
  "checks": {
    "database": {
      "status": "ok",
      "connection_pool": {
        "open_connections": 10,
        "in_use": 2,
        "idle": 8,
        "max_open": 25,
        "wait_count": 0,
        "pool_utilization_percent": 40
      }
    },
    "redis": {
      "status": "ok"
    }
  },
  "timestamp": "2025-11-02T10:30:00Z"
}

# Simulate database failure (kill database container)
docker stop postgres-db

# Health check should fail
curl -v http://localhost:8080/health/ready
# Expected: HTTP 503 Service Unavailable

# Kubernetes should remove pod from service
kubectl get pods
# READY column shows 0/1

# Restore database
docker start postgres-db

# Health check should recover
curl -v http://localhost:8080/health/ready
# Expected: HTTP 200 OK

Common Database Health Check Mistakes:

DON'T: Use complex queries in health checks

// WRONG: Slow, locks tables
_, err := db.Query("SELECT * FROM users WHERE status = 'active' ORDER BY created_at DESC LIMIT 1000")

DO: Use simple, fast queries

// CORRECT: Fast, no locks
var result int
err := db.QueryRow("SELECT 1").Scan(&result)

DON'T: Ignore connection pool exhaustion

// WRONG: Health check passes but app is slow
if err := db.Ping(); err != nil {
    return err
}
// Missing: Check if pool is exhausted (high wait count)

DO: Check both connectivity and pool health

// CORRECT: Verify connectivity AND pool capacity
if err := db.Ping(); err != nil {
    return err
}
stats := db.Stats()
if stats.WaitCount > 100 {
    return fmt.Errorf("connection pool exhausted: wait_count=%d", stats.WaitCount)
}

DON'T: Set unlimited connection pool

// WRONG: Can exhaust database resources
db.SetMaxOpenConns(0)  // 0 = unlimited

DO: Set explicit, reasonable limits

// CORRECT: Explicit limit based on capacity
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(25)

Common Probe Configuration Mistakes:

DON'T: Check external dependencies in liveness probe

livenessProbe:
  httpGet:
    path: /health/ready  # WRONG: Checks Redis, DB

Why: If Redis is down temporarily, liveness fails → container restarts → Redis still down → restart loop

DO: Check only process health in liveness probe

livenessProbe:
  httpGet:
    path: /health/live  # CORRECT: Only checks if process responds

DON'T: Set aggressive failure thresholds

readinessProbe:
  periodSeconds: 5
  failureThreshold: 1  # WRONG: 1 failure = immediately removed

Why: Temporary network blip → immediately removed from load balancer → unnecessary disruption

DO: Allow for temporary failures

readinessProbe:
  periodSeconds: 10
  failureThreshold: 2  # CORRECT: 2 consecutive failures (20s) before removal

DON'T: Forget graceful shutdown

# No preStop hook = immediate termination

Why: WebSocket connections get abruptly closed → bad user experience

DO: Drain connections before shutdown

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]  # Give connections time to close

Testing Health Checks:

# Test liveness probe locally
curl -v http://localhost:8080/health/live
# Expected: HTTP 200 OK

# Test readiness probe locally
curl -v http://localhost:8080/health/ready
# Expected: HTTP 200 OK (if Redis is up)
# Expected: HTTP 503 Service Unavailable (if Redis is down)

# Test in Kubernetes
kubectl get pods
# Check "READY" column: should show 1/1

kubectl describe pod livetemplate-app-xxx
# Check "Conditions" section for probe failures

# Simulate Redis failure
kubectl exec -it redis-0 -- redis-cli shutdown
# Watch readiness probe fail
kubectl get pods -w
# Should see READY change from 1/1 to 0/1

# Restore Redis
kubectl rollout restart statefulset/redis
# Watch readiness probe recover
# Should see READY change from 0/1 to 1/1

Monitoring Probe Health:

Query Kubernetes events to detect probe failures:

# Recent probe failures
kubectl get events --field-selector reason=Unhealthy

# Probe failures for specific pod
kubectl describe pod livetemplate-app-xxx | grep -A 5 "Liveness\|Readiness"

Prometheus metrics for probe failures:

# Liveness probe failures (container restarts)
rate(kube_pod_container_status_restarts_total{pod=~"livetemplate-app-.*"}[5m]) > 0

# Readiness probe failures (not ready)
kube_pod_status_ready{pod=~"livetemplate-app-.*", condition="false"} == 1

Step 6: Test the Migration

Local Testing:

# Start Redis
docker run -d -p 6379:6379 redis:7-alpine

# Run application
REDIS_URL=localhost:6379 go run main.go

# Test session persistence
curl -c cookies.txt http://localhost:8080/
# Restart application
pkill -9 main && REDIS_URL=localhost:6379 go run main.go &
# Verify session persisted
curl -b cookies.txt http://localhost:8080/

Integration Test:

func TestRedisSessionPersistence(t *testing.T) {
    // Setup Redis and handler
    redisClient := redis.NewClient(&redis.Options{Addr: "localhost:6379"})
    sessionStore := livetemplate.NewRedisSessionStore(redisClient)

    controller := &TestController{}
    state := &TestState{Value: 0}
    handler := livetemplate.Mount(controller, livetemplate.AsState(state),
        livetemplate.WithSessionStore(sessionStore),
    )

    // Create session
    w := httptest.NewRecorder()
    r := httptest.NewRequest("GET", "/", nil)
    handler.ServeHTTP(w, r)

    // Extract session cookie
    cookies := w.Result().Cookies()
    sessionCookie := cookies[0]

    // Simulate restart by creating new handler
    handler2 := livetemplate.Mount(controller, livetemplate.AsState(&TestState{Value: 0}),
        livetemplate.WithSessionStore(sessionStore),
    )

    // Verify session persisted
    w2 := httptest.NewRecorder()
    r2 := httptest.NewRequest("GET", "/", nil)
    r2.AddCookie(sessionCookie)
    handler2.ServeHTTP(w2, r2)

    // Session should exist (no new session created)
    assert.Equal(t, sessionCookie.Value, w2.Result().Cookies()[0].Value)
}

Step 7: Deploy to Production

Deployment Strategy:

  1. Blue-Green Deployment (Recommended for first migration):

    # Deploy new version with Redis to "green" environment
    kubectl apply -f deployment-green.yaml
    
    # Verify health checks pass
    kubectl get pods -l version=green
    
    # Switch traffic to green
    kubectl patch service app -p '{"spec":{"selector":{"version":"green"}}}'
    
    # Monitor for 24 hours (session TTL)
    
    # Decommission blue environment
    kubectl delete -f deployment-blue.yaml
    
  2. Rolling Update (For subsequent deployments):

    apiVersion: apps/v1
    kind: Deployment
    spec:
      replicas: 3
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxSurge: 1
          maxUnavailable: 0  # Zero downtime
    

Migration Checklist

Connecting...

Common Migration Issues

Issue: "gob: name not registered for interface type"

Cause: State type not registered with gob.

Solution:

func init() {
    gob.Register(&YourStateType{})
}

Issue: "Sessions lost after migration"

Cause: MemorySessionStore sessions cannot be migrated to Redis.

Solution: Sessions will be recreated on next user visit. For critical sessions:

  1. Set migration window during low-traffic period
  2. Export sessions before migration: sessionStore.List()
  3. Import to Redis after migration

Issue: "Redis connection timeout in production"

Cause: Network policy blocking Redis access.

Solution:

# Kubernetes NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-redis
spec:
  podSelector:
    matchLabels:
      app: livetemplate
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379

Issue: "High Redis memory usage"

Cause: Sessions not expiring (TTL issue).

Solution:

// Set appropriate TTL
sessionStore := livetemplate.NewRedisSessionStore(redisClient,
    livetemplate.WithSessionTTL(24*time.Hour),
)

// Monitor Redis memory
redis-cli INFO memory

Performance Comparison

Metric MemorySessionStore RedisSessionStore (Local) RedisSessionStore (Remote)
Get Latency (p50) <1μs ~500μs 1-5ms
Get Latency (p99) <10μs ~1ms 5-20ms
Set Latency (p50) <1μs ~500μs 1-5ms
Set Latency (p99) <10μs ~1ms 5-20ms
Memory Overhead None Serialization Serialization + Network
Persistence No Yes (RDB/AOF) Yes (RDB/AOF)
Horizontal Scaling No Yes Yes

Note: Latencies are approximate and depend on network, Redis configuration, and data size.

Rollback Plan

If issues arise after migration:

  1. Immediate rollback (if within deployment window):

    kubectl rollout undo deployment/app
    
  2. Graceful rollback (after deployment complete):

    • Deploy old version with MemorySessionStore
    • Users will lose sessions (expected behavior)
    • Communicate downtime if necessary
  3. Partial rollback (keep Redis for some instances):

    // Hybrid approach: Use Redis but fallback to memory on errors
    sessionStore := livetemplate.NewRedisSessionStore(redisClient,
        livetemplate.WithFallbackToMemory(true),
    )
    

Next Steps After Migration

  1. Add distributed pub/sub for multi-instance server-initiated actions:

    pubsubBroadcaster := livetemplate.NewRedisBroadcaster(redisClient)
    handler := livetemplate.Mount(rootStore,
        livetemplate.WithSessionStore(sessionStore),
        livetemplate.WithPubSubBroadcaster(pubsubBroadcaster), // Enable cross-instance updates
    )
    
  2. Configure monitoring for Redis metrics

  3. Set up alerting for Redis connectivity issues

  4. Review capacity planning for expected load

See SESSION.md for the Session API guide on server-initiated actions.


Capacity Planning

This section provides formulas and guidelines for estimating resource requirements based on your expected load.

Memory Estimation

Application Instance Memory

Per WebSocket Connection:

Conservative: 15-200 KB
Realistic (medium complexity): 50 KB
Optimized (M3): 30 KB

Components of Connection Memory:

Example Calculations:

Connections Memory (Conservative) Memory (Realistic) Instances (16GB RAM)
1,000 200 MB 50 MB 1
10,000 2 GB 500 MB 1
50,000 10 GB 2.5 GB 1-2
100,000 20 GB 5 GB 2-4
1,000,000 200 GB 50 GB 10-20

Application Instance Overhead:

Formula for Instance Memory:

Total Memory = (Connections × Memory per Connection) + Overhead + Spike Buffer

Example:

10,000 connections × 50 KB = 500 MB
Overhead (OS + DB + Redis + App) = 4 GB
Spike Buffer (30%) = 1.35 GB
Total Memory Required = 5.85 GB ≈ 6-8 GB instance

Redis Session Store Memory

Per Session (Session Group):

Base session metadata: 500 bytes - 1 KB
Serialized State: Varies by application (1-100 KB typical)
Redis overhead: 20% (data structure overhead, fragmentation)

Example State Sizes:

// Small: ~2 KB
type TodoState struct {
    Items []Todo  // 10 items × 200 bytes
}

// Medium: ~20 KB
type DashboardState struct {
    Metrics   map[string]int     // 100 metrics × 50 bytes
    Alerts    []Alert            // 10 alerts × 500 bytes
    UserPrefs UserPreferences    // 1 KB
}

// Large: ~100 KB
type ChatState struct {
    Messages []Message  // 100 messages × 1 KB
    Users    []User     // 50 users × 100 bytes
}

Redis Memory Formula:

Redis Memory = (Active Sessions × Avg State Size × 1.2) + Redis Overhead

Redis Overhead:

Example Calculations:

Active Sessions Avg Store Size Redis Memory (No HA) Redis Memory (Sentinel)
1,000 5 KB 6 MB + 50 MB = 56 MB 156 MB
10,000 10 KB 120 MB + 50 MB = 170 MB 270 MB
100,000 20 KB 2.4 GB + 100 MB = 2.5 GB 3.5 GB
1,000,000 30 KB 36 GB + 500 MB = 36.5 GB 37.5 GB

Redis Memory Recommendations:

Session TTL Impact:

# Shorter TTL = Lower memory usage
24 hour TTL:  100K sessions × 20 KB = 2.4 GB
6 hour TTL:   25K sessions × 20 KB = 600 MB  # 4x reduction
1 hour TTL:   4K sessions × 20 KB = 96 MB    # 25x reduction

Monitoring Redis Memory:

# Check current memory usage
redis-cli INFO memory

# Key metrics to monitor:
# - used_memory_human: Total memory used
# - used_memory_rss_human: OS-reported memory
# - mem_fragmentation_ratio: Should be 1.0-1.5
# - evicted_keys: Should be 0 (we use TTL, not eviction)

# Session count
redis-cli DBSIZE

CPU Estimation

Per Instance:

Load Type CPU per 1K Connections CPU per 10K Connections
Idle connections 0.1-0.2 cores 1-2 cores
Active browsing (1 action/min) 0.5-1 cores 5-10 cores
Heavy interaction (10 actions/min) 2-4 cores 20-40 cores

Redis CPU:

Recommendation:

Network Bandwidth

WebSocket Traffic:

Per Instance Bandwidth:

10K connections × 100 KB/min avg = 1 GB/min = 16.7 MB/s

Redis Pub/Sub Bandwidth:

Message size × Broadcast rate × Instance count
Example: 5 KB message × 100 broadcasts/sec × 10 instances = 5 MB/s

Recommendation:

Connection Distribution

Rule of Thumb:

Load Balancer Strategy:

Session Distribution:

Scaling Decision Matrix

Use this table to determine when to scale horizontally (add instances) vs vertically (larger instances):

Scenario Current State Recommended Action
Memory at 80% Single instance Add more instances (horizontal scale)
CPU at 80% Single instance Add more instances or upgrade instance size
High broadcast latency Multiple instances Add more Redis resources or optimize broadcasts
Uneven load Multiple instances Enable connection migration (M3) or adjust LB algorithm
Session store slow Redis at capacity Upgrade Redis instance or switch to Cluster

Capacity Planning Example

Scenario: E-commerce platform with 50,000 concurrent users

Requirements:

Calculations:

1. Application Instances:

Connection memory: 50,000 × 50 KB = 2.5 GB
Overhead: 4 GB
Spike buffer (30%): 2 GB
Total per instance: 8.5 GB

Instance size: 16 GB (provides headroom)
Connections per instance: 10,000 (60% capacity)
Required instances: 50,000 / 10,000 = 5 instances

CPU per instance (moderate load): 4-6 cores
Recommended: 5× instances with 8 cores, 16 GB RAM

2. Redis Session Store:

Active sessions: 50,000
Avg store size: 30 KB
Redis memory: 50,000 × 30 KB × 1.2 = 1.8 GB
Add overhead: 100 MB
Total: 2 GB

Recommended: Redis Sentinel (3 nodes, 4 GB each)

3. Load Balancer:

Sticky sessions enabled
Algorithm: Least-connections fallback
Health checks: /health/ready (every 10s)
Connection draining: 30s timeout

4. Total Infrastructure:

Application: 5 instances × $50/month = $250
Redis Sentinel: 3 nodes × $30/month = $90
Load Balancer: $40/month
Database: $100/month
Total: ~$480/month (Tier 2-3 scale)

Capacity Planning Tools

Formula Spreadsheet:

Target Connections: [input]
Memory per Connection: 50 KB (default)
Sessions per Connection: 1 (default)
State Size per Session: 20 KB (default)

→ Application Memory: [calculated]
→ Redis Memory: [calculated]
→ Instance Count: [calculated]
→ Estimated Cost: [calculated]

Monitoring Capacity:

# Connection capacity utilization
(livetemplate_connections_active / livetemplate_connections_max) > 0.7

# Memory capacity utilization
(process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.8

# Redis memory utilization
(redis_memory_used_bytes / redis_memory_max_bytes) > 0.8

Right-Sizing Recommendations

When to Scale Up (Vertical):

When to Scale Out (Horizontal):

When to Use Redis Cluster (vs Sentinel):


Scaling Checklist

Before Scaling to Tier 2 (Redis + 2 Instances)

Connecting...

Before Scaling to Tier 3 (Production HA)

Connecting...

Before Scaling to Tier 4 (Enterprise)


Performance Benchmarks

Single Instance (16 GB RAM, 8 vCPUs)

Metric M1 M2 M3
Max Connections 10K 20K 50K+
Action Latency (p95) <100ms <50ms <20ms
Broadcast Latency (p95) <50ms <100ms <50ms
Memory per Connection 100 KB 70 KB 30 KB
Goroutines per Connection 1 1 0.5

Multi-Instance (10 instances)

Metric M2 M3
Total Connections 200K 500K+
Broadcast Fan-out Time (10K users) 200ms 100ms
Session Lookup Latency (Redis) <5ms <2ms
Cross-Instance Broadcast Latency <100ms <50ms

Note: Benchmarks are approximate and depend on hardware, network, and workload characteristics.


Common Scaling Issues

Issue: Connection Limit Reached

Symptoms:

Solutions:

  1. Horizontal scale: Add more instances
  2. Vertical scale: Increase instance RAM
  3. Optimize: Review connection lifecycle, reduce memory per connection
  4. Limit: Set MaxConnectionsPerGroup to prevent single-user exhaustion

Issue: High Broadcast Latency

Symptoms:

Solutions:

  1. Redis latency: Check redis-cli --latency and network latency
  2. Fan-out size: Limit group sizes or shard groups
  3. Local optimization: Ensure local broadcasts skip Redis (M2 feature)
  4. Compression: Enable WebSocket compression (M3)

Issue: Sessions Not Persisting

Symptoms:

Solutions:

  1. Check Redis: Verify Redis persistence (RDB/AOF) enabled
  2. Check TTL: Ensure session TTL configured correctly
  3. Check serialization: Verify custom State types are serializable
  4. Fallback: Ensure WithFallbackToMemory not masking issues

Issue: Uneven Load Distribution

Symptoms:

Solutions:

  1. Sticky sessions: Verify cookie-based affinity working
  2. Long-lived connections: WebSockets can cause imbalance over time
  3. Rebalancing: Implement periodic connection migration (M3 feature)
  4. Algorithm: Try least-connections instead of round-robin

Monitoring and Alerting

Critical Metrics to Monitor

Connection Health:

livetemplate_connections_active{instance="host1"} > 8000  # 80% of 10K limit
livetemplate_connections_rejected_total > 100

Performance:

livetemplate_action_duration_seconds{quantile="0.95"} > 0.200  # 200ms
livetemplate_broadcasts_sent_total rate(5m) > 10000  # High broadcast rate

Resource Usage:

process_resident_memory_bytes > 13e9  # 13 GB of 16 GB
redis_connected_clients{instance="redis1"} > 9000  # 90% of Redis max clients

Critical (page on-call):

Warning (Slack notification):

Info (metrics only):


Next Steps


Questions? Open an issue on GitHub or join the discussion.