Target Audience: DevOps engineers, SREs, and developers deploying LiveTemplate applications at scale.
Last Updated: 2025-11-01
This guide covers scaling LiveTemplate applications from single-host prototypes to production systems handling millions of concurrent WebSocket connections.
Capacity: <1,000 concurrent connections Infrastructure: 1 instance, no Redis Cost: $5-20/month (VPS)
Recommended For:
Configuration:
sessionStore := livetemplate.NewMemorySessionStore()
handler := livetemplate.Mount(rootStore,
livetemplate.WithSessionStore(sessionStore),
livetemplate.WithMaxConnections(1000),
)
Infrastructure:
Limitations:
Capacity: 1K-10K concurrent connections Infrastructure: 1-2 instances, Redis optional Cost: $50-200/month
Recommended For:
Configuration:
// Option A: Still single-host with Redis for persistence
redisClient := redis.NewClient(&redis.Options{
Addr: os.Getenv("REDIS_URL"),
})
sessionStore := livetemplate.NewRedisSessionStore(redisClient,
livetemplate.WithFallbackToMemory(true),
)
handler := livetemplate.Mount(rootStore,
livetemplate.WithSessionStore(sessionStore),
livetemplate.WithMaxConnections(5000),
)
Infrastructure:
Benefits Over Tier 1:
Capacity: 10K-100K concurrent connections Infrastructure: 2-10 instances, Redis Sentinel Cost: $500-2000/month
Recommended For:
Configuration:
// Multi-instance with Redis Sentinel for HA
redisClient := redis.NewFailoverClient(&redis.FailoverOptions{
MasterName: "mymaster",
SentinelAddrs: []string{"sentinel1:26379", "sentinel2:26379"},
})
sessionStore := livetemplate.NewRedisSessionStore(redisClient)
pubsubBroadcaster := livetemplate.NewRedisBroadcaster(redisClient)
handler := livetemplate.Mount(rootStore,
livetemplate.WithSessionStore(sessionStore),
livetemplate.WithPubSubBroadcaster(pubsubBroadcaster),
livetemplate.WithMaxConnections(10000),
livetemplate.WithMaxConnectionsPerGroup(500),
)
Infrastructure:
Deployment Pattern:
Key Metrics to Monitor:
livetemplate_connections_active per instancelivetemplate_connections_rejected_total (backpressure)livetemplate_action_duration_seconds (p95, p99)Capacity: 100K-1M+ concurrent connections Infrastructure: 10-100+ instances, Redis Cluster Cost: $5K-50K+/month
Recommended For:
Configuration:
// Redis Cluster for horizontal sharding
redisClient := redis.NewClusterClient(&redis.ClusterOptions{
Addrs: []string{
"redis-node1:6379",
"redis-node2:6379",
"redis-node3:6379",
},
})
sessionStore := livetemplate.NewRedisSessionStore(redisClient)
pubsubBroadcaster := livetemplate.NewRedisBroadcaster(redisClient)
handler := livetemplate.Mount(rootStore,
livetemplate.WithSessionStore(sessionStore),
livetemplate.WithPubSubBroadcaster(pubsubBroadcaster),
livetemplate.WithMaxConnections(50000),
livetemplate.WithMaxConnectionsPerGroup(1000),
livetemplate.WithWebSocketCompression(true), // M3 feature
)
Infrastructure:
Architecture Patterns:
This guide walks through migrating from in-memory session storage to Redis-backed storage for horizontal scaling.
Stay with MemorySessionStore if:
Migrate to RedisSessionStore when:
1,000 concurrent connections expected
github.com/redis/go-redis/v9Development (Docker):
docker run -d \
--name livetemplate-redis \
-p 6379:6379 \
redis:7-alpine \
redis-server --appendonly yes
Production (Managed Service):
LiveTemplate uses Go's encoding/gob for serialization, which requires registering custom types.
Before (works with MemorySessionStore):
// State holds data (cloned per session)
type TodoState struct {
Items []Todo
}
// Controller holds dependencies (singleton)
type TodoController struct {
DB *sql.DB
}
// Action method
func (c *TodoController) Add(state TodoState, ctx *livetemplate.Context) (TodoState, error) {
todo := Todo{Title: ctx.GetString("title")}
state.Items = append(state.Items, todo)
return state, nil
}
After (required for RedisSessionStore):
type TodoState struct {
Items []Todo
}
type TodoController struct {
DB *sql.DB
}
func (c *TodoController) Add(state TodoState, ctx *livetemplate.Context) (TodoState, error) {
todo := Todo{Title: ctx.GetString("title")}
state.Items = append(state.Items, todo)
return state, nil
}
// Register all State types in init()
func init() {
gob.Register(&TodoState{})
gob.Register(&Todo{}) // Register nested types too
}
Why? Gob encoding preserves type information only for registered types. Without registration, deserialization fails.
Before (MemorySessionStore):
package main
import "github.com/livetemplate/livetemplate"
func main() {
// In-memory session store (default)
sessionStore := livetemplate.NewMemorySessionStore()
controller := &AppController{}
state := &AppState{}
handler := livetemplate.Mount(controller, livetemplate.AsState(state),
livetemplate.WithSessionStore(sessionStore),
livetemplate.WithMaxConnections(1000),
)
http.Handle("/", handler)
http.ListenAndServe(":8080", nil)
}
After (RedisSessionStore):
package main
import (
"github.com/livetemplate/livetemplate"
"github.com/redis/go-redis/v9"
"log"
"os"
)
func main() {
// Connect to Redis
redisClient := redis.NewClient(&redis.Options{
Addr: os.Getenv("REDIS_URL"), // e.g., "localhost:6379"
Password: os.Getenv("REDIS_PASSWORD"),
DB: 0,
})
// Verify Redis connection
if err := redisClient.Ping(context.Background()).Err(); err != nil {
log.Fatalf("Failed to connect to Redis: %v", err)
}
// Create Redis session store with fallback
sessionStore := livetemplate.NewRedisSessionStore(redisClient,
livetemplate.WithSessionTTL(24*time.Hour),
livetemplate.WithFallbackToMemory(true), // Graceful degradation
)
controller := &AppController{}
state := &AppState{}
handler := livetemplate.Mount(controller, livetemplate.AsState(state),
livetemplate.WithSessionStore(sessionStore),
livetemplate.WithMaxConnections(10000), // Can handle more now
)
http.Handle("/", handler)
http.ListenAndServe(":8080", nil)
}
Development (.env):
REDIS_URL=localhost:6379
REDIS_PASSWORD=
Production (Kubernetes Secret):
apiVersion: v1
kind: Secret
metadata:
name: redis-credentials
type: Opaque
stringData:
redis-url: "redis.production.svc.cluster.local:6379"
redis-password: "your-secure-password"
Deployment:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
env:
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-credentials
key: redis-url
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-credentials
key: redis-password
Add Redis health check to ensure instance is ready before accepting traffic.
import "github.com/livetemplate/livetemplate"
func main() {
// ... Redis setup ...
sessionStore := livetemplate.NewRedisSessionStore(redisClient)
// Health check endpoints
http.HandleFunc("/health/live", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
})
http.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
// Check Redis connectivity
if err := sessionStore.Ping(); err != nil {
http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("READY"))
})
http.Handle("/", handler)
http.ListenAndServe(":8080", nil)
}
Kubernetes Probe Configuration:
Configure liveness and readiness probes to ensure Kubernetes can properly manage your application lifecycle.
apiVersion: apps/v1
kind: Deployment
metadata:
name: livetemplate-app
spec:
replicas: 3
selector:
matchLabels:
app: livetemplate
template:
metadata:
labels:
app: livetemplate
spec:
containers:
- name: app
image: your-registry/livetemplate-app:latest
ports:
- containerPort: 8080
name: http
env:
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: redis-credentials
key: redis-url
# Liveness Probe: Is the application running?
# Failure = Restart container
livenessProbe:
httpGet:
path: /health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 10 # Wait 10s after container starts
periodSeconds: 30 # Check every 30s
timeoutSeconds: 5 # Request timeout
successThreshold: 1 # 1 success = healthy
failureThreshold: 3 # 3 failures = restart (90s total)
# Readiness Probe: Can the application accept traffic?
# Failure = Remove from service endpoints
readinessProbe:
httpGet:
path: /health/ready
port: 8080
scheme: HTTP
initialDelaySeconds: 5 # Start checking after 5s
periodSeconds: 10 # Check every 10s
timeoutSeconds: 5 # Request timeout
successThreshold: 1 # 1 success = ready
failureThreshold: 2 # 2 failures = not ready (20s total)
# Startup Probe: Has the application finished starting?
# Use for slow-starting applications
startupProbe:
httpGet:
path: /health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 0 # Start immediately
periodSeconds: 5 # Check every 5s
timeoutSeconds: 3 # Request timeout
successThreshold: 1 # 1 success = started
failureThreshold: 30 # 30 failures = give up (150s total)
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
# Graceful shutdown: allow connections to drain
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
Probe Configuration Guidelines:
| Probe Type | Purpose | Failure Action | Recommended Settings |
|---|---|---|---|
| Liveness | Detect deadlocks, hung processes | Restart container | periodSeconds: 30, failureThreshold: 3 |
| Readiness | Detect temporary unavailability (Redis down, DB issues) | Remove from load balancer | periodSeconds: 10, failureThreshold: 2 |
| Startup | Handle slow application startup | Delay liveness checks | periodSeconds: 5, failureThreshold: 30 |
When to Use Each Probe:
Liveness Probe (/health/live):
Readiness Probe (/health/ready):
Startup Probe (/health/live):
Health Check Implementation Best Practices:
func setupHealthChecks(sessionStore *livetemplate.RedisSessionStore, db *sql.DB) {
// Liveness: Just check if HTTP server is responding
// Do NOT check external dependencies
http.HandleFunc("/health/live", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
})
// Readiness: Check all critical dependencies
http.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
defer cancel()
// Check Redis
if err := sessionStore.Ping(); err != nil {
log.Printf("Readiness: Redis unhealthy: %v", err)
http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)
return
}
// Check database (optional, if using database)
if db != nil {
if err := db.PingContext(ctx); err != nil {
log.Printf("Readiness: Database unhealthy: %v", err)
http.Error(w, "Database unavailable", http.StatusServiceUnavailable)
return
}
}
// All checks passed
w.WriteHeader(http.StatusOK)
w.Write([]byte("READY"))
})
// Optional: Detailed health check for monitoring (not for k8s probes)
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
health := struct {
Status string `json:"status"`
Checks map[string]string `json:"checks"`
Timestamp time.Time `json:"timestamp"`
Connections int `json:"active_connections"`
}{
Status: "healthy",
Checks: make(map[string]string),
Timestamp: time.Now(),
}
// Check Redis
if err := sessionStore.Ping(); err != nil {
health.Status = "unhealthy"
health.Checks["redis"] = fmt.Sprintf("error: %v", err)
} else {
health.Checks["redis"] = "ok"
}
// Check database
if db != nil {
if err := db.Ping(); err != nil {
health.Status = "unhealthy"
health.Checks["database"] = fmt.Sprintf("error: %v", err)
} else {
health.Checks["database"] = "ok"
}
}
// Return JSON response
w.Header().Set("Content-Type", "application/json")
if health.Status != "healthy" {
w.WriteHeader(http.StatusServiceUnavailable)
}
json.NewEncoder(w).Encode(health)
})
}
Advanced Database Health Checks:
For production deployments with databases, implement comprehensive health checks that verify not just connectivity, but also connection pool health and query performance.
package main
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"log"
"net/http"
"time"
_ "github.com/lib/pq" // PostgreSQL driver
)
type DatabaseHealthChecker struct {
db *sql.DB
timeout time.Duration
}
func NewDatabaseHealthChecker(db *sql.DB) *DatabaseHealthChecker {
return &DatabaseHealthChecker{
db: db,
timeout: 3 * time.Second,
}
}
// Check performs comprehensive database health check
func (d *DatabaseHealthChecker) Check(ctx context.Context) error {
ctx, cancel := context.WithTimeout(ctx, d.timeout)
defer cancel()
// 1. Ping: Verify basic connectivity
if err := d.db.PingContext(ctx); err != nil {
return fmt.Errorf("ping failed: %w", err)
}
// 2. Simple query: Verify database is responsive
var result int
if err := d.db.QueryRowContext(ctx, "SELECT 1").Scan(&result); err != nil {
return fmt.Errorf("query failed: %w", err)
}
return nil
}
// Stats returns database connection pool statistics
func (d *DatabaseHealthChecker) Stats() sql.DBStats {
return d.db.Stats()
}
// Detailed health check endpoint with database metrics
func setupDatabaseHealthCheck(db *sql.DB, sessionStore *livetemplate.RedisSessionStore) {
dbChecker := NewDatabaseHealthChecker(db)
// Simple readiness check for Kubernetes
http.HandleFunc("/health/ready", func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// Check Redis
if err := sessionStore.Ping(); err != nil {
log.Printf("Readiness: Redis unhealthy: %v", err)
http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)
return
}
// Check database
if err := dbChecker.Check(ctx); err != nil {
log.Printf("Readiness: Database unhealthy: %v", err)
http.Error(w, "Database unavailable", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("READY"))
})
// Detailed health check with metrics
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
health := struct {
Status string `json:"status"`
Checks map[string]interface{} `json:"checks"`
Timestamp time.Time `json:"timestamp"`
}{
Status: "healthy",
Checks: make(map[string]interface{}),
Timestamp: time.Now(),
}
// Redis health
if err := sessionStore.Ping(); err != nil {
health.Status = "unhealthy"
health.Checks["redis"] = map[string]interface{}{
"status": "error",
"error": err.Error(),
}
} else {
health.Checks["redis"] = map[string]interface{}{
"status": "ok",
}
}
// Database health with detailed metrics
dbHealth := map[string]interface{}{
"status": "ok",
}
if err := dbChecker.Check(ctx); err != nil {
health.Status = "unhealthy"
dbHealth["status"] = "error"
dbHealth["error"] = err.Error()
} else {
// Add connection pool statistics
stats := dbChecker.Stats()
dbHealth["connection_pool"] = map[string]interface{}{
"open_connections": stats.OpenConnections,
"in_use": stats.InUse,
"idle": stats.Idle,
"max_open": stats.MaxOpenConnections,
"wait_count": stats.WaitCount,
"wait_duration_ms": stats.WaitDuration.Milliseconds(),
"max_idle_closed": stats.MaxIdleClosed,
"max_idle_time_closed": stats.MaxIdleTimeClosed,
"max_lifetime_closed": stats.MaxLifetimeClosed,
}
// Calculate pool utilization
utilization := float64(0)
if stats.MaxOpenConnections > 0 {
utilization = float64(stats.OpenConnections) / float64(stats.MaxOpenConnections) * 100
}
dbHealth["pool_utilization_percent"] = utilization
// Warn if pool is >80% utilized
if utilization > 80 {
dbHealth["warning"] = "connection pool utilization high"
}
}
health.Checks["database"] = dbHealth
// Return response
w.Header().Set("Content-Type", "application/json")
if health.Status != "healthy" {
w.WriteHeader(http.StatusServiceUnavailable)
}
json.NewEncoder(w).Encode(health)
})
}
// Configure database connection pool for production
func configureDatabasePool(db *sql.DB) {
// Maximum number of open connections
// Rule of thumb: (CPU cores × 2) + disk spindles
// Example: 8 cores + 2 disks = 18 connections
db.SetMaxOpenConns(25)
// Maximum number of idle connections in pool
// Should be same as MaxOpenConns for consistent performance
db.SetMaxIdleConns(25)
// Maximum lifetime of a connection
// Helps with connection refresh and load balancer rotation
db.SetConnMaxLifetime(5 * time.Minute)
// Maximum idle time for a connection
// Connections idle longer than this are closed
db.SetConnMaxIdleTime(1 * time.Minute)
}
// Example main function with database health checks
func main() {
// Setup database
db, err := sql.Open("postgres", os.Getenv("DATABASE_URL"))
if err != nil {
log.Fatalf("Failed to open database: %v", err)
}
defer db.Close()
// Configure connection pool
configureDatabasePool(db)
// Verify database is reachable on startup
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := db.PingContext(ctx); err != nil {
log.Fatalf("Failed to ping database: %v", err)
}
log.Println("Database connection established")
// Setup Redis session store
redisClient := redis.NewClient(&redis.Options{
Addr: os.Getenv("REDIS_URL"),
})
sessionStore := livetemplate.NewRedisSessionStore(redisClient)
// Setup health checks
setupDatabaseHealthCheck(db, sessionStore)
// ... rest of application setup
}
Database Connection Pool Best Practices:
Set Appropriate Connection Limits:
// Too low: Queries queue, high latency
db.SetMaxOpenConns(5) // DON'T: Too few for production
// Too high: Resource exhaustion, database overload
db.SetMaxOpenConns(1000) // DON'T: Way too many
// Just right: Based on workload and database capacity
db.SetMaxOpenConns(25) // DO: Reasonable for most apps
Formula:
MaxOpenConns = (CPU cores × 2) + disk spindles
Examples:
- 8 core server + SSD: 8×2 + 1 = 17 ≈ 20 connections
- 16 core server + RAID: 16×2 + 4 = 36 ≈ 40 connections
Match Idle and Max Connections:
// Inconsistent: Idle connections close/reopen frequently
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(5) // DON'T: Creates connection churn
// Consistent: Connections stay open and ready
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(25) // DO: No connection churn
Set Connection Lifetimes:
// Infinite lifetime: Stale connections, load balancer issues
// (default: no limit)
// Reasonable lifetime: Fresh connections, LB-friendly
db.SetConnMaxLifetime(5 * time.Minute) // DO: Rotate connections
db.SetConnMaxIdleTime(1 * time.Minute) // DO: Close idle connections
Monitor Connection Pool Metrics:
// Log pool stats periodically
go func() {
ticker := time.NewTicker(30 * time.Second)
for range ticker.C {
stats := db.Stats()
log.Printf("DB Pool: open=%d in_use=%d idle=%d wait_count=%d",
stats.OpenConnections, stats.InUse, stats.Idle, stats.WaitCount)
// Alert if pool is starved
if stats.WaitCount > 100 {
log.Printf("WARNING: High connection wait count: %d", stats.WaitCount)
}
}
}()
Prometheus Metrics for Database Health:
import (
"database/sql"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
dbConnectionsOpen = promauto.NewGauge(prometheus.GaugeOpts{
Name: "db_connections_open",
Help: "Number of open database connections",
})
dbConnectionsInUse = promauto.NewGauge(prometheus.GaugeOpts{
Name: "db_connections_in_use",
Help: "Number of database connections currently in use",
})
dbConnectionsIdle = promauto.NewGauge(prometheus.GaugeOpts{
Name: "db_connections_idle",
Help: "Number of idle database connections",
})
dbConnectionWaitCount = promauto.NewGauge(prometheus.GaugeOpts{
Name: "db_connection_wait_count_total",
Help: "Total number of times a connection was waited for",
})
dbConnectionWaitDuration = promauto.NewGauge(prometheus.GaugeOpts{
Name: "db_connection_wait_duration_seconds",
Help: "Total time blocked waiting for connections",
})
)
// Export database pool metrics to Prometheus
func exportDatabaseMetrics(db *sql.DB) {
go func() {
ticker := time.NewTicker(10 * time.Second)
for range ticker.C {
stats := db.Stats()
dbConnectionsOpen.Set(float64(stats.OpenConnections))
dbConnectionsInUse.Set(float64(stats.InUse))
dbConnectionsIdle.Set(float64(stats.Idle))
dbConnectionWaitCount.Set(float64(stats.WaitCount))
dbConnectionWaitDuration.Set(stats.WaitDuration.Seconds())
}
}()
}
Alerting Rules:
# Prometheus alerting rules for database health
groups:
- name: database_health
rules:
# Database connectivity
- alert: DatabaseDown
expr: up{job="database"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down"
# Connection pool exhaustion
- alert: DatabasePoolExhausted
expr: (db_connections_in_use / db_connections_open) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Database connection pool >90% utilized"
# High wait count (connection starvation)
- alert: DatabaseConnectionStarvation
expr: rate(db_connection_wait_count_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High database connection wait rate"
# Slow queries
- alert: DatabaseSlowQueries
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{handler="/health"}[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Health check queries are slow (p95 > 1s)"
Testing Database Health Checks:
# Test health endpoint locally
curl -v http://localhost:8080/health | jq .
# Expected response:
{
"status": "healthy",
"checks": {
"database": {
"status": "ok",
"connection_pool": {
"open_connections": 10,
"in_use": 2,
"idle": 8,
"max_open": 25,
"wait_count": 0,
"pool_utilization_percent": 40
}
},
"redis": {
"status": "ok"
}
},
"timestamp": "2025-11-02T10:30:00Z"
}
# Simulate database failure (kill database container)
docker stop postgres-db
# Health check should fail
curl -v http://localhost:8080/health/ready
# Expected: HTTP 503 Service Unavailable
# Kubernetes should remove pod from service
kubectl get pods
# READY column shows 0/1
# Restore database
docker start postgres-db
# Health check should recover
curl -v http://localhost:8080/health/ready
# Expected: HTTP 200 OK
Common Database Health Check Mistakes:
❌ DON'T: Use complex queries in health checks
// WRONG: Slow, locks tables
_, err := db.Query("SELECT * FROM users WHERE status = 'active' ORDER BY created_at DESC LIMIT 1000")
✅ DO: Use simple, fast queries
// CORRECT: Fast, no locks
var result int
err := db.QueryRow("SELECT 1").Scan(&result)
❌ DON'T: Ignore connection pool exhaustion
// WRONG: Health check passes but app is slow
if err := db.Ping(); err != nil {
return err
}
// Missing: Check if pool is exhausted (high wait count)
✅ DO: Check both connectivity and pool health
// CORRECT: Verify connectivity AND pool capacity
if err := db.Ping(); err != nil {
return err
}
stats := db.Stats()
if stats.WaitCount > 100 {
return fmt.Errorf("connection pool exhausted: wait_count=%d", stats.WaitCount)
}
❌ DON'T: Set unlimited connection pool
// WRONG: Can exhaust database resources
db.SetMaxOpenConns(0) // 0 = unlimited
✅ DO: Set explicit, reasonable limits
// CORRECT: Explicit limit based on capacity
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(25)
Common Probe Configuration Mistakes:
❌ DON'T: Check external dependencies in liveness probe
livenessProbe:
httpGet:
path: /health/ready # WRONG: Checks Redis, DB
Why: If Redis is down temporarily, liveness fails → container restarts → Redis still down → restart loop
✅ DO: Check only process health in liveness probe
livenessProbe:
httpGet:
path: /health/live # CORRECT: Only checks if process responds
❌ DON'T: Set aggressive failure thresholds
readinessProbe:
periodSeconds: 5
failureThreshold: 1 # WRONG: 1 failure = immediately removed
Why: Temporary network blip → immediately removed from load balancer → unnecessary disruption
✅ DO: Allow for temporary failures
readinessProbe:
periodSeconds: 10
failureThreshold: 2 # CORRECT: 2 consecutive failures (20s) before removal
❌ DON'T: Forget graceful shutdown
# No preStop hook = immediate termination
Why: WebSocket connections get abruptly closed → bad user experience
✅ DO: Drain connections before shutdown
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Give connections time to close
Testing Health Checks:
# Test liveness probe locally
curl -v http://localhost:8080/health/live
# Expected: HTTP 200 OK
# Test readiness probe locally
curl -v http://localhost:8080/health/ready
# Expected: HTTP 200 OK (if Redis is up)
# Expected: HTTP 503 Service Unavailable (if Redis is down)
# Test in Kubernetes
kubectl get pods
# Check "READY" column: should show 1/1
kubectl describe pod livetemplate-app-xxx
# Check "Conditions" section for probe failures
# Simulate Redis failure
kubectl exec -it redis-0 -- redis-cli shutdown
# Watch readiness probe fail
kubectl get pods -w
# Should see READY change from 1/1 to 0/1
# Restore Redis
kubectl rollout restart statefulset/redis
# Watch readiness probe recover
# Should see READY change from 0/1 to 1/1
Monitoring Probe Health:
Query Kubernetes events to detect probe failures:
# Recent probe failures
kubectl get events --field-selector reason=Unhealthy
# Probe failures for specific pod
kubectl describe pod livetemplate-app-xxx | grep -A 5 "Liveness\|Readiness"
Prometheus metrics for probe failures:
# Liveness probe failures (container restarts)
rate(kube_pod_container_status_restarts_total{pod=~"livetemplate-app-.*"}[5m]) > 0
# Readiness probe failures (not ready)
kube_pod_status_ready{pod=~"livetemplate-app-.*", condition="false"} == 1
Local Testing:
# Start Redis
docker run -d -p 6379:6379 redis:7-alpine
# Run application
REDIS_URL=localhost:6379 go run main.go
# Test session persistence
curl -c cookies.txt http://localhost:8080/
# Restart application
pkill -9 main && REDIS_URL=localhost:6379 go run main.go &
# Verify session persisted
curl -b cookies.txt http://localhost:8080/
Integration Test:
func TestRedisSessionPersistence(t *testing.T) {
// Setup Redis and handler
redisClient := redis.NewClient(&redis.Options{Addr: "localhost:6379"})
sessionStore := livetemplate.NewRedisSessionStore(redisClient)
controller := &TestController{}
state := &TestState{Value: 0}
handler := livetemplate.Mount(controller, livetemplate.AsState(state),
livetemplate.WithSessionStore(sessionStore),
)
// Create session
w := httptest.NewRecorder()
r := httptest.NewRequest("GET", "/", nil)
handler.ServeHTTP(w, r)
// Extract session cookie
cookies := w.Result().Cookies()
sessionCookie := cookies[0]
// Simulate restart by creating new handler
handler2 := livetemplate.Mount(controller, livetemplate.AsState(&TestState{Value: 0}),
livetemplate.WithSessionStore(sessionStore),
)
// Verify session persisted
w2 := httptest.NewRecorder()
r2 := httptest.NewRequest("GET", "/", nil)
r2.AddCookie(sessionCookie)
handler2.ServeHTTP(w2, r2)
// Session should exist (no new session created)
assert.Equal(t, sessionCookie.Value, w2.Result().Cookies()[0].Value)
}
Deployment Strategy:
Blue-Green Deployment (Recommended for first migration):
# Deploy new version with Redis to "green" environment
kubectl apply -f deployment-green.yaml
# Verify health checks pass
kubectl get pods -l version=green
# Switch traffic to green
kubectl patch service app -p '{"spec":{"selector":{"version":"green"}}}'
# Monitor for 24 hours (session TTL)
# Decommission blue environment
kubectl delete -f deployment-blue.yaml
Rolling Update (For subsequent deployments):
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero downtime
Cause: State type not registered with gob.
Solution:
func init() {
gob.Register(&YourStateType{})
}
Cause: MemorySessionStore sessions cannot be migrated to Redis.
Solution: Sessions will be recreated on next user visit. For critical sessions:
sessionStore.List()Cause: Network policy blocking Redis access.
Solution:
# Kubernetes NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-redis
spec:
podSelector:
matchLabels:
app: livetemplate
egress:
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
Cause: Sessions not expiring (TTL issue).
Solution:
// Set appropriate TTL
sessionStore := livetemplate.NewRedisSessionStore(redisClient,
livetemplate.WithSessionTTL(24*time.Hour),
)
// Monitor Redis memory
redis-cli INFO memory
| Metric | MemorySessionStore | RedisSessionStore (Local) | RedisSessionStore (Remote) |
|---|---|---|---|
| Get Latency (p50) | <1μs | ~500μs | 1-5ms |
| Get Latency (p99) | <10μs | ~1ms | 5-20ms |
| Set Latency (p50) | <1μs | ~500μs | 1-5ms |
| Set Latency (p99) | <10μs | ~1ms | 5-20ms |
| Memory Overhead | None | Serialization | Serialization + Network |
| Persistence | No | Yes (RDB/AOF) | Yes (RDB/AOF) |
| Horizontal Scaling | No | Yes | Yes |
Note: Latencies are approximate and depend on network, Redis configuration, and data size.
If issues arise after migration:
Immediate rollback (if within deployment window):
kubectl rollout undo deployment/app
Graceful rollback (after deployment complete):
Partial rollback (keep Redis for some instances):
// Hybrid approach: Use Redis but fallback to memory on errors
sessionStore := livetemplate.NewRedisSessionStore(redisClient,
livetemplate.WithFallbackToMemory(true),
)
Add distributed pub/sub for multi-instance server-initiated actions:
pubsubBroadcaster := livetemplate.NewRedisBroadcaster(redisClient)
handler := livetemplate.Mount(rootStore,
livetemplate.WithSessionStore(sessionStore),
livetemplate.WithPubSubBroadcaster(pubsubBroadcaster), // Enable cross-instance updates
)
Configure monitoring for Redis metrics
Set up alerting for Redis connectivity issues
Review capacity planning for expected load
See SESSION.md for the Session API guide on server-initiated actions.
This section provides formulas and guidelines for estimating resource requirements based on your expected load.
Per WebSocket Connection:
Conservative: 15-200 KB
Realistic (medium complexity): 50 KB
Optimized (M3): 30 KB
Components of Connection Memory:
Example Calculations:
| Connections | Memory (Conservative) | Memory (Realistic) | Instances (16GB RAM) |
|---|---|---|---|
| 1,000 | 200 MB | 50 MB | 1 |
| 10,000 | 2 GB | 500 MB | 1 |
| 50,000 | 10 GB | 2.5 GB | 1-2 |
| 100,000 | 20 GB | 5 GB | 2-4 |
| 1,000,000 | 200 GB | 50 GB | 10-20 |
Application Instance Overhead:
Formula for Instance Memory:
Total Memory = (Connections × Memory per Connection) + Overhead + Spike Buffer
Example:
10,000 connections × 50 KB = 500 MB
Overhead (OS + DB + Redis + App) = 4 GB
Spike Buffer (30%) = 1.35 GB
Total Memory Required = 5.85 GB ≈ 6-8 GB instance
Per Session (Session Group):
Base session metadata: 500 bytes - 1 KB
Serialized State: Varies by application (1-100 KB typical)
Redis overhead: 20% (data structure overhead, fragmentation)
Example State Sizes:
// Small: ~2 KB
type TodoState struct {
Items []Todo // 10 items × 200 bytes
}
// Medium: ~20 KB
type DashboardState struct {
Metrics map[string]int // 100 metrics × 50 bytes
Alerts []Alert // 10 alerts × 500 bytes
UserPrefs UserPreferences // 1 KB
}
// Large: ~100 KB
type ChatState struct {
Messages []Message // 100 messages × 1 KB
Users []User // 50 users × 100 bytes
}
Redis Memory Formula:
Redis Memory = (Active Sessions × Avg State Size × 1.2) + Redis Overhead
Redis Overhead:
Example Calculations:
| Active Sessions | Avg Store Size | Redis Memory (No HA) | Redis Memory (Sentinel) |
|---|---|---|---|
| 1,000 | 5 KB | 6 MB + 50 MB = 56 MB | 156 MB |
| 10,000 | 10 KB | 120 MB + 50 MB = 170 MB | 270 MB |
| 100,000 | 20 KB | 2.4 GB + 100 MB = 2.5 GB | 3.5 GB |
| 1,000,000 | 30 KB | 36 GB + 500 MB = 36.5 GB | 37.5 GB |
Redis Memory Recommendations:
Session TTL Impact:
# Shorter TTL = Lower memory usage
24 hour TTL: 100K sessions × 20 KB = 2.4 GB
6 hour TTL: 25K sessions × 20 KB = 600 MB # 4x reduction
1 hour TTL: 4K sessions × 20 KB = 96 MB # 25x reduction
Monitoring Redis Memory:
# Check current memory usage
redis-cli INFO memory
# Key metrics to monitor:
# - used_memory_human: Total memory used
# - used_memory_rss_human: OS-reported memory
# - mem_fragmentation_ratio: Should be 1.0-1.5
# - evicted_keys: Should be 0 (we use TTL, not eviction)
# Session count
redis-cli DBSIZE
Per Instance:
| Load Type | CPU per 1K Connections | CPU per 10K Connections |
|---|---|---|
| Idle connections | 0.1-0.2 cores | 1-2 cores |
| Active browsing (1 action/min) | 0.5-1 cores | 5-10 cores |
| Heavy interaction (10 actions/min) | 2-4 cores | 20-40 cores |
Redis CPU:
Recommendation:
WebSocket Traffic:
Per Instance Bandwidth:
10K connections × 100 KB/min avg = 1 GB/min = 16.7 MB/s
Redis Pub/Sub Bandwidth:
Message size × Broadcast rate × Instance count
Example: 5 KB message × 100 broadcasts/sec × 10 instances = 5 MB/s
Recommendation:
Rule of Thumb:
Load Balancer Strategy:
livetemplate-id (LiveTemplate session ID)Session Distribution:
Use this table to determine when to scale horizontally (add instances) vs vertically (larger instances):
| Scenario | Current State | Recommended Action |
|---|---|---|
| Memory at 80% | Single instance | Add more instances (horizontal scale) |
| CPU at 80% | Single instance | Add more instances or upgrade instance size |
| High broadcast latency | Multiple instances | Add more Redis resources or optimize broadcasts |
| Uneven load | Multiple instances | Enable connection migration (M3) or adjust LB algorithm |
| Session store slow | Redis at capacity | Upgrade Redis instance or switch to Cluster |
Scenario: E-commerce platform with 50,000 concurrent users
Requirements:
Calculations:
1. Application Instances:
Connection memory: 50,000 × 50 KB = 2.5 GB
Overhead: 4 GB
Spike buffer (30%): 2 GB
Total per instance: 8.5 GB
Instance size: 16 GB (provides headroom)
Connections per instance: 10,000 (60% capacity)
Required instances: 50,000 / 10,000 = 5 instances
CPU per instance (moderate load): 4-6 cores
Recommended: 5× instances with 8 cores, 16 GB RAM
2. Redis Session Store:
Active sessions: 50,000
Avg store size: 30 KB
Redis memory: 50,000 × 30 KB × 1.2 = 1.8 GB
Add overhead: 100 MB
Total: 2 GB
Recommended: Redis Sentinel (3 nodes, 4 GB each)
3. Load Balancer:
Sticky sessions enabled
Algorithm: Least-connections fallback
Health checks: /health/ready (every 10s)
Connection draining: 30s timeout
4. Total Infrastructure:
Application: 5 instances × $50/month = $250
Redis Sentinel: 3 nodes × $30/month = $90
Load Balancer: $40/month
Database: $100/month
Total: ~$480/month (Tier 2-3 scale)
Formula Spreadsheet:
Target Connections: [input]
Memory per Connection: 50 KB (default)
Sessions per Connection: 1 (default)
State Size per Session: 20 KB (default)
→ Application Memory: [calculated]
→ Redis Memory: [calculated]
→ Instance Count: [calculated]
→ Estimated Cost: [calculated]
Monitoring Capacity:
# Connection capacity utilization
(livetemplate_connections_active / livetemplate_connections_max) > 0.7
# Memory capacity utilization
(process_resident_memory_bytes / node_memory_MemTotal_bytes) > 0.8
# Redis memory utilization
(redis_memory_used_bytes / redis_memory_max_bytes) > 0.8
When to Scale Up (Vertical):
When to Scale Out (Horizontal):
10K concurrent connections
When to Use Redis Cluster (vs Sentinel):
100K active sessions
| Metric | M1 | M2 | M3 |
|---|---|---|---|
| Max Connections | 10K | 20K | 50K+ |
| Action Latency (p95) | <100ms | <50ms | <20ms |
| Broadcast Latency (p95) | <50ms | <100ms | <50ms |
| Memory per Connection | 100 KB | 70 KB | 30 KB |
| Goroutines per Connection | 1 | 1 | 0.5 |
| Metric | M2 | M3 |
|---|---|---|
| Total Connections | 200K | 500K+ |
| Broadcast Fan-out Time (10K users) | 200ms | 100ms |
| Session Lookup Latency (Redis) | <5ms | <2ms |
| Cross-Instance Broadcast Latency | <100ms | <50ms |
Note: Benchmarks are approximate and depend on hardware, network, and workload characteristics.
Symptoms:
livetemplate_connections_rejected_total increasingSolutions:
MaxConnectionsPerGroup to prevent single-user exhaustionSymptoms:
Solutions:
redis-cli --latency and network latencySymptoms:
Solutions:
WithFallbackToMemory not masking issuesSymptoms:
Solutions:
Connection Health:
livetemplate_connections_active{instance="host1"} > 8000 # 80% of 10K limit
livetemplate_connections_rejected_total > 100
Performance:
livetemplate_action_duration_seconds{quantile="0.95"} > 0.200 # 200ms
livetemplate_broadcasts_sent_total rate(5m) > 10000 # High broadcast rate
Resource Usage:
process_resident_memory_bytes > 13e9 # 13 GB of 16 GB
redis_connected_clients{instance="redis1"} > 9000 # 90% of Redis max clients
Critical (page on-call):
Warning (Slack notification):
Info (metrics only):
Questions? Open an issue on GitHub or join the discussion.