Observability & Operations

This document defines how Grove is monitored, how incidents are handled, and what operational procedures ensure platform reliability.

Observability Strategy

Monitoring Stack

Current Implementation

Layer	Tool	Status
Error Tracking	Sentry	Planned
Analytics	PostHog	Planned
Logs	Supabase Dashboard	Active
Metrics	Supabase Dashboard	Active
Uptime	Supabase Status	Active

Planned Stack

Key Metrics

Application Health Metrics

Metric	Target	Critical Threshold
App Crash Rate	< 0.1%	> 1%
ANR Rate (Android)	< 0.1%	> 0.5%
API Error Rate	< 1%	> 5%
API Latency (p95)	< 500ms	> 2000ms
Auth Success Rate	> 99%	< 95%

Business Metrics

Metric	Description	Tracking
Daily Active Users	Unique users per day	PostHog
Weekly Active Users	Unique users per week	PostHog
Communities Created	New communities per week	Custom
Events Created	Events per week	Custom
Message Volume	Messages per day	Database
User Retention (D1, D7, D30)	Cohort analysis	PostHog

Infrastructure Metrics

Metric	Source	Alert On
Database Size	Supabase	> 400MB (free tier)
Storage Usage	Supabase	> 800MB
API Requests	Supabase	> 1.5M/month
Realtime Connections	Supabase	> 150 concurrent
Edge Function Invocations	Supabase	> 400K/month

Error Tracking

Sentry Configuration

import * as Sentry from '@sentry/react-native';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: __DEV__ ? 'development' : 'production',
  tracesSampleRate: 0.2,
  profilesSampleRate: 0.1,
  attachStacktrace: true,
  enableAutoSessionTracking: true,
  sessionTrackingIntervalMillis: 30000,
  beforeSend(event) {
    // Scrub sensitive data
    if (event.user) {
      delete event.user.email;
    }
    return event;
  },
});

Error Categories

Error Severity Levels

Level	Definition	Response Time	Examples
Critical	Service down, data loss	Immediate	Auth broken, DB unreachable
High	Major feature broken	4 hours	Chat not working, events fail
Medium	Minor feature issue	24 hours	Image upload fails sometimes
Low	Cosmetic, edge case	1 week	UI glitch on specific device

Analytics

Event Taxonomy

User Events:

// Authentication
analytics.track('user_signed_up', { method: 'email' | 'google' | 'apple' });
analytics.track('user_signed_in', { method: 'email' | 'google' | 'apple' });
analytics.track('user_signed_out');

// Communities
analytics.track('community_created', { category, join_mode });
analytics.track('community_joined', { method: 'invite' | 'request' | 'public' });
analytics.track('community_left', { reason });

// Events
analytics.track('event_created', { is_recurring, has_location });
analytics.track('event_rsvp', { status: 'going' | 'maybe' | 'not_going' });

// Engagement
analytics.track('message_sent', { has_media: boolean });
analytics.track('post_created', { has_media: boolean });
analytics.track('comment_added');

// Finance
analytics.track('transaction_added', { type: 'income' | 'expense' });
analytics.track('contribution_created', { amount, member_count });

Funnel Analysis

Core User Journey:

Expected Conversion Rates:

Step	Target
Install → Sign Up	> 60%
Sign Up → Create Community	> 40%
Create Community → Invite	> 70%
Invite → First Event	> 50%

Retention Cohorts

Metric	Target	Healthy
Day 1 Retention	> 40%	> 50%
Day 7 Retention	> 20%	> 30%
Day 30 Retention	> 10%	> 15%

Logging

Log Levels

Level	Usage	Example
ERROR	Unexpected failures	API call failed, exception
WARN	Degraded operation	Retry succeeded, rate limited
INFO	Significant events	User action, state change
DEBUG	Development only	Internal state, calculations

Structured Logging

// Log format
const log = {
  timestamp: new Date().toISOString(),
  level: 'INFO',
  message: 'Event created',
  context: {
    user_id: 'uuid',
    community_id: 'uuid',
    event_id: 'uuid',
    is_recurring: true,
  },
  metadata: {
    app_version: '1.0.0',
    platform: 'ios',
    device: 'iPhone 14',
  },
};

Log Retention

Log Type	Retention	Purpose
Error logs	90 days	Debugging
Auth logs	30 days	Security
API logs	7 days	Performance
Debug logs	1 day	Development

Alerting

Alert Rules

Alert Configurations

Alert	Condition	Severity	Channel
High Error Rate	> 5% for 5 min	Critical	PagerDuty
API Latency Spike	p95 > 2s for 5 min	High	Slack
Auth Failures	> 10% for 10 min	High	Slack
Database Near Limit	> 90% capacity	High	Email
Low Disk Space	< 10% remaining	Medium	Email
Unusual Traffic	> 3x normal	Medium	Slack

Alert Response

Incident Management

Incident Severity

Level	Definition	Response	Communication
SEV1	Complete outage	All hands	Status page + Social
SEV2	Major degradation	Primary oncall	Status page
SEV3	Minor issue	Best effort	Internal only
SEV4	Cosmetic	Next sprint	Internal only

Incident Response Process

On-Call Rotation

Role	Responsibility	Escalation
Primary	First responder	15 min response
Secondary	Backup if primary unavailable	30 min response
Engineering Lead	Major incidents	As needed

Dashboards

Operations Dashboard

Sections:

Health Overview
- Current error rate
- API latency (p50, p95, p99)
- Active users
- System status
Trends
- Error rate over time
- Request volume
- User activity
Alerts
- Active alerts
- Recent incidents
- Threshold status

Product Dashboard

Sections:

User Metrics
- DAU/WAU/MAU
- New signups
- Retention cohorts
Feature Usage
- Communities created
- Events scheduled
- Messages sent
- Transactions logged
Funnels
- Onboarding completion
- Activation rate
- Feature adoption

Health Checks

Automated Checks

// Health check endpoint
async function healthCheck() {
  const checks = {
    database: await checkDatabase(),
    auth: await checkAuth(),
    storage: await checkStorage(),
    realtime: await checkRealtime(),
  };

  const healthy = Object.values(checks).every(c => c.status === 'ok');

  return {
    status: healthy ? 'healthy' : 'degraded',
    checks,
    timestamp: new Date().toISOString(),
  };
}

Check Schedule

Check	Frequency	Timeout
Database connectivity	1 min	5 sec
Auth service	5 min	10 sec
Storage access	5 min	10 sec
Realtime connection	1 min	5 sec

Runbooks

Common Issues

High Error Rate

SYMPTOMS: Error rate > 5%
POSSIBLE CAUSES:
API changes without client update
Database connection issues
Third-party service outage

STEPS:
Check Sentry for error distribution
Review recent deployments
Check Supabase status
Roll back if deployment-related
Escalate to engineering lead if unresolved

Slow API Response

SYMPTOMS: p95 latency > 2 seconds
POSSIBLE CAUSES:
Database query performance
High traffic spike
N+1 queries

STEPS:
Check Supabase query performance
Review slow query logs
Check for traffic anomalies
Enable query caching if needed
Scale resources if traffic-related

Auth Failures

SYMPTOMS: Auth success rate < 95%
POSSIBLE CAUSES:
OAuth provider issues
Token validation errors
Rate limiting

STEPS:
Check OAuth provider status
Review auth error messages
Check for rate limiting
Verify API keys are valid
Contact provider if their issue

Service Level Objectives (SLOs)

Metric	SLO	Error Budget
Availability	99.9%	43 min/month
API Latency (p95)	< 500ms	N/A
Error Rate	< 1%	N/A
Push Delivery	95% within 10s	5% delayed

SLO Tracking

Capacity Planning

Current Limits

Resource	Limit (Free)	Current	Runway
Database	500 MB	50 MB	10x
Storage	1 GB	100 MB	10x
API Requests	2M/month	100K	20x
Realtime	200 concurrent	10	20x

Scaling Triggers

Metric	Upgrade Trigger	Action
Database > 400 MB	High	Upgrade to Pro
Storage > 800 MB	High	Upgrade to Pro
API > 1.5M/month	Medium	Monitor closely
Users > 1000 DAU	Low	Plan upgrade

Observability Strategy​

Monitoring Stack​

Current Implementation​

Planned Stack​

Key Metrics​

Application Health Metrics​

Business Metrics​

Infrastructure Metrics​

Error Tracking​

Sentry Configuration​

Error Categories​

Error Severity Levels​

Analytics​

Event Taxonomy​

Funnel Analysis​

Retention Cohorts​

Logging​

Log Levels​

Structured Logging​

Log Retention​

Alerting​

Alert Rules​

Alert Configurations​

Alert Response​

Incident Management​

Incident Severity​

Incident Response Process​

On-Call Rotation​

Dashboards​

Operations Dashboard​

Product Dashboard​

Health Checks​

Automated Checks​

Check Schedule​

Runbooks​

Common Issues​

High Error Rate​

Slow API Response​

Auth Failures​

Service Level Objectives (SLOs)​

SLO Tracking​

Capacity Planning​

Current Limits​

Scaling Triggers​