Skip to main content

Observability & Operations

This document defines how Grove is monitored, how incidents are handled, and what operational procedures ensure platform reliability.

Observability Strategy


Monitoring Stack

Current Implementation

LayerToolStatus
Error TrackingSentryPlanned
AnalyticsPostHogPlanned
LogsSupabase DashboardActive
MetricsSupabase DashboardActive
UptimeSupabase StatusActive

Planned Stack


Key Metrics

Application Health Metrics

MetricTargetCritical Threshold
App Crash Rate< 0.1%> 1%
ANR Rate (Android)< 0.1%> 0.5%
API Error Rate< 1%> 5%
API Latency (p95)< 500ms> 2000ms
Auth Success Rate> 99%< 95%

Business Metrics

MetricDescriptionTracking
Daily Active UsersUnique users per dayPostHog
Weekly Active UsersUnique users per weekPostHog
Communities CreatedNew communities per weekCustom
Events CreatedEvents per weekCustom
Message VolumeMessages per dayDatabase
User Retention (D1, D7, D30)Cohort analysisPostHog

Infrastructure Metrics

MetricSourceAlert On
Database SizeSupabase> 400MB (free tier)
Storage UsageSupabase> 800MB
API RequestsSupabase> 1.5M/month
Realtime ConnectionsSupabase> 150 concurrent
Edge Function InvocationsSupabase> 400K/month

Error Tracking

Sentry Configuration

import * as Sentry from '@sentry/react-native';

Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: __DEV__ ? 'development' : 'production',
tracesSampleRate: 0.2,
profilesSampleRate: 0.1,
attachStacktrace: true,
enableAutoSessionTracking: true,
sessionTrackingIntervalMillis: 30000,
beforeSend(event) {
// Scrub sensitive data
if (event.user) {
delete event.user.email;
}
return event;
},
});

Error Categories

Error Severity Levels

LevelDefinitionResponse TimeExamples
CriticalService down, data lossImmediateAuth broken, DB unreachable
HighMajor feature broken4 hoursChat not working, events fail
MediumMinor feature issue24 hoursImage upload fails sometimes
LowCosmetic, edge case1 weekUI glitch on specific device

Analytics

Event Taxonomy

User Events:

// Authentication
analytics.track('user_signed_up', { method: 'email' | 'google' | 'apple' });
analytics.track('user_signed_in', { method: 'email' | 'google' | 'apple' });
analytics.track('user_signed_out');

// Communities
analytics.track('community_created', { category, join_mode });
analytics.track('community_joined', { method: 'invite' | 'request' | 'public' });
analytics.track('community_left', { reason });

// Events
analytics.track('event_created', { is_recurring, has_location });
analytics.track('event_rsvp', { status: 'going' | 'maybe' | 'not_going' });

// Engagement
analytics.track('message_sent', { has_media: boolean });
analytics.track('post_created', { has_media: boolean });
analytics.track('comment_added');

// Finance
analytics.track('transaction_added', { type: 'income' | 'expense' });
analytics.track('contribution_created', { amount, member_count });

Funnel Analysis

Core User Journey:

Expected Conversion Rates:

StepTarget
Install → Sign Up> 60%
Sign Up → Create Community> 40%
Create Community → Invite> 70%
Invite → First Event> 50%

Retention Cohorts

MetricTargetHealthy
Day 1 Retention> 40%> 50%
Day 7 Retention> 20%> 30%
Day 30 Retention> 10%> 15%

Logging

Log Levels

LevelUsageExample
ERRORUnexpected failuresAPI call failed, exception
WARNDegraded operationRetry succeeded, rate limited
INFOSignificant eventsUser action, state change
DEBUGDevelopment onlyInternal state, calculations

Structured Logging

// Log format
const log = {
timestamp: new Date().toISOString(),
level: 'INFO',
message: 'Event created',
context: {
user_id: 'uuid',
community_id: 'uuid',
event_id: 'uuid',
is_recurring: true,
},
metadata: {
app_version: '1.0.0',
platform: 'ios',
device: 'iPhone 14',
},
};

Log Retention

Log TypeRetentionPurpose
Error logs90 daysDebugging
Auth logs30 daysSecurity
API logs7 daysPerformance
Debug logs1 dayDevelopment

Alerting

Alert Rules

Alert Configurations

AlertConditionSeverityChannel
High Error Rate> 5% for 5 minCriticalPagerDuty
API Latency Spikep95 > 2s for 5 minHighSlack
Auth Failures> 10% for 10 minHighSlack
Database Near Limit> 90% capacityHighEmail
Low Disk Space< 10% remainingMediumEmail
Unusual Traffic> 3x normalMediumSlack

Alert Response


Incident Management

Incident Severity

LevelDefinitionResponseCommunication
SEV1Complete outageAll handsStatus page + Social
SEV2Major degradationPrimary oncallStatus page
SEV3Minor issueBest effortInternal only
SEV4CosmeticNext sprintInternal only

Incident Response Process

On-Call Rotation

RoleResponsibilityEscalation
PrimaryFirst responder15 min response
SecondaryBackup if primary unavailable30 min response
Engineering LeadMajor incidentsAs needed

Dashboards

Operations Dashboard

Sections:

  1. Health Overview

    • Current error rate
    • API latency (p50, p95, p99)
    • Active users
    • System status
  2. Trends

    • Error rate over time
    • Request volume
    • User activity
  3. Alerts

    • Active alerts
    • Recent incidents
    • Threshold status

Product Dashboard

Sections:

  1. User Metrics

    • DAU/WAU/MAU
    • New signups
    • Retention cohorts
  2. Feature Usage

    • Communities created
    • Events scheduled
    • Messages sent
    • Transactions logged
  3. Funnels

    • Onboarding completion
    • Activation rate
    • Feature adoption

Health Checks

Automated Checks

// Health check endpoint
async function healthCheck() {
const checks = {
database: await checkDatabase(),
auth: await checkAuth(),
storage: await checkStorage(),
realtime: await checkRealtime(),
};

const healthy = Object.values(checks).every(c => c.status === 'ok');

return {
status: healthy ? 'healthy' : 'degraded',
checks,
timestamp: new Date().toISOString(),
};
}

Check Schedule

CheckFrequencyTimeout
Database connectivity1 min5 sec
Auth service5 min10 sec
Storage access5 min10 sec
Realtime connection1 min5 sec

Runbooks

Common Issues

High Error Rate

SYMPTOMS: Error rate > 5%
POSSIBLE CAUSES:
1. API changes without client update
2. Database connection issues
3. Third-party service outage

STEPS:
1. Check Sentry for error distribution
2. Review recent deployments
3. Check Supabase status
4. Roll back if deployment-related
5. Escalate to engineering lead if unresolved

Slow API Response

SYMPTOMS: p95 latency > 2 seconds
POSSIBLE CAUSES:
1. Database query performance
2. High traffic spike
3. N+1 queries

STEPS:
1. Check Supabase query performance
2. Review slow query logs
3. Check for traffic anomalies
4. Enable query caching if needed
5. Scale resources if traffic-related

Auth Failures

SYMPTOMS: Auth success rate < 95%
POSSIBLE CAUSES:
1. OAuth provider issues
2. Token validation errors
3. Rate limiting

STEPS:
1. Check OAuth provider status
2. Review auth error messages
3. Check for rate limiting
4. Verify API keys are valid
5. Contact provider if their issue

Service Level Objectives (SLOs)

MetricSLOError Budget
Availability99.9%43 min/month
API Latency (p95)< 500msN/A
Error Rate< 1%N/A
Push Delivery95% within 10s5% delayed

SLO Tracking


Capacity Planning

Current Limits

ResourceLimit (Free)CurrentRunway
Database500 MB50 MB10x
Storage1 GB100 MB10x
API Requests2M/month100K20x
Realtime200 concurrent1020x

Scaling Triggers

MetricUpgrade TriggerAction
Database > 400 MBHighUpgrade to Pro
Storage > 800 MBHighUpgrade to Pro
API > 1.5M/monthMediumMonitor closely
Users > 1000 DAULowPlan upgrade