Low-Level Design: Customer Support Ticketing System — Routing, SLA, Escalation, and Knowledge Base

Core Entities

Ticket: ticket_id, customer_id, subject, description, priority (LOW, MEDIUM, HIGH, URGENT), status (OPEN, IN_PROGRESS, WAITING_ON_CUSTOMER, RESOLVED, CLOSED), category, assigned_agent_id, created_at, first_response_at, resolved_at. Agent: agent_id, name, team_id, skills[], max_tickets, current_ticket_count, is_available. SLAPolicy: policy_id, priority, first_response_sla_minutes, resolution_sla_minutes. Message: message_id, ticket_id, sender_id, sender_type (CUSTOMER, AGENT, BOT), content, created_at. KBArticle: article_id, title, content, tags[], helpful_count, not_helpful_count, view_count.

Ticket Routing

Automated routing on ticket creation: (1) Category detection: NLP classifier on subject+description to assign category (BILLING, TECHNICAL, RETURNS, ACCOUNT). (2) Priority assignment: rule-based (keywords like “urgent”, “critical”, VIP customer tag) + ML model for predicted severity. (3) Agent assignment: find the available agent with the matching skill, fewest current tickets, and shortest average handle time for this category. Use a weighted score: score = (available_capacity_weight * available_capacity) + (skill_match_weight * skill_score). Assign the highest-scoring available agent.

class TicketRouter:
    def assign_agent(self, ticket: Ticket) -> Optional[Agent]:
        candidates = self.db.get_agents(
            skill=ticket.category,
            has_capacity=True,
            team=self.get_team_for_category(ticket.category)
        )
        if not candidates:
            return None  # queue the ticket, assign when agent available

        def score(agent: Agent) -> float:
            capacity_ratio = 1 - (agent.current_tickets / agent.max_tickets)
            skill_level = agent.skill_level(ticket.category)  # 1-5
            return 0.6 * capacity_ratio + 0.4 * (skill_level / 5)

        return max(candidates, key=score)

SLA Tracking and Escalation

SLA = Service Level Agreement. Define per-priority: URGENT first response in 1 hour, HIGH in 4 hours, MEDIUM in 8 hours, LOW in 24 hours. Track: first_response_at (set when agent first replies). resolution_at (set on RESOLVED). SLA breach check: a background job runs every 5 minutes. For each open ticket: sla_deadline = created_at + sla_policy.first_response_sla_minutes. If deadline < NOW() and first_response_at is NULL: SLA breached — escalate. Escalation: reassign to senior agent, notify team lead via Slack/email, mark ticket.sla_breached = true for reporting. Resolution SLA breach: same pattern with resolved_at.

Knowledge Base Integration

Deflect tickets with self-service: (1) Before submission: as the customer types the subject, query the KB for relevant articles (Elasticsearch full-text search). Show top 3 articles. If the customer finds their answer, no ticket is created (deflection). (2) On ticket creation: suggest KB articles to the agent to speed up resolution. (3) On resolution: prompt the agent to link the KB article used (builds the connection between ticket categories and articles for future routing). Track KB effectiveness: helpful_count, not_helpful_count, deflection_rate per article. Archive articles with high not-helpful rate or zero views in 90 days.

Canned Responses and Macros

Agents frequently send the same response to common issues. Canned responses: pre-written templates with {{customer_name}}, {{ticket_id}}, {{order_number}} placeholders. Macros: a set of actions (set category, assign to team, add tag, send canned response) triggered by one click. Example macro “Shipping Delay”: sets category=SHIPPING, tags=delay, sends the shipping delay canned response, sets status=WAITING_ON_CUSTOMER. Macros save agents 30-60 seconds per ticket and ensure consistent messaging.

Analytics and Reporting

Key metrics: Average First Response Time (FRT) by priority, team, and agent. Average Handle Time (AHT). Resolution rate by category. SLA compliance rate (% of tickets meeting SLA). Customer Satisfaction (CSAT): send a survey after resolution. NPS (Net Promoter Score) for long-term loyalty. Agent utilization: current_tickets / max_tickets. Ticket volume trends: detect spikes (product outage, bad batch of orders) by comparing hourly volume to the same hour last week. Dashboard updated in real-time for current queue status; daily reports emailed to team leads.

Asked at: Atlassian Interview Guide

Asked at: Shopify Interview Guide

Asked at: DoorDash Interview Guide

Asked at: Snap Interview Guide

Scroll to Top