Collaborative Filtering at Scale

Building recommendation systems for sparse data environments with city-specific personalization

Building recommendation systems for high-consideration purchases presents unique challenges: users interact infrequently, booking patterns are sparse, and traditional collaborative filtering approaches struggle with limited signal. Yet users expect personalized, relevant results when browsing hundreds to thousands of available options.

This is the technical story of a recommendation engine that optimized click-through rates across city-specific models, using collaborative filtering with embeddings to extract meaningful patterns from sparse user behavior data.

The Sparse Data Challenge

Unlike Netflix (where users consume many movies) or Amazon (frequent purchases), car-sharing represents a low-frequency, high-consideration domain. Most users book vehicles only occasionally, creating a classic sparse data problem for recommendation systems.

Core Challenge: Generate personalized recommendations for 200-1000 cars per city when most users have minimal direct booking history, requiring inference from weak signals like clicks, search patterns, and demographic attributes.

The system optimized for click-through rate as the primary metric, operating under the principle that relevance drives engagement, which ultimately leads to bookings and revenue.

Scale and Constraints:

Multi-City Architecture Strategy

Geographic context fundamentally shapes user preferences and inventory characteristics. Rather than building a single global model, the system employs city-specific recommendation models that account for local market dynamics.

City-Specific Model Rationale

Key Insight: "Hyderabad is polar opposite of Delhi" - Different cities exhibit distinct user behaviors, vehicle types, pricing patterns, and usage contexts that require tailored recommendation strategies.

City-specific models capture local patterns that global models would average away:

City-Specific Routing Algorithm: 1. Extract user location from search context 2. Route to appropriate city-specific model (Delhi/Mumbai/Bangalore/Hyderabad/Pune) 3. Apply city-specific recommendation logic 4. Return personalized results for local market dynamics

Segment-Based Collaborative Filtering

Traditional collaborative filtering relies on sufficient user-item interactions. With sparse booking data, the system employs segment-based collaborative filtering that groups users with similar characteristics to amplify weak signals.

User Segmentation Strategy

User segments form the foundation for both cold-start handling and collaborative filtering enhancement. Segmentation operates hierarchically within each city:

Primary Segmentation Dimensions:

Cold Start Solution

New users present an immediate recommendation challenge. The segment-based approach provides an elegant cold-start solution:

Cold Start Process: New user arrives → Immediate segment assignment based on initial search parameters → Leverage segment-average preferences for initial recommendations → Gradually personalize as user interaction data accumulates.
Segmentation Algorithm: 1. Extract features: duration preference, temporal patterns, acquisition channel 2. Classify user into primary segments using decision tree logic 3. Assign composite segment ID combining multiple dimensions 4. For new users: use segment-average preferences as baseline 5. Gradually learn individual preferences through interaction feedback

Collaborative Filtering with Embeddings

The core recommendation algorithm employs collaborative filtering enhanced with embedding techniques to handle sparse data effectively.

User-Item Interaction Matrix

Given the sparse booking data, the system constructs interaction matrices using multiple signal types with different weights:

Interaction Matrix Construction: • Bookings: Weight = 10.0 (strongest signal, actual conversions) • Clicks: Weight = 1.0 (abundant but weaker signal) • Extended Views: Weight = 2.0 (engaged browsing behavior) • Filter Applications: Weight = 1.5 (preference indicators) Final Matrix: Sparse representation of weighted user-vehicle interactions

Embedding Architecture

The system learns dense vector representations for users and vehicles that capture latent preference patterns despite sparse interactions:

Embedding Learning Process: 1. Initialize user embeddings (128-dimensional vectors) 2. Initialize vehicle embeddings (128-dimensional vectors) 3. Learn representations through matrix factorization 4. Optimize for weighted interaction prediction 5. Include bias terms for popularity adjustments Prediction: Interaction_Score = dot(user_vector, vehicle_vector) + biases

Multi-Signal Feature Engineering

Beyond collaborative filtering, the system incorporates rich contextual features to enhance recommendation quality and handle edge cases where collaborative signals are insufficient.

Signal Types and Weighting

User Behavior Signals:

Booking Parameter Analysis:

Contextual Features:

Feature Engineering Pipeline: 1. Historical Behavior → Extract price preferences, duration patterns, car type affinity 2. Current Session → Analyze scroll depth, time spent, filter usage, search refinements 3. Contextual Data → Time of day, day of week, booking timing, seasonal factors 4. Geographic Data → User location, vehicle proximity, area type matching 5. Composite Features → Combine multiple signals into recommendation scores

Geographic and Temporal Personalization

Effective recommendations must account for both geographic and temporal context that significantly influence user preferences and vehicle availability.

Geographic Personalization

Beyond city-specific models, the system incorporates fine-grained geographic personalization:

Geographic Personalization Algorithm: 1. Distance-based scoring: Penalize vehicles far from user location 2. Area type matching: Business district users prefer business area vehicles 3. Traffic accessibility: Factor in traffic patterns and commute convenience 4. Neighborhood preferences: Match user area type with vehicle location type

Temporal Personalization

Temporal Patterns: Weekday morning searches often indicate business travel (sedan preference, airport proximity), while weekend evening searches suggest leisure activities (larger vehicles, entertainment areas).
Temporal Personalization Logic: • Business Hours + Weekday → Boost sedans/luxury vehicles • Evening + Weekend → Boost SUVs/family vehicles • Peak Demand Periods → Prioritize high-availability vehicles • Holiday Seasons → Apply seasonal preference adjustments

Batch Processing Architecture

Given the low-frequency nature of bookings and the high computational cost of model training, the system employs a batch processing architecture with strategic refresh cycles.

Architectural Decision: Batch processing over real-time model updates. For high-consideration purchases with low booking frequency, real-time model updates provide minimal benefit while adding significant computational overhead.

Model Training and Deployment Pipeline

RECOMMENDATION SYSTEM ARCHITECTURE BATCH PROCESSING PIPELINE +----------------+ +----------------+ +----------------+ +----------------+ | Data |--->| Feature |--->| Model |--->| Model | | Collection | | Engineering | | Training | | Validation | | (Daily) | | Pipeline | | (City-wise) | | & Deployment | +----------------+ +----------------+ +----------------+ +----------------+ | | | | v v v v +----------------+ +----------------+ +----------------+ +----------------+ | User | | Interaction | | Collaborative | | Model | | Interactions | | Matrix | | Filtering | | Artifacts | | Clicks/Books | | Construction | | Training | | (Per City) | +----------------+ +----------------+ +----------------+ +----------------+ REAL-TIME SERVING LAYER +----------------+ +----------------+ +----------------+ +----------------+ | User Search |--->| Feature |--->| Model |--->| Ranked | | Request | | Extraction | | Inference | | Results | +----------------+ +----------------+ +----------------+ +----------------+ | | | | v v v v +----------------+ +----------------+ +----------------+ +----------------+ | User Context | | Real-time | | City-specific | | Personalized | | Session Data | | Features | | Model | | Vehicle List | +----------------+ +----------------+ +----------------+ +----------------+

Model Refresh Strategy

The system balances model freshness with computational efficiency through strategic refresh cycles:

Model Refresh Schedule: • Collaborative Filtering Models: Weekly full retrain • User Segmentation: Monthly refresh • City-specific Models: Bi-weekly updates • Feature Weights: Daily adjustments Trigger Conditions: • Time-based: Regular scheduled updates • Performance-based: When metrics drop below thresholds • Data-based: When significant new interaction volume accumulated

A/B Testing and Evaluation

Continuous improvement of recommendation quality requires rigorous A/B testing and evaluation frameworks that balance multiple objectives.

Primary and Secondary Metrics

Primary Optimization Target:

Secondary Metrics:

A/B Test Evaluation Framework: 1. Primary Metric: CTR lift > 5% with statistical significance 2. Revenue Constraint: No more than 2% revenue drop allowed 3. Secondary Metrics: Monitor engagement, diversity, booking conversion 4. Deployment Decision: Deploy if primary metric improves + constraints satisfied Statistical Methods: • Chi-square tests for CTR significance • Bootstrap sampling for confidence intervals • Multi-armed bandit for continuous optimization
Constraint ensuring recommendation changes don't negatively impact revenue
  • User Engagement: Time spent on recommended vehicle pages, scroll depth
  • Recommendation Diversity: Ensuring recommendations don't become overly narrow
  • class RecommendationEvaluator: def __init__(self): self.primary_metrics = ['ctr'] self.secondary_metrics = ['click_to_booking_rate', 'revenue_per_user', 'engagement_time'] self.constraint_metrics = ['revenue_impact'] def evaluate_ab_test(self, control_group, treatment_group, test_duration): """Evaluate A/B test results for recommendation changes""" results = {} # Primary metric evaluation control_ctr = calculate_ctr(control_group, test_duration) treatment_ctr = calculate_ctr(treatment_group, test_duration) results['ctr_lift'] = (treatment_ctr - control_ctr) / control_ctr results['ctr_significance'] = calculate_statistical_significance( control_group.clicks, control_group.impressions, treatment_group.clicks, treatment_group.impressions ) # Secondary metrics for metric in self.secondary_metrics: control_value = calculate_metric(control_group, metric, test_duration) treatment_value = calculate_metric(treatment_group, metric, test_duration) results[f'{metric}_lift'] = (treatment_value - control_value) / control_value results[f'{metric}_significance'] = calculate_significance( control_group, treatment_group, metric ) # Constraint validation revenue_impact = self.validate_revenue_constraint( control_group, treatment_group ) results['revenue_constraint_satisfied'] = revenue_impact >= -0.02 # Max 2% revenue drop return results def make_deployment_decision(self, test_results): """Decide whether to deploy based on test results""" # Primary metric must show significant improvement if (test_results['ctr_lift'] > 0.05 and test_results['ctr_significance'] < 0.05): # Revenue constraint must be satisfied if test_results['revenue_constraint_satisfied']: return 'DEPLOY' else: return 'REJECT_REVENUE_CONSTRAINT' return 'REJECT_INSUFFICIENT_IMPROVEMENT'

    Performance Results and Business Impact

    Primary Achievement: Optimized click-through rates through personalized recommendations while maintaining revenue constraints and improving user engagement across city-specific models.

    System Performance:

    User Experience Impact:

    Key Engineering Insights

    Sparse data requires creative signal amplification: In low-frequency domains, traditional collaborative filtering must be enhanced with segment-based approaches and multi-signal feature engineering to extract meaningful patterns.
    Geographic context is crucial for local marketplaces: City-specific models dramatically outperform global approaches when local market dynamics, inventory, and user preferences vary significantly.
    Batch processing is optimal for high-consideration purchases: Real-time model updates provide minimal benefit for infrequent, deliberate purchasing decisions while adding unnecessary computational complexity.

    Evolution and Scaling Considerations

    Model Sophistication: As user interaction data grows, the system can evolve from segment-based collaborative filtering toward more sophisticated deep learning approaches like neural collaborative filtering or transformer-based recommendation models.

    Real-Time Personalization: Future iterations could incorporate real-time session behavior for dynamic re-ranking within browsing sessions, balancing computational cost with personalization gains.

    Cross-City Learning: Advanced approaches could leverage transfer learning to bootstrap recommendation models in new cities using patterns learned from established markets.

    ← Back to All Writing