Graph-Based Fraud Detection at Scale

Home | ← Writing | Comics | Patents

Traditional fraud detection looks at transactions in isolation. But fraud doesn't happen in isolation - it spreads through networks of connected users, shared devices, and coordinated attacks.

This is the technical story of building a graph-based fraud detection system that processed 110+ million nodes in real-time, achieving 70%+ high-risk detection accuracy while maintaining sub-second response times for booking decisions.

The Fraud Network Problem

Fraud in digital marketplaces follows predictable patterns: stolen credit cards get shared among networks of bad actors, device fingerprints appear across multiple fake accounts, and fraudulent booking behaviors cluster around specific geographic or temporal patterns.

Core Insight: Fraudsters don't operate in isolation. They form networks connected by shared devices, payment methods, phone numbers, locations, and referral chains. Traditional rule-based systems miss these network effects entirely.

Our approach models the entire user ecosystem as a graph where:

Nodes represent users with rich attribute sets (user_id, phone, email, cards, device fingerprints, location data)
Edges represent relationships (shared devices, referrals, similar booking patterns, geographic proximity)
Node classification evolves in real-time as fraud patterns emerge and propagate through the network

110M+ Total Nodes

10M Registered Users

100M First-time Users

70%+ High-Risk Detection

Sub-second Response Time

Graph Construction and Data Model

Building a fraud detection graph at this scale requires careful consideration of both data model design and performance characteristics.

Node Attributes and Feature Engineering

Each user node contains multiple layers of identity and behavioral signals:

// Node structure in Neo4j
CREATE (user:User {
    user_id: "abc123",
    phone_hash: "ph_89a7b2c1",
    email_hash: "em_4f5a6b7c", 
    device_fingerprint: "dev_aa1bb2cc",
    sim_fingerprint: "sim_8877aabb",
    registration_location: [lat, lng],
    signup_timestamp: datetime(),
    account_age_days: 45,
    booking_count: 12,
    cancellation_rate: 0.08,
    risk_score: 0.23,
    classification: "unknown"  // red, green, unknown
})

Edge Relationships: Connections form based on shared attributes and behavioral patterns:

// Relationship types
(:User)-[:SHARES_DEVICE]->(:User)      // Same device fingerprint
(:User)-[:SHARES_PHONE]->(:User)       // Same phone number  
(:User)-[:REFERRED_BY]->(:User)        // Referral chain
(:User)-[:SIMILAR_LOCATION]->(:User)   // Geographic proximity
(:User)-[:SHARES_PAYMENT]->(:User)     // Same payment method
(:User)-[:BOOKING_PATTERN]->(:User)    // Similar timing/behavior

The New User Challenge

100 million first-time users present the greatest fraud risk - they lack historical behavior data but represent the primary attack vector for stolen credentials and coordinated fraud attempts.

Behavioral Bootstrapping for New Users:

Device fingerprinting: Primary connection signal for graph insertion
Geographic anomalies: Registration location vs booking location analysis
Booking patterns: Short-term bookings with immediate start times = high risk
App usage analysis: Rushed behavior vs normal exploration patterns
Channel analysis: Referral source and acquisition pathway
Temporal signals: Time of day, day of week patterns

Multi-Algorithm Fraud Detection

Rather than relying on a single detection method, the system employs multiple graph algorithms simultaneously, each targeting different fraud patterns:

Algorithm Ensemble

1. Community Detection

Identifies tightly connected clusters of users who may be coordinating fraudulent activities. Fraud rings often exhibit high intra-cluster connectivity through shared devices, payment methods, or referral patterns.

2. Shortest Path Analysis

Measures degrees of separation from known fraudulent nodes. Users connected to confirmed fraudsters through short paths receive elevated risk scores based on relationship proximity.

3. Centrality Measures

Identifies key players in potential fraud networks using betweenness centrality and PageRank-style algorithms. Users who serve as bridges between multiple suspicious clusters warrant investigation.

4. Connected Components

Discovers isolated fraud clusters that operate independently from the main user graph. These often represent organized fraud attempts using completely fabricated identities.

Ensemble Scoring and Weight Learning

Individual algorithm outputs combine through learned weights based on historical fraud data:

def calculate_ensemble_risk_score(user_node):
    # Individual algorithm scores
    community_score = detect_fraud_communities(user_node)
    path_score = shortest_path_to_fraudsters(user_node) 
    centrality_score = calculate_centrality_risk(user_node)
    isolation_score = connected_component_analysis(user_node)
    
    # Learned weights from historical fraud cases
    weights = {
        'community': 0.35,
        'path': 0.25, 
        'centrality': 0.20,
        'isolation': 0.20
    }
    
    # Weighted ensemble
    final_score = (
        community_score * weights['community'] +
        path_score * weights['path'] +
        centrality_score * weights['centrality'] + 
        isolation_score * weights['isolation']
    )
    
    return normalize_score(final_score)

Weight Learning: Supervised learning on historical theft data determines optimal algorithm weights. The system continuously updates these weights as new fraud patterns emerge and algorithm effectiveness evolves.

Performance Architecture at Scale

Achieving sub-second response times on graphs with 110+ million nodes requires sophisticated performance optimization strategies.

Score-Based Partitioning Strategy

Key Insight: "Thieves have a habit of repeat offense." This behavioral pattern enables performance optimization through risk-based data partitioning.

The system partitions the graph based on user risk scores rather than traditional sharding approaches:

LFU Caching with Fraud Pattern Awareness

Traditional LRU caching doesn't optimize for fraud detection patterns. The system implements Least Frequently Used (LFU) caching with domain-specific optimizations:

class FraudAwareLFUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}
        self.frequencies = {}
        self.fraud_boost_factor = 2.0
        
    def get_user_risk(self, user_id):
        if user_id in self.cache:
            # Boost frequency for high-risk users
            base_freq = self.frequencies[user_id]
            risk_score = self.cache[user_id]['risk_score']
            
            if risk_score > HIGH_RISK_THRESHOLD:
                self.frequencies[user_id] = base_freq * self.fraud_boost_factor
            else:
                self.frequencies[user_id] = base_freq + 1
                
            return self.cache[user_id]
        
        # Cache miss - fetch from graph database
        return self.fetch_and_cache(user_id)

Neo4j Performance Optimization

Graph database optimization for real-time fraud detection requires multiple strategies:

// Pre-computed risk scores as node properties
MATCH (u:User)
SET u.risk_score = calculate_ensemble_score(u)
SET u.last_updated = datetime()

// Strategic indexing for hot queries
CREATE INDEX user_device_idx FOR (u:User) ON (u.device_fingerprint)
CREATE INDEX user_risk_idx FOR (u:User) ON (u.risk_score)
CREATE INDEX user_phone_idx FOR (u:User) ON (u.phone_hash)

// Optimized fraud detection query
MATCH (suspect:User {user_id: $userId})
MATCH (suspect)-[:SHARES_DEVICE|SHARES_PHONE*1..2]-(connected:User)
WHERE connected.risk_score > 0.7
RETURN suspect.risk_score, 
       collect(connected.user_id) as high_risk_connections
LIMIT 50

System Architecture Overview

The fraud detection system integrates multiple components for real-time risk assessment and graph maintenance:

FRAUD DETECTION SYSTEM ARCHITECTURE USER EVENTS & DATA INGESTION +-------------+ +-------------+ +-------------+ +-------------+ | New Booking |--->| User |--->| Kafka |--->| Feature | | Events | | Behavior | | Streaming | | Extraction | +-------------+ | Tracking | +-------------+ +-------------+ +-------------+ | | | v +-------------+ +-------------+ | +-------------+ | Device |--->| Identity | | | Graph Node | | Fingerprint | | Resolution | | | Properties | +-------------+ +-------------+ | +-------------+ | | +-------------+ +-------------+ | v | External |--->| Fraud | | +-------------+ | Fraud | | Reports | | | Neo4j | | Reports | +-------------+ | | Database | +-------------+ | +-------------+ | | REAL-TIME PROCESSING | | v v +-------------+ +-------------+ +-------------+ | Risk |<---| Multi-Algo |<---| Graph | | Assessment | | Ensemble | | Queries | | Engine | +-------------+ +-------------+ +-------------+ | ^ | | | v v | +-------------+ +-------------+ | | Booking | | Graph |------------+ | Decision | | Updates | +-------------+ +-------------+ PERFORMANCE LAYER +-------------------+ +-------------------+ +-------------------+ | HOT PARTITION | | WARM PARTITION | | COLD PARTITION | | High-Risk Users | | Medium-Risk Users | | Clean Users | | In-Memory Cache | | SSD Storage | | Cold Storage | | <10ms response | | ~50ms response | | ~200ms response | +-------------------+ +-------------------+ +-------------------+

Real-Time Graph Updates and Streaming

The system maintains graph consistency while processing continuous updates from booking events, fraud discoveries, and user behavior changes.

Event-Driven Graph Maintenance

def handle_fraud_discovery(fraudster_user_id):
    """Handle real-time fraud discovery and graph updates"""
    
    # 1. Update fraudster node properties
    update_node_classification(fraudster_user_id, "red")
    
    # 2. Find all connected nodes within 2 degrees
    connected_users = find_connected_nodes(
        fraudster_user_id, 
        max_depth=2
    )
    
    # 3. Re-compute risk scores for connected nodes
    for user_id in connected_users:
        new_risk_score = calculate_ensemble_risk_score(user_id)
        update_node_risk_score(user_id, new_risk_score)
        
        # 4. Update partitioning if risk level changed significantly
        if risk_level_changed(user_id, new_risk_score):
            migrate_partition(user_id, new_risk_score)
    
    # 5. Invalidate relevant cache entries
    invalidate_cache_for_users(connected_users)
    
    # 6. Trigger alerts for newly high-risk users
    trigger_review_queue(get_high_risk_users(connected_users))

Streaming Feature Updates

User behavior changes continuously update graph features through Kafka streams:

@kafka_consumer('user-behavior-events')
def update_behavioral_features(event):
    user_id = event['user_id']
    
    behavioral_updates = {
        'last_booking_time': event['timestamp'],
        'booking_count': increment_counter(user_id, 'bookings'),
        'cancellation_rate': calculate_cancellation_rate(user_id),
        'location_variance': update_location_variance(user_id, event['location'])
    }
    
    # Update Neo4j node properties
    update_user_properties(user_id, behavioral_updates)
    
    # Re-evaluate risk if significant behavior change
    if significant_change_detected(behavioral_updates):
        new_risk_score = calculate_ensemble_risk_score(user_id)
        propagate_risk_updates(user_id, new_risk_score)

Fraud Pattern Recognition

Device Sharing Networks

Pattern: Multiple accounts using the same device fingerprint, often with rapid account creation and immediate booking attempts.

Detection: Graph analysis reveals star-pattern connectivity where one device node connects to many user nodes with recent registration timestamps.

Referral Chain Exploitation

Pattern: Fraudsters create fake referral chains to exploit signup bonuses or referral rewards, often using slight variations of personal information.

Detection: Connected component analysis identifies isolated clusters of users with suspicious referral patterns and minimal external connectivity.

Geographic Clustering

Pattern: Coordinated attacks from specific geographic regions, often targeting high-value inventory during peak demand periods.

Detection: Spatial clustering algorithms identify unusual concentrations of new accounts with similar booking patterns in localized geographic areas.

Performance Results and Business Impact

Detection Accuracy: 70%+ high-risk detection rate with false positive rates maintained below 5% through ensemble algorithm balancing and continuous threshold tuning.

System Performance Characteristics:

Query Response Time: Sub-second risk assessment for 95% of booking requests
Graph Update Latency: New fraud discoveries propagate to connected nodes within 30 seconds
Throughput: Processing 50,000+ booking risk assessments per hour during peak periods
Scalability: Linear performance scaling through score-based partitioning and caching strategies

Business Impact:

Fraud Loss Reduction: Prevented estimated 85% of potential fraud losses through proactive detection
Investigation Efficiency: Reduced manual review burden by 60% through automated high-confidence classifications
User Experience: Maintained seamless booking experience for legitimate users while blocking fraudulent attempts
Network Effect Discovery: Uncovered previously unknown fraud networks through graph analysis, leading to systematic cleanup of organized fraud attempts

Key Engineering Insights

Graph-based approaches reveal fraud patterns invisible to traditional methods: Individual transaction analysis misses coordinated attacks and shared infrastructure that becomes obvious in graph representation.

Performance optimization must align with domain patterns: Score-based partitioning and fraud-aware caching leverage behavioral insights ("repeat offenders") for system optimization.

Real-time graph updates require careful consistency management: Balancing immediate fraud response with system performance requires sophisticated event-driven architectures and update propagation strategies.

Evolution and Future Considerations

Algorithm Adaptation: Fraud patterns evolve continuously, requiring dynamic algorithm weights and new detection methods. The ensemble approach enables adding new algorithms without disrupting existing detection capabilities.

Privacy and Compliance: Graph-based analysis of user relationships raises privacy considerations that require careful balance between fraud detection effectiveness and user data protection.

Adversarial Resistance: As fraudsters adapt to detection methods, the system requires continuous evolution to stay ahead of new attack patterns and evasion techniques.

← Back to All Writing