Graph-Based Fraud Detection at Scale

Real-time risk analysis using multi-algorithm graph processing on 110M+ node networks

Traditional fraud detection looks at transactions in isolation. But fraud doesn't happen in isolation - it spreads through networks of connected users, shared devices, and coordinated attacks.

This is the technical story of building a graph-based fraud detection system that processed 110+ million nodes in real-time, achieving 70%+ high-risk detection accuracy while maintaining sub-second response times for booking decisions.

The Fraud Network Problem

Fraud in digital marketplaces follows predictable patterns: stolen credit cards get shared among networks of bad actors, device fingerprints appear across multiple fake accounts, and fraudulent booking behaviors cluster around specific geographic or temporal patterns.

Core Insight: Fraudsters don't operate in isolation. They form networks connected by shared devices, payment methods, phone numbers, locations, and referral chains. Traditional rule-based systems miss these network effects entirely.

Our approach models the entire user ecosystem as a graph where:

110M+ Total Nodes
10M Registered Users
100M First-time Users
70%+ High-Risk Detection
Sub-second Response Time

Graph Construction and Data Model

Building a fraud detection graph at this scale requires careful consideration of both data model design and performance characteristics.

Node Attributes and Feature Engineering

Each user node contains multiple layers of identity and behavioral signals:

// Node structure in Neo4j CREATE (user:User { user_id: "abc123", phone_hash: "ph_89a7b2c1", email_hash: "em_4f5a6b7c", device_fingerprint: "dev_aa1bb2cc", sim_fingerprint: "sim_8877aabb", registration_location: [lat, lng], signup_timestamp: datetime(), account_age_days: 45, booking_count: 12, cancellation_rate: 0.08, risk_score: 0.23, classification: "unknown" // red, green, unknown })

Edge Relationships: Connections form based on shared attributes and behavioral patterns:

// Relationship types (:User)-[:SHARES_DEVICE]->(:User) // Same device fingerprint (:User)-[:SHARES_PHONE]->(:User) // Same phone number (:User)-[:REFERRED_BY]->(:User) // Referral chain (:User)-[:SIMILAR_LOCATION]->(:User) // Geographic proximity (:User)-[:SHARES_PAYMENT]->(:User) // Same payment method (:User)-[:BOOKING_PATTERN]->(:User) // Similar timing/behavior

The New User Challenge

100 million first-time users present the greatest fraud risk - they lack historical behavior data but represent the primary attack vector for stolen credentials and coordinated fraud attempts.

Behavioral Bootstrapping for New Users:

Multi-Algorithm Fraud Detection

Rather than relying on a single detection method, the system employs multiple graph algorithms simultaneously, each targeting different fraud patterns:

Algorithm Ensemble

1. Community Detection

Identifies tightly connected clusters of users who may be coordinating fraudulent activities. Fraud rings often exhibit high intra-cluster connectivity through shared devices, payment methods, or referral patterns.

2. Shortest Path Analysis

Measures degrees of separation from known fraudulent nodes. Users connected to confirmed fraudsters through short paths receive elevated risk scores based on relationship proximity.

3. Centrality Measures

Identifies key players in potential fraud networks using betweenness centrality and PageRank-style algorithms. Users who serve as bridges between multiple suspicious clusters warrant investigation.

4. Connected Components

Discovers isolated fraud clusters that operate independently from the main user graph. These often represent organized fraud attempts using completely fabricated identities.

Ensemble Scoring and Weight Learning

Individual algorithm outputs combine through learned weights based on historical fraud data:

def calculate_ensemble_risk_score(user_node): # Individual algorithm scores community_score = detect_fraud_communities(user_node) path_score = shortest_path_to_fraudsters(user_node) centrality_score = calculate_centrality_risk(user_node) isolation_score = connected_component_analysis(user_node) # Learned weights from historical fraud cases weights = { 'community': 0.35, 'path': 0.25, 'centrality': 0.20, 'isolation': 0.20 } # Weighted ensemble final_score = ( community_score * weights['community'] + path_score * weights['path'] + centrality_score * weights['centrality'] + isolation_score * weights['isolation'] ) return normalize_score(final_score)

Weight Learning: Supervised learning on historical theft data determines optimal algorithm weights. The system continuously updates these weights as new fraud patterns emerge and algorithm effectiveness evolves.

Performance Architecture at Scale

Achieving sub-second response times on graphs with 110+ million nodes requires sophisticated performance optimization strategies.

Score-Based Partitioning Strategy

Key Insight: "Thieves have a habit of repeat offense." This behavioral pattern enables performance optimization through risk-based data partitioning.

The system partitions the graph based on user risk scores rather than traditional sharding approaches:

+-------------------+ +-------------------+ +-------------------+ | HOT PARTITION | | WARM PARTITION | | COLD PARTITION | | High-Risk Users | | Medium-Risk Users | | Clean Users | +-------------------+ +-------------------+ +-------------------+ | - Known fraudsters| | - New users | | - Verified users | | - Repeat offenders| | - Partial matches | | - Long history | | - Recent flags | | - Moderate scores | | - Low risk scores | | | | | | | | In-Memory Caching | | SSD Storage | | Cold Storage | | Sub-10ms queries | | ~50ms queries | | ~200ms queries | +-------------------+ +-------------------+ +-------------------+

LFU Caching with Fraud Pattern Awareness

Traditional LRU caching doesn't optimize for fraud detection patterns. The system implements Least Frequently Used (LFU) caching with domain-specific optimizations:

class FraudAwareLFUCache: def __init__(self, capacity): self.capacity = capacity self.cache = {} self.frequencies = {} self.fraud_boost_factor = 2.0 def get_user_risk(self, user_id): if user_id in self.cache: # Boost frequency for high-risk users base_freq = self.frequencies[user_id] risk_score = self.cache[user_id]['risk_score'] if risk_score > HIGH_RISK_THRESHOLD: self.frequencies[user_id] = base_freq * self.fraud_boost_factor else: self.frequencies[user_id] = base_freq + 1 return self.cache[user_id] # Cache miss - fetch from graph database return self.fetch_and_cache(user_id)

Neo4j Performance Optimization

Graph database optimization for real-time fraud detection requires multiple strategies:

// Pre-computed risk scores as node properties MATCH (u:User) SET u.risk_score = calculate_ensemble_score(u) SET u.last_updated = datetime() // Strategic indexing for hot queries CREATE INDEX user_device_idx FOR (u:User) ON (u.device_fingerprint) CREATE INDEX user_risk_idx FOR (u:User) ON (u.risk_score) CREATE INDEX user_phone_idx FOR (u:User) ON (u.phone_hash) // Optimized fraud detection query MATCH (suspect:User {user_id: $userId}) MATCH (suspect)-[:SHARES_DEVICE|SHARES_PHONE*1..2]-(connected:User) WHERE connected.risk_score > 0.7 RETURN suspect.risk_score, collect(connected.user_id) as high_risk_connections LIMIT 50

System Architecture Overview

The fraud detection system integrates multiple components for real-time risk assessment and graph maintenance:

FRAUD DETECTION SYSTEM ARCHITECTURE USER EVENTS & DATA INGESTION +-------------+ +-------------+ +-------------+ +-------------+ | New Booking |--->| User |--->| Kafka |--->| Feature | | Events | | Behavior | | Streaming | | Extraction | +-------------+ | Tracking | +-------------+ +-------------+ +-------------+ | | | v +-------------+ +-------------+ | +-------------+ | Device |--->| Identity | | | Graph Node | | Fingerprint | | Resolution | | | Properties | +-------------+ +-------------+ | +-------------+ | | +-------------+ +-------------+ | v | External |--->| Fraud | | +-------------+ | Fraud | | Reports | | | Neo4j | | Reports | +-------------+ | | Database | +-------------+ | +-------------+ | | REAL-TIME PROCESSING | | v v +-------------+ +-------------+ +-------------+ | Risk |<---| Multi-Algo |<---| Graph | | Assessment | | Ensemble | | Queries | | Engine | +-------------+ +-------------+ +-------------+ | ^ | | | v v | +-------------+ +-------------+ | | Booking | | Graph |------------+ | Decision | | Updates | +-------------+ +-------------+ PERFORMANCE LAYER +-------------------+ +-------------------+ +-------------------+ | HOT PARTITION | | WARM PARTITION | | COLD PARTITION | | High-Risk Users | | Medium-Risk Users | | Clean Users | | In-Memory Cache | | SSD Storage | | Cold Storage | | <10ms response | | ~50ms response | | ~200ms response | +-------------------+ +-------------------+ +-------------------+

Real-Time Graph Updates and Streaming

The system maintains graph consistency while processing continuous updates from booking events, fraud discoveries, and user behavior changes.

Event-Driven Graph Maintenance

def handle_fraud_discovery(fraudster_user_id): """Handle real-time fraud discovery and graph updates""" # 1. Update fraudster node properties update_node_classification(fraudster_user_id, "red") # 2. Find all connected nodes within 2 degrees connected_users = find_connected_nodes( fraudster_user_id, max_depth=2 ) # 3. Re-compute risk scores for connected nodes for user_id in connected_users: new_risk_score = calculate_ensemble_risk_score(user_id) update_node_risk_score(user_id, new_risk_score) # 4. Update partitioning if risk level changed significantly if risk_level_changed(user_id, new_risk_score): migrate_partition(user_id, new_risk_score) # 5. Invalidate relevant cache entries invalidate_cache_for_users(connected_users) # 6. Trigger alerts for newly high-risk users trigger_review_queue(get_high_risk_users(connected_users))

Streaming Feature Updates

User behavior changes continuously update graph features through Kafka streams:

@kafka_consumer('user-behavior-events') def update_behavioral_features(event): user_id = event['user_id'] behavioral_updates = { 'last_booking_time': event['timestamp'], 'booking_count': increment_counter(user_id, 'bookings'), 'cancellation_rate': calculate_cancellation_rate(user_id), 'location_variance': update_location_variance(user_id, event['location']) } # Update Neo4j node properties update_user_properties(user_id, behavioral_updates) # Re-evaluate risk if significant behavior change if significant_change_detected(behavioral_updates): new_risk_score = calculate_ensemble_risk_score(user_id) propagate_risk_updates(user_id, new_risk_score)

Fraud Pattern Recognition

Device Sharing Networks

Pattern: Multiple accounts using the same device fingerprint, often with rapid account creation and immediate booking attempts.

Detection: Graph analysis reveals star-pattern connectivity where one device node connects to many user nodes with recent registration timestamps.

Referral Chain Exploitation

Pattern: Fraudsters create fake referral chains to exploit signup bonuses or referral rewards, often using slight variations of personal information.

Detection: Connected component analysis identifies isolated clusters of users with suspicious referral patterns and minimal external connectivity.

Geographic Clustering

Pattern: Coordinated attacks from specific geographic regions, often targeting high-value inventory during peak demand periods.

Detection: Spatial clustering algorithms identify unusual concentrations of new accounts with similar booking patterns in localized geographic areas.

Performance Results and Business Impact

Detection Accuracy: 70%+ high-risk detection rate with false positive rates maintained below 5% through ensemble algorithm balancing and continuous threshold tuning.

System Performance Characteristics:

Business Impact:

Key Engineering Insights

Graph-based approaches reveal fraud patterns invisible to traditional methods: Individual transaction analysis misses coordinated attacks and shared infrastructure that becomes obvious in graph representation.

Performance optimization must align with domain patterns: Score-based partitioning and fraud-aware caching leverage behavioral insights ("repeat offenders") for system optimization.

Real-time graph updates require careful consistency management: Balancing immediate fraud response with system performance requires sophisticated event-driven architectures and update propagation strategies.

Evolution and Future Considerations

Algorithm Adaptation: Fraud patterns evolve continuously, requiring dynamic algorithm weights and new detection methods. The ensemble approach enables adding new algorithms without disrupting existing detection capabilities.

Privacy and Compliance: Graph-based analysis of user relationships raises privacy considerations that require careful balance between fraud detection effectiveness and user data protection.

Adversarial Resistance: As fraudsters adapt to detection methods, the system requires continuous evolution to stay ahead of new attack patterns and evasion techniques.