Detecting Fraud With Oversampling Techniques and Sparsity Contraints

Background

How did this project come to be?

Motivation

In recent years, there has been a huge surge in the usage of online platforms like Reddit, Yelp, and Amazon. While the usage of such platforms has positively influenced our daily lives, fraudulent activities is amid this rapid increase in digital data volume and anonymity of online networks. Fraudulent behaviors are executed and hidden in plain sight of this vast amount of online records. Thus, enhancing fraud detection is crucial to protect us from fraudsters against harmful scams, and even criminal activities. By tackling this fraud detection task, we hope to contribute to protecting us against scams.

Need For Better Fraud Detection Method

Many techniques exist to combat fraud; however, they often fail to capture the imbalanced class structure in data involving fraudulent activities. It's important to tackle such concern so we can harness its power to correctly predict anomalies. So, the question remains: How can we effectively detect and mitigate fraudulent activities, especially when faced with imbalanced datasets? Our research contributes to the study of such concern with a model that harnesses the strengths of many existing techniques from different domains.

Our Proposed Solution

Our proposed solution utilizes a combination of 2 models: GraphSMOTE and SparseGAD. GraphSMOTE, although not commonly used for fraud detection, is an oversampling technique that identifies minority class nodes and creates new nodes that resemble those minority class data points, thus helping it balance. Since datasets involving fraud detection are hugely imbalanced, GraphSMOTE will help us balance the class distribution in the graph. We will then take the output of this GraphSMOTE and feed it into SparseGAD, known for anomaly detection by introducing sparsity constraints. Such constraints help highlight significant connections, and anything that substantially deviates from that will be looked into further for fraud. Upon applying these techniques, we utilize Graph Neural Networks (GNNs) for our model.

Dataset

Details on our dataset

Our model utilizes three datasets - Amazon, Yelp, and Reddit. These datasets are obtained from the Deep Graph Library (DGL) and PyGOD in graph format. Each dataset exhibits class imbalance:

Amazon Dataset: Features product reviews under the Musical Instruments category. It's a binary classification task with a 1:10.5 positive (fraudulent) to negative (benign) ratio.
Yelp Dataset: Consists of hotel and restaurant reviews, with a 1:5.9 imbalance between spam and legitimate reviews.
Reddit Dataset: Comprises community posts from September 2014, with 3.3% anomalies detected.

Methods

Timeline of development for our model.

Oversampling Technique

GraphSMOTE

We feed our imbalanced datasets into GraphSMOTE, which balances it by generating synthetic nodes. It addresses the imbalance by adding synthetic anomalous nodes to the graph, amplifying the minority class. These synthetic nodes maintain connections with existing nodes while preserving heterogeneity among links.
Introducing Sparsity

SparseGAD

GraphSMOTE alone doesn't fully capture the dissimilarity of anomalous nodes with their connected users. To address this, SparseGAD introduces sparsity, utilizing a learnable adjacency matrix ("homey" matrix) derived from cosine similarity calculations. This matrix helps identify whether connected nodes are similar or dissimilar, enhancing the model's ability to detect anomalous nodes exhibiting heterophilic behaviors.
GNN Models

GCN, GAT, GraphSage

Upon applying these techniques, we feed in the output to our GNNs to obtain a result. The 3 GNNs we use are Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSage. We utilize different preprocessing technique, or "paths" before applying these GNN models. The main 3 paths, as well as a description of these 3 GNN models will be discussed in detail in the Implementation section.
Decision Phase

Fraudulent or Not

Finally, we use the trained GNN model to classify nodes as either fraudulent or non-fraudulent based on their features and neighborhood structure. We evaluate the model’s performance using the ROC-AUC score based on its ability to correctly identify anomalies in the dataset.

Details on the Implementation

Models Details

How GraphSMOTE works

In the synthetic node generation process, SMOTE method is utilized to amplify the minority class, typically anomalous users. Synthetic nodes are created by replicating existing nodes and their connections, maintaining heterogeneity in links. The "homey" adjacency matrix replaces the original adjacency matrix, enabling the calculation of cosine similarity between adjacent node pairs. This diagram explains GraphSMOTE: GraphSMOTE Diagram In short:

Generates synthetic nodes for minority class
Upsampling method “replicates” nodes and connections
Maintains heterogeneity in links
Adds minor differences for distinction

How SparseGAD works

Sparsification techniques further refine the graph by filtering unnecessary connections using a threshold δ and limiting connections through KNN. Graph Anomaly Detection (GAD)-oriented regularization is applied to prevent faulty links, enhancing the model's accuracy in distinguishing anomalous users. This diagram explains SparseGAD: SparseGAD Diagram In short:

Fraudulent users tend to link with heterophilic users
SparseGAD adjusts adjacency matrix to separate homophily and heterophily links
Filters unnecessary elements in homey adjacency matrix
Uses threshold delta and KNN to remove unnecessary links

How GNNs works

GCN aggregates information from neighboring nodes to update node representations, capturing local graph structures effectively.
GAT employs attention mechanisms to dynamically weigh the importance of neighboring nodes during information aggregation, enhancing the model's ability to capture complex relationships in the graph.
GraphSage generates node embeddings by sampling and aggregating features from a fixed number of neighbors, enabling scalable and robust representation learning on large and heterogeneous graphs.

Three Main Paths

Our approach to detecting fraud involves three main paths, each of which utilizes a different preprocessing technique before applying GNN models (such as GCN, GAT, or GraphSage). The following visualizes the workflow: Data Workflow In short,

Baseline Path: Here, we directly input the dataset into the chosen GNN model. This path serves as our baseline for comparison against the other paths.
GraphSMOTE Path: In this path, we apply the GraphSMOTE technique discussed in the Methods section before feeding it into the chosen GNN model. Again, GraphSMOTE is an oversampling technique, which helps address the imbalance in the dataset by generating synthetic nodes.
GraphSMOTE with New Adjacency Matrix Path: In this path, we utilize GraphSMOTE and the principles of SparseGAD. Here, the SparseGAD receives the balanced dataset generated by GraphSMOTE and incorporates consine similarity matrix and sparsity constraints which helps highlight significant connections and anomalies in the graph.

Results

What did we find in this project?

Our Findings

Now, let’s compare the models’ performance on each of these datasets.

As we can see in the table, the GraphSage model performs best on Yelp and Reddit datasets, and the GraphSage with Modified GraphSMOTE performs best on the Amazon dataset.

We first aim to explain why GraphSAGE outperforms GCN and GAT on the Yelp and Reddit datasets, while achieving a similar ROC-AUC score on the Amazon dataset. Our intuition is that this discrepancy arises from the structural differences between the datasets. It's important to note that both the Yelp and Reddit datasets exhibit relatively sparser structures, whereas the Amazon dataset presents a more condensed structure. This condensed structure in the Amazon dataset allows the weights in the GAT model on each neighbor’s features to have greater significance compared to those in the Yelp and Reddit datasets. On the other hand, the GraphSAGE model samples and aggregates from the neighboring nodes. On the Yelp and Reddit datasets, the simplicity of the GraphSAGE method facilitates better model tuning and results in improved convergence, thereby enabling the GraphSAGE model to outperform the GAT model on these datasets.

The main reasons why we believe our GraphSAGE performs well on the Yelp and Reddit datasets, and GraphSAGE with Modified GraphSMOTE on the Amazon dataset, are related to the degrees of connectivity and the number of connected nodes. Below, we present the degree distribution of the datasets.

In the Yelp and Reddit datasets, the median number of degrees is lower (6 and 1 respectively) vs. Amazon (which has 42). This discrepancy may reflect variations in the connectivity and structural complexity of the datasets, which directly influence the performance of fraud detection models. Furthermore, Yelp and Reddit datasets have more nodes with only self-loops (106 and 625 respectively) compared to Amazon (which has 6). This lack of connectivity in the Yelp and Amazon datasets limits the potential for synthetic nodes generated by the SMOTE method to establish meaningful connections within the graph. As a result, the ability of the SMOTE method to effectively balance the class distribution while maintaining dataset structure may be constrained in Yelp and Reddit datasets whereas flourished in the Amazon dataset. Thus, the GraphSMOTE and the Modified GraphSMOTE models were unable to surpass the performance of the baseline GraphSage model

This disparity in performance highlights the importance of tailoring the model to the characteristics of the dataset.

Discussion

What did we learn in this project?

Our Project

Fraud detection is prevalent now more than ever due to the massive surge in the usage of online platforms. Our research contributes to the study of such concern with a model that harnesses the benefits of many existing models. We proposed a solution that utilizes a combination of oversampling techniques and sparsity constraints to balance and predict fraud data.

Limitations & Future

In the future, we would like to refine our model to accommodate variability in datasets such as degrees of connectivity and number of connected nodes. We would also like to seek partnerships for deployment and cybersecurity collaboration. As mentioned, online fraud detection can be catastrophic to many and results in a negative impression of internet. Our goal is to contribute to bridging the gap in cybersecurity fraud detection.

The Models

The fusion of GraphSMOTE and SparseGAD showcased promising results, more specifically for Amazon dataset. However, there are many more models that exist. We would need to find a combinations of them which will be the most powerful in detecting fraud. Leveraging subject matter expertise and domain knowledge will be invaluable assets in research of this domain.

Project Material

Our Deliverables

Report

Our published paper with detailed information.

Poster

Our poster with a brief overview of our research.

Our Team

Meet our Team!

Prabina Pokharel

UCSD '24, Data Science and Business-Economics

Jingyu Zhang

UCSD '24, Data Science

Yandong Xiang