Build a machine learning model using SEON data

Updated on 22.06.23

8 minutes to read

Copy link

Overview

This whitepaper will showcase how you can build a machine learning model using data from SEON's data enrichment process. Due to the nature of data science, we can only offer suggestions. Consider your specific use case and any additional data available to your company. You may have to go through a few steps not detailed below.

Clear goals, focus, and domain

Model building can call for different approaches depending on whether you want to focus on, for example, account takeovers (ATO), identity theft, or bonus abuse.

Learn more: See more resources on specific use cases here.

For example, in the case of ATO detection, you'll have to use velocity features (such as the count of different IP addresses per user in the last 24 hours).

For fake user detection, building on social data is more important (such as the number of registered accounts or the earliest date that the person fell victim to a data breach).

Create different models for different customer groups/clusters (e.g., EMEA/APAC or returning/new customers) and/or different use cases (registration/login/deposit).

Data collection

To collect data, call SEON APIs and save the responses into a database. Please refer to the API reference to understand the full scope of our API fields.

Data preparation

Clean your data
- Drop responses with non-successful HTTP response codes or empty rows
  These may be caused by timeouts, bad or invalid requests, or other errors. Refer to our error codes for more details.
- Drop columns with a high NaN/unknown ratio
  These occur when a field is not filled (e.g., billing_country) or the field is not relevant to the customers (e.g., social media site not used in your region or country).

Note: You can adjust your timeouts on the Settings page.

Add target labels
- In machine learning, it's vital that you mark fraudulent or unwanted events and actors. These will be your target labels. It's best to send us these feedback labels using SEON's Label API. Doing so is free of charge and enables SEON to improve its accuracy based on your feedback.
- Use the most recent and well-labeled data you have to achieve the best results and catch the latest fraud patterns.
- Find the right amount of data and the correct ratio of positive labels (fraudsters) to build a model.
  In anomaly detection scenarios, datasets can be highly imbalanced. Thus, we recommend that you apply specific positive labels (indicating fraudsters). You can also oversample the positive labels or undersample the negative ones.
Merge SEON data with other data
- Your company likely has additional data, which is not sent to SEON but is useful for your model. This is the right time to merge the two data sets.

Feature engineering

You can create more valuable features by aggregating or comparing your data. For example, a user not being registered on a specific social media platform, such as Facebook, is not a strong marker of them being a fraudster. However, it's highly suspicious if they have 0 social and online profiles out of all the 50+ platforms checked by SEON; this indicates a freshly minted email address. Here are a few more parameters you may want to combine, to get a better understanding of users:

Exact data matches and string similarities or geological distances between:
- names (user full name, card full name, email address prefix, and social media profile names)
- countries (IP, phone, roaming carrier, carrier, shipping, billing, user countries, and countries coming from social media enrichment)
- cities (IP, shipping, billing, user cities, and cities coming from social media enrichment)
- addresses (user, billing, shipping street)
- carriers (phone carrier, carrier name, original carrier, ported carrier, roaming carrier)
Velocity checks:
- Maximum count of distinct values for the same id over a period of time (e.g., count of distinct browser hash values for the same user ID in the last 1 day or count of distinct user ID values for the same browser hash in the last 1 hour)
- This can be as simple as the count of devices the user used in the last month or as complicated as how far the current transaction amount is from the min-max interval of the user's previous transaction amounts.
Others
- Count of social media registrations with the email address
- Count of social media registrations with the phone number
- Email minimum age: The difference between the date of the transaction and the earliest email-related timestamp of the data (first data breach, Airbnb created date, etc.) in months.
- Or any other feature that makes sense based on domain knowledge.

See the data cleaning section above for recommendations on cleaning the data and choosing the right features to build a model.

Feature selection

Choose the features that are useful for the model and drop unnecessary columns like id, phone number, IP address, and apartment number or features with low variance.
String fields and ids with many different unique values can cause overfitting. Drop columns that only have one value (e.g., if you only ever have False values for VPN).

Choose the model

Model selection

We have seen the best results using a Gradient Boosting Machine, or at least a Random Forest. While the performance of deep learning algorithms is constantly improving, there is broad scientific consensus that Gradient Boosting algorithms are the most suitable for tabular data.

Data conversion

Transform the data to the shape and data types the chosen algorithm requires. For example, unlike Catboost, there are algorithms like LightGBM or XGBoost that can only handle numerical features. In these cases, Label encoding or One Hot encoding is needed for string and categorical features. Also, you'll have to fill in missing values.
Ensure and test that the data preprocessing and feature calculations work in the same way if the training and prediction are implemented in different programming languages.

Training

At this point, the prepared data with all the necessary features, rock-solid target labels, and the chosen model are available.

Dataset splitting

Split the dataset into training, validation, and test sets, and find the right ratio of positive labels for each set. It can be the same as the original ratio, or you can oversample positives in the training set.

Hyperparameters

You must understand the parameters of the model and find the correct settings for them. Fixing random_seed can be useful to ensure reproducibility. Early stopping and model-size-based regulations are also helpful.

In the case of the suggested tree-based models, the maximum tree depth and the right learning_rate or other regulations (number of leaves, L2 regularization, minimum child weight) are vital to avoid overfitting.

We suggest controlling speed with the number of iterations/estimators and, depending on the chosen model, the parameter for random subspace method, fractions, or subsampling.
Since the training can be pretty quick, it's easier to do hyperparameter tuning.

Learn more: Dive into hyperparameters with a detailed overview from Towards Data Science.

Evaluation

Choosing the correct metric for your use case is an essential step in improving your model. Your most widespread options are

ROC curve and the AUC: Receiver Operating Characteristic curve and the Area Under the Curve are commonly used tools to get information about the skills of the model.
Precision, recall, accuracy, F1 score: The model predicts probabilities, so a threshold is necessary to convert probability to class value.
The threshold usually defaults to 0.5, which means that below 0.5, the outcome is negative, and between 0.5 and 1, the outcome is positive. With this conversion, we can calculate the mentioned metrics from the count of true positives, false positives, false negatives, and true negatives.
Because of the highly unbalanced data (the number of positive labels is fairly small), there is a possibility that the precision, recall, and F1 scores are not as high as someone might have seen in other areas of machine learning. But with a good enough model, you can counteract even unbalanced data.

Understanding and explaining the model

Calculating and weighing feature importance is usually good for feedback and as a source of insights.
SHAP is a highly recommended library for interpreting tree-based models. An excellent example of how PayPal uses it in its fraud prevention model is available online.

To better understand how valuable SEON data is and how much it can help your company, evaluate batch tests based not on unrefined scoring but a model built from the batch test results and on the explanation of the model.

Deployment and prediction

Once you've completed training and the model is ready to use, you'll need an environment in which everything works the same way as during training. This will lead to all the data cleaning, conversion, and feature calculations being the same.

Final Comments

We hope you found this guide helpful.

Our clients have had great success incorporating our data into their machine learning models, and we aim to help you achieve the same. If you need further assistance, contact our Customer Success Team.

About the author

Gellért Nacsa is the Data Science Lead at SEON. He studied applied mathematics at university and worked as a data analyst, algorithm designer, and data scientist. He has been enjoying data and machine learning for over six years.