Top features to truly enrich your machine learning model
Updated on 07.02.23
6 minutes to read
Data enrichment and machine learning are becoming buzzwords in fraud prevention. As a result, it's easy to get overwhelmed by the nuances and the abundance of expanding possibilities these technologies can offer.
Beyond providing a conceptual overview of data enrichment and the use of machine learning in fraud detection, we also wanted to share how you can improve your data models with SEON. Add these recommended features to your existing machine learning model so your business can benefit the most and make you as successful in fraud prevention as possible.
It only takes a few
Needless to say, the more data you have, the more educated decisions you can make. SEON can create hundreds of data points using only a handful via data enrichment: you can get plenty of extra information on any user only from an email or IP address, which, of course, comes in handy when calculating risks.
We gather this information via an automated solution from various sources and publicly available databases. While the data collected is primarily utilized for fighting fraud, you can theoretically use it for other purposes as well. However, as a SEON customer, you must ensure that your use of this enriched data complies with all data protection regulations in your local jurisdiction.
So, how's the enriched data put to good use? That's where machine learning comes into the picture. Machine learning, a subset of artificial intelligence (AI), uses algorithms to identify patterns behind fraudulent transactions and create data models. It then suggests you risk rules to implement so that you can catch suspicious activities earlier.
It's important to note that machine learning is indeed all about learning: the more information you "feed" it and the more training it gets by accepting/flagging its suggested risk rules, the more accurate it gets. Not only does this make fighting fraud easier and faster, but it also enables you to benefit from the data models in other areas, such as alternative credit scoring, customer segmentation, or loan default risk calculations.
Where to start
While the best features for your machine learning model are specific to you and highly depend on your industry, business, and individual needs, we'd like to give you a head start with a generic list of suggested features from which you can benefit.
A pillar of SEON's scoring system and logic, this field might be the easiest to rely on, as it accumulates all the default rules and scores. However, these default settings might not entirely cover your specific needs, so it's worth refining the default scores and adding custom rules.
Email, phone, and IP scores
While the all-in-one
fraud_score covers the basics, it might be beneficial to dig deeper and break it down into email, phone, and IP scores.
The same rule applies as above: fine-tuning the default settings to match your use case is always a good way to go.
Number of social media accounts
Summing up social media registrations (email/phone/both) can accurately indicate whether we are facing a real person or a fake identity. You'll have to parse your fraud API response and count the number of 'true' values in the email and phone modules.
Email is older than…
Having an estimate of when an email address was created turned out to be an essential piece of information in many of our models. It indicates whether the email address is real and if its owner used it elsewhere.
You can use the following fields to calculate how old the email address might be:
Take the earliest timestamp from the fields mentioned above and subtract it from the timestamp of the transaction, then convert it to an integer of months.
Number of data breaches
While a data breach is not positive, they can prove that an email exists and has been used elsewhere.
A data breach is an event where privately held information is made public. The most common type of data breach tends to affect user records, which are exchanged or sold on online marketplaces. If we can find an email in such user records, it's safer to assume it's been around for a while.
You might also want to consider the probability that SEON'S Blackbox Machine Learning model provides.
IP address-related features
Take a look at the following IP-related features:
You can also check if the IP address belongs to a data center and if it's blacklisted.
There are additional features that might not be much of a help on their own but might play a crucial role in "making the final call," not to mention in a learnable way. So, they are often the most significant features when training the machine learning model.
You should therefore keep an eye on these things, too:
- Whether a rule has been triggered by an email address similar/not similar to the user's full name. You can use Default rule E123 to check this:
- The type of IP address, based on the internet service provider (
ip_details.type). The IP address belonging to a data center, library, educational institute, organization, government, mobile or fixed line ISP, etc., can make a huge difference.
- The email domain's creation date and time (UTC timezone) (the year and month value of
- Whether the email's domain is a free provider such as Gmail, Hotmail, etc. (
- Whether the email's domain is disposable or has been proven fraudulent before (
- The battery level of the used device (
device_details.battery_level). You can only access this data when you use Device Fingerprinting with SEON's iOS or Android SDK.
Only the beginning
It might seem like a lot, but this was merely an introduction to the myriad of possibilities machine learning offers. These generic features can already get you far when fighting fraud. Still, we encourage you to dive deeper, check out further recommendations for feature engineering, and find the best set of features based on your specific needs to get the best possible results.