Top features to truly enrich your machine learning model

Updated on 12.01.23
6 minutes to read
Copy link

Overview

Data enrichment and machine learning are becoming buzzwords in fraud prevention. As a result, it's easy to get overwhelmed by the nuances and the abundance of expanding possibilities these technologies can offer. 

Beyond providing a conceptual overview of data enrichment and the use of machine learning in fraud detection, we also wanted to share how you can improve your data models with SEON. Add these recommended features to your existing machine learning model so your business can benefit the most and make you as successful in fraud prevention as possible.

 

It only takes a few

Needless to say, the more data you have, the more educated decisions you can make. SEON can create hundreds of data points using only a handful via data enrichment: you can get plenty of extra information on any user only from an email or IP address, which, of course, comes in handy when calculating risks. 

We gather this information via an automated solution from various sources and publicly available databases. While the data collected is primarily utilized for fighting fraud, you can theoretically use it for other purposes as well. However, as a SEON customer, you must ensure that your use of this enriched data complies with all data protection regulations in your local jurisdiction.

 

Stay data-hungry

So, how's the enriched data put to good use? That's where machine learning comes into the picture. Machine learning, a subset of artificial intelligence (AI), uses algorithms to identify patterns behind fraudulent transactions and create data models. It then suggests you risk rules to implement so that you can catch suspicious activities earlier. 

It's important to note that machine learning is indeed all about learning: the more information you "feed" it and the more training it gets by accepting/flagging its suggested risk rules, the more accurate it gets. Not only does this make fighting fraud easier and faster, but it also enables you to benefit from the data models in other areas, such as alternative credit scoring, customer segmentation, or loan default risk calculations.

 

Where to start

While the best features for your machine learning model are specific to you and highly depend on your industry, business, and individual needs, we'd like to give you a head start with a generic list of suggested features from which you can benefit.

 

Fraud score

fraud_score

A pillar of SEON's scoring system and logic, this field might be the easiest to rely on, as it accumulates all the default rules and scores. However, these default settings might not entirely cover your specific needs, so it's worth refining the default scores and adding custom rules.

Email, phone, and IP scores

email_details.score, phone_details.score, ip_details.score

While the all-in-one fraud_score covers the basics, it might be beneficial to dig deeper and break it down into email, phone, and IP scores.

The same rule applies as above: fine-tuning the default settings to match your use case is always a good way to go.

 

Number of social media accounts

all_social_media_profile_count, email_social_media_profile_count, phone_social_media_profile_count

Summing up social media registrations (email/phone/both) can accurately indicate whether we are facing a real person or a fake identity. You'll have to parse your fraud API response and count the number of 'true' values in the email and phone modules.

 

Email is older than…

email_is_older_than_n_months

Having an estimate of when an email address was created turned out to be an essential piece of information in many of our models. It indicates whether the email address is real and if its owner used it elsewhere.

You can use the following fields to calculate how old the email address might be:

  • email_details.breach_details.first_breach
  • email_details.account_details.ok.date_joined
  • email_details.account_details.airbnb.created_at
  • email_details.history.first_seen

Take the earliest timestamp from the fields mentioned above and subtract it from the timestamp of the transaction, then convert it to an integer of months.

 

Number of data breaches 

email_details.breach_details.number_of_breaches

While a data breach is not positive, they can prove that an email exists and has been used elsewhere. 

A data breach is an event where privately held information is made public. The most common type of data breach tends to affect user records, which are exchanged or sold on online marketplaces. If we can find an email in such user records, it's safer to assume it's been around for a while. 

 

Blackbox score

blackbox_score

You might also want to consider the probability that SEON'S Blackbox Machine Learning model provides. 

IP address-related features 

Take a look at the following IP-related features:

  • ip_details.web_proxy 
  • ip_details.public_proxy 
  • ip_details.tor
  •  ip_details.vpn 
  • ip_details.open_ports_number

You can also check if the IP address belongs to a data center and if it's blacklisted. 

 

Important nice-to-haves

There are additional features that might not be much of a help on their own but might play a crucial role in "making the final call," not to mention in a learnable way. So, they are often the most significant features when training the machine learning model. 

You should therefore keep an eye on these things, too:

  • Whether a rule has been triggered by an email address similar/not similar to the user's full name. You can use Default rule E123 to check this:
  • The type of IP address, based on the internet service provider (ip_details.type). The IP address belonging to a data center, library, educational institute, organization, government, mobile or fixed line ISP, etc., can make a huge difference. 
  • The email domain's creation date and time (UTC timezone) (the year and month value of email_details.domain_details.created).
  • Whether the email's domain is a free provider such as Gmail, Hotmail, etc. (email_details.domain_details.free).
  • Whether the email's domain is disposable or has been proven fraudulent before (email_details.domain_details.disposable).
  • The battery level of the used device (device_details.battery_level). You can only access this data when you use Device Fingerprinting with SEON's iOS or Android SDK.

Only the beginning

It might seem like a lot, but this was merely an introduction to the myriad of possibilities machine learning offers. These generic features can already get you far when fighting fraud. Still, we encourage you to dive deeper, check out further recommendations for feature engineering, and find the best set of features based on your specific needs to get the best possible results.

 

About the author

 

Gellért Nacsa is the Data Science Lead at SEON. He studied applied mathematics at university and worked as a data analyst, algorithm designer, and data scientist. He has been enjoying data and machine learning for over six years.

Was this article helpful?

?Got a question

Talk to sales