Data Deduplication with ML

Problem Statement

Briefly, we work with a company and they allow their customer to sign up for account. The company has so many branches, and one customer can open one (or more) account(s) at any branch. As a result, duplication happens, so here we are.

When signing up, we need this kind of information from a customer:

  • first name
  • last name
  • email
  • date of birth
  • phone number

Sample data:

first name last name email date of birth phone number
Ana Laurel ana_laurel@yahoo.com 02/01/1990 3102105770

Approaches:

  • Hard coding: compare each line with other, and check:
    • if they exactly match → a match
    • otherwise → a distinct
  • Dedupe: active learning from user labelling.
    • loss function: affine gap
    • model: regularized logistic regression
    • results
  • Research new approach
    • data: labelled data pairs (match pairs and distinct pairs, 40 and 460 respectively)
    • idea:
      • the data has 5 terms and 2 of them (name (including first and last) and email) are critical information that can be used by features such as vectorizing it (convert word to vector) to train an ML model.
      • date of birth and phone number are numeric values, and can be used as well.
    • steps:
      • clean data
        • replace NaN value in date of birth with “01-01-1800”
        • remove “-” and convert to integer (“01-01-1800” → 01011800)
        • replace NaN value in phone number with “99999999999”
      • Vectorize:
        • use TfidfVectorizer library of sklearn
        • the final feature has shape of 40 * 198 for match and 460 * 198 for distinct (quite imbalanced)
      • Train model and Test model: as the test data is not available right now, models are trained and tested using the same data.
        • Logistic Regression: Recall: 0.8 Precision: 0.9831932773109244
        • Linear SVM: Recall: 0.8 Precision: 0.9831932773109244
        • XGBoost: Recall: 1.0 Precision: 1.0
      • Comments:
        • XGboost seems to be overfitting, which is understandable because the dataset is small and not balanced. We can check and see if it is overfitting or not once we have a test set.
        • Logistic Regression and SVM look promising. We might increase the performance by increasing the data size and feature size.
Na Nguyen Thi
Latest posts by Na Nguyen Thi (see all)
Like what you read? Share with a friend.

Contact Us!

Do you have a question or need more info? Please enter your information and describe your inquiry, and we’ll get back to you as soon as possible. Thanks!

Check out our Case Study!

We’d love to hear more from you. Tell us more about yourself and we’ll message you with our case studies as soon as we receive your message!