Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 12111

Remove duplicates without exact match

$
0
0

I need to removes duplicates from my dataframe where the respective values in columns A nd C match, and where the respective values in columns B and C match. My issue is that there are null values in column B and some of the "duplicates" aren't exact matches. Column B contains names, which for some rows just have a last and first, but for other rows have last, first, middle and degrees. Any row with a matching last and first name counts as a duplicate.

Starting dataframe:

d = {'A': [123498, 123498, 234875, 457898, 'SMITHJ', 'DOEJ',],'B': ['SIMON, PAUL JD', None,  'DOE, JANE MARY PHD', 'MERCURY, FREDRICK MS', None, 'DOE, JANE'], 'C': ['red', 'red', 'green', 'red', 'blue', 'green']}df = pd.DataFrame(data=d)df      A            B                 C0   123498   SIMON, PAUL JD         red1   123498   None                   red2   234875   DOE, JANE MARY PHD     green3   457898   MERCURY, FREDRICK MS   red4   SMITHJ   None                   blue5   DOEJ     DOE, JANE              green

Final dataframe:

      A            B                 C0   123498   SIMON, PAUL JD         red3   457898   MERCURY, FREDRICK MS   red4   SMITHJ   None                   blue5   DOEJ     DOE, JANE              green

I used df.drop_duplicate(['A', 'C']) to remove duplicates from column A and a mask to remove the exact duplicates from column B, while keeping Null values.

Also, it doesn't matter which of the duplicate rows that I keep, so rows at index 0 and 5 could've been removed instead of the rows at index 1 and 2 and that would be acceptable.

Thank you!


Viewing all articles
Browse latest Browse all 12111

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>