I need to removes duplicates from my dataframe where the respective values in columns A nd C match, and where the respective values in columns B and C match. My issue is that there are null values in column B and some of the "duplicates" aren't exact matches. Column B contains names, which for some rows just have a last and first, but for other rows have last, first, middle and degrees. Any row with a matching last and first name counts as a duplicate.
Starting dataframe:
d = {'A': [123498, 123498, 234875, 457898, 'SMITHJ', 'DOEJ',],'B': ['SIMON, PAUL JD', None, 'DOE, JANE MARY PHD', 'MERCURY, FREDRICK MS', None, 'DOE, JANE'], 'C': ['red', 'red', 'green', 'red', 'blue', 'green']}df = pd.DataFrame(data=d)df A B C0 123498 SIMON, PAUL JD red1 123498 None red2 234875 DOE, JANE MARY PHD green3 457898 MERCURY, FREDRICK MS red4 SMITHJ None blue5 DOEJ DOE, JANE green
Final dataframe:
A B C0 123498 SIMON, PAUL JD red3 457898 MERCURY, FREDRICK MS red4 SMITHJ None blue5 DOEJ DOE, JANE green
I used df.drop_duplicate(['A', 'C'])
to remove duplicates from column A and a mask to remove the exact duplicates from column B, while keeping Null values.
Also, it doesn't matter which of the duplicate rows that I keep, so rows at index 0 and 5 could've been removed instead of the rows at index 1 and 2 and that would be acceptable.
Thank you!