I tried to run a distinct() command on my dataset to get rid of duplicated respondents, but when I do that it also gets rid of unique project IDs that I need. It is ok for there to be duplicated IDs for the projects, but RESPNO should not be duplicated.
I have a merged dataset with different aid projects that I have merged with survey data. I use the following line of code to merge the two data sets
Full_dataset <- merge(AidData, OpinionData, by = "Recipient")
which produces far too many observations and I notice the respondent IDs from the survey data are duplicated. I also use
Here is an example dataset of what the data frame looks like after the merge. It includes 250 Unique IDs and 46860 RESPNO IDs. I use distinct() within the dplyr package to filter down to the unique RESPNO IDs.
set.seed(42)# Number of rowsn_rows <- 386378Full_dataset <- data.frame("ID" = rep(1:250, length.out = n_rows, each = ceiling(n_rows/250)),"RESPNO" = rep(1:46860, length.out = n_rows),"Recipient" = sample(c("Angola", "Benin", "Peru", "UK", "South Africa", "Congo", "Mali", "India", "Greece"), n_rows, replace = TRUE),"Mitigation" = runif(n_rows, 0, 100),"Adaptation" = runif(n_rows, 0, 100),"Fossil_Fuel" = runif(n_rows, 0, 100))Full_dataset <- Full_dataset %>% distinct(RESPNO, .keep_all = TRUE)
I use the following code in the dplyr package to see that I have 250 unique IDs
result <- Full_dataset %>% group_by(ID) %>% summarise(count = n()) %>% ungroup() %>% arrange(desc(count))result
Yet when I use the distinct command on the full dataset, I drop down to just 31 project IDs even though I know I should have 250. I don't understand why this is happening or how to fix it.
Full_dataset <- Full_dataset %>% distinct(RESPNO, .keep_all = TRUE)
How can I use the distinct command to get rid of duplicated respondents (RESPNO) while keeping the correct amount of unique project IDs?