I am accessing a PostGres database using python and SQL ALchemy. I can't figure out how to delete in a timely manner. The query is fast, the delete takes 250,000x longer
I have a table, 'RP', that has 92M rows. I am trying to delete some of them.
I have some code that finds the objects I want to delete that works and runs fast, it is basically:
import sqlalchemy as sa...with Session(engine, autobegin=True) as session:... r_count = 0 for image in images: sub_r = sa.select(RP).filter_by(d_id=dive_image.id) count = session.execute(sa.select(sa.func.count()).select_from(sub_roi)).scalar_one() r_count += count #print timing and count info here
This loop executes ~500 times in ~0.2 seconds, because len(images)~500, each time counting ~10-200 individual rows, for a total of ~70,000 rows that I want to delete.
When I add a delete command below, it takes much longer. Each of the 500 passes through the loop, which originally took <0.01 second to execute, now is taking 250 seconds for each iteration, meaning that it will take 36+ hours to delete these 70,000 rows (which were found by a query in <0.5 seconds).
import sqlalchemy as sa...with Session(engine, autobegin=True) as session:... r_count = 0 for image in images: sub_r = sa.select(RP).filter_by(d_id=image.id) count = session.execute(sa.select(sa.func.count()).select_from(sub_roi)).scalar_one() r_count += count #print timing and count info here#New Delete Code Below for image in images: session.query(RP).filter_by(d_id=image.id).delete() #session.commit() #tried with this inserted and removed, doesn't seem to matter # Deleting using the session does not go faster # del_rp = sa.delete(RP).where(RP.d_id ==image.id) # session.execute(del_rp) # Deleting one at a time also doesn't seem to go faster # rois = session.query(RP).filter_by(d_id=image.id).all() # for r in rois: # session.delete(r) #print timing for each loop here, ~260+ seconds for each loop
To summarize: I tried 3 main strategies. I tried doing session.query().filter_by().delete(), I tried doing session.execute(sa.delete().where()), and I tried doing a for loop on session.query().filter_by().all(), and then doing a session.delete(one). I also tried including session.commit() at points in the middle.
I am expecting it to execute much faster. I see no reason why the query should go fast and the delete take many orders of magnitude longer.
I am the only user on the server. So there is no other possible bottleneck besides this code. The commented out methods also seem to take many hundreds of seconds for a few dozen deletes (it would take me longer to gather precise timing info)
I am using the pgAdmin dashboard, and in the 'Tuples Out' view, I see ~200-300 seconds of flatline, then 4,000 fetched, 12 billion returned. If I'm deleting ~140 objects in that pass through the loop, then that corresponds to each individual delete causing a fetch/return of the whole 92M table. Is there some way to tell the delete that I don't care for any return value (assuming that's what is happening?)
I suppose I could try to be more clever and accumulate all 70,000 rows and issue a single delete command, but based on the pgAdmin dashboard, it seems like this will not help, because it seems like the session.query().filter_by().delete() is getting broken up individually anyways.
What am I supposed to be doing differently? I have tried .delete(synchronize_session='fetch'), it doesn't seem to help, but I don't remember if I tried it in every permutation.
Maybe this should be a second question, but I also can't figure out a way to interrupt the code. It seems like if I send a ctrl+C, it waits until the ~250 second loop iteration is done. If I use the pgAdmin tool, I don't have permission to kill the session. I don't want a bunch of idle threads clogging up the server, so at this point I'm just being patient and waiting for the loop
**Edit: I checked the server as suggested. I see that the actual code being executed is **
DELETE FROM rp WHERE rp.d_id = 5254591 RETURNING rp.id
That should just be returning a single .id, correct?