I have a dataset of points of different types. For every point in the dataset I want to find the closest point in every category. I can achieve this but the compute time is very long and I'm struggling to get the query to use a spatial index for the KNN in tandem with the type information in an effiecient way.
Sample data generation
CREATE TYPE point_type AS ENUM ('1','2','3','4','5');CREATE TABLE points AS SELECT ST_MakePoint( 1000*random(), 1000*random() )::geometry(Point) AS geom, ((random()*3)::int+1)::text::point_type point_type, pk FROM generate_series(1,6000) pk;update pointsset point_type='5' where pk=999;
Index creation
create index points_geom_idx on points using gist (geom);CREATE INDEX points_dual ON points (point_type, geom);
Query that works but is very slow but works
Because of distances KNN's being pulled first, then filtered by the type constraint after?
explain analysewith types as (select column1::point_type point_type from (values('1'), ('2'), ('3'), ('4'),('5') ))SELECT c1.point_type, c1.pk AS main_id, b.pk AS secondary_id, c1.secondary_point_type, b.secondary_point_type, b.distanceFROM (SELECT c.point_type, c.pk, c.geom, types.point_type secondary_point_type FROM points c join types on true ) c1 LEFT JOIN LATERAL ( SELECT c2.point_type, c2.geom, c2.pk, c2.point_type secondary_point_type, c1.geom <->c2.geom AS distance FROM points c2 where c1.pk <> c2.pk and c1.secondary_point_type=c2.point_type ORDER BY distance LIMIT 1) b on true;
Query that is very fast but doesn't provide correct resultsI believe this is because it's just getting the closest point, and if that point isn't of the correct type, the join ultimately fails, so no data is joined, leaving nulls for most results
explain analysewith types as (select column1::point_type point_type from (values('1'), ('2'), ('3'), ('4'),('5') ))SELECT c1.point_type, c1.pk AS main_id, b.pk AS secondary_id, c1.secondary_point_type, b.secondary_point_type, b.distanceFROM (SELECT c.point_type, c.pk, c.geom, types.point_type secondary_point_type FROM points c join types on true ) c1 LEFT JOIN LATERAL ( SELECT c2.point_type, c2.geom, c2.pk, c2.point_type secondary_point_type, c1.geom <->c2.geom AS distance FROM points c2 where c1.pk <> c2.pk ORDER BY distance LIMIT 1) b on c1.secondary_point_type=b.secondary_point_type ;
I'm trying to achieve this query quickly, using the spatial index for all knn measures across all types. Thanks!
outputs for analyzefirst query:
Sort (cost=29155.39..29230.39 rows=30000 width=28) (actual time=24533.167..24543.539 rows=30000 loops=1)" Output: c.point_type, c.pk, c2.pk, ((""*VALUES*"".column1)::point_type), c2.point_type, ((c.geom <-> c2.geom))" Sort Key: c2.point_type DESC Sort Method: quicksort Memory: 2409kB Buffers: shared hit=180999 -> Nested Loop Left Join (cost=0.15..26924.49 rows=30000 width=28) (actual time=5.024..24430.122 rows=30000 loops=1)" Output: c.point_type, c.pk, c2.pk, (""*VALUES*"".column1)::point_type, c2.point_type, ((c.geom <-> c2.geom))" Buffers: shared hit=180999 -> Nested Loop (cost=0.00..499.07 rows=30000 width=72) (actual time=0.546..105.076 rows=30000 loops=1)" Output: c.point_type, c.pk, c.geom, ""*VALUES*"".column1" Buffers: shared hit=64 -> Seq Scan on public.points c (cost=0.00..124.00 rows=6000 width=40) (actual time=0.341..12.850 rows=6000 loops=1) Output: c.geom, c.point_type, c.pk Buffers: shared hit=64 -> Materialize (cost=0.00..0.09 rows=5 width=32) (actual time=0.001..0.006 rows=5 loops=6000)" Output: ""*VALUES*"".column1"" -> Values Scan on ""*VALUES*"" (cost=0.00..0.06 rows=5 width=32) (actual time=0.034..0.141 rows=5 loops=1)"" Output: ""*VALUES*"".column1" -> Limit (cost=0.15..0.86 rows=1 width=52) (actual time=0.802..0.803 rows=1 loops=30000) Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom)) Buffers: shared hit=180935 -> Result (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.800..0.800 rows=1 loops=30000) Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)" One-Time Filter: ((""*VALUES*"".column1)::point_type = (""*VALUES*"".column1)::point_type)" Buffers: shared hit=180935 -> Index Scan using points_geom_idx on public.points c2 (cost=0.15..500.15 rows=5999 width=40) (actual time=0.787..0.787 rows=1 loops=30000) Output: c2.geom, c2.point_type, c2.pk Order By: (c2.geom <-> c.geom) Filter: (c.pk <> c2.pk) Rows Removed by Filter: 1 Buffers: shared hit=180935Settings: search_path = 'public, topology, tiger'Planning Time: 4.964 msExecution Time: 24553.107 ms
Second query:
QUERY PLANNested Loop (cost=0.88..1197.38 rows=30000 width=28) (actual time=3.535..4538.832 rows=30000 loops=1)" Output: c.point_type, c.pk, b.pk, (""*VALUES*"".column1)::point_type, b.secondary_point_type, b.distance" Buffers: shared hit=36251 -> Seq Scan on public.points c (cost=0.00..124.00 rows=6000 width=40) (actual time=0.095..4.897 rows=6000 loops=1) Output: c.geom, c.point_type, c.pk Buffers: shared hit=64 -> Hash Left Join (cost=0.88..0.98 rows=5 width=48) (actual time=0.726..0.743 rows=5 loops=6000)" Output: ""*VALUES*"".column1, b.pk, b.secondary_point_type, b.distance"" Hash Cond: ((""*VALUES*"".column1)::point_type = b.secondary_point_type)" Buffers: shared hit=36187" -> Values Scan on ""*VALUES*"" (cost=0.00..0.06 rows=5 width=32) (actual time=0.001..0.008 rows=5 loops=6000)"" Output: ""*VALUES*"".column1" -> Hash (cost=0.87..0.87 rows=1 width=16) (actual time=0.707..0.707 rows=1 loops=6000) Output: b.pk, b.secondary_point_type, b.distance Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=36187 -> Subquery Scan on b (cost=0.15..0.87 rows=1 width=16) (actual time=0.701..0.703 rows=1 loops=6000) Output: b.pk, b.secondary_point_type, b.distance Buffers: shared hit=36187 -> Limit (cost=0.15..0.86 rows=1 width=52) (actual time=0.700..0.700 rows=1 loops=6000) Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom)) Buffers: shared hit=36187 -> Index Scan using points_geom_idx on public.points c2 (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.695..0.695 rows=1 loops=6000) Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom) Order By: (c2.geom <-> c.geom) Filter: (c.pk <> c2.pk) Rows Removed by Filter: 1 Buffers: shared hit=36187Settings: search_path = 'public, topology, tiger'Planning Time: 3.206 msExecution Time: 4549.481 ms