sql - Optimize groupwise maximum query

Tuesday, 21 November 2017

sql - Optimize groupwise maximum query

itemprop="text">

select * 
from
            records 
where id in ( select max(id) from records group by option_id
            )

This
query works fine even on millions of rows. However as you can see from the result of
explain statement:

 QUERY
            PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested
            Loop (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458
            rows=1057 loops=1)
-> HashAggregate (cost=30218.41..30220.41 rows=200
            width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
 ->
            HashAggregate (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807
            rows=1057 loops=1)
 -> Seq Scan on records records_1 (cost=0.00..23995.15
            rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
->
            Index Scan using records_pkey on records (cost=0.43..7.80 rows=1 width=44) (actual
            time=0.002..0.003 rows=1 loops=1057)
 Index Cond: (id =
            (max(records_1.id)))

Total runtime: 1443.752
            ms

(cost=0.00..23995.15 rows=1240315 width=8) <- Here it says it is scanning all rows and that
is obviously inefficient.

I also tried
reordering the query:

select r.*
            from records r
inner join (select max(id) id from records group by option_id)
            r2 on r2.id= r.id;


 QUERY
            PLAN
-------------------------------------------------------------------------------------------------------------------------------

Nested
            Loop (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057
            loops=1)
-> HashAggregate (cost=30196.72..30206.36 rows=964 width=8)
            (actual time=835.471..835.836 rows=1057 loops=1)
 -> Seq Scan on records
            (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315
            loops=1)
-> Index Scan using records_pkey on records r (cost=0.43..7.80
            rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
 Index Cond: (id
            = (max(records.id)))
Total runtime: 840.809
            ms

(cost=0.00..23995.15 rows=1240315 width=8) <- Still scanning all
rows.

I tried with and without
index on (option_id), (option_id, id),
(option_id, id desc), none of them had any effect on the query
plan.

Is there a way of executing
a groupwise maximum query in Postgres without scanning all
rows?

What I am looking for, programmatically,
is an index which stores the maximum id for each option_id as
they are inserted into the records table. That way, when I query for maximums of
option_ids, I should only need to scan index records as many times as there are
different option_ids.

I've seen
select distinct on answers all over SO from high ranking users
(thanks to @Clodoaldo Neto for giving me keywords to search for). Here's why it doesn't
work:

create index
            index_name on records(option_id, id desc)

select distinct on
            (option_id) *
from records
order by option_id, id desc

            QUERY
            PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique
            (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056
            loops=1)
 -> Index Scan using records_option_id_id_idx on records
            (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342
            loops=1)

Total runtime: 1668.817
            ms

That's great, it's
using an index. However using an index to scan all ids doesn't really make much sense.
According to my executions, it is in fact slower than a simple sequential scan.

Interesting enough, MySQL 5.5 is able to
optimize the query simply using an index on records(option_id, id)

mysql> select
            count(1) from records;

+----------+

| count(1)
            |
+----------+
| 1086342
            |
+----------+

1 row in set (0.00
            sec)

mysql> explain extended select * from records

            inner join ( select max(id) max_id from records group by option_id ) mr
 on
            mr.max_id=
            records.id;


+------+----------+--------------------------+
|
            rows | filtered | Extra
            |
+------+----------+--------------------------+
| 1056 | 100.00 |
            |
| 1 | 100.00 | |
| 201 | 100.00 | Using index for group-by
            |
+------+----------+--------------------------+

3 rows
            in set, 1 warning (0.02 sec)

Answer

Assuming relatively few rows in
options for many rows in
records.

Typically, you would have a look-up
table options that is referenced
from records.option_id, ideally with a href="http://www.postgresql.org/docs/current/interactive/ddl-constraints.html#DDL-CONSTRAINTS-FK"
rel="nofollow noreferrer">foreign key constraint. If you don't, I suggest
to create one to enforce referential
integrity:

CREATE TABLE options
            (
 option_id int PRIMARY KEY
, option text UNIQUE NOT
            NULL
);


INSERT INTO options
SELECT
            DISTINCT option_id, 'option' || option_id -- dummy option names
FROM
            records;

Then we have
no need to emulate a rel="nofollow noreferrer">loose index scan any more and this becomes
very simple and fast. Correlated subqueries can use a plain
index on (option_id, id).

SELECT
            option_id
 ,(SELECT max(id)

 FROM records

            WHERE option_id = o.option_id
 ) AS max_id
FROM options
            o
ORDER BY
            1;

This includes
options with no match in table records. You get NULL for
max_id and you can easily remove such rows in an outer
SELECT if needed.

Or
(same result):

SELECT
            option_id
 , (SELECT id
 FROM records
 WHERE option_id =
            o.option_id
 ORDER BY id DESC NULLS LAST
 ) AS max_id
FROM
            options o
ORDER BY
            1;

May be a
bit faster. The subquery uses the sort order DESC NULLS LAST -
same as the aggregate function max() which ignores NULL values.
Sorting just DESC would have NULL
first:

So, the perfect
index for this:

CREATE INDEX on
            records (option_id, id DESC NULLS
            LAST);

Doesn't
matter much while columns are defined NOT NULL.

There can still be a
sequential scan on the small table options, that's just the
fastest way to fetch all rows. The ORDER BY may bring in an
index (only) scan to fetch pre-sorted rows.
The big table
records is only accessed via (bitmap) index scan - or, if
possible, index-only
scan.

href="http://sqlfiddle.com/#!15/9fd11/2" rel="nofollow noreferrer">SQL
Fiddle showing two index-only scans for the simple
case.

Or
use LATERAL joins for a similar effect in Postgres
9.3+:

Blog

Tuesday, 21 November 2017

sql - Optimize groupwise maximum query

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file