Tuesday 5 December 2017

database - SQLite insert speed slows as number of records increases due to an index

itemprop="text">

Original
question



Background




It
is well-known that SQLite href="https://stackoverflow.com/questions/1711631/how-do-i-improve-the-performance-of-sqlite">needs
to be fine tuned to achieve insert speeds on the order of 50k inserts/s. There
are many questions here regarding slow insert speeds and a wealth of advice and
benchmarks.



There are also href="https://stackoverflow.com/questions/784173/what-are-the-performance-characteristics-of-sqlite-with-very-large-database-file?lq=1">claims
that SQLite can handle large amounts of data, with reports of 50+ GB not
causing any problems with the right settings.



I
have followed the advice here and elsewhere to achieve these speeds and I'm happy with
35k-45k inserts/s. The problem I have is that all of the benchmarks only demonstrate
fast insert speeds with < 1m records. What I am seeing is that insert speed
seems to be inversely proportional to table
size
.



Issue



My
use case requires storing 500m to 1b tuples ([x_id, y_id,
z_id]
) over a few years (1m rows / day) in a link table. The values are all
integer IDs between 1 and 2,000,000. There is a single index on
z_id.




Performance
is great for the first 10m rows, ~35k inserts/s, but by the time the table has ~20m
rows, performance starts to suffer. I'm now seeing about 100
inserts/s.



The size of the table is not
particularly large. With 20m rows, the size on disk is around
500MB.



The project is written in
Perl.



Question



Is
this the reality of large tables in SQLite or are there any secrets to
maintaining high insert rates for tables with > 10m
rows?




Known workarounds which I'd
like to avoid if
possible




  • Drop
    the index, add the records, and re-index
    : This is fine as a workaround, but
    doesn't work when the DB still needs to be usable during updates. It won't work to make
    the database completely inaccessible for x minutes /
    day

  • Break the table into smaller subtables /
    files
    : This will work in the short term and I have already experimented with
    it. The problem is that I need to be able to retrieve data from the entire history when
    querying which means that eventually I'll hit the 62 table attachment limit. Attaching,
    collecting results in a temp table, and detaching hundreds of times per request seems to
    be a lot of work and overhead, but I'll try it if there are no other
    alternatives.

  • Set href="http://www.sqlite.org/c3ref/c_fcntl_chunk_size.html#sqlitefcntlchunksize"
    rel="noreferrer">SQLITE_FCNTL_CHUNK_SIZE: I don't
    know C (?!), so I'd prefer to not learn it just to get this done. I can't see any way to
    set this parameter using Perl
    though.



UPDATE




Following
href="https://stackoverflow.com/questions/15778716/sqlite-insert-speed-slows-as-number-of-records-increases-due-to-an-index#comment22443079_15778716">Tim's
suggestion that an index was causing increasingly
slow insert times
despite SQLite's claims that it is capable
of handling large data sets, I
performed a benchmark comparison with the following

settings:




  • inserted
    rows: 14 million

  • commit batch size:
    50,000
    records

  • cache_size
    pragma:
    10,000


  • page_size
    pragma:
    4,096

  • temp_store
    pragma:
    memory

  • journal_mode
    pragma:
    delete

  • synchronous
    pragma:
    off



In
my project, as in the benchmark results below, a file-based temporary table is created
and SQLite's built-in support
for importing CSV data is used. The temporary
table is then attached
to the receiving database and sets of 50,000 rows are
inserted with an
insert-select statement. Therefore,
the insert times do not reflect

file to
database
insert times, but rather table to table
insert
speed. Taking the CSV import time into account would reduce the
speeds
by 25-50% (a very rough estimate, it doesn't take long to import
the
CSV data).



Clearly having an index
causes the slowdown in insert speed as table size
increases.



src="https://i.stack.imgur.com/UMDAM.png" alt="Plot of SQLite insert speed and table
size">



It's quite clear from the data above
that the correct answer can be assigned to href="https://stackoverflow.com/a/15809806/918938">Tim's answer rather than
the assertions that SQLite just can't handle it. Clearly it can
handle large datasets if indexing that dataset is not part of your
use case. I have been using SQLite for just that, as a backend for a logging system, for
a while now which does not need to be indexed, so I was quite
surprised at the slowdown I
experienced.




Conclusion



If
anyone finds themselves wanting to store a large amount of data using SQLite
and have it indexed, href="https://stackoverflow.com/questions/128919/extreme-sharding-one-sqlite-database-per-user">using
shards may be the answer. I eventually settled on using the first three
characters of an MD5 hash a unique column in z to determine
assignment to one of 4,096 databases. Since my use case is primarily archival in nature,
the schema will not change and queries will never require shard walking. There is a
limit to database size since extremely old data will be reduced and eventually
discarded, so this combination of sharding, pragma settings, and even some
denormalisation gives me a nice balance that will, based on the
benchmarking above, maintain an insert speed of at least 10k inserts /
second.



Answer




If your requirement is to find a particular
z_id and the x_ids and
y_ids linked to it (as distinct from quickly selecting a range
of z_ids) you could look into a non-indexed hash-table
nested-relational db that would allow you to instantly find your way to a particular
z_id in order to get its y_ids and
x_ids -- without the indexing overhead and the concomitant
degraded performance during inserts as the index grows. In order to avoid clumping (aka
bucket collisions), choose a key hashing algorithm that puts greatest weight on the
digits of z_id with greatest variation
(right-weighted).



P.S. A database that uses a
b-tree may at first appear faster than a db that uses linear hashing, say, but the
insert performance will remain level with the linear hash as performance on the b-tree
begins to degrade.



P.P.S. To answer
@kawing-chiu's question: the core feature relevant here is that such a database relies
on so-called "sparse" tables in which the physical location of a record is determined by
a hashing algorithm which takes the record key as input. This approach permits a seek
directly to the record's location in the table without
the intermediary of an index
. As there is no need to traverse indexes or to
re-balance indexes, insert-times remain constant as the table becomes more densely
populated. With a b-tree, by contrast, insert times degrade as the index tree grows.
OLTP applications with large numbers of concurrent inserts can benefit from such a
sparse-table approach. The records are scattered throughout the table. The downside of
records being scattered across the "tundra" of the sparse table is that gathering large
sets of records which have a value in common, such as a postal code, can be slower. The
hashed sparse-table approach is optimized to insert and retrieve individual records, and
to retrieve networks of related records, not large sets of records
that have some field value in
common.




A nested relational database
is one that permits tuples within a column of a
row.


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...