php - Traverse a big table in PostgreSQL with Propel -

- January 15, 2012

i had task iterate on big table (~40kk records) in postgresql using propel , encountered performance issues, both memory limit , execution speed. script had been running 22(!) hours.

the task retrieve records based on criteria (not active last 6 months) , archive them (move table) , related entities other tables.

the primary table, script working on, has several columns: id, device_id, application_id, last_activity_date , others, don’t have significant meaning here. table contains information applications installed on device , last activity dates. there may several records same device_id , different application_id. here sample table:

     id    | device_id | application_id |   last_init_date  ----------+-----------+----------------+---------------------          1 |         1 |              1 | 2013-09-24 17:09:01          2 |         1 |              2 | 2013-09-19 20:36:23          3 |         1 |              3 | 2014-02-11 00:00:00          4 |         2 |              4 | 2013-09-29 20:12:54          5 |         3 |              5 | 2013-08-31 19:41:05

so, device considered old enough archived, if maximum last_activity_date particular device_id in table older 6 months. here query:

select device_id device_applications group device_id having max(last_init_date) < '2014-06-16 08:00:00'

in propel looks like:

\deviceapplicationsquery::create()   ->select('deviceid')   ->groupbydeviceid()   ->having('max(device_applications.last_init_date) < ?', $date->format('y-m-d h:i:s'))   ->find();

the resulting set, understand, big fit in memory, have split somehow chunks.

the question is: best strategy choose in situation decrease memory consumption , speed script? in answer i'll show i've found far.

i know 3 strategies of traversing big table.

1. old limit/offset

the problem approach database examines records, want skip offset. here quote doc:

the rows skipped offset clause still have computed inside server; therefore large > offset might inefficient.

here simple example (not initial query):

explain (analyze) select * device_applications order device_id limit 100 offset 300;

execution plan:

limit  (cost=37.93..50.57 rows=100 width=264) (actual time=0.630..0.835 rows=100 loops=1)     ->  index scan using device_applications_device_id_application_id_unique on device_applications  (cost=0.00..5315569.97 rows=42043256 width=264) (actual time=0.036..0.806 rows=400 loops=1) total runtime: 0.873 ms

pay special attention actual results in index scan section. shows, postgresql worked 400 records, offset (300) plus limit (100). approach quite inefficient, taking consideration complexity of initial query.

2. ranging column

we can avoid limitations of limit/offset approach making query work ranges of table, made slicing table column.

to clarify, let’s imaging have table 100 records, can divide table 5 ranges 20 records in each: 0 - 20, 20 - 40, 40 - 60, 60 - 80, 80 - 100, , work smaller subsets. in case column can range device_id. query looks this:

select device_id device_applications device_id >= 1 , device_id < 1000 group device_id having max(last_init_date) < '2014-06-16 08:00:00';

it groups records device_id, extracts range , applies condition on last_init_date. of course, may (and in cases) there no records matching condition. so, problem approach have scan whole table, if records want find 5% of records.

3. using cursors

what need cursor. cursors allow iterate on result set without fetch whole data @ once. in php make use of cursors when iterate on pdostatement. simple example:

$stmt = $dbh->prepare("select * table"); $stmt->execute();  // iterate on statement using cursor foreach ($stmt $row) {     // }

in propel can make use of pdo's feature propelondemandformatter class. so, final code:

$devapps = \deviceapplicationsquery::create()   ->setformatter('\propelondemandformatter')   ->select('deviceid')   ->groupbydeviceid()   ->having('max(device_applications.last_init_date) < ?', $date->format('y-m-d h:i:s'))   ->find();  /** @var \deviceapplications $devapp */ foreach ($devapps $devapp) {     // }

here call find() not fetch data, instead create collection on demand object creation.

Search This Blog

Deter