A Short Guide to DBI
Perl's Database Interface Module
by Mark-Jason Dominus
October 22, 1999
Short guide to DBI (The Perl Database Interface Module)
General information about relational databases
Relational databases started to get to be a big deal in the 1970's, andthey're still a big deal today, which is a little peculiar, because they're a 1960's technology.
A relational database is a bunch of rectangular tables. Each row of a table is a record about one person or thing; the record contains several pieces of information called fields. Here is an example table:
LASTNAME FIRSTNAME ID POSTAL_CODE AGE SEX
Gauss Karl 119 19107 30 M
Smith Mark 3 T2V 3V4 53 M
Noether Emmy 118 19107 31 F
Smith Jeff 28 K2G 5J9 19 M
Hamilton William 247 10139 2 M
The names of the fields are LASTNAME, FIRSTNAME, ID, POSTAL_CODE, AGE, and SEX. Each line in the table is a record, or sometimes a row or tuple. For example, the first row of the table represents a 30-year-old male whose name is Karl Gauss, who lives at postal code 19107, and whose ID number is 119.
Sometimes this is a very silly way to store information. When the information naturally has a tabular structure it's fine. When it doesn't, you have to squeeze it into a table, and some of the techniques for doing that are more successful than others. Nevertheless, tables are simple and are easy to understand, and most of the high-performance database systems you can buy today operate under this 1960's model.
About SQL
SQL stands for Structured Query Language. It was invented at IBM in the 1970's. It's a language for describing searches and modifications to a relational database.
SQL was a huge success, probably because it's incredibly simple and anyone can pick it up in ten minutes. As a result, all the important database systems support it in some fashion or another. This includes the big players, like Oracle and Sybase, high-quality free or inexpensive database systems like MySQL, and funny hacks like Perl's DBD::CSV module, which we'll see later.
There are four important things one can do with a table:
SELECT
Find all the records that have a certain property
INSERT
Add new records
DELETE
Remove old records
UPDATE
Modify records that are already there
Those are the four most important SQL commands, also called queries. Suppose that the example table above is named people. Here are examples of each of the four important kinds of queries:
SELECT firstname FROM people WHERE lastname = 'Smith'
(Locate the first names of all the Smiths.)
DELETE FROM people WHERE id = 3
(Delete Mark Smith from the table)
UPDATE people SET age = age+1 WHERE id = 247
(William Hamilton just had a birthday.)
INSERT INTO people VALUES ('Euler', 'Leonhard', 248, NULL, 58, 'M')
(Add Leonhard Euler to the table.)
There are a bunch of other SQL commands for creating and discarding tables, for granting and revoking access permissions, for committing and abandoning transactions, and so forth. But these four are the important ones. Congratulations; you are now a SQL programmer. For the details, go to any reasonable bookstore and pick up a SQL quick reference.
Every database system is a little different. You talk to some databases over the network and make requests of the database engine; other databases you talk to through files or something else.
Typically when you buy a commercial database, you get a library with it. The vendor has written some functions for talking to the database in some language like C, compiled the functions, and the compiled code is the library. You can write a C program that calls the functions in the library when it wants to talk to the database.
Every vendor's library is different. The names of the functions vary, and the order in which you call them varies, and the details of passing queries to the functions and getting the data back out will vary. Some libraries, like Oracle's, are very thin—they just send the query over to the network to the real database and let the giant expensive real database engine deal with it directly. Other libraries will do more predigestion of the query, and more work afterwards to turn the data into a data structure. Some databases will want you to spin around three times and bark like a chicken; others want you to stand on your head and drink out of your sneaker.
What DBI is For
There's a saying that any software problem can be solved by adding a layer of indirection. That's what Perl's DBI (`Database Interface') module is all about. It was written by Tim Bunce.
DBI is designed to protect you from the details of the vendor libraries. It has a very simple interface for saying what SQL queries you want to make, and for getting the results back. DBI doesn't know how to talk to any particular database, but it does know how to locate and load in DBD (`Database Driver') modules. The DBD modules have the vendor libraries in them and know how to talk to the real databases; there is one DBD module for every different database.
When you ask DBI to make a query for you, it sends the query to the appropriate DBD module, which spins around three times or drinks out of its sneaker or whatever is necessary to communicate with the real database. When it gets the results back, it passes them to DBI. Then DBI gives you the results. Since your program only has to deal with DBI, and not with the real database, you don't have to worry about barking like a chicken.
Here's your program talking to the DBI library. You are using two databases at once. One is an Oracle database server on some other machine, and another is a DBD::CSV database that stores the data in a bunch of plain text files on the local disk.
Your program sends a query to DBI, which forwards it to the appropriate DBD module; let's say it's DBD::Oracle. DBD::Oracle knows how to translate what it gets from DBI into the format demanded by the Oracle library, which is built into it. The library forwards the request across the network, gets the results back, and returns them to DBD::Oracle. DBD::Oracle returns the results to DBI as a Perl data structure. Finally, your program can get the results from DBI.
On the other hand, suppose that your program was querying the text files. It would prepare the same sort of query in exactly the same way, and send it to DBI in exactly the same way. DBI would see that you were trying to talk to the DBD::CSV database and forward the request to the DBD::CSV module. The DBD::CSV module has Perl functions in it that tell it how to parse SQL and how to hunt around in the text files to find the information you asked for. It then returns the results to DBI as a Perl data structure. Finally, your program gets the results from DBI in exactly the same way that it would have if you were talking to Oracle instead.
There are two big wins that result from this organization. First, you don't have to worry about the details of hunting around in text files or talking on the network to the Oracle server or dealing with Oracle's library. You just have to know how to talk to DBI.
Second, if you build your program to use Oracle, and then the following week upper management signs a new Strategic Partnership with Sybase, it's easy to convert your code to use Sybase instead of Oracle. You change exactly one line in your program, the line that tells DBI to talk to DBD::Oracle, and have it use DBD::Sybase instead. Or you might build your program to talk to a cheap, crappy database like MS Access, and then next year when the application is doing well and getting more use than you expected, you can upgrade to a better database next year without changing any of your code.
There are DBD modules for talking to every important kind of SQL database. DBD::Oracle will talk to Oracle, and DBD::Sybase will talk to Sybase. DBD::ODBC will talk to any ODBC database including Microsoft Acesss. (ODBC is a Microsoft invention that is analogous to DBI itself. There is no DBD module for talking to Access directly.) DBD::CSV allows SQL queries on plain text files. DBD::mysql talks to the excellent MySQL database from TCX DataKonsultAB in Sweden. (MySQL is a tremendous bargain: It's $200 for commercial use, and free for noncommerical use.)
Example of How to Use DBI
Here's a typical program. When you run it, it waits for you to type a last name. Then it searches the database for people with that last name and prints out the full name and ID number for each person it finds. For example:
Enter name> Noether
118: Emmy Noether
Enter name> Smith
3: Mark Smith
28: Jeff Smith
Enter name> Snonkopus
No names matched `Snonkopus'.
Enter name> ^D
Here is the code:
use DBI;
my $dbh = DBI->connect('DBI:Oracle:payroll')
or die "Couldn't connect to database: " . DBI->errstr;
my $sth = $dbh->prepare('SELECT * FROM people WHERE lastname = ?')
or die "Couldn't prepare statement: " . $dbh->errstr;
print "Enter name> ";
while ($lastname = <>) { # Read input from the user
my @data;
chomp $lastname;
$sth->execute($lastname) # Execute the query
or die "Couldn't execute statement: " . $sth->errstr;
# Read the matching records and print them out
while (@data = $sth->fetchrow_array()) {
my $firstname = $data[1];
my $id = $data[2];
print "\t$id: $firstname $lastname\n";
}
if ($sth->rows == 0) {
print "No names matched `$lastname'.\n\n";
}
$sth->finish;
print "\n";
print "Enter name> ";
}
$dbh->disconnect;
use DBI;
This loads in the DBI module. Notice that we don't have to load in any DBD module. DBI will do that for us when it needs to.
my $dbh = DBI->connect('DBI:Oracle:payroll');
or die "Couldn't connect to database: " . DBI->errstr;
The connect call tries to connect to a database. The first argument, DBI:Oracle:payroll, tells DBI what kind of database it is connecting to. The Oracle part tells it to load DBD::Oracle and to use that to communicate with the database. If we had to switch to Sybase next week, this is the one line of the program that we would change. We would have to change Oracle to Sybase.
payroll is the name of the database we will be searching. If we were going to supply a username and password to the database, we would do it in the connect call:
my $dbh = DBI->connect('DBI:Oracle:payroll', 'username', 'password')
or die "Couldn't connect to database: " . DBI->errstr;
If DBI connects to the database, it returns a database handle object, which we store into $dbh. This object represents the database connection. We can be connected to many databases at once and have many such database connection objects.
If DBI can't connect, it returns an undefined value. In this case, we use die to abort the program with an error message. DBI->errstr returns the reason why we couldn't connect—``Bad password'' for example.
my $sth = $dbh->prepare('SELECT * FROM people WHERE lastname = ?')
or die "Couldn't prepare statement: " . $dbh->errstr;
The prepare call prepares a query to be executed by the database. The argument is any SQL at all. On high-end databases, prepare will send the SQL to the database server, which will compile it. If prepare is successful, it returns a statement handle object which represents the statement; otherwise it returns an undefined value and we abort the program. $dbh->errstr will return the reason for failure, which might be ``Syntax error in SQL''. It gets this reason from the actual database, if possible.
The ? in the SQL will be filled in later. Most databases can handle this. For some databases that don't understand the ?, the DBD module will emulate it for you and will pretend that the database understands how to fill values in later, even though it doesn't.
print "Enter name> ";
Here we just print a prompt for the user.
while ($lastname = <>) { # Read input from the user
...
}
This loop will repeat over and over again as long as the user enters a last name. If they type a blank line, it will exit. The Perl <> symbol means to read from the terminal or from files named on the command line if there were any.
my @data;
This declares a variable to hold the data that we will get back from the database.
chomp $lastname;
This trims the newline character off the end of the user's input.
$sth->execute($lastname) # Execute the query
or die "Couldn't execute statement: " . $sth->errstr;
execute executes the statement that we prepared before. The argument $lastname is substituted into the SQL in place of the ? that we saw earlier. execute returns a true value if it succeeds and a false value otherwise, so we abort if for some reason the execution fails.
while (@data = $sth->fetchrow_array()) {
...
}
fetchrow_array returns one of the selected rows from the database. You get back an array whose elements contain the data from the selected row. In this case, the array you get back has six elements. The first element is the person's last name; the second element is the first name; the third element is the ID, and then the other elements are the postal code, age, and sex.
Each time we call fetchrow_array, we get back a different record from the database. When there are no more matching records, fetchrow_array returns the empty list and the while loop exits.
my $firstname = $data[1];
my $id = $data[2];
These lines extract the first name and the ID number from the record data.
print "\t$id: $firstname $lastname\n";
This prints out the result.
if ($sth->rows == 0) {
print "No names matched `$lastname'.\n\n";
}
The rows method returns the number of rows of the database that were selected. If no rows were selected, then there is nobody in the database with the last name that the user is looking for. In that case, we print out a message. We have to do this after the while loop that fetches whatever rows were available, because with some databases you don't know how many rows there were until after you've gotten them all.
$sth->finish;
print "\n";
print "Enter name> ";
Once we're done reporting about the result of the query, we print another prompt so that the user can enter another name. finish tells the database that we have finished retrieving all the data for this query and allows it to reinitialize the handle so that we can execute it again for the next query.
$dbh->disconnect;
When the user has finished querying the database, they type a blank line and the main while loop exits. disconnect closes the connection to the database.
Cached Queries
Here's a function which looks up someone in the example table, given their ID number, and returns their age:
sub age_by_id {
# Arguments: database handle, person ID number
my ($dbh, $id) = @_;
my $sth = $dbh->prepare('SELECT age FROM people WHERE id = ?')
or die "Couldn't prepare statement: " . $dbh->errstr;
$sth->execute($id)
or die "Couldn't execute statement: " . $sth->errstr;
my ($age) = $sth->fetchrow_array();
return $age;
}
It prepares the query, executes it, and retrieves the result.
There's a problem here though. Even though the function works correctly, it's inefficient. Every time it's called, it prepares a new query. Typically, preparing a query is a relatively expensive operation. For example, the database engine may parse and understand the SQL and translate it into an internal format. Since the query is the same every time, it's wasteful to throw away this work when the function returns.
Here's one solution:
{ my $sth;
sub age_by_id {
# Arguments: database handle, person ID number
my ($dbh, $id) = @_;
if (! defined $sth) {
$sth = $dbh->prepare('SELECT age FROM people WHERE id = ?')
or die "Couldn't prepare statement: " . $dbh->errstr;
}
$sth->execute($id)
or die "Couldn't execute statement: " . $sth->errstr;
my ($age) = $sth->fetchrow_array();
return $age;
}
}
There are two big changes to this function from the previous version. First, the $sth variable has moved outside of the function; this tells Perl that its value should persist even after the function returns. Next time the function is called, $sth will have the same value as before.
Second, the prepare code is in a conditional block. It's only executed if $sth does not yet have a value. The first time the function is called, the prepare code is executed and the statement handle is stored into $sth. This value persists after the function returns, and the next time the function is called, $sth still contains the statement handle and the prepare code is skipped.
Here's another solution:
sub age_by_id {
# Arguments: database handle, person ID number
my ($dbh, $id) = @_;
my $sth = $dbh->prepare_cached('SELECT age FROM people WHERE id = ?')
or die "Couldn't prepare statement: " . $dbh->errstr;
$sth->execute($id)
or die "Couldn't execute statement: " . $sth->errstr;
my ($age) = $sth->fetchrow_array();
return $age;
}
Here the only change to to replace prepare with prepare_cached. The prepare_cached call is just like prepare, except that it looks to see if the query is the same as last time. If so, it gives you the statement handle that it gave you before.
Transactions
Many databases support transactions. This means that you can make a whole bunch of queries which would modify the databases, but none of the changes are actually made. Then at the end you issue the special SQL query COMMIT, and all the changes are made simultaneously. Alternatively, you can issue the query ROLLBACK, in which case all the queries are thrown away.
As an example of this, consider a function to add a new employee to a database. The database has a table called employees that looks like this:
FIRSTNAME LASTNAME DEPARTMENT_ID
Gauss Karl 17
Smith Mark 19
Noether Emmy 17
Smith Jeff 666
Hamilton William 17
and a table called departments that looks like this:
ID NAME NUM_MEMBERS
17 Mathematics 3
666 Legal 1
19 Grounds Crew 1
The mathematics department is department #17 and has three members: Karl Gauss, Emmy Noether, and William Hamilton.
Here's our first cut at a function to insert a new employee. It will return true or false depending on whether or not it was successful:
sub new_employee {
# Arguments: database handle; first and last names of new employee;
# department ID number for new employee's work assignment
my ($dbh, $first, $last, $department) = @_;
my ($insert_handle, $update_handle);
my $insert_handle =
$dbh->prepare_cached('INSERT INTO employees VALUES (?,?,?)');
my $update_handle =
$dbh->prepare_cached('UPDATE departments
SET num_members = num_members + 1
WHERE id = ?');
die "Couldn't prepare queries; aborting"
unless defined $insert_handle && defined $update_handle;
$insert_handle->execute($first, $last, $department) or return 0;
$update_handle->execute($department) or return 0;
return 1; # Success
}
We create two handles, one for an insert query that will insert the new employee's name and department number into the employees table, and an update query that will increment the number of members in the new employee's department in the department table. Then we execute the two queries with the appropriate arguments.
There's a big problem here: Suppose, for some reason, the second query fails. Our function returns a failure code, but it's too late, it has already added the employee to the employees table, and that means that the count in the departments table is wrong. The database now has corrupted data in it.
The solution is to make both updates part of the same transaction. Most databases will do this automatically, but without an explicit instruction about whether or not to commit the changes, some databases will commit the changes when we disconnect from the database, and others will roll them back. We should specify the behavior explicitly.
Typically, no changes will actually be made to the database until we issue a commit. The version of our program with commit looks like this:
sub new_employee {
# Arguments: database handle; first and last names of new employee;
# department ID number for new employee's work assignment
my ($dbh, $first, $last, $department) = @_;
my ($insert_handle, $update_handle);
my $insert_handle =
$dbh->prepare_cached('INSERT INTO employees VALUES (?,?,?)');
my $update_handle =
$dbh->prepare_cached('UPDATE departments
SET num_members = num_members + 1
WHERE id = ?');
die "Couldn't prepare queries; aborting"
unless defined $insert_handle && defined $update_handle;
my $success = 1;
$success &&= $insert_handle->execute($first, $last, $department);
$success &&= $update_handle->execute($department);
my $result = ($success ? $dbh->commit : $dbh->rollback);
unless ($result) {
die "Couldn't finish transaction: " . $dbh->errstr
}
return $success;
}
We perform both queries, and record in $success whether they both succeeded. $success will be true if both queries succeeded, false otherwise. If the queries succeded, we commit the transaction; otherwise, we roll it back, cancelling all our changes.
The problem of concurrent database access is also solved by transactions. Suppose that queries were executed immediately, and that some other program came along and examined the database after our insert but before our update. It would see inconsistent data in the database, even if our update would eventually have succeeded. But with transactions, all the changes happen simultaneously when we do the commit, and the changes are committed automatically, which means that any other program looking at the database either sees all of them or none.
do
If you're doing an UPDATE, INSERT, or DELETE there is no data that comes back from the database, so there is a short cut. You can say
$dbh->do('DELETE FROM people WHERE age > 65');
for example, and DBI will prepare the statement, execute it, and finish it. do returns a true value if it succeeded, and a false value if it failed. Actually, if it succeeds it returns the number of affected rows. In the example it would return the number of rows that were actually deleted. (DBI plays a magic trick so that the value it turns is true even when it is 0. This is bizarre, because 0 is usually false in Perl. But it's convenient because you can use it either as a number or as a true-or-false success code, and it works both ways.)
AutoCommit
If your transactions are simple, you can save yourself the trouble of having to issue a lot of commits. When you make the connect call, you can specify an AutoCommit option which will perform an automatic commit operation after every successful query. Here's what it looks like:
my $dbh = DBI->connect('DBI:Oracle:payroll',
{AutoCommit => 1},
)
or die "Couldn't connect to database: " . DBI->errstr;
Automatic Error Handling
When you make the connect call, you can specify a RaiseErrors option that handles errors for you automatically. When an error occurs, DBI will abort your program instead of returning a failure code. If all you want is to abort the program on an error, this can be convenient:
my $dbh = DBI->connect('DBI:Oracle:payroll',
{RaiseError => 1},
)
or die "Couldn't connect to database: " . DBI->errstr;
Don't do This
People are always writing code like this:
while ($lastname = <>) {
my $sth = $dbh->prepare("SELECT * FROM people
WHERE lastname = '$lastname'");
$sth->execute();
# and so on ...
}
Here we interpolated the value of $lastname directly into the SQL in the prepare call.
This is a bad thing to do for three reasons.
First, prepare calls can take a long time. The database server has to compile the SQL and figure out how it is going to run the query. If you have many similar queries, that is a waste of time.
Second, it will not work if $lastname contains a name like O'Malley or D'Amico or some other name with an '. The ' has a special meaning in SQL, and the database will not understand when you ask it to prepare a statement that looks like
SELECT * FROM people WHERE lastname = 'O'Malley'
It will see that you have three 's and complain that you don't have a fourth matching ' somewhere else.
Finally, if you're going to be constructing your query based on a user input, as we did in the example program, it's unsafe to simply interpolate the input directly into the query, because the user can construct a strange input in an attempt to trick your program into doing something it didn't expect. For example, suppose the user enters the following bizarre value for $input:
x' or lastname = lastname or lastname = 'y
Now our query has become something very surprising:
SELECT * FROM people WHERE lastname = 'x'
or lastname = lastname or lastname = 'y'
The part of this query that our sneaky user is interested in is the second or clause. This clause selects all the records for which lastname is equal to lastname; that is, all of them. We thought that the user was only going to be able to see a few records at a time, and now they've found a way to get them all at once. This probably wasn't what we wanted.
Postgresql8.1
的索引
索引是提高数据库性能的常用方法。索引可以大大加速对特定字段机器组合的检索或查找速度,但与此同时又会增加数据库的负荷,所以如何有效的利用索引成为了一名
DBA
必须掌握的技术。
介绍
对于下面这样一个表:
Create table test1(
Id integer,
Content varchar
)
在该表上将大量采用如下查询:
Select content from test1 where id=constant;
(
constant
表示一个常熟)
通常数据库不得不一行一行地扫描整个
test1
表来寻找所有匹配的数据。如果,最终得到的数据只占
test1
中很小的一部分,那么使用如上的方法显然很低效。如果我们在
id
列上维护一个索引用于定位匹配的行,那么数据库将只需通过检索搜查树上的若干层就可以得到所需的结果。
下面的命令将在
id
上建立索引:
Create index test1_id_index on test1(id);
当然,我们可以自由地给索引取一个名字,只要之后可以方便地想起来就行。
删除索引地命令:
Drop index test1_id_index;
很大程度上,一个有效的索引是建立在事前分析的基础上的,一旦建立了一个索引,就不需要再有太多地干预了,不过也可以定期使用
analyze
命令来更新统计数据,使数据库查询规划器能做出有效的判断。
除了需要避免不必要的索引对系统带来过多的负荷,索引无论使对查找、删除还是更新,都将带来一定程度上性能的提升。
索引的类型
关键字:操作符集
一定类型的索引只能在相关操作符集中触发
Postgresql8.1
提供
b-tree
,
r-tree
,
hash
,
gist
这几种索引类型。每一种索引,因为所使用的算法不同,所以适用的场合也有所不同。缺省的情况下,系统将使用
b-tree
索引,它也可以适用于大多数的情况。
b-tree
可以处理按照某种顺序排列的数据的等值或范围查询,当索引所在列使用一下几种比较符号时,查询规划器将考虑使用索引。
>
,
<
,
>=
,
<=
,
=
构造等效于这些符号的组合也将会启动索引,如
between
,
in
等。需要注意的使
is null
不等于
=
,所以不会启动索引。当匹配模式是一个常量并且常量位于开头处时,如:
col like ‘foo%’
、
col ~ ‘^foo’
,也将启动索引。
r-tree
适用与处理二维空间的数据时,要创建一个
r-tree
索引可以使用如下命令:
create indes INDEX_NAME on TABLE_NAME using rtree (COLUMN_NAME);
当一个索引了的列涉及到使用下列操作符之一进行比较的时候,
PostgreSQL
的查询规划器都会考虑使用
R-tree
索引∶
<<
,
&<
,
&>
,
>>
,
<<|
,
&<|
,
|>&
,
|>>
,
~
,
@
,
~=
,
&&
散列(
hash
)索引只能处理简单的等值查询。就目前而言,由于
hash
索引的性能不比
b-tree
的好,再加上它的尺寸较大、制作时间较长,并且不支持
wal
日志,所以不建议使用。
Gist
不是一种单独的索引类型,而是一种架构,可以在这种架构上实现不同的索引策略。根据策略的不同,可以使用使用
gist
索引的操作符也不尽相同。需要注意的是,
gist
上有实现
r-tree
的一组操作符,而且支持
wal
日志,因此比
r-tree
更加适用。
多字段日志
关键字:前导字段
索引的前导字段对查询性能起着至关重要的作用
一个索引可以定义在表中的多个字段上。比如,你有如下的表:
Create table test2(
Major int,
Minor int,
Name varchar
);
并且经常使用如下的查询语句:
Select * from test2 where major=X and minor=Y;
那么在
major
和
mino
上定义一个索引会比较合适,
Create index multi_index on test2(major,minor);
目前只有
b-tree
和
gist
支持多字段索引,缺省只有
32
个字段。可以在编译
postgresql
之前,通过修改
pg_config_manual.h
中的相关参数来改变。
只要在查询条件中包含了多字段索引的子集或者全部,那么就有可能触发索引,但是只有在查询条件中使用索引的最前端字段(前导字段)时,索引才最为有效。准确的规则是在索引的前导字段中使用等值约束,再在第二个字段上加上一个非等值查询,这样将限制需要扫描索引的范围。在这两个字段之后的条件将在索引中查询,这样就减少了对表的操作,但这不会减少需要扫描的索引。
比如:有一个建立在
(a,b,c)
上的索引,有一个查询的条件:
a=5 and b>44 and c<=77
。索引将从符合
a=5 and b=44
条件的第一个条目开始扫描,直到所有符合
a=5
条件的条目。此时,将跳过对
c>=77
的条目的扫描,但之后还是会扫描的。
这个索引也可以用在没有对
a
进行约束的查询上,但这势必会大大降低性能,因为这将扫描整个索引。
只有当前导字段出现在查询条件中时,才会启用
gist
索引。附加的条件将限制返回的索引条目,但前导字段还是起了决定性的作用,它决定需要扫描字段的数量。如果前导索引相应的字段中有较多的相同值,那么即使在附加索引字段上添加查询条件,查询的整体性能依然将不太理想。
必须很谨慎地使用多字段索引。在大多数情况下,单字段索引已足够了,而且还节省了生成时间和数据库空间。除非对某个表的查许非常固定,一般不要使用多余
3
个字段的索引。
多个索引的组合
关键字:
bitmap bitmap
保存按物理位置排序的表中行的入口
总结上面两节,单个索引只有适用于以下这种情况:
查询语句中使用了与索引相关的操作符,并且包含索引对应字段的子集字段的交集。举例说就是,如(
a
,
b
)的索引可以在如查询条件
where a=50 and b=70
的情况下使用,但如果查询条件时
where a=50 or b=70
就无法使用该索引了。
从
8.1
版本开始,
postgresql
就增加了组合多索引的功能来解决单个索引无法处理的情况。如
where x=44 or x=88 or x=43 or x=12
这样的查询,可以分解为
4
个单独的查询,
postgresql
将
4
个单独查询的结果用
or
运算组合在一起得到最终的结果。再来一个例子是,
where x=45 and y=46
,如果再
x
和
y
上各有一个索引,那么
postgresql
将分别在这两个索引上扫描,然后将得到的进行与运算,得到最终的结果。
为了合并多个索引,
postgresql
扫描需要的索引,将符合索引条件的表中行标识在内存中的
bitmap
中,
bitmap
进行与、或运算得到最终的结果。
Bitmap
中记载了按照磁盘顺序排序的表中行的物理位置(查询的结果将对行进行标识),这就意味着任何使用合并的查询将丢失在索引上的排序。如果有
order by
这样的排序语句,那么排序的行为将在查询的最后进行,这就降低了性能。所以,有时候规划器会忽略多索引的存在,而只使用单索引。
在大多数最简单的应用中,可能会有多种索引的组合可供使用,数据库的开发员必须作出合理的选择以优化数据库的性能。有时候,使用多字段索引会很有效,但有时候建立单字段索引再将它们组合起来会更灵活。例如在一表上有对
x
、以、
x
和
y
的三种查询。可以分别在
x
、
y
上建立单独的索引,利用索引组合可以实现对三种查询的优化。也可以使用一个(
x
,
y
)的多字段索引,单它对
y
的查询无能为力。综合两种方法,可以建立一个(
x
,
y
)多字段索引,再加上一个建立在
y
字段上的索引,唯一的缺点是在
x
上的查询性能将弱于建立在
x
上的单独索引。最后一种选择是,建立三种查询的各个索引,只是这会降低更新数据库的性能,因此只能用在查询远多于更新,且三种查询都很普遍的情况下。如果其中有一个字段的索查询量远大于其它的字段及其组合,那么建立单独的两个索引会比较合适。
唯一索引
关键字:实现唯一性的细节
索引可以强制其所对应字段或字段集合的唯一性。
Create unique index NAME on TABLE (COLUMN…)
;
目前,只有
b-tree
支持唯一索引。一旦索引被声明为唯一,那么所对应的字段(集合)就不允许存在重复的值。
Postgresql
在用户为一字段声明唯一或主约束时,自动生成一个唯一索引。
事实上,应当只将唯一索引当作一种实现唯一性的细节,一般情况下使用如下:
Alter table …add constraint…
来实现唯一性约束,而不直接创建唯一索引。
建立在表达式上的索引
关键字:计算值
索引中存储计算值,与单索引的查询速度相同
索引不仅可以建立在对印表的某些字段上,还可以建立在以某些字段的值为参数的函数和表达式上。这个特性对于优化以计算值为查询条件的搜索很有效。
例如,查询一个与大小写无关的比较
:
Select * from test1 wherelower(clo1)=’value’;
相应的,可以建立如下的索引:
Create index test_lower_col1_idx on test1(lower(col1));
如果我们将这个索引声明为唯一索引,那么就可以实现与大小写无关的唯一性约束了。
另一个例子是,如果经常使用如下的查询:
Select * from people where ( first_name||’’||last_name)=’jone smith’;
可以建立如下的索引:
Create index full_name_idx on people((first_name || ‘’ || lastname));
在
create index
语法中,除了单个函数外,对应的字段表达式都要有圆括号包起来。
维护表达式索引的代价相对较大,因为索引中存储的是真正的计算值而不是在查询中计算,虽然这大大提升了相关的查询速度(与单字段索引的速度相同),但每次执行更新或者插入都需要计算出相应的值。所以,这种索引只用在追求查询速度,而对插入和更新不太计较的情况下。
部分索引
关键字:索引条件
查询只有在匹配索引条件时才可能被触发
部分索引是建立在表的子集上的索引,该子集有索引中的条件表达式决定。只有符合表达式条件的表的行才会被建立索引。部分索引是一种特殊的特征,但在一些特定的场合中会很有用。
建立部分索引的主要理由是避免对普便的数据建立索引。因为一个重复的普遍数据可能在表中占据较可观的比例,在其上建立索引不会起到优化性能的结果,事实上规划器甚至不会启动索引。建立部分索引,不但减小了索引的大小,从而提高了查询的速度,而且加速了更新的速度。
假设,在数据库中存储了网络服务器的登陆记录。大多数的记录显示来自于内网的访问。如果我们需要检索的仅仅是来自外网的记录,那么就不必在整张表上建立索引,只需建立部分索引。
Create table access_log(
url varchar,
client_ip inet,
…
);
可建立如下的部分索引:
Create index access_log_client_ip_idx on access_log (client_ip)
Where not (client_ip > inet ‘192.168.100.0’ and client_ip < inet ‘192.168.100.255’);
一个典型的可以使用以上索引的查询是:
Select * from access_log where url = ‘/index.htm’ and client_ip = ‘212.78.10.32’::inet;
一个不能使用该索引的例子:
Select * from access_log where client_ip = inet ‘192.168.100.23’;
注意到像这样的部分索引需要可预计的普遍数值。如果数值的分布因为应用程序的性质而是固定的,或者是不随时间变化的静态值,那么建立这样的部分索引并不困难。但如果普遍数据来自于不固定的大量一致性数据(拥有相同的索引对应字段的值)装载,那么就需要管理员不断地修改索引。
另一个部分索引地应用是剔除某些典型查询不感兴趣的数据。在能收到于以上情况相同的效果的同时,它还避免进入不干兴趣的数据,即使索引对访问这些不感兴趣的数据有好处。建立这种索引必须很小心,应该事先做大量的试验。
例子:
如果有这么一个表,包含了已付款和未付款的订单,未付款的订单只占总数据很小的一部分,但却被很频繁地访问,为优化性能,可以在该表上建立为付款订单的部分索引。索引如下:
Create index orders_unbilled_index on orders(order_nr)
Where billed is not true;
一个可能启动该索引的查询是:
Select * from orders where billed is not true and order<10000;
然而,该索引还可以被不涉及到
order_nr
的查询触发:
Select * from orders where billed is not true and amount>500.00;
该查询显然不会比直接建立在
amount
上的索引高效,因为这相当于在检索一个表,但如果
unbilled
的数据量较少的话,它还是一个不错的选择。
这样的查询将不会使用以上的索引:
Select * from orders where order_nr=3501;
这个例子说明了索引字段不必于条件中涉及的字段相同。
Postgresql
支持任意条件字段的部分索引,只要这些字段出现在表中。要记住的一点是,只有在索引中的条件符合查询条件时,部分索引才会触发。更准确地说,只有在查询条件中,以数学上地意义包含了部分索引的条件,该索引才会起效。至今,
postgresql
还没有一套复杂的理论来保证系统可以识别形式不同、但却有相同意义的数学表达式。(这样功能的实现过于复杂,而且在真正应用中会很慢。)系统可以自动识别简单的不等式包含,比如:“
x<1
”包含在“
x<2
”中,除此之外必须提供精确匹配索引条件的查询条件。
第三种情况不需要部分索引被查询使用,主要用于在字段的部分数据上建立唯一约束。
Create table tests(
Subject text,
target text,
success Boolean,
…..
)
Create unique index tests_success_constraint on tests (subject, target) where success;
最后,部分索引可以重载系统的查询规划选择。这发生在因为数据分布较为特殊,而使系统在不该使用索引的时候使用了索引。这时可以建立部分索引使得它在此类查询中不可用。
需要注意的是,建立部分索引意味着索引建立者必须知道查询,特别是要知道索引在什么情况下是有用的。要完全掌握需要丰富的经验,并且要理解
postgresql
索引的工作原理。大多数情况下,部分索引不比一个普通的索引更显优势。