219

Chapter 15:

Data Storage

"Where is the wisdom? Lost in the knowledge.
Where is the knowledge? Lost in the information."

- T.S. Eliot

"Where is the information? Lost in the data.
Where is the data? Lost in the #@$%?!& database."

- Joe Celko

Perl has enough kinds of internal data structures to accomodate almost any
needs a progammer might have. Similarly, it offers a variety of waysto
access external ("persistent") data storage:

  • close, getc, open, print, printf, read, seek, <>,...

These routines allow files to be written (read, etc.) in a buffered
manner, as by the Unix
stdio(standard I/O) library.

  • format, formline, write,...

These routines also do buffered I/O, but use a controlling format specifi-
cation. They have several features that make them handy for creating
multi-page business reports, etc.

tie, untie, ...

These routines allow Perl data elements (e.g., scalars, lists, hashes) to
be "tied" to arbitrary routines. They are often used to deal with access
methods such as DBM (Data Base Management) or ISAM (Indexed Se-
quential Access Method).


IMAGE imgs/330.Data_Storage01.gif
  • dbmopen, dbmclose, ...

These routines allow hashes to be tied to Unix-style dbm(3)data files.
When the program exits, the files (and thus the hashes) persist. The
Camel book deprecates these calls (in favor of using
tie()/ untie()
with DBM modules), but we find them to be simple and convenient.

  • sysclose, sysopen, sysread, syswrite, ...

These routines, which are seldom needed, allow the use of unbuffered
("raw") I/O, as by the Unix
read(2)system call.

Any of these forms of storage may be used to create a database (a grouping
of related information in a computer-readable format). An arbitrary file,
however, does not qualify as a database unless it is organized in a manner
that assists the computer in using its contents.

Most of the time, for instance, an article from the New York Times could not
be considered a database. Even if the article contained a list of facts, the
computer (for the most part) could not read those facts and use them in an
organized fashion. A file containing several articles could be a database,
however, if the articles were indexed by author, topic, etc.

As the size and complexity of databases grow, hand-crafted solutions can
become unwieldy. At some point, a programmer may decide to turn to a data
base management system
(DBMS). Alternatively, it may be necessary to
interact with a DBMS for other reasons. For instance, the database may
already be stored under FileMaker or Oracle. Fortunately, Perl is very good
at dealing with data base management systems, of whatever type:

  • big, expensive servers (e.g., Oracle and Sybase)
  • smaller-scale SQL servers (e.g., mSQL)
  • end-user solutions (e.g., 4D and FileMaker)

We won't get into the differences between these systems, which are many
and detailed, but we can and will discuss some of their common features.

Database Basics

Databases tend to have certain things in common. Most databases are
organized into records(collections of data describing a given entity). Each
record may have one or morefields (single data values).


IMAGE imgs/330.Data_Storage02.gif

Let's consider a database of information on pets.1Each pet's name occupies a
field, of course, but we also want to know the pet's type, breed, and weight
(in pounds). So, we add three more fields:

NameTypeBreedWeight
---------------------------------
BanjodogLabrador110
FestercatAbyssinian12
IrisdogRetriever45

In many databases, you must select a key for the database. A key is a spe-
cial field that must always be unique; no two records can have the same
value for that field.
2Key choice can be tricky: if we use the Namefield as
the key, we can't allow more than one pet to have the same name.

Let's add a field for details regarding each type of pet. Putting this infor-
mation into each pet's record wouldbe a waste of space, but we can add
another tableto the database, or create a new database entirely.
3

TypeFoodDesignation
-----------------------------------
dogdog foodCanis familiaris
catcat foodFelis domestica

In Perl, these tables might be represented as multi-dimensional hashes:

$by_name{'Banjo'}{'Type'}= 'dog';
$by_type{'dog'}{'Designation'} = 'Canis familiaris';
...
$type = $by_name{'Banjo'}{'Type'};
print "$by_type{$type}{'Designation'}\n";

Displays:

Canis familiaris

IMAGE imgs/330.Data_Storage03.gif

1This is only a representation of a database, not necessarily an example of what a
database file might look like internally.
2The resemblance to a Perl hash is not coincidental. In fact, hashes may be tied to DBM
files, yielding a simple form of database.
3Some databases allow multiple tables in one database. Simpler databases, like text
files and DBMs, do not. But then, you can always have multiple files (TMTOWTDI :-).


IMAGE imgs/330.Data_Storage04.gif

That is, %by_nameis a hash whose indices are pet names and whose
elements are also (references to) hashes. We can thus index one of these
elements by the text string
'Type', storing 'dog'in that position. Using
similar techniques, we can store and retrieve other related data.
4

But how does this data get saved to or read from a file? More to the point,
how can we get it into a DBMS? As it happens, that is our next topic.

Delimited Text

Different programs store their databases in different (often proprietary
and/or undocumented) formats. Consequently, it is seldom advisable for a
Perl program to read from (let alone write to) one of these databases.

Data exchange is a fact of life, however, so DBMS vendors tend to provide
ways to exportand importdata. This requires a file format which clearly
delineates the database's records and fields. By choosing delimiters (e.g.,
commas, newlines, and tabs) that are not found in the data, vendors can pro-
vide a format for export files, while keeping their own formats private.

If it were exported into delimited text, our pet database might look like:

Banjo,Dog,Labrador,110
Fester,Cat,Abyssinian,12
Iris,Dog,Retriever,45

This format is easy for Perl scripts to read and parse:

while (defined($line = <FILE>)) {
chomp($line);
@line = split(/,/, $line);
printf("My %s %s, a %s, weighs %s lbs.\n",
$line[1], $line[0], $line[2], $line[3]);
}

Displays:

My dog Banjo, a Labrador, weighs 110 lbs.
...

IMAGE imgs/330.Data_Storage03.gif

4See the perlref, perldsc, and perllolmanual pages for more information on multi-
dimensional data structures.


IMAGE imgs/330.Data_Storage06.gif

Because text fields might themselves contain delimiters, many export
formats use single or double quotes to enclose text fields, as:

"Banjo","Dog","Labrador",110
"Fester","Cat","Abyssinian",12
"Iris","Dog","Retriever",45

This solves a data representation problem, but it creates a programming
problem: our simple
split()no longer works. Fortunately, MacPerl comes
with a
quotewordsfunction (in Text::ParseWords) that splits text on a
specified delimiter, stripping out any quote marks. Using this function, we
can write something similar to what we used before:
5

use Text::ParseWords;
while (defined($line = <FILE>)) {
chomp($line);
@line = quotewords(',', 0, $line);
printf("My %s %s, a %s, weighs %s lbs.\n",
$line[1], $line[0], $line[2], $line[3]);
}

Array numbers aren't particularly memorable, however. Fortunately, this
is Perl, so There's More Than One Way To Do It:

($name, $type, $breed, $weight) =
quotewords(',', 0, $line);
printf("My %s %s, a %s, weighs %s lbs.\n",
$type, $name, $breed, $weight);

If you really need to keep the values in an array, there are still ways to
resolve the problem. Just create a named item - a scalar, subroutine, or hash
key - that corresponds to the array index:

sub name { return 0 }
$type = 1;
$pet{'breed'} = 2;
printf(
"My %s %s is a %s.\n",
$line[$type],
$line[name()],

# set up a function
# set up a scalar
# set up a hash

# index by a scalar
# index by a function

IMAGE imgs/330.Data_Storage03.gif

5If the fields were delimited by tabs instead of commas, we would pass "\t"to
quotewords()instead of a comma.


IMAGE imgs/330.Data_Storage08.gif

$line[$pet{'breed'}]
);

# index by a hash

Creating a comma-delimited text file is a simple matter:

printf NEWFILE ("%s,%s,%s,%s\n", @line);

Or, if we need embedded quotes:

printf NEWFILE (qq{"%s","%s","%s","%s"\n}, @line);

This type of database has a very flexible structure. You can choose any key
you want, multiple keys, or no keys at all. You can have any number of rec-
ords and fields, and any amount of data (limited only by your storage). Most
proprietary databases can export to this type of database format, and it is
portable and free to use.

Unfortunately, it is also slow and a bit awkward to use. So, for larger
databases that will be accessed frequently, we recommend the use of other
database types.

DBM

Data Base Management(DBM) routines are very efficient at finding a
given record in a database. Typically, they use only two disk accesses to
find a given key, no matter how large the DBM file.

DBM files are more restrictive than delimited text files, however. Like
hashes, they allow only two fields per record: a key and a value.

Note:The two-field limitation is not as severe as it might seem. You
can open more than one DBM, if you wish, and you can also encode
multiple scalars into each "value":

$hash{'Banjo'} = '"Dog","Labrador",110';
@fields = quotewords(',', 0, $hash{'Banjo'});

Some types of DBM files put restrictions on assorted (e.g., field, record, or
file) sizes. The material below, however, concentrates on the
DB_File
module, which does not impose size restrictions outside of those imposed by
your system.
6

IMAGE imgs/330.Data_Storage03.gif

6DBM access is always fast, but creating large DBMs with MacPerl can be slow.


IMAGE imgs/330.Data_Storage10.gif

AnyDBM_File

Only two DBM modules come with MacPerl: DB_Fileand NDBM_File.
NDBM_Filein MacPerl is an emulation layer on top of the same routines
DB_Fileis based on, but the two modules have different capabilities.
Since they use the same underlying routines, however, the files are com-
patible with each other.

Not all versions of Perl have every DBM module. All Unix versions of Perl
come with
SDBM_File, and many come with ODBM_Fileand GDBM_File.
MacPerl has none of these, however, and not all Unix versions of Perl have
DB_Fileor NDBM_File. So, how does one assure portability?

If a program depends on certain features of a given DBM type (such as arbi-
trarily large sizes), portability cannot be ensured. But for many cases, your
program can remain agnostic about the type of DBM being used.

AnyDBM_Fileoffers a true virtual base class, with no methods or data of
its own. It attempts to inherit from otherDBM modules:
NDBM_Fileis
tried first (for compatibility with
dbmopen()in Perl 4), then DB_File,
then by the others. If you use
AnyDBM_File, your code should work no
matter which DBM modules are present on a given machine. The search
order can be changed; see the pod documentation for details.

Note:This allows portability of code, but not of files produced by the
code. DBM files are not necessarily portable between machines!

tie()

Perl supplies a mechanism, called tie(), that can hide complex data rou-
tines behind a data structure.
7For example, the Mac::MoreFilesmodule
provides a tied hash called
%Application. When the hash is accessed,
Perl automatically (and invisibly) calls the
FSpDTGetAPPL()function to
find the path to the given application in the filesystem. So, the calling
program doesn't need to know about the function.

DBM files are accessed in a similar way. A hash is tied to a database file
via a DBM module. First, you have to
usethe module. Then, tie()is
passed the hash, the name of the module, the filename, the database

IMAGE imgs/330.Data_Storage03.gif

7Read perltiefor more information.


IMAGE imgs/330.Data_Storage12.gif

flags,8and the mode (Unix-style file permissions) value.9When we want to
close the file, we just
untie()the hash.

#!perl -w
use DB_File;
my(%hash1, %hash2);

tie(%hash1, 'DB_File', ':my_dbm', O_RDWR|O_CREAT,
0644);
$hash1{'key'} = 'value1';
untie(%hash1);

tie(%hash2, 'DB_File', ':my_dbm', O_RDONLY, 0644);
$hash2{'key'} = 'value2';
print $hash2{'key'};
untie(%hash2);

Displays:

value1

We are planning on writing to the database file for %hash1, so we use flags
that direct MacPerl to open it in read/write mode, creating the file if neces-
sary. Because we opened the file for
%hash2in read-only mode, we cannot
assign any values to
%hash2. So, when we try to assign a value to $hash2{
'key'}
, it fails (silently, unlike the behavior of most other DBMs).

Database Servers

High performance databases are usually implemented as servers. This lets
multiple users log in and perform functions on the data. Oracle might well
be the most popular database server; Informix, Ingres, Postgres, Sybase, and
others are also very popular.

Most of these servers use SQL(Structured Query Language) to perform oper-
ations on the data. Many databases, usually for Windows, also support
ODBC(Open Data Base Connectivity) to interact with external programs.

IMAGE imgs/330.Data_Storage03.gif

8These flags (e.g., O_RDWR) are implemented as function calls, but common practice for
constant usage is to leave the parentheses (e.g., O_RDWR()) off, except when necessary.
9Because Macs do not use file permissions, we can use 0 as the mode. If you want the
program to be portable to Unix machines, you should supply an appropriate value.


IMAGE imgs/330.Data_Storage14.gif

Most of this functionality is provided by a single module and set of drivers
known as
DBI(Data Base Interface).10DBIis a standardized API(Appli-
cation Programming Interface
) for accessing databases. Using driver mod-
ules called
DBDs(Data Base Drivers) , a programmer can connect with all
kinds of SQL and ODBC database servers.

If we are using mysql, a freely available SQL database server, we would
connect to it using
DBIand DBD::mysql. If we decide to upgrade to an Ora-
cle server later, we could change a few lines of code to use
DBD::Oracle,
and need few (or possibly no) other code changes.

DBImakes accessing databases of all kinds quite simple. Unfortunately,
DBIis not presently available for MacPerl. A client library has to be ported
to Mac OS, in order to access the server on another machine, then the corres-
ponding
DBDhas to be ported to MacPerl, as well.

Client libraries for several databases do exist on Mac OS, including Oracle
and mSQL. A port of
DBIand some DBDs might be coming soon. Until then,
there is no easy way with MacPerl to interact with database servers on
other machines. (TCP/IP and sockets are, however, a possibility ...)

If the server is on the same machine as MacPerl, you might be able to com-
municate with it using Apple Events. Oracle supports this capability.

Data Formats

Although database export formats are considerably easier to parse than
internal database formats, they can still contain some peculiarities. Worse,
these peculiarities are seldom documented in the database system's user
documentation. Here are some notes on FileMaker Pro, by way of example.

FileMaker Pro's export files use carriage returns and tabs to delimit records
and fields. So, the basic parsing code is pretty straightforward:

while (defined($line = <FILE>)) {
chomp($line);
@fields = split(/\t/, $line);
}

IMAGE imgs/330.Data_Storage03.gif

10For more information about DBI, see www.hermetica.com/technologia/DBI/.


IMAGE imgs/330.Data_Storage16.gif

If a field holds multiple values, however, a special delimiter (\035) is
used. Also, text fields may use some oddball characters (e.g., curly apostro-
phes) which don't work well with other applications or systems (e.g.,
HTML), so you may want to edit them out.

There are far too many Macintosh (and other!) applications, however, for
us to have any chance of covering them all. So, here's some general advice:

  • Examine the application's documentation. Although many programs do
    not document their export formats, some do, and you could get lucky.
  • Contact the vendor. The vendor may have developed a technical note
    that describes their export format; it certainly can't hurt to ask.
  • Ask around. If there is a USENET newsgroup or email list for the appli-
    cation, someone on it may have the information you need. Don't forget
    to ask the MacPerl list, while you're at it.
  • Export a file and examine it. BBEdit is surprisingly useful on arbitrary
    binary files. Alternatively, a small Perl script can tell you quite a lot.

Here is a sample file analysis script; tweak it to fit your needs:

#!perl
$/ = ''; $text = <>;
foreach $ord (unpack('C*', $text)) { $m[$ord]++; }

print("occursoctdec\n");
foreach $ord (0 .. 31, 127 .. 255) {
printf("%6d%03o%3d\n", $m[$ord], $ord, $ord)
if ($m[$ord] != 0);
}

Displays:

occursoctdec
249840119
5201311
...

Just remember, we cannot stress it enough (especially when dealing with
data storage): There's More Than One Way To Do It.

Copyright © 1997-1998 by Prime Time Freeware. All Rights Reserved.