21

Chapter 2:

A Dip Into Data

Representation isthe essence of programming.

-Fred Brooks, The Mythical Man-Month

The Macintosh user interface tends to blur the distinctions between data
files and programs. Compounding the problem, Apple uses terminology that
is at variance with that used by most other operating system vendors. Con-
sequently, it is not surprising that many Macintosh users get bewildered
when they step outside of this somewhat sheltered environment.

Double-click on a Mac OS document(data file) and the Finder will locate
and start up the appropriate application(program). Double-click on an
application and it will start up, often asking for a document. Double-click
on a folder(directory) and the Finder will show you a representation of the
enclosed items.

This is all very convenient, but it muddies the conventional distinctions
between nouns (data files) and verbs (programs). So, let's discuss some data-
related terminology and concepts. Don't worry about remembering every
detail; you can always refer back here if you get lost ...

Data1is an encoded form of information.2The symbols you are reading right
now (commonly known as the alphabet) are thus data elements, used to
encode the information we are trying to convey. In any encoding system, the
contextdetermines whether and how the code should be interpreted.

The character sequences "plume" and "rouge", for instance, have different
(though related) meanings in the English and French languages. Even in

IMAGE imgs/115.DD01.gif

1We use "data" for both singular and plural, eschewing "datum" (almost) entirely.
2That is, it has been turned into symbols for storage and/or transmission. The exact
smell of a rose, in contrast, is information that is seldom, if ever, encoded into data.


IMAGE imgs/115.DD02.gif

English, many character sequences (e.g., "lead" and "read") depend upon
context for both meaning and pronunciation.

Bits And Bytes

In computerese, data generally consists of sequences of bits: binary(base-2)
numbers.
3Thus, people speak of computers as understanding only ones and
zeros. But, given our previous discussion, what does it mean for a computer
to "understand" something? Predictably, the answer is: "It depends."

For engineering reasons, most computers store and manipulate bits in groups
of 8, 16, 32, etc. Bytes(groups of eight bits) are, in general, the smallest use-
ful aggregation. Because a byte can hold any of 256 unique values, bytes are
commonly used to represent text characters.
4

Let's say that, in looking through a computer's memory, we encounter a pair
of 8-bit bytes: 01101001 01100110. Well, that's a little hard to read, but we
can make things easier by changing the base. Here are the same numbers, in
a variety of bases:

2 (binary)

8 (octal)

10 (decimal)

16 (hexadecimal)

01101001
01100110

151
146

105
102

69
66

Not that much easier to read, eh? OK, let's see what they look like if we
interpret them as a sequence of text characters. In the ASCII(American
Standard Code for Information Interchange) code, they translate to "
if".
This is still not enough, however, to tell us what the sequence "means":

  • It might be an artificial character sequence, such as a control code.
  • It could be part of a longer string: nifty artifacts beautify California.
  • Even if the sequence is a "word", is it English, Perl, or ???
IMAGE imgs/115.DD01.gif

3In conventional place-value notation, each digit has a value which gets multiplied by
a power of ten (the base). Thus, decimal101 is equal to 100*1 + 10*0 + 1*1. Binary
notation works the same way, save that the multipliers are powers of 2, rather than
10. Thus, binary 101 is equal to 4*1 +2*0 + 1*1.
4A character may, however, occupy more that one byte of storage. The UNICODE
system, for instance, uses two bytes per character, to support oriental languages, etc.


IMAGE imgs/115.DD04.gif

In fact, the bit sequence could just as easily be an instruction to the computer,
part of a numeric value, or part of a bit-encoded image. Clearly, context is
critical to understanding any encoding of information.

To "understand" a bit sequence, computers apply a specified interpretation.
In most cases, the computer has no way to detect an incorrect specification or
faulty data. Hence, the expression: Garbage In, Garbage Out!

In Mac OS, a consistent set of user interface guidelines tell applications and
the Finder how to interpret user actions. Similarly, there are standards for
file encodings, programming interfaces, and other internal details. As you
develop programs, you may find it necessary to research one or more of these
standards, lest you generate the wrong encoding for a given context. For now,
however, just be aware that context is important!

In particular, recognize that a document can be edited as text, interpreted
and run as a program (using an interpreter such as MacPerl), and then (if
need be) edited again. The distinction between a "document" and a "pro-
gram" thus becomes a matter of (dare we say :-) interpretation; if a file is
being edited or read, it's data; if it's being run, it's a program ...

Quantifying Data Storage

Collections of data (applications, documents, data structures, etc.) and phy-
sical storage devices (Disk, RAM, etc.) are measured in terms of the number
of bytes they contain. Some of these collections can get very large, so they
tend to be measured in terms of kilobytes(thousands of bytes), megabytes
(millions of bytes), or gigabytes(billions
5of bytes).

Here are some precise technical definitions:

210bytes (1024)

kilobyte6

KB

220bytes (10242; 1048576)

megabyte

MB

(1024 KB)

230bytes (10243; 1073741824)

gigabyte

GB

(1024 MB)

IMAGE imgs/115.DD01.gif

5That is, American billions; in the rest of the world, it would be "thousand-millions".
6Prefixes such as "kilo" are used differently in the computer industry than in common
usage. For reasons of engineering convenience, they are taken to mean powers of 1024
which (by a happy accident) are fairly close to a thousand, a million, etc.


IMAGE imgs/115.DD06.gif

To get an idea of the scales involved, consider a single-spaced (American;
8.5" x 11") page of text. Assuming 60 lines of 80 characters, such a page can
hold 4800 characters, or about 5 KB of data. So, a 1 MB file can hold about
200 pages; a 600 MB CD-ROM can hold about 130,000 pages of text.

Mac OS documents commonly contain information other than text, however,
including images, formatting codes, etc. Consequently, a word processor will
typically use quite a bit more than a megabyte of disk space when storing a
200-page document. CD-ROMs, which often contain executable programs,
graphic images, and audio and video data, will generally contain far less
actual text than they otherwise might.

Values, Variables, And Calculation

Simple digits and characters (letters) cannot hold very much information.
So, we use them in sequences (e.g., numbers and words). For similar reasons,
computers combine bytes into larger-scale structures. The simplest of these
are scalar values, known in Perl as scalars.

A number(numeric scalar) should be able to handle any reasonable value
without overflowing.
7In Perl, the standard number format supports integer
(e.g., 123) values of up to seventeen digits, with no loss of precision. Deci-
mal (e.g., 1.23) values have the same precision, but their magnitudecan
range from very small to very large values (roughly, 10
-300to 10+300).8

A string(string scalar) is a sequence of bytes, generally interpreted as a
sequence of characters. Most data enters and leaves a Perl program in string
format, even if it is used within the program in a numeric context. As dem-
onstrated below, Perl is more than willing to coerce(convert) strings into
numbers (and vice versa), upon request.

It is very useful to be able to save and retrieve values by name, especially if
the value may vary during the operation of the program. For this reason,
most programming languages support the use of variables. Like their alge-

IMAGE imgs/115.DD01.gif

7If you put a four-digit number into a three-digit location, it won't fit. The result may
vary, but you probably won't like it! In short, an overflow is usually bad news ...
8Perl stores all numbers in double-precision floating-pointformat (even if they are
being used as integers). This is a computer-oriented variant of scientific notation,
using binary fractions and exponents. Perl also provides an arbitrary-precision
"package" (
Math::BigFloat) to support very demanding numeric calculations.


IMAGE imgs/115.DD08.gif

braic counterparts (e.g., "let x = y + 1"), programming variables "stand in"
for values in calculations. Here are some examples of variable usage in Perl:

$a = "123";
$b = $a;
$c = $b + 1;

# Set $a to "123"
# Set $b to "123"
# Set $c to 124

Unlike its algebraic counterpart, the "equals sign" (=) in these expressions
does not stand for equality. Rather, it is used for a command. The variable
on the left-hand side (known in Perl as an lvalue) is set to the value of the
expression on the right-hand side (rvalue). Thus, it is quite legal to modify
(e.g., increment) a variable, using its current value as a starting point:

$c = $c + 1;

# Set $c to 125

In our egg-cooking example, scalar values are used in several places, as:

Wait 25 minutes.

A variable in this location might allow Rachel to cook either soft- or hard-
boiled eggs, using the same basic recipe (oops, program :-). Assuming that
the cooking time (in minutes) has been stored in
$cook_time, we could use
the Perl
sleep(go away for a specified number of seconds) function. A Perl
version of this line might thus look like:

sleep($cook_time * 60); # Wait for eggs to cook

Data Structures

Scalar variables are very useful, but they aren't good at expressing notions
about collections of things. In our example above, cartons and sauce pans are
both used to contain eggs. What kinds of collective data structurescould let
us modelthis aspect of cartons and pans? (Sauce pans can also contain water
and sit on stoves, but these capabilities may not be critical to our model.)

We could model either of these containers as an ordered set of values, or
list. Lists contain items and have ways to insert and remove items. The
exact order of items in a list may not be critical in a given application. For
instance, eggs have no particular ordering in a carton or pan.

If, on the other hand, we were modelling a package-handling system, the
ordering of items in our lists might be very important, indeed. For this kind
of application, we probably would use some number of "First-In, First-Out"
(FIFO) lists, otherwise known as queues.


IMAGE imgs/115.DD09.gif

In other applications, we might want to use "Last-In, First-Out" (LIFO)
lists (stacks), or numerically indexedlists (arrays). In each case, we would
pick a data structure that provided a good map(representation) of the
real-world col-lection we are modelling.

Perl programmers often use a hash(associative array), which"associates"
array positions with particular text strings. This can be useful, for instance,
in managing personnel data: "Raise Fred Smith's salary by $2 per hour."

Finally, by combining basic forms, it is possible to create structures that are
even more complex. For instance, it is quite possible to consider using trees of
arrays, stacks of queues, hashes of arrays, and more.

The ability to create and manage complex data structures is critical to good
program design and implementation. If the data structures do not meet the
needs of the problem, the program logic will be needlessly convoluted. Perl,
fortunately for all of us, has a very rich set of data structuring operators.

Copyright © 1997-1998 by Prime Time Freeware. All Rights Reserved.