Monday, 10 August 2009

Other key non-database files

Although there are only three types of file which technically make up an Oracle database, there are other disk files floating around the place which are extremely important tot he correct operation of an Oracle server. They are not considered to be technically part of the database, though, because if they ever get deleted or damaged, no-one will know about it and you won't be told about it. Compare that with losing your control file or a data file: the database will definitely report that as an error or even crash as a consequnce. In short, the key difference between a database file and a non-database one is whether or not the file is continually being written to and monitored. The three types of file I'm going to mention here are not continually monitored.

Archived Redo Logs

The online redo logs contain a record of every transaction that has modified data in the database. That 'stream of redo' is there so that if we ever need to re-perform those transactions (during a database recovery), we can. The trouble is that there is only a limited number of online redo log groups and LGWR eventually fills up the last group... and then switches back to the first and over-writes its contents. That means LGWR eventually destroys redo... redo which was supposed to be there to protect our database by making it possible to recover it!

Archived redo logs are the necessary answer to this paradox. Every time LGWR stops writing to one online log and starts writing to another, the ARCn background process makes a replica copy of the just-filled log in some new location on disk. LGWR can eventually do its worst in over-writing the redo in the online log, because we have a copy of its contents stored elsewhere. The stream of redo produced by the database is thus made continuous and permanent by the Archiver process -which means that archived redo logs are essential to truly being able to say, 'I can always recover my database'.

A database therefore operates with just a handful of online logs -but with an ever-increasing number of archives. You do not need to retain archives for ever, however: once they are a week or more old, they can usually be aged out to tape and deleted from the disk to which they were originally written. This means, of course, that whilst archives are great for protecting your database they impose a management and maintenance burden on the DBA, who has to make sure there is enough disk space for the next archive to be written (implying bulk deletion of old archives) but enough logs available on disk to make any conceivable recovery do-able as quickly as possible (implying retaining as many archives as feasible on disk). It's a management balancing act, in short.

Just as we care about redo enough to multiplex our online redo logs, it stands to reason that we should think about multiplexing our archived redo log copies, too -and, indeed, it is possible for Enterprise Edition users to create archives in up to ten different locations. Standard Edition users can only create duplexed copies (that is, archivees can be written to only two different locations). It is not compulsory to multiplex your archives, but since they are the only thing (ultimately) standing between you and data loss, it would not usually be a sensible move to neglect to do so.

Password File

Logic dictates that the record of who is allowed to start up instances cannot be held inside the database -because to read data from inside a database you need an instance... but you can't create an instance until you've read the record of who is allowed to start up the instance, which you can't read until you've got an instance... and so ad infinitum!

To break this sort of Catch-22, the details of who is allowed to start an instance is held outside the database, in a small binary file called the password file. The password file permits external authentication of privileged users -where the term "privileged user" means 'someone who can startup and shutdown an instance, backup and recover a database, and create a database in the first place'. These are the five basic privileged actions, and if your username and password are found inside the password file, you can perform any of them. If your username and password are instead only stored inside the database itself, then you are merely an "ordinary user" (however many rights and powers you might be granted) and you will never be able to issue the startup or shutdown commands.

Password files need to be created in a given location and with a specific name. The location is $ORACLE_HOME/dbs and the file name must be orapwXXX, where "XXX" is the name of your instance, given by the value assigned to the ORACLE_SID environment variable in force at the time. If you set ORACLE_SID to FRED, for example, then the startup command will create you an instance called FRED, provided we can find a password file called orapwFRED in the ORACLE_HOME/dbs directory. (Note that Windows users have to store their password files in %ORACLE_HOME%\database and the file is called PWDXXX.ora). There is no leeway on the naming or location of the password file, though Unix/Linux users can always store it anywhere they like and call it anything they like and leave behind a symbolic link of the right name in the right place that points to where the file really is. Windows users don't have that sort of luxury, though.

Password files are created using a special Oracle-supplied utility called orapwd. You might issue this sort of command, for example:

C:\oracle\product\10.2.0\db_1\database> orapwd file=PWDtkit.ora password=tkit entries=25

C:\oracle\product\10.2.0\db_1\database> dir
Volume in drive C is W2KOS
Volume Serial Number is 24FE-B205

Directory of C:\oracle\product\10.2.0\db_1\database

07/10/2006 10:54a 61 inittkit.ora
17/10/2006 09:08a 4,608 PWDtkit.ora
2File(s) 39,997 bytes
3 Dir(s) 6,502,473,728 bytes free

This command creates a file of the right name and sets the default privileged user's password to be "dizwell". It also causes 24 other 'slots' to be created inside the new file so that details of other users who we later decide should become privileged users can be stored here. If you ever run out of slots and need to make one extra user a privileged user, you will have no choice but to delete the password file and re-create it with a larger value for the entries setting. Similarly, if you ever are unfortunate enough to lose your password file, you simply re-create it using this sort of command. It is clearly a major security problem is any old user can gain access to your file system and delete and re-create password files, so you take ordinary operating system-related precautions to make sure that only authorised users can delete and create files in the ORACLE_HOME/dbs (or \database) directory. That is a system administrator's function, though: there's nothing the database itself can do to protect the password file.

Not everyone will use a password file, because there are alternative ways of performing external authentication of privileged users. It is possible, for example, to assign an operating system user to membership of a special operating system group (on Unix, usually it's called dba; on Windows, it's ORA_DBA). Logging on to the server itself then means you acquire O/S group membership, and Oracle can check for that and allow you to do privileged actions accordingly. But you can only exercise O/S group privileges if you log on to the Oracle server directly. The password file is the only way of authenticating privileged users remotely.

Parameter File

I discussed earlier how the SGA consists of a Shared Pool, a Buffer Cache and so on. But the obvious question then arises: how big should the Shared Pool be? Should there be a Java Pool at all? Where should the database be storing the archived redo logs? These are important questions concerning database configuration and functionality, and the answers to those questions are contained within a special file called the parameter file by setting keyword-value pairs. For example, you might set SHARED_POOL_SIZE=600M, and then we know exactly how big your Shared Pool is going to be. If you set DB_CACHE_SIZE=348M, I know your Buffer Cache is going to be 348 MB in size, and if I see JAVA_POOL_SIZE=0, I know you don't want a Java Pool at all.

There are 257 different possible parameters that can be set in an Oracle 10g Release 2 database... but, fortunately, only about 20 or 30 of those possibilties commonly need to be set. If you don't set a parameter yourself, it usually has a vaguely-sensible default value. I would say that, to begin with, you absolutely must know (and be able to recite in 'parameter-speak'!) the following core parameters:

DB_BLOCK_SIZE

The size of the Oracle block used by default for every data file.

DB_NAME

The name of the database

SGA_TARGET

The target size of the SGA overall

SGA_MAX_SIZE

The absolute maximum size the SGA can grow to

SHARED_POOL_SIZE

A minimum size for the Shared Pool component of the SGA

DB_CACHE_SIZE

A minimum size for the Buffer Cache component of the SGA

LOG_BUFFER

The absolute size of the Log Buffer component of the SGA

UNDO_MANAGEMENT

Whether the database uses automatic undo management or not

UNDO_TABLESPACE

Where undo is written to if automatic undo management is switched on

UNDO_RETENTION

How long before undo can be over-written

DB_RECOVERY_FILE_DEST

The location of the Flash Recovery Area, where archive redo logs are written by default

DB_RECOVERY_FILE_DEST_SIZE

How much disk space the Flash Recovery Area can consume before declaring 'out of space'

Obviously, there are many other parameters to become familiar with over time, but these are absolutely core competency ones. A full list (together with their meaning and default values, if any) is available from the Oracle website. Note that new parameters get added to the list all the time as each new Oracle version (and its new functionality) gets released. Old parameters also disappear off the list as their functionality becomes deprecated or (finally) obsoleted. If you are trying to learn about what's truly new in a new Oracle version, it's usually a good idea to start by comparing lists of parameters from both versions and seeing what new arrivals there are, and what ones have disappeared.

Parameters can be stored in two totally different types of parameter file. You can either use a text file, editable with nothing more than notepad or vi, and if you do, we call that an init.ora, because the file must have a name of initXXX.ora, where "XXX" matches the value assigned to your ORACLE_SID environment variable. Additionally, the file must be stored in the ORACLE_HOME/dbs directory (ORACLE_HOME\database for Windows users). Alternatively, the parameters can be stored in a binary equivalent of the init.ora called the spfile.ora, because the file must have a name of spfileXXX.ora (the "XXX" again matches your ORACLE_SID). Like the init.ora, the spfile.ora must be located in the ORACLE_HOME/dbs directory. Because spfiles are binary files, stored in a proprietary Oracle format, only an instance can be used to edit them. You edit an spfile by connecting to a running instance and issuing commands such as:

alter system set undo_retention=1800 scope=spfile;

If you cannot get an instance running and yet still need to edit the spfile, you must convert the spfile to a pfile, edit the pfile with a standard text editor and then convert the text file back to the spfile, like so:

create pfile from spfile;

--edit the init.ora thus produced--

create spfile from pfile;

If both an init.ora and an spfile.ora exist in the ORACLE_HOME/dbs directory, the spfile.ora is used to configure the instance and the existence of the init.ora is simply ignored. For that reason, I generally make a point of always deleting one or other of the files, so that there's no ambiguity in my mind about which is the 'operative' file.

Spfiles were invented in Oracle 9i, and before then the init.ora was the only parameter file that was ever used. A lot of 9i users stuck with the init.ora (me included!), out of habit and old familiarity, but these days you should generally switch to using an spfile and become familiar with how you edit it.

Summary on non-Database Files

Archived redo logs are absolutely critical to being able to recover your database; the password file is essential for anyone wanting to remotely administer a database; the parameter file (of whatever type) is crucial for correctly configuring the nature and capabilities of an instance and database -so each of the three files I've discussed here are in no way trivial or irrelevant! They are all immensely important files, but they are not technically part of the dataabse, because if anything ever happened to any of them, you could simply ignore the loss or re-create the relevant file. You can't re-create archives, of course, but so long as you take a new backup, you shouldn't ever need to use or rely on a lost archive. If you lose an spfile, you can use vi to type up a replacement init.ora, and then create an spfile from that. If the password file is ever corrupted, you merely delete it and use orapwd to re-create it. In other words, loss of these files does not cause you to perform formal database recovery, which is definitely not the case if you lose your control file, one of your data files or your current redo log!

Sunday, 9 August 2009

The Database

Whilst an Oracle instance is made up of many different and various memory areas together with countless different background processes, each doing a specific task, an Oracle database consists of just three types of physical disk file. Theya re all binary files (so don't go opening them in vi or notepad!) and have an internal structure that only an Oracle instance can make sense of. Taking each of the file types in turn, therefore, we have:

Control Files

Every database must have at least one control file. It is a small binary file (typically, no more than about 10MB in size) and has two main jobs. In the first place, it points to the physical location of all the other files that make up the database. That is, it physically stores the full path and file name of every other constituent file. Secondly, it stores the latest checkpoint change number, which is the timestamp written by the CKPT process at the end of each checkpoint event. This is the "CKPT timestamp" I mentioned earlier.

Other files in the database get timestamped by CKPT at every checkpoint -but the control file knows how to find every other file in the database, it must be the one file which is read first whenever a database is opened. That means the timestamp stored in the control file is read first, and is assumed to be the timestamp to which all the other files in the database will agree. In a sense, therefore, you can say that the control file is the database's master clock.

In short, the control file tells us where everything else in the database can be found and how old it should be when we find it. If you lose your control file, therefore, you are in deep trouble: not only can your instance not find out where every other file is located, but it won't have any idea what checkpoint change number everything should be up to, and therefore SMON has no basis for working out whether the database is clean and consistent. For this reason, although you technically only need to have one control file, sensible people will multiplex their control files, so that the database sees two (or more) copies of it, and will therefore maintain both (or all) copies identically. At that point, if you lose one copy of the control file, there's another copy which you can use to recover with, and no major harm is done.

Note that multiplexing your control files is not the same thing as mirroring them. In a hardware mirror, the instance sees and maintains only one control file copy. Whatever is done to that one file, the hardware kicks in and replicates to a mirror copy. Multiplexing means telling the instance that there are two or more files to write to, and anything written by the instance to one copy is written afresh by the instance to the second copy. To really protect your control file, you really want to do both: multiplex two or three times, and then mirror: potentially, you'll have four or six copies of the control file, and if one copy is lost, there are three or five others you can rely on for recovery purposes.

Data Files

Data files are, funnily enough, where your data is stored, in tables, indexes, clusters and other more exotic kinds of storage structures. All Oracle databases start out with at least two data files: one called system and the other called sysaux (sysaux is new in 10g). Both these files store special system-related data, such as the data dictionary tables which tell us what users exist, what privileges they have, what tables have been created and what columns they contain. Although there's nothing stopping you from creating your own tables inside system or sysaux, it's a bad idea to do so: having good access to the data dictionary is extremely important, because it's needed every time a SQL statement gets parsed. Having your own tables being worked on inside the same file would cause I/O issues, and slow down parsing.

Therefore, although every database starts out with just two data files, you will soon begin to add more -and in fact you can add up to another 64,000 or so data files if you really needed to, each up to several Terabytes in size. The entire collection of data files in an Oracle database can therefore get significantly big!

Internally, data files are organised into discrete 'chunks' of data known as an Oracle block. Blocks are usually 2K, 4K, 8K, 16K or (on some O/Ses only) 32K in size. Once you've specified a block size, that's it: you can't change your mind and ask for an 8K-block data file to re-organise itself so that its blocks are now 16K in size. Whatever block size you decide to use, that is then the smallest unit of I/O that Oracle will use when reading or writing to the data file: ask to see one column of one row in a table, and an entire 16K (or 8K etc.) block is read instead.

The very first blocks of any data file are known as the data file header, and in the header of every data file, CKPT periodically writes the latest checkpoint change number. All data files in a healthy database must have the same checkpoint change number stored in their headers, and all must agree with whatever checkpoint change number CKPT also wrote into the control file. If these conditions are met, we say that the database is consistent. (Note that there one or two exceptions to the 'every file must have the same checkpoint number. Files which are marked read-only, for example, are allowed to be different).

Although a data file doesn't care what data is stored inside itself, it is common practice to make sure that each data file only stores one 'type' of data. For example, you will often see a data file which stores only index data, and another one which stores only tables -just as the system data file is supposed to store only the data dictionary information. Such 'specialisation' in the type of data stored in a data file is a management convenience only and it requires constant vigilence to make sure an index, for example, is not accidentally created inside the tables' usual data file (though no actual harm will arise if such a thing were to occur).

Online Redo Logs

There must be a minimum of two online redo logs, which are then each known as a redo log group. Each group is usually not that big (say, a couple of hundred megabytes), but that obviously depends on the specific requirements of each database installation.

Log Writer saves the contents of the Log Buffer to one of the log groups -which is therefore known as the current log group. When the current log is filled to capacity, Log Writer switches to writing to the other group. When that in turn is filled, Log Writer switches back to the first group and begins to over-write its contents. In this way, a couple of relatively small online logs are re-used in a cyclical fashion.

The contents of the online logs is known generically as redo. Redo lets the database re-perform transactions should the database ever need recovering. Therefore, losing redo is not good -in fact, it's about as close to a disaster as it ever gets with Oracle. Therefore, just as with the control files, each online redo log group should be multiplexed so that Log Writer writes into multiple copies of the same group. Lose one of the copies (known as a redo log member), and the other member is still able to be used and no redo has been irretrievably lost. Multiplexing of redo log groups is not actually compulsory... but if you care about not losing any of your data, it ought to be considered so.

Whereas data files are internally made up of Oracle blocks, redo logs are internally made up of file system blocks which are almost always 512 bytes in size. That size is a product of how a disk is formatted with a file system and is therefore not configurable from within Oracle at all. Some documentation available on the Internet refers occasionally to 'redo log blocks', but they're really just talking about 'file system blocks'.

Summary of the Oracle Database

Oracle databases can be huge: up to 8 Exabytes (that's eight million terabytes!) in fact. But however big they get, they are always only ever made up of three sorts of file: a 10MB-ish control file, two or more 100MB-ish online redo logs, and large numbers of extremely large data files. (The sizes mentioned here are just typical sorts of values. Specific requirements will mean each database is different, though).

Just as we can say 'An instance = SGA + Background Processes", so we can therefore say:

Database = Control Files + Data Files + Online Redo Logs

Each file type has a specific job to do. Data files store data (yours and the database's internal housekeeping variety). Redo logs store redo, to use in case of recovery of transactions. Control files store the location of every other file and the master timestamp for the database as a whole.

All database files are forever being read from, and written to, by background and other processes, and losing any one of them is not an option. If you lose just one of these files, your database must be considered damaged; it will probably cease functioning and the instance managing it will probably crash; and you will have to perform some sort of recovery. We can use multiplexing techniques to minimise the damage done by the loss of one file; we can also use hardware mirroring to protect against hardware failure. A truly safe database will be using both techniques.

Saturday, 8 August 2009

Summary of the Oracle Instance

When you run Oracle, you cause a large chunk of memory to be reserved for Oracle's use. That chunk of memory is called the System Global Area (SGA). It is composed internally of several discrete 'pools' of memory, such as the Buffer Cache and the Shared Pool. Each 'pool' is used to store a particular sort of data for particular purposes -sometimes, 'your' data, so you can see rows from a table; sometimes, Oracle's own internal 'housekeeping' data needed to recover a database or parse a SQL statement.

In addition to grabbing chunks of memory, running Oracle also causes many background processes to be launched. The processes run independently of what users happen to be doing (hence the use of the term background). They perform many different housekeeping tasks which are essential for the health and future recoverability of the database, such as writing your modified data back to disk or periodically 'time-stamping' the database so that we can tell it is healthy and consistent at any given point.

An SGA plus the background processes is known as an Oracle instance. The instance is the name we give to 'Oracle running in memory': it's both the memory the Oracle program consumes and the processes (or threads) it needs to carry out is work. If we were to express this concept in a formula, it might read:

Instance = SGA + Background Processes

One instance can only ever open, read, manage and manipulate the data from one Oracle database. But what exactly constitutes a database is something we need to look at next!

Friday, 7 August 2009

Processes or Threads?

I've been busy describing DBWn and LGWR as "background processes", but in fact they might not be processes at all. If you are running Oracle on Windows, for example, then they are technically 'threads' within the main process, not independent processes in their own right. On Unix and Linux, however, they are genuinely independent processes and can be listed just like you can list the existence of any operating system process:

[oracle@tkit ~]$ ps -ef | grep ora
oracle 6688 1 0 06:32 ? 00:00:00 ora_pmon_mydb
oracle 6690 1 0 06:32 ? 00:00:00 ora_psp0_
mydb
oracle 6692 1 0 06:32 ? 00:00:00 ora_mman_mydb
oracle 6694 1 0 06:32 ? 00:00:02 ora_dbw0_mydb
oracle 6696 1 0 06:32 ? 00:00:01 ora_lgwr_mydb
oracle 6698 1 0 06:32 ? 00:00:01 ora_ckpt_mydb
oracle 6700 1 0 06:32 ? 00:00:02 ora_smon_mydb
oracle 6702 1 0 06:32 ? 00:00:00 ora_reco_mydb
oracle 6704 1 0 06:32 ? 00:00:01 ora_cjq0_mydb
oracle 6706 1 0 06:32 ? 00:00:03 ora_mmon_mydb
oracle 6708 1 0 06:32 ? 00:00:00 ora_mmnl_mydb
oracle 6710 1 0 06:32 ? 00:00:00 ora_d000_mydb
oracle 6712 1 0 06:32 ? 00:00:00 ora_s000_mydb
oracle 6716 1 0 06:32 ? 00:00:00 ora_qmnc_mydb
oracle 6719 1 0 06:32 ? 00:00:00 ora_q000_mydb
oracle 6721 1 0 06:32 ? 00:00:00 ora_q001_mydb

The middle bit of the names in the right-hand column show you the process abbreviations I've been using -and you can see PMON, DBW0, LGWR and so on listed together with a bunch of those 'other' processes (like S000) I mentioned briefly in passing.

There is no equivalent on Windows to the 'ps' command, however -and even downloadable tools like CurrProcess Explorer won't show you items with identifiable names such as I've been using here. The best you can see is the existence of a single process, called oracle.exe, and that's about it. You can see the process names even on Windows, though, if you query the instance itself by selecting from V$PROCESS:

SQL> select program from v$process;

PROGRAM
-----------------------------------
PSEUDO
ORACLE.EXE (PMON)
ORACLE.EXE (PSP0)
ORACLE.EXE (MMAN)
ORACLE.EXE (DBW0)
ORACLE.EXE (LGWR)
ORACLE.EXE (CKPT)
ORACLE.EXE (SMON)

...but that's as good as it ever gets on Windows.

This difference between Unix and Windows is down to the underlying architecture of the operating system and has no impact on the fundamental capabilities and functions of the Oracle database itself, but it means that, technically, we should talk about "background threads" when discussing Oracle-on-Windows ...but no-one in their right mind ever does so! Everyone still calls them 'background processes', even on Windows. It might not be technically precise but everyone will know what you mean.

Thursday, 6 August 2009

The Background Processes

A memory area, like the SGA, is useless unless something (or some many things) can write data into the memory area and read data out of it. You need, in short, a set of processes which know how to interact with the various memory areas which make up an Oracle SGA. There are indeed several such processes, and they are collectively known as background processes, because they run and do their work in the background, not interfering in any obvious way with what the users of the database are trying to achieve in terms of practical work outcomes.

There are five compulsory background processes, and all the rest are optional. Taking them in turn...

SMON (System Monitor)

SMON checks the database for consistency and health every time you start up an instance and try and use it to open a database. If the database is, for whatever reason, inconsistent SMON will either fix up the problem entirely automatically if it is in its power to do so, or it will prompt you to rectify the problem if it can't do it itself. SMON, in short, either performs or prompts for the performance of some sort of recovery. A recovery that SMON can perform entirely automatically is known as an instance recovery. One which requires manual intervention from the DBA is known as a media recovery.

A reasonable question to ask is, 'On what basis does SMON assess a database as being healthy or in need of recovery?'. To which the answer is: checkpoint change numbers. Periodically, the database is "timestamped" with a new sequence number (the latest number from a pool of ever-increasing sequence numbers). So long as every file in the database has been stamped with the same number, we can say that the database is 'consistent to that point of time'. If all the checkpoint change numbers that SMON reads during the database opening procedure, and which are stored in the headers of every physical file constituting the database, are the same, the database is healthy. If they are not all the same, and if there are no good reasons for the difference, the database is presumed to have suffered some sort of failure and a recovery procedure is therefore invoked.

CKPT (Checkpoint)

It is another reasonable question to then ask, 'What writes the checkpoint change number, which SMON reads, into the headers of each database file?'. To which the answer is, 'Another background process called Checkpoint, abbreviated as CKPT'. The occasions when CKPT writes its information into the headers of all database files are known as checkpoint events. Every transaction that takes place on an Oracle database (that is, every insert, update or delete of a record) causes a system-wide sequence number called the System Change Number or SCN to be incremented. At a checkpoint event, the latest SCN is deemed to be the latest Checkpoint change number, and it is this CCN which is written by CKPT into the headers of all database files.

It is fair to say that, during prolonged database operation, CKPT will forever be writing new checkpoint change numbers into the headers of all the database files; it is SMON's job, at the opening of the database, to ensure that every constituent component of the database agrees on what the latest checkpoint change number should be. If they are all in agreement, the database opens. If there is disagreement, recovery is called for.

To make things even simpler: SMON reads what CKPT writes, and both processes are intimately involved in identifying when databases need recovery, and how much recovery they need, if any.

DBWn (Database Writer)

Database Writer is the process responsible for periodically saving the modified contents of the Buffer Cache back down to the physical database files. By writing modified buffers back to disk, Database Writer makes that data safe, immune from the vagaries of operating system crashes or power failures.

Since a buffer in memory that contains modified data is known as a dirty buffer, and because Database Writer's job is, in effect, to re-synchronise the version of data on disk with the version of data held in the Buffer Cache, we sometimes say that Database Writer's job is to clean buffers.

Furthermore, since a clean buffer only contains data which could simply be re-read off disk (if we had to), it is OK to over-write the contents of a clean buffer. By contrast, a dirty buffer is the only version of a modified buffer -it hasn't yet been saved to disk- so it would not be very sensible to let you over-write the contents of a dirty buffer. Therefore, we can also say that Dataabase Writer's job is to permit the recycling of the Buffer Cache: by 'cleansing' buffers, it makes them re-usable.

Although we call the process 'Database Writer', since 8.0 it's been possible to have multiple independent database writers, and that's why you see an 'n' in its abbreviated name: on a production system where several writers have been configured, you'd see processes DBW0, DBW1, DBW2 and so on.

LGWR (Log Writer)

As Database Writer takes the contents of the Buffer Cache and writes them to disk, so the Log Writer background process periodically saves the contents of the Log Buffer to disk. The Log Buffer, you will recall, stores redo -a series of instructions about how to re-perform (and hence recover) data modifications. Log Writer therefore makes the record of transactions that was stored originally in memory permanently safe by writing them to hard disk.

In the process, Log Writer permits the 'recycling' of the contents of the Log Buffer, just as Database Writer allows data buffers to be re-used: you can't over-write the contents of any part of memory if they're the only record you've got of something happening to the database, but once there's a copy of the old stuff on disk, the memory in the Log Buffer can be re-used by new redo entries.

There is only ever one Log Writer process. Unlike Database Writer, for example, which can have several 'clone' processes sharing the work burden, redo log records must be written to disk sequentially -and that means only one process can be in charge of working out what to write to disk. If that wasn't true, you might come to grief when a series of transactions have dependencies. If you first create a new product record and then raise a new sale record to indicate that the new product has just been sold, for example, you need the product transaction to be recorded before the sale transaction -otherwise, during a recovery, you might end up trying to sell a product that doesn't exist yet. To prevent that logical nonsense from happening, therefore, a single Log Writer process writes everything in the correct sequence.

Whilst it guarantees the correct order of events, though, only have one Log Writer does sometime mean that Log Writer becomes a bottleneck on the system: there's no way to speed him up by sharing his work amongst other processes, for example. That's why the files to which Log Writer saves data (called redo logs) should ideally be stored on the very fastest hard disks you have available. It's far more important to let Log Writer write as fast as possible than it is to let Database Writer write quickly, for example.

PMON (Process Monitor)

Process Monitor keeps an eye on all the other processes. If any of them hang or die, PMON can try to re-start them -and sometimes it actually succeeds! If it can't succeed, then it takes appropriate action -which is, quite often, to crash the entire instance. PMON also keeps an eye on users: if they are in the middle of a transaction, for example, and their PC hangs or the network cable gets disconnected, PMON will spot the abnormal disconnection of the user and rollback whatever transaction they happened to be in the middle of. That means locks on rows are removed, memory is freed and any other 'database resources' the user was consuming at the time get reclaimed.

ARCn (Archiver)

I've mentioned already that Log Writer allows us to re-use the Log Buffer by periodically saving its contents to disk, by writing them to one of the online redo logs. If it writes enough redo to these logs, Log Writer must eventually run out of space -and must, at that point, begin to overwrite the redo previously stored in the online logs. Put another way: online redo log contents are recycled. That in turn must mean there is a limit to how far back you can go and still hope to be able to re-perform transactions during a recovery -and that is indeed the case.

Log Writer on its own protects the database for only the relatively short amount of time it takes to fill up all online logs. If you want more protection than that, you need to copy the contents of the online logs out to some more permanent store before you over-write them... and that is exactly what the Archiver Process does. Once one online log has filled up, it copies its contents to an archive redo log. The online log can then be over-written whenever LGWR feels like it and yet there is a continual stream of redo available from the archives. Recoveries going back days, weeks or months therefore remain entirely possible.

Archiver therefore makes a permanent record of redo. With that permanent record, you can guarantee recovery of a database without loss of committed data. Without it, you have to pray the redo you want to re-play is still available in the online logs, and it might or might not be. Technically, therefore, although Archiver is an optional background process, every production database that cares about not losing committed data would have it switched on.

You'll note that the process name abbreviation includes an 'n', like Database Writer does. That's for the same reason, too: just as you can have up to ten Database Writers, so you can have up to ten Archivers, and hence the processes get named ARC0, ARC1, ARC2 and so on.

Others

I've just listed the six background processes which are most closely involved in saving your data, saving the instructions for how to modify your data and checking the health and consistency of your database. Five of them are compulsory and one (ARCn) should be considered almost so.

There are, however, dozens of additional background processes that might be running in any given Oracle instance, depending on what extra software has been installed or what additional database options have been implemented. If you configure an Oracle RAC, for example, you will immediately get five additional background processes (LMON, LMD, LMS, LCK, DIAG). If you start to schedule database jobs, you'll see a new process called CJQ0 ("coordinator of the job queue"). If you configure to run in what is known as Shared Server mode, you'll see a lot of processes called S000, S001 and so on (shared servers).

When you install or configure these 'exotic' configurations of Oracle, it's as well to know what the extra processes are called and what they do (and how they do it). But for now, you simply need to know that there can be many more background processes running than the quick half dozen I've described to you so far -but that half dozen are 'core', and every other background process you might see are configurable extras.

Wednesday, 5 August 2009

SGA Summary

When the Oracle program runs, it grabs a large chunk of memory in which it can do its work. That 'chunk', in its entirety, is known as the System Global Area, or SGA.

The SGA is not a monolithic chunk of memory, however: it has internal structure, and is internally chopped up into discrete areas of memory which perform specialised tasks. These sub-areas are known as the Shared Pool (which caches execution plans and the data dictionary tables), the Buffer Cache (which caches ordinary data), the Log Buffer (which records changes made to ordinary data), the Large Pool (which is needed for efficient memory allocations to parallel slaves and backup processes), the Java Pool (providing memory to Java stored procedures) and the Streams Pool (unique to 10g and above, and providing memory to replication services).

Tuesday, 4 August 2009

The Streams Pool

New in 10g, a special area of memory within the SGA is now dedicated to allowing a database to replicate itself to a different location (think of a head office database which needs to be 'copied' up to a branch office out in the boondocks). If you don't have a need to have your database replicate itself, you don't need a Streams Pool. Before 9i, replication took whatever memory it needed from the Shared Pool -and hence the business of preventing hard parses was compromised. Oracle version 10g has recognised the significant memory requirements for effective replication and has thus created the Streams Pool as a dedicated resource which alleviates the burden on the Shared Pool.


Monday, 3 August 2009

The Java Pool

Since Oracle 8i, it has been possible to write procedures, functions and other code which gets stored inside the database in the Java programming language. It has since turned out that you would be certifiably insane to do so but nevertheless, the capability remains. (Java is better utilised "in the middle tier" -that is, in an application server, rather than in the backend database itself). If you run Java code in the database itself, it needs its own unique area of memory in which to execute -and that area of memory is called the Java Pool. It can be enormous (in the gigabyte range, for example) -but if you don't run Java in the database at all (all your code is, say, PL/SQL), then it can be zero in size.

Sunday, 2 August 2009

The Large Pool

The Large Pool is an optional component of the SGA, though these days it is so useful to have that I doubt many production databases would not be using one: optional it may be in strict right, but I suspect most DBAs these days consider it functionally compulsory.

Whenever you parallelise an operation in Oracle, communication between the individual parallel slaves requires memory access. That memory is usually grabbed from the Shared Pool if nothing else is available: but if a Large Pool has been configured, it gets taken from that instead. You end up using parallel slaves whenever you do a parallel query, perform an RMAN backup or utilise multiple I/O slaves. Parallel query and multiple I/O slaves are slightly exotic (and cost money), but practically every Oracle database on the planet needs backing up with RMAN -so that means practically every Oracle database on the planet needs a Large Pool.

You can do all of these things without a Large Pool, but doing so means the Shared Pool is stressed. It's busy trying to minimise hard parses, and suddenly great gobbets of memory that it needs to do that job are pinched by a backup job or a parallel query! Not a good outcome, in short... and the existence of a Large Pool would mean that such a conflict never arises.

So let me make it simple for you: if you ever back up your database, you need a Large Pool. That should garner a near 100% response, I would have thought!!


Saturday, 1 August 2009

The Log Buffer

It is impossible to generalise, but if there is such a thing as a 'typical' Oracle database, you might measure the size of the Shared Pool and the Buffer Cache in the several hundred megabytes range, or bigger. In comparison, most Log Buffers anywhere in the World would not be much bigger than, tops, 10M or so.

The Log Buffer is thus a relatively tiny area of memory in which a record of the changes you make to data is stored. Every insert, update or delete that you perform will generate a record of what row was affected and how it was modified. That record is written, in the first instance, to the Log Buffer component of the SGA. From there it is periodically saved to special files on disk, thus making the record of what transactions have taken place on a database permanent.

The transaction record is generically known as redo, and we can therefore say that redo is first written (for reasons of speed) into the Log Buffer of the SGA for each and every transaction that users perform.

As the name implies, redo is generated in order to let us 'do again' those transactions which might otherwise have been lost because of hard disk failure or power supply failure or software error or user stuff-up. Redo, in short, is Oracle's recovery mechanism, and the Log Buffer is simply the first place redo is created.


Friday, 31 July 2009

The Buffer Cache

The Buffer Cache is a memory area in which the data physically stored in the database is read and modified. When you want to see the employee records in the EMP table, they are first read from disk and loaded into the Buffer Cache. Why not simply fetch the rows off disk and straight back to you and your client PC? Someone else might want to read the EMP records just after you. If we loaded the rows into Buffer Cache first time round, the second person won't have to read the rows from disk at all: they'll be able to access the copy of them direct from the Buffer Cache. Reading rows of data off disk involves performing a physical read. Reading rows out of the Buffer Cache copy of them involves performing a logical read. Logical reads don't require physical disk access and so are much quicker to perform than physical reads. The job of the Buffer Cache can be said to be, therefore, to minimise physical reads and to help maximise logical ones.

It's not just reading data, though. If you want to update a record, the modification is made to the version of the data stored in the Buffer Cache. Only later is this modified buffer (known as a dirty buffer) then saved back down to disk. When we write a modified buffer back down to disk, the contents of the in-memory buffer become the same as the on-disk data, and we therefore say that the buffer is now a clean buffer.

Oracle never reads individual rows, though. If you want to update Jim's phone number in the EMP table, we never just get Jim's record and let you at it. The minimum unit of reading and writing is an entire "collection" of rows -usually 2K, 4K, 8K or 16K in size. The idea is that if I am modifying Jim's phone number, you will probably want to update Bob's phone number similarly in just a moment or two... so my request to update Jim's record helps you out, because it causes the entire collection of rows around Jim's row to be loaded into memory at the same time. Your subsequent update to a different row hopefully therefore only involves logical I/O, not physical I/O -and you therefore benefit from my earlier work.

This 2K, 4K, 8K or 16K collection of data is known as the Oracle Block and is the minimum unit of I/O in an Oracle database. Request to see just one byte in a block, and the entire block is read into the Buffer Cache. The term "block" is actually used to describe the collection on disk; when a block from disk is read into memory, we call it a "buffer" instead. Essentially, though, a block and a buffer are the same thing, except that one physically exists and the other is only an in-memory thing. Some other database products (such as SQL Server) call the same thing a "page" of data. Blocks, buffers, pages... they're all really just different words for the same idea: a minimum 'chunk' of data read from disk into memory, and stored in memory for other users to find without having to re-visit the physical disk.


Thursday, 30 July 2009

The Shared Pool

When you issue a query such as select * from emp, the database must work out what that means. Do we have a table called EMP? What are its constituent columns? Where, physically, on disk, does the table live? Are there any indexes built on the table which might help us get the results back more quickly? Do you, the user, have rights to query the table? The process of answering all these sorts of questions is known as parsing a SQL statement, and it involves a lot of work going on 'behind the scenes'.

To work out whether a table exists at all, for example, we have to query an in-built system table called TAB$. To work out where it physically exists on disk, we have to query more in-built system tables including one called TS$ . To check if you have the rights to view the table, yet another query is made of a table called USER$ (amongst others). You issue a simple select statement, in other words, and the poor database has to issue a dozen queries of assorted system tables before it can even work out what your statement means. We call these 'System SQL statements issued in response to a user SQL statement' recursive SQL, and performing recursive SQL means performing lots of preparatory work before actually answering your SQL requests.

Parsing is therefore very expensive... so we'd really like to only have to do it once and have the results of the parse stored for you and other users to make re-use of every time you re-issue the same basic SQL statement in the future. The result of a parse is known as an execution plan, and it represents a package of pre-worked-out instructions as to how to answer a SQL statement. Each new SQL statement causes a new execution plan to be constructed and stored... and they are stored in a sub-section of the Shared Pool known as the Library Cache. If a second user issues exactly the same SQL statement as someone else happened to issue earlier, therefore, the second user can simply nip into the Library Cache and make use of the execution plan previously worked out for the first user.

If you have to go through the entire rigmarole of working out an execution plan from scratch, that's called 'doing a hard parse'. If you can merely re-use execution plans previously created by other users, that's called 'doing a soft parse', and soft parses are a lot cheaper and quicker to carry out than hard ones.

You can fairly say, therefore, that the job of the Library Cache is to help stop us having to do hard parses.

Incidentally, fetching the contents of system tables like TAB$, USER$ and so on is expensive, too: these tables are collectively known as the data dictionary, and the data dictionary is physically stored on disk in the system datafile. Forever having to access these tables off disk would not be a good idea: lots of disk access generally means "slow performance". Therefore, once we've read some rows from these tables, we store them for future access in another component of the Shared Pool called the Data Dictionary Cache. The data dictionary cache is where in-memory copies of the data dictionary tables, needed to carry out parsing operations, reside.

The job of the dictionary cache is therefore to stop us having to visit the system data file on disk every time we parse.

Since both the Library Cache and the Data Dictionary Cache are both sub-components of the Shared Pool, and since both are designed to minimise parsing activity, or at least make it an in-memory affair if it does have to happen, it is reasonable to conclude that the job of the Shared Pool overall is to make parsing SQL statements as cheap an affair as possible.


Wednesday, 29 July 2009

Oracle Server Components

What is an Oracle Server?

When you edit a document in Word (or whatever word processor you happen to prefer!), you don't type documents directly into a disk file, because doing so would be painfully slow. Instead, documents are initially 'typed into' and created in memory -your computer's RAM- because that works at nanosecond speed and is thus very fast and responsive to work with. Unfortunately, most of today's RAM is volatile: switch off the computer's power supply and you lose the lot -and if you've ever been typing a long, complicated document and had a power cut before you remembered to save it, you'll know the sinking feeling of your brush with volatility!

To protect against data loss like that is why you're supposed to do a periodic 'File -> Save', of course. Hard disks might be slow and painful to use for constant access, but they make a fine permanent store of things which have initially been created and worked on in RAM. Saving a document in Word simply means 'transfer a copy of it out of volatile memory and onto permanent disk storage'. In this way, your word processor lets you combine the speed of RAM with the permanence of hard disk storage. You therefore end up getting the best of both worlds: speed and safety.

In fact, that goes for any program you care to mention, be it a word processor, a video editing package, a media player, whatever: data is read from its permanent store on hard disk and manipulated and worked on in memory, because one is fast but volatile and the other is slow but safe.

The Oracle Relational Database Management System (RDBMS) is just another program in this respect, because it does exactly the same thing. The data it manages is stored permanently on disk but when someone wants to update it, view it or delete it, it is loaded from disk into memory and it is in memory that all the data manipulation takes place. The permanent store of data on disk is what we call the Oracle database. The chunk of RAM where the database's data is actually worked on is called the Oracle instance.

The relationship between an instance and a database is one-to-one: if an instance has accessed the data from database X, it cannot also access or work on the data from database Y. If you want to work with the data from database Y, you will need to use another instance entirely. This relationship -one instance manages only one database- is always and completely true.

The relationship the other way around is not always like that, though. In an advanced configuration of Oracle known as a Real Application Cluster (or RAC for short), it is possible for one database to be accessed by two or more instances, simultaneously. That's usually done to speed things up and get more work done. But even so, each of the participating instances in a RAC can only be used to store and work on data from one database. So the rule can be more fully stated as:

One instance can only work on one database,
but one database may be opened by many instances

An Oracle Server is nothing more, therefore, than a computer onto which the Oracle software has been installed; on which an Oracle instance is running in memory (providing fast but volatile data management capabilities); and on which an Oracle database has been permanently stored on disk.

Users connect to the instance (not the database, note) and make requests to see or to manipulate data from the database by issuing SQL statements. In response to these SQL statements, the Oracle program loads the requested data into the instance. The user continues to work on the data within the instance. Occasionally, to make sure a power failure doesn't cause the loss of all the work users have been doing, the Oracle software performs, in effect, an automatic 'File -> Save' and writes the data stored in the instance back to the physical database on disk. In this way, users of an Oracle database get all the performance advantages of working with things in fast RAM, but all the permanence and safety advantages of having saved stuff to disk.

The Structure of the Oracle Instance

An Oracle instance is merely an allocation of memory in which to do work plus a bunch of background processes which automate a lot of that work for us. The memory areas are collectively known as the System Global Area (or SGA for short). The constituent 'pools' of memory, which in the aggregate make up an SGA, are called The Shared Pool, The Buffer Cache and The Log Buffer. These three SGA components are compulsory and every SGA on the planet will have all three of them. There are various additional 'pools' which your SGA might or might not have configured, depending on how you are using your database, what software options you've enabled (and paid for!) and so on. You will commonly see, for example, a Java Pool and a Large Pool. You will less commonly see (in 10g and above) a Streams Pool.

No matter what fancy names these pools of memory are given, they are all merely chunks of memory in which different sorts of data are stored. The principle function of all of them is to try and 'cache' that data in memory so that you do not have to fetch it back off disk or work it out from scratch (both of which options are relatively slow and expensive).

Tuesday, 28 July 2009

Simple Block Layout

So, with this overhead, free space, and (wow) actual data, how is it all stored in the data block?

SIMPLE BLOCK LAYOUT

The data block header (fixed) is at the beginning of the block. A small tail is at the end of the block (the tail is for block consistency checking). Everything else (variable header and data rows) fits between the beginning and the end.

Variable header info grows from the top down (just below the fixed header) if necessary and rows are inserted from the bottom up (just above the tail).

Everything else is freespace. I will discuss free space and "freelists" in part 2 of the data block.

Ok, in part 2, I will show some more examples of the data block and how the other structures we've talked about use and sometimes mimic the data block structure.

Now it's time to see data blocks in action. For this example I am just going to consider a normal table with rows and columns and nothing more exotic (once you understand this, the various permutations (clusters, partitions, etc.) should be easy to understand).

So, we've done step 1; we've issued a CREATE TABLE statement. Whether we accepted the default storage settings (by not entering a storage clause explicity) or we tweaked it for this specific table, a few things happen during the execution of that CREATE table statement:

-Oracle creates a table segment in a tablespace and grabs some initial extents comprised of data blocks that sit physically in one or more datafiles. (reread the earlier posts if you don't get this part)

-Those previously empty datablocks are preloaded with some small amount of overhead information (if you're still wondering why this overhead is good or even needed, don't get discouraged).

-Oracle "registers" this segment in some data dictionary tables (adds rows) for database management purposes (it is a Relational Database Management System after all).

-----------------------------
That last bullet might sound interesting. We have not INSERTed a single row of data yet, but there appears to already be rows loaded in the database. That is correct.

One of the required tablespaces in Oracle is SYSTEM. The SYSTEM tablespace has a whole host of tables, views, indexes, and procedural objects that have nothing to do with your data, but everything to do with your metadata.

You see, even the Oracle code uses the Oracle database as part of normal operations. This is great for two reasons: it makes the system very scalable; and it means that almost any DBA task can be done via simple SQL commands.

------------------------------

So now we are ready to insert some data into our table (let's not worry about indexes right now). Let's say we have some data from an application or a flat file and we want to insert 1,000 rows into our new table.

After some parsing (we'll also talk about SQL later), Oracle is ready to store the rows in data blocks. How does Oracle determine which data blocks to use? Freespace is used first.

If Oracle can find some empty space in the segment, it will try and fill those empty spaces with the new data. If there is no freespace left, Oracle will try to add more extents to the segment and use those data blocks for storage.

Because we just created the table, all of the allocated space (minus overhead) should be available. We know it's new and empty, but how does Oracle know?

Well, Oracle could simply keep a list of all new tables that have never held rows, but that would only be of very limited use and could be a source of contention (if everyone inserting data had to check a global listing of empty tables, blah, blah, blah to see if their table was on the list).

Instead, Oracle leverages the overhead in data blocks and does something similar at the segment level. That something is -- SEGMENT HEADERS. From a logical perspective, segment headers are similar in many ways to data block headers: they contain some management data about the segment.

Segment headers contain a list of the extents in the segment (extents table), the FREE LISTS, and the HIGH WATER MARK. The segment header starts in the first block of the first extent in the segment.

FREE LIST

The freelist is simply a list of data blocks currently allocated to extents in the segment that have some free space in them for new rows. There are some interesting things to understand about freelists.

Contention- instead of having everyone in the database wait in line to find free space in segments, freelists moves that contention down to the individual segment so only people manipulating data on that segment (table) have to contend with each other to find free space. If you have a segment (table) that you expect to have a lot of insert and update activity on, you can create multiple freelists to reduce that contention even more. To avoid waits, you can determine how many people will be inserting into that table AT THE EXACT SAME TIME (on the count of three, push your button) and create that many freelists. Unfortunately, many people adjust this number too high (thinking if two is good, a hundred is great) and end up wasting disk space (I'll try to remember to explain this when we start talking about design issues). A good DBA will monitor any freelist contention and adjust AS NECESSARY.

Data block free space- in the first post on data blocks, I talked about free space inside data blocks. This free space is used to expand rows (via update) and for new rows (insert). In OLTP systems, almost every data block will have at least one lonely block that is not being used. Do we really want to see if we can squeeze a whole row into that single byte of space? No, it will never fit. We only want to try and put data where there is plenty of room. Fortunately, when we created the segment (CREATE TABLE), we told the segment (explicitly or by using defaults) how we wanted to report this free space inside of data blocks.

We did this using the keywords PCTFREE and PCTUSED. These are the keywords that determine if a table's individual data blocks shows up on the freelist group or not.

These percentages (PCTxxxx) are just that, parts of 100. PCTUSED puts blocks on the freelist, PCTFREE takes a block off of the freelist (it is no longer accepting new inserts, only updates (DELETE's are always accepted and have no interest in checking freelists)).

The default value for PCTUSED is 40(%) and PCTFREE is 10(%). That means if a block is only 39% used (<40%), it goes on the freelist as a candidate for new rows. It will stay on the freelist after new rows are inserted until the data block is 90% full (<10% free space remaining).

Inserts increase PCTUSED (thereby decreasing PCTFREE), deletes decrease PCTUSED (thereby increasing PCTFREE), and updates can do either depending on if the update is increasing or decreasing the size of the row(s).

Since our new segment currently holds no data, all of the data blocks in our INITIAL extent(s) are well below 40% used, so they are on our freelist. Any of those data blocks can be used to hold our 1,000 rows.

Now, what if we had 1,000,000 empty blocks and were only going to use maybe 100 for this insert, would we really want to keep track of the freespace in all 1,000,000 blocks individually? I wouldn't. It could take longer to maintain and search for free space than just allocating a new extent. But then, like I hinted at above, I would be saving time, but possibly wasting disk space.

So Oracle has another neat item that helps me out (not only here, but when doing things like FULL TABLE SCANS as well).

That thing is...

High Water Mark (HWM) - just like the name sounds, the HWM keeps track of how high the river of data has ever risen.

While just counting the number and size of extents in a segment will tell us how much space has been allocated, the HWM will tell us how much of that space has ever been used (until to rebuild or truncate it). Hopefully, you can already guess at some of the benefits of knowing that.

One of those benefits is in controlling the freelists. The freelist only tracks blocks that are below the HWM. That keeps things very manageable and reduces the chances of wasting space and resources (like by having 1,000,000 blocks with a single row each as opposed to having fewer, more densly packed data blocks).

Let's say we have 2 extents each with 10 data blocks that can each hold 50 of our 1,000 rows. Let's also keep or PCTUSED=40 and our PCTFREE=10

so our block count=20. All of them are empty.

Because we've never inserted rows into this segment, our HWM is 0.

When we insert the first row, the freelist will say "I don't have anything free, let me try to add some empty blocks." The freelist sees that there is a block already allocated to the segment (from one of our 2 extents) that is empty, so it advances the HWM from 0 to 1 and grabs one of those blocks. The row is then inserted.

When we insert the second row, the freelist says "I have a block waiting for data" and tells us where to stick the data (whether its being rude or not depends on how you treat the database --joke), decreases the PCTFREE in the block and increases the PCTUSED.

This goes on for the next few dozen rows until we try to insert our 46th row (we know that physically we have enough space for 50 rows in each block). By then, something has happened. Our PCTUSED has grown to 90%, but we don't really care about that because the block was already on the freelist (any of our blocks with PCTUSED<40 will be put on the freelist). What we do care about is that the PCTFREE has gone from almost 100 down to 10 when we inserted our 45th row. At that point, the data block said "I'm stuffed, leave me alone." So it was removed from the freelist.

Now the freelist says "I don't have anymore space below the HWM, I'd better see what's out there." It sees that there are still 19 data blocks that have been allocated but never used. So, it grabs the next one, puts it on its list and advances the HWM to 2.

If this sounds like overhead to you, you're right. If it sounds like unnecessary (or even undesirable overhead) you're wrong. When we start looking at the big picture, you'll learn to appreciate it.

So now we continue to insert rows in the same fashion, filling up block after block in the first extent (no flamers!) until all the blocks are full (or at least they're 90% full) and the HWM has been moved up to 10. At this point, we have used all of the 10 blocks in our first extent. Now the freelist has to grab some data blocks from another extent (see the no flamers comment).
The freelist checks with the segment and sees that we have more blocks allocated in our second extent that have free space.

We'll continue to insert rows like we have been, moving up the HWM until it gets to 20.

At this point, both of our extents' set of data blocks have been removed from the freelist because they are 90% full (only 10% free). Now when the freelist looks for allocated, but unused space, it won't find any, so it will extend the segment by allocating a new extent (with its set of 10 data blocks) and advance the HWN to start adding those newly allocated data blocks to the freelist.

We can then insert the remaining 100 rows.
--------------------------------------------
In this example, we've seen how freelists are used to insert data into data blocks. We have seen our segment size grow to the sum of the extent sizes allocated and how the HWM is used in this process. We have seen how data blocks accept new rows and when they reject new rows.

What we haven't seen is the affect this has on the tablesapce and the datafile(s).

Since a tablespace is similar in ways to a segment, we know that the tablespace has grown (logically) to equal the sum of the size of it segments and is (physically) still the sum of the size of its datafiles. But what has happened to the datafile(s) themselves? You'll remember, or can look it up, that in Chapter 1 of this series we said that a datafile had a predetermined size. When we create it, we told it what size to be.

That datafile is allocated to the tablespace. Allocated but not used (for data). As we create segments, we create extents which consist of data blocks from datafiles owned by the tablespace that the segment was created in. Whew! Say that 10 times fast.

As we allocate new extents for our segment, does the datafile get bigger? No, not necessarily. The allocation of that space goes from unused (free) to used as a segment needs it. So, both the tablespace and the datafile get filled up. When the datafiles get full (by allocation, not by inserting data) then you have to either add a new datafile or allow an existing one to grow. Tablespaces work in a similar fashion logically with extents. As you approach the MAXEXTENTS setting, you have to increase it or prevent the segment (table) from growing.

-------------------------------------------------

This is not the end of our discussion on data blocks. It is just the beginning. But it is very important that I do a good job of helping you understand the basic structure and role of data blocks (and their chain of custody up to the tablespace) before moving on to all of the simple tasks in Oracle. That's part of why I simplified some of this without trying to mislead you. This is the way it works and you will discover any shortcuts I took as we continue.

Understanding the data block is critical to mastering the Oracle database and all of the tasks that come with developing for and administering the Oracle database.

Monday, 27 July 2009

Concurrency - Interested Transaction List

Concurrency really just means how many people can be doing things at the same time in the same place. It directly affects scalability.

Many RDBMS products use a form of centralized "lock manager" to keep track of who's doing what to what. That is a valid solution, but as the number of concurrent users rises (not people just logged on to the database, but the people actively doing things) then a lock manager can and does become a point of contention. Oracle decentralizes almost all of this and let's the data blocks themselves deal with concurrent access (two or more people trying to manipulate rows in the same data block).

That is the reason why a well designed Oracle database can scale higher on less than any other database out there.

An interesting test I've performed (because my boss thought Oracle sucked and tried to prove it) is comparing Oracle to Brand X for a single user. Oracle was fine, but in numerous tests it lagged behind it's competitor. But, as I started to ramp up to dozens then hundreds of concurrent transactions (different sessions, etc.) I noticed something: the Oracle performance line stayed about the same, while Brand X started to fall behind more and more.

-------------------------------------

The ITL is controlled by the settings INITTRANS and MAXTRANS (just accept the defaults unless you really know what you're doing). As you can probably guess from those keywords, you set a starting number and then tell it how big it can grow (how many concurrent transactions you'll track before making people wait to get into the block).

There is also a row directory that contains data about each row in the block. One interesting thing to know about this (small) component is that once space has been allocated in the row directory, you don't get it back when you just delete a row. It's interesting, as we'll see later, but not a thing to worry about or to feel you're being robbed over.

All in all, the data block overhead is usually only about 100 bytes or more (let's round it up to 200 bytes and you can see that in a 4k block it's still a small percentage), so it shouldn't play too much of a role in your decision on a data block size.

There are about 2 dozen other things in the overhead that I haven't mentioned, but they are as simple as the rest of it and we will address those later.

I've mentioned the pieces I think you need to know for now.