Extract Transform & Load – What is it and why is it important?

So-called ETL relates to moving data from external sources into and out of relational databases or data warehouses.

Extract

Source systems may store data in an infinite variety of formats.  Extracting involves getting that data into common files for moving to the destination system.  CSV file also known as comma separated values is named because each of the records is stored as one line in the file, and fields are separated by commas, and often surrounded by quotes as well.  In MySQL INTO OUTFILE syntax can perform this function.  If you have a lot of tables to work with, you can script the process using the data dictionary as a lookup for table names, and create a .mysql script to then run with the mysql shell.  In Oracle you would use the spool command in SQL*Plus the command line shell.  Spool sends subsequent output from the screen also to a file.

Transform

This step involves modifying the extracted data in preparation for moving it into the target database server.  It may involve sweeping out blank records, or rearranging columns, or breaking files into smaller subsets of data.  You might also map values differently for instance if one column in the source database was gender with values M/F you might transform those to the strings “Male” and “Female” if that is more useful for your target database server.  Or you might transform those to numerical values, for instance Male & Female might be 0/1 in your target database.

Although I myriad of high level GUI tools exist to perform these functions, the Unix operating system includes a plethora of very powerful tools that every experience System Administrator is familiar with.  Those include grep & sed which operate on regular expressions and can perform data transformation at lightening speed.  Then there is sort which can sort data and send the results to stdout or the file of your choosing.  Other tools include wc – word count, cut which can remove columns and so forth.
Load

This final step involves moving the data into the database server, and it’s final target tables.  For instance in MySQL this might be done with the LOAD DATA INFILE syntax, while in Oracle you might use SQL*Loader, which is a very fast flat file dataloader.

Quora discussion by Sean Hull – What is ETL?

Point-in-time Recovery – What is it and why is it important?

Web-facing database servers receive a barrage of activity 24 hours a day.  Sessions are managed for users logging in, ratings are clicked and comments are added.  Even more complex are web-based ecommerce applications.  All of this activity is organized into small chunks called transactions.  They are discrete sets of changes.  If you’re editing a word processing document, it might autosave every five minutes.  If you’re doing something in excel it may provide a similar feature.  There is also an in-built mechanism for undo and redo of recent edits you have made.  These are all analogous to transactions in a database.

These are important because all of these transactions are written to logfiles.  They make replication possible, by replaying those changes on another database server downstream.

If you have lost your database server because of hardware failure or instance failure in EC2, you’ll be faced with the challenge of restoring your database server.  How is this accomplished?  Well the first step would be to restore from the last full backup you have, perhaps a full database dump that you perform everyday late at night.  Great, now you’ve restored to 2am.  How do I get the rest of my data?

That is where point-in-time recovery comes in.  Since those transactions were being written to your transaction logs, all the changes made to your database since the last full backup must be reapplied.  In MySQL this transaction log is called the binlog, and there is a mysqlbinlog utility that reads the transaction log files, and replays those statements.  You’ll tell it the start time – in this case 2am when the backup happened.  And you’ll tell it the end time, which is the point-in-time you want to recover to.  That time will likely be the time you lost your database server hardware.

Point-in-time recovery is crucial to high availability, so be sure to backup your binlogs right alongside your full database backups that you keep every night.  If you lose the server or disk that the database is hosted on, you’ll want an alternate copy of those binlogs available for recovery!

Quora discussion on Point-in-time Recovery by Sean Hull

Data warehousing – What is it and why is it important?

Have a question? Email me anytime [email protected]

A data warehouse is a special type of database.  It is used to store large amounts of data, such as analytics, historical, or customer data, and then build large reports and data mining against it.  It is markedly different from a web-facing or high-transaction database, which typically has many many small transactions or pieces of data that are constantly changing, through many 100’s or 1000’s or small user sessions.  These typically execute in speeds on the order of 1/100th of a second, while in data warehouse you have fewer large queries which can take minutes to execute.

Need assistance? Check out my pricing page

Data warehouses are tuned for updates happening in bulk via batch jobs, and for large queries which need big chunks of memory to sort and cross-tabulate data from different tables.  Often full table scans are required because of the specialized one-off nature of these reports.  The same queries are not executed over and over.

Read: 5 Ways to get your data to Amazon Redshift warehouse database

It’s important not to mix data warehousing databases with transactional databases in the same instance, whether you are dealing with MySQL or Oracle.  That’s because they are tuned totally differently.  It would be like trying to use the same engine for commuting to work, and a container ship traveling around the world.  Different jobs require different databases or databases that with their dials set for different uses.

Also: Are SQL databases dead?

Quora discussion of data warehousing – Sean Hull

Relational Database – What is it and why is it important?

A relational database is the warehouse of your data.  Your crown jewels.  It’s your excel spreadsheet or filing cabinet writ large.  You use them everyday and may not know it.  Your smartphone stores it’s contacts in a relational database, most likely sqlite – the ever present but ever invisible embedded platform.  Your online bank at Citibank or Chase stores all your financial history, statements, contact info, personal data and so forth, all in a relational database.





Relational databases are:

  • organized around records
  • data points are columns in a table
  • relationships are enforced with constraints
  • indexing data brings hi-speed access
  • SQL is used to get data in and out of the db
  • triggers, views, stored procs & materialized views may also be supported

Need help? Check out my pricing page and also How daily notes are helping me work better with clients

Like excel, relational databases are organized around records.  A record is like a 3×5 card with a number of different data points on it.  Say you have 3×5 cards for your addressbook.  Each card holds one address, phone number, email, picture, notes and so forth.  By organizing things nicely on cards, and for each card predictable fields such as first name, last name, birthday etc, you can then search on those data points.  Want all the people in your addressbook with birthday of July 5th, no problem.

Related: What is a database migration & why is it important?

While relational databases have great advantages, they require a lot of work to get all of your information into neatly organized files.  What’s more the method for getting things into and out of them – SQL is a quirky and not very friendly language.  What’s more relational databases have trouble clustering, and scaling horizontally.  NOSql database have made some headway in these departments, but at costs to consistency and reliability of data.

Read: 5 Ways to get your data to Amazon Redshift warehouse database

As servers continue to get larger, it becomes rarer that a single web-facing database really needs more than one single server.  If it’s tuned right, that is.  Going forward and looking to the future, the landscape will probably continue to be populated by a mix of traditional relational databases, new nosql type databases, key-value stores, and other new technologies yet to be dreamed up.

Also: Are SQL databases dead?

Sean Hull asks on Quora – What is an rdbms and why are they important?

Query Profiling – What is it and why is it important?

Queries are so-named because of the lovely language we call SQL – simplified query language.  That’s a bit of sarcasm on my part, I’ve never found it particularly simple or elegant.  Profiling them involves finding how they are spending their time, and what they are asking the server to do.   In this way you can make them faster, and improve performance of the whole server, and thus your website.

At any rate queries ask the database for information.  If they are on the simpler side, something like give me all the users whose name starts with “hu” for example, and last name is indexed, that will run very fast.  The database will lookup in the index the last name field, and find the subset of ones starting with those letters, then go lookup the records by id and return them to you.  Cache the index blocks, cache the data blocks.  Great!  However, say those are customers, and you want their calling cellphone calling history.  Ok, now you have to join on another table, matching by some key, hold all those records in memory, shuffle them around, and so on.

So queries are effectively little jobs or bits of work you ask your database server to perform on your behalf.  With websites you typically have hundreds of concurrently running sessions all sending their own little SQL jobs to the server to get processed, and records returned.  And blip in the radar slows everyone down, so you want them to all run quickly.

That’s where profiling comes in.  MySQL, Oracle, and SQL Server alike all have EXPLAIN type facilities for showing the plan with which the database will use to fetch your data.  It shows indexes, sorting, joins, order, and so forth.  All of this contributes to the overall execution time and resources used on the database server.

Quora discussion by Sean Hull – What is query profiling and why is it important?

Oracle to MySQL Migration Considerations

There are a lot of forms of transportation, from walking to bike riding, motorcycles and cars to busses, trains and airplanes.  Each mode of transport will get you from point a to point b, but one may be faster, or more comfortable and another more cost effective.  It’s important to keep in mind when comparing databases like Oracle and MySQL that there are indeed a lot of feature differences, a lot of cultural differences, and a lot of cost differences.  There are also a lot of impassioned people on both sides arguing at the tomfoolery of the other.  Hopefully we can dispel some of the myths and discuss the topic fairly.

** See also: Migrating MySQL to Oracle Guide **

As a long time Oracle DBA turned MySQL expert, I’ve spent time with clients running both database engines and many migrating from one to the other.  I can speak to many of the differences between the two environments.  I’ll cover the following:

  1. Query & Optimizer Limitations
  2. Security Differences
  3. Replication & HA Are Done Differently
  4. Installation & Administration Simplicity
  5. Watch Out – Triggers, Stored Procedures, Materialized Views & Snapshots
  6. Huge Community Support – Open-source Add-ons
  7. Enter The Cloud With MySQL
  8. Backup and Recovery
  9. Miscellaneous Considerations

Check back again as we edit and publish the various sections above.

Oracle Expert – New York City

Looking for a top-flight DBA?  We are subject matter experts in Oracle and MySQL, especially integrating these database platforms with open source technologies.

We’ve always been about open-source technology, integration & mixing commercial technologies such as Oracle with open-source ones such as MySQL and Linux. As open-source becomes mainstream, and more shops consider moving critical services to these technologies, we continue to provide assistance and expertise for these transitions. Whether it is performance testing and tuning, benchmarking, high availability or recovery, we can provide services for your specific needs.

call_quote

Oracle DBA Interview Questions

Oracle Database Administrator or often called DBAs are an indispensable part of your operations team. They manage the systems that house all your business data, your customers, products, transactions and all that analytical data on what customers are actually doing. If you’ve ever been on the hunt, you may wonder, why the shortage of DBAs? To that we’ll answer, have you ever heard of Dustin Moskovitz?

So you certainly want to entrust that to someone who knows what they’re talking about. Enter the Oracle DBA Interview, a process that some will see as a technical test, while others will see as a fit of personalities, behaviors, and work ethic.

From the technical side we thought we’d bring you a quick and dirty checklist of questions. This isn’t an exhaustive list by any means, but is a good place to start and will certainly provide you with a glimpse of their knowledge.

Also if you’re looking to hire a MySQL DBA here’s a guide, and also one for hiring and EC2 expert.

1. What is the difference between RMAN and a traditional hotbackup?

RMAN is faster, can do incremental (changes only) backups, and does not place tablespaces into hotbackup mode.

2. What are bind variables and why are they important?

With bind variables in SQL, Oracle can cache related queries a single time in the SQL cache (area). This avoids a hard parse each time, which saves on various locking and latching resources we use to check objects existence and so on. BONUS: For rarely run queries, especially BATCH queries, we explicitely DO NOT want to use bind variables, as they hide information from the Cost Based Opitmizer.

BONUS BONUS: For batch queries from 3rd party apps like peoplesoft, if we can’t remove bind variables, we can use bind variable peeking!

3. In PL/SQL, what is bulk binding, and when/how would it help performance?

Oracle’s SQL and PL/SQL engines are separate parts of the kernel which require context switching, like between unix processes. This is slow, and uses up resources. If we loop on an SQL statement, we are implicitely flipping between these two engines. We can minimize this by loading our data into an array, and using PL/SQL bulk binding operation to do it all in one go!

4. Why is SQL*Loader direct path so fast?

SQL*Loader with direct path option can load data ABOVE the high water mark of a table, and DIRECTLY into the datafiles, without going through the SQL engine at all. This avoids all the locking, latching, and so on, and doesn’t impact the db (except possibly the I/O subsystem) at all.

5. What are the tradeoffs between many vs few indexes? When would you want to have many, and when would it be better to have fewer?

Fewer indexes on a table mean faster inserts/updates. More indexes mean faster, more specific WHERE clauses possibly without index merges.

6. What is the difference between RAID 5 and RAID 10? Which is better for Oracle?

RAID 5 is striping with an extra disk for parity. If we lose a disk we can reconstruct from that parity disk.

RAID 10 is mirroring pairs of disks, and then striping across those sets.

RAID 5 was created when disks were expensive. Its purpose was to provide RAID on the cheap. If a disk fails, the IO subsystem will perform VERY slowly during the rebuild process. What’s more your liklihood of failure increases dramatically during this period, with all the added weight of the rebuild. Even when it is operating normally RAID 5 is slow for everything but reading. Given that and knowing databases (especially Oracle’s redo logs) continue to experience write activity all the time, we should avoid RAID5 in all but the rare database that is MOSTLY read activity. Don’t put redologs on RAID5.

RAID10 is just all around goodness. If you lose one disk in a set of 10 for example, you could lose any one of eight other disks and have no troubles. What’s more rebuilding does not impact performance at all since you’re simply making a mirror copy. Lastly RAID10 perform exceedingly well in all types of databases.

7. When using Oracle export/import what character set concerns might come up? How do you handle them?

Be sure to set NLS_LANG for example to “AMERCIAN_AMERICA.WE8ISO8859P1”. If your source database is US7ASCII, beware of 8-bit characters. Also be wary of multi-byte characters sets as those may require extra attention. Also watch export/import for messages about any “character set conversions” which may occur.

8. How do you use automatic PGA memory management with Oracle 9i and above?

Set the WORKAREA_SIZE_POLICY parameter to AUTO and set PGA_AGGREGATE_TARGET

9. Explain two easy SQL optimizations.

a. EXISTS can be better than IN under various conditions

b. UNION ALL is faster than UNION (not sorting)

10. Name three SQL operations that perform a SORT.

a. CREATE INDEX

b. DISTINCT

c. GROUP BY

d. ORDER BY

f. INTERSECT

g. MINUS

h. UNION

i. UNINDEXED TABLE JOIN

11. What is your favorite tool for day-to-day Oracle operation?

Hopefully we hear some use of command line as the answer!

12. What is the difference between Truncate and Delete? Why is one faster?

Can we ROLLBACK both? How would a full table scan behave after?

Truncate is nearly instantaenous, cannot be rolled back, and is fast because Oracle simply resets the HWM. When a full table scan is performed on a table, such as for a sort operation, Oracle reads to the HWM. So if you delete every single solitary row in 10 million row table so it is now empty, sorting on that table of 0 rows would still be extremely slow.

13. What is the difference between a materialized view (snapshot) fast refresh versus complete refresh? When is one better, and when the other?

Fast refresh maintains a change log table, which records change vectors, not unlike how the redo logs work. There is overhead to this, as with a table that has a LOT of indexes on it, and inserts and updates will be slower. However if you are performing refreshes often, like every few minutes, you want to do fast refresh so you don’t have to full-table-scan the source table. Complete refresh is good if you’re going to refresh once a day. Does a full table scan on the source table, and recreats the snapshot/mview. Also inserts/updates on the source table are NOT impacted on tables where complete refresh snapshots have been created.

14. What does the NO LOGGING option do? Why would we use it? Why would we be careful of using it?

It disables the logging of changes to the redologs. It does not disable ALL LOGGING, however as Oracle continues to use a base of changes, for recovery if you pull the plug on the box, for instance. However it will cause problems if you are using standby database. Use it to speed up operations, like an index rebuild, or partition maintenance operations.

15. Tell me about standby database? What are some of the configurations of it? What should we watch out for?

Standby databases allow us to create a copy of our production db, for disaster recovery. We merely switch mode on the target db, and bring it up as read/write. Can setup as master->slave or master->master. The latter allows the former prod db to become the standby, once the failure cause is remedied. Watch out for NO LOGGING!! Be sure we’re in archivelog mode.

Hey you! If you made it this far, definitely grab our newsletter.

APress – Cost-Based Oracle by Jonathan Lewis

The beauty of reading a book by a publisher not sanctioned by Oracle and by an author who doesn’t work for Oracle is that they can openly mention bugs. And there are oh-so-many! This book is a superb introduction to the Cost Based Optimizer, and is not afraid to discuss it’s many shortcomings. In so doing it also explains how to patch up those shortcomings by giving the CBO more information, either by creating a histogram here and there, or by using the DBMS_STATS package to insert your own statistics in those specific cases where you need to.

Another interesting thing is how this book illustrates, though accidentally, the challenges of proprietary software systems. Much of this book and the authors time is spent reverse engineering the CBO, Oracle’s bread and butter optimizing engine. Source code, and details about its inner workings are not published or available. And of course that’s intentional. But what’s clear page after page in this book is that for the DBA and system tuner, going about their day to day tasks, they really need inside information about what the optimizer is doing, and so this book goes on a long journal to illuminate much of what the CBO is doing, or in some cases provide very educated guesses and some speculation. In contrast, as we know and hear about often, the Open Source alternative provides free access to source code, though not necessarily to the goods themselves. What this means in a very real way is that a book like this would not need to be written for an alternative open source application, because the internal code would be a proverbial open book. That said it remains difficult to imagine how a company like Oracle might persue a more open strategy given that their bread and butter really is the secrets hidden inside their Cost Based Optimizing engine. At any rate, let’s get back to Jonathan’s book.

Reading this book was like reading a scientists notebook. I found it:

o of inestimable value, but sometimes difficult to sift through

o very anecdotal in nature, debugging, and constantly demonstrating that the CBO is much more faulty and prone to errors than you might imagine

o may not be easy to say I have a query of type X, and it is behaving funny, how do I lookup information on this?

o his discussion of the evolution of the product is so good I’ll quote it:

“A common evolutionary path in the optimizer code seems to be the following: hidden by undocumented parameter and disabled in first release; silently enabled but not costed in second release; enabled and costed in third release.”

o has excellent chapter summaries which were particularly good for sifting, and boiling down the previous pages into a few conclusions.

o it will probably be of particular value to Oracle’s own CBO development teams

Chapter highlights

CH2 – Tablescans

explains how to gather system stats, how to use dbms_stats to set ind. stats manually, bind variables can make the CBO blind, bind variable peeking may not help, partition exchange may break global stats for table, use CPU costing when possible

CH3 – Selectivity

big problem with IN lists in 8i, fixed in 9i/10g, but still prob. with NOT IN, uses very good example of astrological signs overlapping birth months, and associated CBO cardinality problems, reminds us that the optimizer isn’t actually intelligent per se, but merely a piece of software

CH4 BTree Access

cost based on depth, #leaf blocks, and clustering factor, try to use CPU costing (system statistics)

CH5 – Clustering Factor

mainly a measure of the degree of random distribution of your data, very important for costing indx scans, use dbms_stats to correct when necessary, just giving CBO better information, freelists (procID problem) + freelist groups discussion with RAC

CH6 – Selectivity Issues

there is a big problem with string selectivity, Oracle uses only first seven characters, will be even more trouble for urls all starting with “http://”, and multibyte charactersets, trouble when you have db ind. apps which use string for date, use histrograms when you have problems, can use the tuning advisor for “offline optimization”, Oracle uses transitive closure to transform queries to more easily opt versions, moves predicates around, sometimes runs astray

CH7 – Histograms

height balanced > 255 buckets (outside Oracle called equi-depth),

otherwise frequency histograms, don’t use cursor sharing as it forces bind variables, blinds CBO, bind var peeking is only first call, Oracle doesn’t use histograms much, expensive to create, use sparingly, dist queries don’t pull hist from remote site, don’t work well with joins, no impact if you’re using bind vars, if using dbms_stats to hack certain stats be careful of rare codepaths

CH8 – Bitmap Indexes

don’t stop at just one, avoid updates like the plague as can cause deadlocking, opt assumes 80% data tightly packed, 20% widely scattered

CH9 – Query Transformation

partly rule based, peeling the onion w views to understand complex queries, natural language queries often not the most efficient, therefore this transformation process has huge potential upside for Oracle in overall optimization of app code behind the scenes by db engine, always remember Oracle may rewrite your query, sometimes want to block with hints, tell CBO about uniqueness, not NULL if you know this

CH10 – Join Cardinality

makes sensible guess at best first table, continues from there,

don’t hide useful information from the CBO, histograms may help with some difficult queries

CH11 – Nested Loops

fairly straightforward costing based on cardinality of each returned set multiplied together

CH12 – Hash Joins

Oracle executes as optimal (all in memory), onepass (doesn’t quite fit so dumped to disk for one pass) and multipass (least attractive sort to disk), avoid scripts writing scripts in prod, best option is to use workarea_size_policy=AUTO, set pga_aggregate_target & use CPU costing

CH 13 – Sorting + Merge Joins

also uses optimal, onepass, & multipass algorithms, need more than 4x dataset size for in memory sort, 8x on 64bit system, increasing sort_area_size will incr. CPU util so on CPU bottlenecked machines sorting to disk (onepass) may improve performance, must always use ORDER BY to guarentee sorted output, Oracle may not need to sort behind the scenes, Oracle very good at avoiding sorts, again try to use workarea_size_policy=AUTO

CH 14 – 10053 Trace

reviews various ways to enable, detailed rundown of trace with comments inline, and highlights; even mentions a VOL 2 + 3 of the book is coming!

Appendix A

be careful when switching from analyze to dbms_stats, in 10g some new hist will appear w/default dbms_stats options, 10g creates job to gather stats

Conclusion

I found this book to be full of gems of information that you won’t find anywhere else. If you’re at the more technical end of the spectrum, this is a one of a kind Oracle book and a

must-have for your collection. Keep in mind something Jonathan mentions in appendix A: “New features that improve 99% of all known queries may cripple your database because you fall into the remaining 1% of special cases”. If these cases are your concern, then this book will surely prove to be one-of-a-kind for you!

View this review on Amazon.com

APress – Expert Oracle DB Arch by Tom Kyte

I have a confession to make. I haven’t read an Oracle book cover-to-cover in almost three years. Sure I skim through the latest titles for what I need and of course check out documentation of the latest releases. That’s what good docs provide, quick reference when you need to check syntax, or details of a particular parameter, or feature, but have you ever read some documentation, sift through a paragraph, page or two, and say to yourself, that’s great, but what about this situation I have right now? Unfortunately documentation doesn’t always

speak to your real everyday needs. It is excellent for reference, but doesn’t

have a lot of real-world test cases, and practical usage examples. That’s where Tom Kyte’s new book comes in, and boy is it a killer.

I’ve read Tom’s books before, and always enjoyed them. But his new APress title really stands out as an achievement. Page after page and chapter after chapter he uses straightforward examples pasted right from the SQL*Plus prompt to illustrate, demonstrate, and illuminate concepts that he is explaining. It is this practical hands on, relentless approach that makes this book 700 pages of goodness.

Already an expert at Oracle? You’ll become more of one after reading this book. With reviewers like Jonathan Lewis I expected this book to be good from the outset I have to admit. But each chapter delves into a bit more depth around subjects that are central to Oracle programming and administration.

No SCREEN SHOTS!

One of the things I loved about this book most of all is its complete lack of screenshots! But how does one illustrate a concept then, you might ask? These days with graphical interfaces becoming more and more popular even among technical folks, I run into the question of the command line over an over again. How can you be doing sophisticated database administration of the latest servers running Oracle with the command line? Or another question I often get is, can you really do everything with the command line? The answer to both is a resounding yes, in fact you can do much more with the command line. Luckily for us, Tom is of this school too, and page after page of his book are full of real examples and commands that you can try for yourself, with specific instructions on

setting up the environment, using statistics gathering packages, and so on. In an era of computing where GUIs seem to reign like magazines over the best literature of the day, it is refreshing to see some of the best and most technical minds around Oracle still advocate the best tool, command line as the interface

of choice. In fact it is the command line examples, and happily the complete lack of screenshots that indeed makes this book a jewel of a find.

Audience

As a DBA you might wonder why I’m talking so highly of a book more focused towards developers. There are a couple of reasons. First this book is about the Oracle architecture, as it pertains to developers. In order for developers to best take advantage of the enterprise investment in Oracle *** they need to thoroughly understand the architecture, how specific features operate, which features are appropriate, and how to optimize their code for best interaction with them. Of course a DBA who is trying to keep a database operating in tip top shape needs to be aware of when developers are not best using Oracle, to identify,

and bring attention to bottlenecks, and problem areas in the application. Second, it is often a DBAs job to tune an existing database, and the very largest benefits come from tuning application SQL. For instance if a developer has chosen to use a bitmap index on an INSERT/UPDATE intensive table, they’re in for serious problems. Or if a developer forgot to index a foreign key column. This book directly spearheads those types of questions, and when necessary does mention a thing or two of direct importance to DBAs as well.

Highlights

Chapter 2 has an excellent example of creating an Oracle database. You simply write one line to your init.ora “db_name=sean” for example, and then from the SQL> prompt issues “startup nomount” and then “create database”. Looking at the processes Oracle starts, and the files that are created can do wonders for your understanding of database, instance, and Oracle in general.

Chapter 3 covers files, files, and more files. Spfile replaces a text init.ora allowing parameters to be modified while an instance is running *AND* stored persistently. He covers redolog files, flashback logs, and change tracking file

s, as well as import/export dump files, and lastly datapump files.

Chapter 4 covers memory, and specifically some of the new auto-magic options, how they work, and what to watch out for.

Chapter 5 covers processes.

Chapter 6, 7, and 8 cover lock/latching, multiversioning, and transactions respectively. I mention them all here together because to me these chapters are the real meat of the book. And that’s coming from a vegetarian! Seriously these

topics are what I consider to be the most crucial to understanding Oracle, and modern databases in general, and the least understood. They are the darkest corners, but Tom illuminates them for us. You’ll learn about optimistic versus pessismistic locking, page level, row level, and block level locking in various modern databases such as SQLServer, Informix, Sybase, DB2 and Oracle. Note Oracle is by far in the lead in this department, never locking more than it needs to, which yields the best concurrency with few situations where users block each other. Readers never block, for instance, because of the way Oracle implements all of this. He mentions latch spinning, which Oracle does to avoid a context switch, that is more expensive, how to detect, and reduce this type of contention. You’ll learn about dirty reads, phantom reads, and non-repeatable reads, and about Oracle’s Read-committed versus Serializable modes. What’s more you’ll learn about the implications of these various models on your applications, and what type of assumptions you may have to unlearn if you’re coming from developing on another database to Oracle. If I were to make any criticism at all, I might mention that in this area Tom becomes ever so slightly preachy about Oracle’s superb implementation of minimal locking, and non-blocking reads. This is in large part due I’m sure to running into so many folks who are used to developing on databases which do indeed dumb you down *BECAUSE* of their implementation, encouraging bad habits with respect to transactions, and auto-commit for instance. One thing is for sure you will learn a heck of a lot from these three chapters, I know I did.

Chapter 9 Redo & Undo describes what each is, how to avoid checkpoint not complete and why you want to, how to *MEASURE* undo so as to reduce it, how to avoid log file waits (are you on RAID5, are your redologs on a buffered filesystem?), and what block cleanouts are.

Chapter 10 covers tables. After reading it I’d say the most important types are normal (HEAP), Index Organized, Temporary, and External Tables. Use ASSM where possible as it will save you in many ways, use DBMS_METADATA to reverse engineer objects you’ve created to get all the options, don’t use TEMP tables to avoid inline views, or complex joins, your performance will probably suffer, and how to handle LONG/LOB data in tables.

Chapter 11 covers indexes, topics ranging from height, compression count, DESC sorted, colocated data, bitmap indexes and why you don’t want them in OLTP data

bases, function based indexes and how they’re most useful for user defined functions, why indexing foreign keys is important, and choosing the leading edge of an index. Plus when to rebuild or coalesce and why.

Chapter 12 covers datatypes, why never to use CHAR, using the NLS features, the CAST function, the number datatypes and precision versus performance, raw_to_hex, date arithmatic, handling LOB data and why not to use LONG, BFILEs and the new UROWID.

Chapter 13 discusses partitioning. What I like is he starts the chapter with the caveat that partitioning is not the FAST=TRUE option. That says it all. For OLTP databases you will achieve higher availability, and ease of administration of large options, as well as possibly reduced contention on larger objects,

but it is NOT LIKELY that you will receive query performance improvements because of the nature of OLTP. With a datawarehouse, you can use partition elimination on queries that do range or full table scans which can speed up queries dramatically. He discusses range, list, hash, and composite partitioning, local indexing (prefixed & non-prefixed) and global indexing. Why datawarehouses tend to use local, and OLTP databases tend to use global indexes, and even how you

can rebuild your global indexes as you’re doing partition maintenance avoiding a costly rebuild of THE ENTIRE INDEX, and associated downtime. He also includes a great auditing example.

Chapter 14 covers parallel execution such as parallel dml, ddl, and so on. Here is where a book like Tom’s is invaluable, as he comes straight out with his opinions on a weighty topic. He says these features are most relevant to DBAs doing one-off maintenance and data loading operations. That is because even in

datawarehouses, todays environments often have many many users. The parallel features are designed to allow single session jobs to utilize the entire system resources. He explains that Oracle’s real sweet spot in this real is parallel

DDL, such as CREATE INDEX, CREATE TABLE AS SELECT, ALTER INDEX REBUILD, ALTER TABLE MOVE, and so on.

Chapter 15, the final chapter covers loading and unloading data. A significant portion of the chapter covers SQL*Loader for completeness, but he goes on to celebrate the wonders of external tables for loading data into Oracle. In particular there is an option in SQL*Loader to generate the CREATE statement for an

external table that does the SAME load! This is great stuff. External tables provide advantages over SQL*Loader in almost every way, except perhaps loading over a network, concurrent user access, and handling LOB data. External tables can use complex where clauses, merge data, do fast code lookups, insert into multiple tables, and finally provide a simpler learning curve.

Conclusions

Yum. If you love Oracle, you’ll want to read this book. If you need to know more about Oracle say, for your job, that’s another reason you might read this book. Oracle is fascinating technology, and Tom’s passion for understanding every last bit of it makes this book both a necessary read, and a very gratifying

one.

View this review on Amazon.com