Tag Archives: data warehouse

Extract Transform & Load – What is it and why is it important?

So-called ETL relates to moving data from external sources into and out of relational databases or data warehouses.


Source systems may store data in an infinite variety of formats.  Extracting involves getting that data into common files for moving to the destination system.  CSV file also known as comma separated values is named because each of the records is stored as one line in the file, and fields are separated by commas, and often surrounded by quotes as well.  In MySQL INTO OUTFILE syntax can perform this function.  If you have a lot of tables to work with, you can script the process using the data dictionary as a lookup for table names, and create a .mysql script to then run with the mysql shell.  In Oracle you would use the spool command in SQL*Plus the command line shell.  Spool sends subsequent output from the screen also to a file.


This step involves modifying the extracted data in preparation for moving it into the target database server.  It may involve sweeping out blank records, or rearranging columns, or breaking files into smaller subsets of data.  You might also map values differently for instance if one column in the source database was gender with values M/F you might transform those to the strings “Male” and “Female” if that is more useful for your target database server.  Or you might transform those to numerical values, for instance Male & Female might be 0/1 in your target database.

Although I myriad of high level GUI tools exist to perform these functions, the Unix operating system includes a plethora of very powerful tools that every experience System Administrator is familiar with.  Those include grep & sed which operate on regular expressions and can perform data transformation at lightening speed.  Then there is sort which can sort data and send the results to stdout or the file of your choosing.  Other tools include wc – word count, cut which can remove columns and so forth.

This final step involves moving the data into the database server, and it’s final target tables.  For instance in MySQL this might be done with the LOAD DATA INFILE syntax, while in Oracle you might use SQL*Loader, which is a very fast flat file dataloader.

Quora discussion by Sean Hull – What is ETL?

Data warehousing – What is it and why is it important?

A data warehouse is a special type of database.  It is used to store large amounts of data, such as analytics, historical, or customer data, and then build large reports and data mining against it.  It is markedly different from a web-facing or high-transaction database, which typically has many many small transactions or pieces of data that are constantly changing, through many 100’s or 1000’s or small user sessions.  These typically execute in speeds on the order of 1/100th of a second, while in data warehouse you have fewer large queries which can take minutes to execute.

Data warehouses are tuned for updates happening in bulk via batch jobs, and for large queries which need big chunks of memory to sort and cross-tabulate data from different tables.  Often full table scans are required because of the specialized one-off nature of these reports.  The same queries are not executed over and over.

It’s important not to mix data warehousing databases with transactional databases in the same instance, whether you are dealing with MySQL or Oracle.  That’s because they are tuned totally differently.  It would be like trying to use the same engine for commuting to work, and a container ship traveling around the world.  Different jobs require different databases or databases that with their dials set for different uses.

Quora discussion of data warehousing – Sean Hull

Relational Database – What is it and why is it important?

A relational database is the warehouse of your data.  Your crown jewels.  It’s your excel spreadsheet or filing cabinet writ large.  You use them everyday and may not know it.  Your smartphone stores it’s contact database in a relational database, most likely sqlite – the ever present but ever invisible embedded database platform.  Your online bank at Citibank or Chase stores all your financial history, statements, contact info, personal data and so forth, all in a relational database.

  • organized around records
  • data points are columns in a table
  • relationships are enforced with constraints
  • indexing data brings hi-speed access
  • SQL is used to get data in and out of the db
  • triggers, views, stored procs & materialized views may also be supported

Like excel, relational databases are organized around records.  A record is like a 3×5 card with a number of different data points on it.  Say you have 3×5 cards for your addressbook.  Each card holds one address, phone number, email, picture, notes and so forth.  By organizing things nicely on cards, and for each card predictable fields such as first name, last name, birthday etc, you can then search on those data points.  Want all the people in your addressbook with birthday of July 5th, no problem.

While relational databases have great advantages, they require a lot of work to get all of your information into neatly organized files.  What’s more the method for getting things into and out of them – SQL is a quirky and not very friendly language.  What’s more relational databases have trouble clustering, and scaling horizontally.  NOSql database have made some headway in these departments, but at costs to consistency and reliability of data.

As servers continue to get larger, it becomes rarer that a single web-facing database really needs more than one single server.  If it’s tuned right, that is.  Going forward and looking to the future, the landscape will probably continue to be populated by a mix of traditional relational databases, new nosql type databases, key-value stores, and other new technologies yet to be dreamed up.

Sean Hull asks on Quora – What is an rdbms and why are they important?

Big Data – What is it and why is it important?

There’s lots of debate about exactly what constitutes “big” when talking about big data.  Technical folks may be inclined to want a specific number.

But when most CTOs and operations managers are talking about big data, they mean data warehouse and analytics databases.  Data warehouses are unique in that they are tuned to run large reporting queries and churn through large multi-million row tables.  Here you load up on indexes to support those reports, because the data is not constantly changing as in a web-facing transaction oriented database.

More and more databases such as MySQL which were originally built as web-facing databases are being used to support big data analytics.  MySQL does have some advanced features to support large databases such as partitioned tables, but many operations still cannot be done *online* such as table alters, and index creation.  In these cases configuring MySQL in a master-master active/passive cluster provides higher availability.  Perform blocking operations on the inactive side of the cluster, and then switch the active node.

We’ve worked with MySQL databases as large as 750G in size and single user tables as large as 40 million records without problems.  Table size, however has to be taken into consideration for many operations and queries.  But as long as your tables are indexed to fit the query, and you minimize table scans especially on joins, your MySQL database server will happily support these huge datasets.

Sean Hull discusses on Quora – What is Big Data and why is it important?