So-called ETL relates to moving data from external sources into and out of relational databases or data warehouses.
Source systems may store data in an infinite variety of formats. Extracting involves getting that data into common files for moving to the destination system. CSV file also known as comma separated values is named because each of the records is stored as one line in the file, and fields are separated by commas, and often surrounded by quotes as well. In MySQL INTO OUTFILE syntax can perform this function. If you have a lot of tables to work with, you can script the process using the data dictionary as a lookup for table names, and create a .mysql script to then run with the mysql shell. In Oracle you would use the spool command in SQL*Plus the command line shell. Spool sends subsequent output from the screen also to a file.
This step involves modifying the extracted data in preparation for moving it into the target database server. It may involve sweeping out blank records, or rearranging columns, or breaking files into smaller subsets of data. You might also map values differently for instance if one column in the source database was gender with values M/F you might transform those to the strings “Male” and “Female” if that is more useful for your target database server. Or you might transform those to numerical values, for instance Male & Female might be 0/1 in your target database.
Although I myriad of high level GUI tools exist to perform these functions, the Unix operating system includes a plethora of very powerful tools that every experience System Administrator is familiar with. Those include grep & sed which operate on regular expressions and can perform data transformation at lightening speed. Then there is sort which can sort data and send the results to stdout or the file of your choosing. Other tools include wc – word count, cut which can remove columns and so forth.
This final step involves moving the data into the database server, and it’s final target tables. For instance in MySQL this might be done with the LOAD DATA INFILE syntax, while in Oracle you might use SQL*Loader, which is a very fast flat file dataloader.