So-called ETL relates to moving data from external sources into and out of relational databases or data warehouses.
Source systems may store data in an infinite variety of formats. Extracting involves getting that data into common files for moving to the destination system. CSV file also known as comma separated values is named because each of the records is stored as one line in the file, and fields are separated by commas, and often surrounded by quotes as well. In MySQL INTO OUTFILE syntax can perform this function. If you have a lot of tables to work with, you can script the process using the data dictionary as a lookup for table names, and create a .mysql script to then run with the mysql shell. In Oracle you would use the spool command in SQL*Plus the command line shell. Spool sends subsequent output from the screen also to a file.
This step involves modifying the extracted data in preparation for moving it into the target database server. It may involve sweeping out blank records, or rearranging columns, or breaking files into smaller subsets of data. You might also map values differently for instance if one column in the source database was gender with values M/F you might transform those to the strings “Male” and “Female” if that is more useful for your target database server. Or you might transform those to numerical values, for instance Male & Female might be 0/1 in your target database.
Although I myriad of high level GUI tools exist to perform these functions, the Unix operating system includes a plethora of very powerful tools that every experience System Administrator is familiar with. Those include grep & sed which operate on regular expressions and can perform data transformation at lightening speed. Then there is sort which can sort data and send the results to stdout or the file of your choosing. Other tools include wc – word count, cut which can remove columns and so forth.
This final step involves moving the data into the database server, and it’s final target tables. For instance in MySQL this might be done with the LOAD DATA INFILE syntax, while in Oracle you might use SQL*Loader, which is a very fast flat file dataloader.
Quora discussion by Sean Hull – What is ETL?
Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything. That’s why automated monitoring is so important.
So what should you monitor? You can divide up your monitoring into a couple of strategic areas. Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.
Business & Application Monitoring
- If a user is getting an error page or cannot connect
- If an e-commerce transaction is failing
- General service outages
- If a business goal is met – or not
- Page timeouts or slowness
Systems Level Monitoring
- Backups completed and success
- Error logs from database, webserver & other major services like email
- Database replication is running
- Webserver timeouts
- Database timeouts
- Replication failures – via error logs & checksum checks
- Memory, CPU, Disk I/O, Server load average
- Network latency
- Network security
Tools that can perform this type of monitoring include Nagios,
Quora discussion – Web Operations Monitoring
Web-facing database servers receive a barrage of activity 24 hours a day. Sessions are managed for users logging in, ratings are clicked and comments are added. Even more complex are web-based ecommerce applications. All of this activity is organized into small chunks called transactions. They are discrete sets of changes. If you’re editing a word processing document, it might autosave every five minutes. If you’re doing something in excel it may provide a similar feature. There is also an in-built mechanism for undo and redo of recent edits you have made. These are all analogous to transactions in a database.
These are important because all of these transactions are written to logfiles. They make replication possible, by replaying those changes on another database server downstream.
If you have lost your database server because of hardware failure or instance failure in EC2, you’ll be faced with the challenge of restoring your database server. How is this accomplished? Well the first step would be to restore from the last full backup you have, perhaps a full database dump that you perform everyday late at night. Great, now you’ve restored to 2am. How do I get the rest of my data?
That is where point-in-time recovery comes in. Since those transactions were being written to your transaction logs, all the changes made to your database since the last full backup must be reapplied. In MySQL this transaction log is called the binlog, and there is a mysqlbinlog utility that reads the transaction log files, and replays those statements. You’ll tell it the start time – in this case 2am when the backup happened. And you’ll tell it the end time, which is the point-in-time you want to recover to. That time will likely be the time you lost your database server hardware.
Point-in-time recovery is crucial to high availability, so be sure to backup your binlogs right alongside your full database backups that you keep every night. If you lose the server or disk that the database is hosted on, you’ll want an alternate copy of those binlogs available for recovery!
Quora discussion on Point-in-time Recovery by Sean Hull
A data warehouse is a special type of database. It is used to store large amounts of data, such as analytics, historical, or customer data, and then build large reports and data mining against it. It is markedly different from a web-facing or high-transaction database, which typically has many many small transactions or pieces of data that are constantly changing, through many 100′s or 1000′s or small user sessions. These typically execute in speeds on the order of 1/100th of a second, while in data warehouse you have fewer large queries which can take minutes to execute.
Data warehouses are tuned for updates happening in bulk via batch jobs, and for large queries which need big chunks of memory to sort and cross-tabulate data from different tables. Often full table scans are required because of the specialized one-off nature of these reports. The same queries are not executed over and over.
It’s important not to mix data warehousing databases with transactional databases in the same instance, whether you are dealing with MySQL or Oracle. That’s because they are tuned totally differently. It would be like trying to use the same engine for commuting to work, and a container ship traveling around the world. Different jobs require different databases or databases that with their dials set for different uses.
Quora discussion of data warehousing – Sean Hull
A lot of technical forums and discussions have highlighted the limitations of EC2 and how it loses on performance when compared to physical servers of equal cost. They argue that you can get much more hardware and bigger iron for the same money. So it then seems foolhardy to turn to the cloud. Why this mad rush to the cloud then? Of course if all you’re looking at is performance, it might seem odd indeed. But another way of looking at it is, if performance is not as good, it’s clearly not the driving factor to cloud adoption.
CIOs and CTOs are often asking questions more along the lines of, “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”
Another question, “Is it a good idea to deploy your database in the cloud?” It depends! Let’s take a look at some of the strengths and weaknesses, then you decide.
8 big strengths of the cloud
- Flexibility in disaster recovery – it becomes a script, no need to buy additional hardware
- Easier roll out of patches and upgrades
- Reduced operational headache – scripting and automation becomes central
- Uniquely suited to seasonal traffic patterns – keep online only the capacity you’re using
- Low initial investment
- Auto-scaling – set thresholds and deploy new capacity automatically
- Easy compromise response – take server offline and spinup a new one
- Easy setup of dev, qa & test environments
Some challenges with deploying in the cloud
- Big cultural shift in how operations is done
- Lower SLAs and less reliable virtual servers – mitigate with automation
- No perimeter security – new model for managing & locking down servers
- Where is my data? — concerns over compliance and privacy
- Variable disk performance – can be problematic for MySQL databases
- New procurement process can be a hurdle
Many of these challenges can be mitigated against. The promise of the infrastructure deployed in the cloud is huge, so digging our heels in with gradual adoption is perhaps the best option for many firms. Mitigate the weaknesses of the cloud by:
- Use encrypted filesystems and backups where necessary
- Also keep offsite backups inhouse or at an alternate cloud provider
- Mitigate against EBS performance – cache at every layer of your application stack
- Employ configuration management & automation tools such as Puppet & Chef
Quora discussion – Why or why not to migrate to the cloud?
Replication in MySQL allows you to copy and replay changes from your primary database to an alternate backup or slave database. This facility in MySQL is an asynchronous process, which means it does not happen at the time it occurs on the primary. It could happen a second later, or minutes later. In fact sometimes the secondary database can get bogged down by heavy load because transactions are applied serially, while they execute in parallel sessions on production. You can find out how far behind the master you are with SHOW SLAVE STATUS, and look at:
If you are sending SELECT or the READ traffic from your website to the slave database, you may experience phantom reads. For instance if you comment on a blog posting, and refresh the page within 8 seconds on the server above, it would not display the comment just posted!
As it turns out the Maatkit toolkit has a tool called mk-slave-prefetch which can help with slow performance of the slave. Since most of the work of doing inserts, updates and deletes involves fetching the right rows, running a similar SELECT query in advance of running the actual transaction will warm up the caches, and speed things up dramatically and may be enough for your needs. Test it first and find out.
Semi-Synchronous Replication comes to the rescue if you really need this type of guarantee, but it comes at a cost. You enable it on the master, then on the slave and restart the slave. Whenever the master commits a transaction, it will block until one of two things happen. It must either get an acknowledgement from at least one slave that the transaction has been applied downstream or it must reach the timeout threshold.
This type of arrangement may sound fine in theory as such blocking would often be less than a second. However in the microscopic world of high speed, high transaction, high traffic websites, this may be an eternity, and one which can slow the database down substantially. So test first before assuming it’s a solution that will help you.
Quora discussion of Semi-synchronous Replication
MySQL is a relational database that backs many internet websites and enterprise applications. Like all enterprise software, it has a whole complement of features which are well documented, such as data types, storage engines, transactional behaviors and so forth. It also has a set of processes, many of which involve how software operates on Linux servers, such as how it gets installed, where binaries and libraries will get placed, where to find logfiles, and how to move directories and set permissions. Thirdly it is important to understand the culture, in this case Unix-based, forum discussions and community contributions as an open-source project.
MySQL can do much of the workhorse kind of stuff you see in databases like Oracle or SQL Server, but sometimes it achieves those goals in very different ways. For instance there are many open-source projects that support and surround the database, such as mysqltuner an analysis script, innotop a unix top-like utility for monitoring on-going activity in the database, and maatkit a whole suite of tools that build on and expand the features already present in the MySQL database.
Some Limitations in MySQL
- Complex queries and subqueries specifically can be problematic in MySQL. If you’re used to writing huge queries in Oracle, and having the CBO figure everything out for you, you’ll be in for a surprise with MySQL. Keep your queries simple, proper columns indexed and avoid complex joins where possible. The EXPLAIN facility is available to you and at your disposal. Use it!
- Vertical Scalability problems – primarily addressed in 5.5, the latest version of MySQL, previously the database did not scale well on greater than four processor boxes. SMP or Symmetric Multiprocessing servers were less common 10-15 years ago when MySQL was in it’s infancy, and development is slowly catching up with the big iron of today.
- There is no flashback table, tablespace or database that you might find in other databases such as Oracle. You can achieve the same thing with point-in-time recovery, so keep regular backups of your database, and also backup the transaction logs.
- MySQL can do JOINs, but only with the nested loops algorithm. It can’t do sort merge join or hash join.
- MyISAM is the default table type and storage engine. It is not crash safe and not transactional. On new installations it’s recommended that you change this to InnoDB and use InnoDB for most if not all of your tables. It’s very reliable and very fast!
- There is a query cache, but it caches result sets not query plans! It also has some performance issues and shows some erratic behavior on larger SMP boxes. Query plans are cached on a session basis, but when a session is closed and reopened, MySQL must reparse and reexecute that query.
- MySQL does not have a facility like Oracle’s Real Application Clusters. It does have NDB Cluster which is an all-in-memory clustering solution. Despite it’s promise, it tends to have very serious performance problems with any type of join, and is mainly good for single table index-based lookups. If managed well it can increase availability but will probably reduce performance.
- MySQL’s default replication solution is statement based. Although it is easy to setup, it breaks almost as easily, sometimes with resolvable errors, and sometimes silently. Consider row-based replication, and definitely make use of Maatkit’s mk-table-checksum and mk-table-sync tools. Also be sure to do thorough and regular monitoring of your replication setup.
- There are no in-built materialized views or snapshots in MySQL. There is an open-source project called Flexviews by Justin Swanhart that provides this facility to the MySQL community.
- MySQL provides stored procedures, triggers and functions as a regular feature to the database. However I would use them with caution. They are very difficult to edit, troubleshoot and diagnose when they are causing troubles. Also as with the query plan caching, stored procedures are cached at the session level, so they can be expensive to execute over and over again in different areas of your application. They can cause real performance problems.
- There is no in-built mechanism for auditing that you find in relational databases such as Oracle 11g.
- Only b-tree indexes are supported, no bitmap indexes, index-organized tables, clustered indexes or other more exotic index types.
- ALTER TABLE is generally a locking and blocking operation. For example if you add a new column or change a columns data type, the entire table will be locked for the duration of the operation. This will be a surprise coming from the Oracle world where these type of operations can routinely be done online.
MySQL’s Strengths Are Numerous
- Install with an RPM using Yum or Aptget. Fast & simple!
- Works great in the cloud, using MySQL Community distro, Percona distro, or Amazon’s own RDS solution.
- Comes out-of-the-box with an excellent command line shell providing all sorts of features and power that are constant frustrations on the Oracle side. Command history, standard input/output redirection support, a full compliment of features and options, and easy autologin with a user level my.cnf file which fits in nicely with the global settings as well.
- A simpler mechanism to serve unique id columns with the auto-increment data type. Although Oracle’s sequence method is extremely scalable, for many many developers it is troublesome and confusing.
- Good support of the LIMIT clause allowing an easier method for developers to fetch a subset of data.
- A huge community of users, forums, and support in third party applications such as monitoring (Nagios etc…) as well as metrics collection (Munin, Cacti, OpenNMS, Ganglia etc.)
- Great visibility of system variables with SHOW VARIABLES. Many can be changed dynamically as well, just like Oracle.
- Great visibility of internal system state with SHOW PROCESSLIST.
- System counters for all sorts of internal instrumentation data using SHOW STATUS and SHOW INNODB STATUS. Ultimately it is not as comprehensive as Oracle’s own data dictionary and millions of instrumentation counts. However Oracle could take a huge page out of the MySQL book in terms of usability. The obfuscation of Oracle’s internal kernel state makes it all but unusable by most.
- innotop, the utility much like the unix TOP facility that all Unix & Linux folks love, it provides instant visibility into what queries are running, what work is being done, and what is blocking. Oracle could really take a page from this playbook, as this tool is so invaluable.
- The incredible Maatkit, a veritable goldmine of great community contributed powertools. Query analyzers, profilers, log tools, replication tools, data archiver, a find facility, and a whole lot more!
Sean Hull discusses further on Quora – What considerations are important when migrating to MySQL?
Root Cause Analysis is the means to identify the ultimate source and cause of an outage. When an outage occurs that causes serious downtime of a website, typically organizations are in crisis mode. Urgency of resolution sometimes pushes aside due process, change management and general caution. Root Cause Analysis attempts to as much as possible isolate logfiles, configurations, and the current state of systems for later analysis.
With traditional physical servers, physical hardware failure, operator error, or a security breach can cause outages. Since you’re dealing with one physical machine, resolving that issue necessarily means moving around the things that broke. So caution and later analysis must be balanced with the immediate problem resolution.
Another silver lining in cloud hosted solutions is around root cause analysis. If a server was breached for example, that server can immediately be shutdown, while maintaining it’s current state as a disk or EBS snapshot. A new server can then be fired up from a AMI image, then your server rebuilt from scripts or template and you’re back up and running. Save the snapshot then for later analysis.
This could be used for analysis of operator error related outages as well. Hardware failures are more expected and common in cloud hosted environments, so this should and really must push adoption of best practices around infrastructure, that is having scripts at hand that rebuild everything from bare metal.
More discussion of root cause analysis by Sean Hull on Quora.
Backups are obviously an important part of any managed infrastructure deployment. Computing systems are inherently fallible, through operator error or hardware failure. Existing systems must be backed up, from configurations, software and media files, to the backend data store.
In a managed hosting environment or cloud hosting environment, it is convenient to use various filesystem snapshot technologies to perform backups of entire disk volumes in one go. These are powerful, fast, reliable, and easy to execute. In Amazon EC2 for example these EBS snapshots are stored on S3. But what happens if your data center goes down – through network outage or power failure? Or further what happens if S3 goes offline? Similar failures can affect traditional managed hosting facilities as well.
This is where offsite backups come in handy. You would the be able to rebuild your application stack and infrastructure despite your entire production servers being offline. That’s peace of mind! Offsite backups can come in many different flavors:
- mysqldump of the entire database, performed daily and copied to alternate hosting facility
- semi-synchronous replication slave to alternate datacenter or region
- DRBD setup – distributed filesystem upon which your database runs
- replicated copy of version control repository – housing software, documentation & configurations
Offsite backups can also be coupled with a frequent sync of the binlog files (transaction logs). These in combination with your full database dump will allow you to perform point-in-time recovery to the exact point the outage began, further reducing potential data loss.
Offsite Backups – What are they – discussed on Quora by Sean Hull
Database replication is a service for shipping changes to your database, off to a copy housed on another server, potentially even in another data center.
Each change to the records of information in your database or groups of them are organized into transactions. These each get unique identifiers or change numbers. Those groupings of changes are logged to transaction logs which are then moved across to the sibling database server, and applied there.
Oracle offers this facility in the form of Standby Database aka Dataguard. Although it can be messy to setup, it is fairly bulletproof. If you don’t receive any errors in your logfiles, you can rest assured that the data on your main or master database server looks exactly the same as what’s on your secondary server. With MySQL the situation can be a bit more complicated. MySQL replication was built based on statements, not changed data. So those same statements or instructions (SQL statements of DML & DDL) get rerun on the secondary database server. Some of the problems associated with this are:
- MySQL standby transactions may execute at different times than master
- Serialization is important to maintain order. Transactions need to get applied in the same order as on the primary. However all tables & storage engines are not necessarily transactional in MySQL. This can cause troubles, and potentially break replication on the slave side.
- MySQL replication can fail silently, that is data can drift out of sync with the master without your knowing.
Caveats aside, MySQL replication is powerful, and quite fast. With the right caution, and correct tools, it can be as robust and reliable as Oracle’s own standby database technology.
- Use checksum tools like Maatkit to verify & compare tables on primary & secondary servers
- Use rigorous monitoring to watch error logs, and checksum logs hourly if necessary.
- Use Maatkit sync tool to resync tables if they get out of whack.
- Be especially vigilant if you’re taking backups off the secondary server
Replication can also be a powerful high availability tool in your arsenal. Setup as a master-master cluster, you can keep your application pointed to the “active” side or role while performing blocking operations on the inactive database server. Since master-master cluster also known as circular replication applies changes to both databases, you’ll need to stop replication during the blocking operation. Once it is completed on the inactive side, redirect the webservers & application to the database where you completed the operation, then reenable replication and watch the databases catch up with each other!
Quora discussions – What is database replication?